When Machines Are the Audience

I recently received an email from someone at the Woodrow Wilson Center that began in the following way: “Dear Sir/Madam: I was wondering if you might share the following fellowship opportunity with the members of your list…The Africa Program is pleased to announce that it is now accepting applications…” The email was, of course, tagged as spam by my email software, since it looked suspiciously like what the U.S. Secret Service calls a 419 fraud scheme, or a scam where someone (generally from Africa) asks you to send them your bank account information so they can smuggle cash out of their country (the transfer then occurs in the opposite direction, in case you were wondering). Checking the email against a statistical list of high-likelihood spam triggers identified the repeated use of words such as “application,” “generous,” “Africa,” and “award,” as well as the phrases “submitted electronically” and the opening “Dear Sir/Madam.” The email piqued my curiosity because over the past year I’ve started altering some of my email writing to avoid precisely this problem of a “false positive” spam label, e.g., never sending just an attachment with no text (a class spam trigger) and avoiding the use of phrases such as “Hey, you’ve got to look at this.” In other words, I’ve semi-consciously started writing for a new audience: machines. One of the central theories of humanities disciplines such as literature and history is that our subjects write for an audience (or audiences). What happens when machines are part of this audience?

As the Woodrow Wilson Center email shows, the fact that digital text is machine readable suddenly makes the use of specific words problematic, because keyword searches can much more easily uncover these words (and perhaps act on them) than in a world of paper. It would be easy to find, for instance, all of the emails about Monica Lewinsky in the 40 million Clinton White House emails saved by the National Archives because “Lewinsky” is such an unusual word. Flipping that logic around, if I were currently involved in a White House scandal, I would studiously avoid the use of any identifying keywords (e.g., “Abramoff”) in my email correspondence.

In other cases, this keyword visibility is desirable. For instance, if I were a writer today thinking about my Word files, I would consider including or excluding certain words from each file for future research (either by myself or by others). Indeed, the “smart folder” technology in Apple’s Spotlight search or the upcoming Windows Vista search can automatically group documents based on the presence of a keyword or set of keywords. When people ask me how they can create a virtual network of websites on a historical topic, I often respond by saying that they could include at the bottom of each web page in the network a unique invented string of characters (e.g., “medievalhistorynetwork”). After Google indexes all of the web pages with this string, you could easily create a specialized search engine that scans only these particular sites.

“Machine audience consciousness” has probably already infected many other realms of our writing. Have some other examples? Let me know and I’ll post them here.

No Computer Left Behind

In this week’s issue of the Chronicle of Higher Education Roy Rosenzweig and I elaborate on the implications of my H-Bot software, and of similar data-mining services and the web in general. “No Computer Left Behind” (cover story in the Chronicle Review; alas, subscription required, though here’s a copy at CHNM) is somewhat more polemical than our recent article in First Monday (“Web of Lies? Historical Knowledge on the Internet”). In short, we argue that just as the calculator—an unavoidable modern technology—muscled its way into the mathematics exam room, devices to access and quickly scan the vast store of historical knowledge on the Internet (such as PDAs and smart phones) will inevitably disrupt the testing—and thus instruction—of humanities subjects. As the editors of the Chronicle put it in their headline: “The multiple-choice test is on its deathbed.” This development is to be praised; just as the teaching of mathematics should be about higher principles rather than the rote memorization of multiplication tables, the teaching of subjects like history should be freed by new technologies to focus once again (as it was before a century of multiple-choice exams) on more important principles such as the analysis and synthesis of primary sources. Here are some excerpts from the article.

“What if students will have in their pockets a device that can rapidly and accurately answer, say, multiple-choice questions about history? Would teachers start to face a revolt from (already restive) students, who would wonder why they were being tested on their ability to answer something that they could quickly find out about on that magical device?

“It turns out that most students already have such a device in their pockets, and to them it’s less magical than mundane. It’s called a cellphone. That pocket communicator is rapidly becoming a portal to other simultaneously remarkable and commonplace modern technologies that, at least in our field of history, will enable the devices to answer, with a surprisingly high degree of accuracy, the kinds of multiple-choice questions used in thousands of high-school and college history classes, as well as a good portion of the standardized tests that are used to assess whether the schools are properly “educating” our students. Those technological developments are likely to bring the multiple-choice test to the brink of obsolescence, mounting a substantial challenge to the presentation of history—and other disciplines—as a set of facts or one-sentence interpretations and to the rote learning that inevitably goes along with such an approach…

“At the same time that the Web’s openness allows anyone access, it also allows any machine connected to it to scan those billions of documents, which leads to the second development that puts multiple-choice tests in peril: the means to process and manipulate the Web to produce meaningful information or answer questions. Computer scientists have long dreamed of an adequately large corpus of text to subject to a variety of algorithms that could reveal underlying meaning and linkages. They now have that corpus, more than large enough to perform remarkable new feats through information theory.

“For instance, Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating ‘good enough’ translations—not by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer’s translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

“As it turns out, ‘good enough’ is precisely what multiple-choice exams are all about. Easy, mechanical grading is made possible by restricting possible answers, akin to a translator’s receiving four possible translations for a sentence. Not only would those four possibilities make the work of the translator much easier, but a smart translator—even one with a novice understanding of the translated language—could home in on the correct answer by recognizing awkward (or proper) sounding pieces in each possible answer. By restricting the answers to certain possibilities, multiple-choice questions provide a circumscribed realm of information, where subtle clues in both the question and the few answers allow shrewd test takers to make helpful associations and rule out certain answers (for decades, test-preparation companies like Kaplan Inc. have made a good living teaching students that trick). The ‘gaming’ of a question can occur even when the test taker doesn’t know the correct answer and is not entirely familiar with the subject matter…

“By the time today’s elementary-school students enter college, it will probably seem as odd to them to be forbidden to use digital devices like cellphones, connected to an Internet service like H-Bot, to find out when Nelson Mandela was born as it would be to tell students now that they can’t use a calculator to do the routine arithmetic in an algebra equation. By providing much more than just an open-ended question, multiple-choice tests give students—and, perhaps more important in the future, their digital assistants—more than enough information to retrieve even a fairly sophisticated answer from the Web. The genie will be out of the bottle, and we will have to start thinking of more meaningful ways to assess historical knowledge or ‘ignorance.'”

Job Openings at CHNM for 2006-2007

Do you have technical skills and would like to apply those talents to expand and improve online learning and scholarship? Does your inner geek thrive in an academic setting? Do you want to be on the cutting edge of digital research? The Center for History and New Media is hiring! We have three openings for jobs that begin in the summer of 2006. The Center is a fantastic, exciting place to work, as I can attest. Here are the job descriptions; please feel free to contact me if you have any questions, and please forward these descriptions to others who might be interested.

Digital Historian: This is a one- to two-year position (depending on funding) at the rank of Research Assistant Professor at the Center for History and New Media (CHNM), which is closely affiliated with the Department of History and Art History at George Mason University. A PhD or advanced ABD in History or a closely related field is required. We are especially interested in people with some or all of the following credentials, but they are not required for the position: 1. experience in digital history or digital libraries; 2. strong technical background in new technology and new media; 3. administrative and organizational experience; 4. background in the history of science, technology, and industry, broadly defined. Please send letter of application, CV or resume, and three letters of recommendation (or dossier) to chnm@gmu.edu or Center for History and New Media, George Mason University, 4400 University Drive MS 1E7, Fairfax, VA 22030. Electronic submissions encouraged. Please use subject line “Digital Historian.” We will begin considering applications 1 April, 2006.

Webmaster and Technical Coordinator: The Center for History and New Media (CHNM) at George Mason University (GMU) anticipates an opening in Summer 2006 for a Webmaster/ Technical Coordinator to maintain the CHNM server and oversee the CHNM computer lab. This is a permanent classified staff position that is particularly appropriate for someone with combined interest in technology and history. We are seeking an energetic, responsible, well-organized person who is equally able to work independently, as part of a team, and as a supervisor. Specific background and experience is less important than the ability to learn new technical skills quickly. But knowledge of some combination of the following would be particularly helpful: scripting languages (especially PHP); database-driven web applications (especially using MySQL); command-line configuration of Red Hat Linux and Apache; web design (CSS, Dreamweaver, Photoshop); and Mac and Windows operating systems. In addition to overseeing a Red Hat Linux server, the Webmaster will help develop web database applications, construct websites, purchase and maintain software and hardware for the lab, and supervise part-time staff. Salary: $35-39K plus excellent benefits. Please email resume, three references, links to prior web-based work, and cover letter about technology background and interest in history to chnm@gmu.edu. Please use subject line “Webmaster.” We will begin considering applications on 4/15/2006 and continue until the position is filled.

Web Developer: The Center for History and New Media (CHNM) at George Mason University anticipates an entry-level opening for a web and multimedia developer in Summer 2006. We require an energetic and well-organized person to work on a variety of innovative, web-based history projects. This is a grant-funded, one-to-two-year position that is particularly appropriate for someone with a combined interest in technology and history. Specific background and experience is less important than the ability to learn new technical skills quickly. But knowledge of some combination of the following would be particularly helpful: web-database applications (MySQL and PHP), web design (CSS, Dreamweaver, Photoshop), multimedia (Flash, including ActionScript), and Final Cut Pro. Salary: $34K plus excellent benefits. Please email resume, three references, links to any prior web-based work (or programming examples), and a cover letter about technology background and interest in history to chnm@gmu.edu. Please use subject line “web developer.” We will begin considering applications on 4/15/2006 and continue until the position is filled.

Doing Digital History June 2006 Workshop

If your work deals in some way with the history of science, technology, or industry, and you would like to learn how to create online history projects, the Echo Project at the Center for History and New Media is running another one of our free, week-long workshops. The workshop covers the theory and practice of digital history; the ways that digital technologies can facilitate the research, teaching, writing and presentation of history; genres of online history; website infrastructure and design; document digitization; the process of identifying and building online history audiences; and issues of copyright and preservation.

As one of the teachers for this workshop, I can say somewhat immodestly that it’s really a great way to get up to speed on the many (sometimes complicated) elements necessary for website development. Unfortunately space is limited, so be sure to apply online by March 10, 2006. The workshop will take place from June 12-16, 2006, at George Mason University’s Arlington campus, right outside of Washington, DC. It is co-sponsored by the American Historical Association and the National History Center, and funded by the Alfred P. Sloan Foundation. There is no registration fee, and a limited number of fellowships are available to defray the costs of travel and lodging for graduate students and young scholars. Hope to see you there!

Impact of Field v. Google on the Google Library Project

I’ve finally had a chance to read the federal district court ruling in a case, Field v. Google, that has not been covered much (except in the technology press), but which has obvious and important implications for the upcoming battle over the legality of Google’s library digitization project. The case, Field v. Google, involved a lawyer who dabbles in some online poetry, and who was annoyed that Google’s spider cached a version of his copyrighted ode to delicious tea (“Many of us must have it iced, some of us take it hot and combined with milk, and others are not satisfied unless they know that only the rarest of spices and ingredients are contained therein…”). Field sued Google for copyright infringement; Google argued fair use. Field lost the case, with most of his points rejected by the court. The Electronic Frontier Foundation has hailed Google’s victory as a significant one, and indeed there are some very good aspects of the ruling for the book copying case. But there also seem to be some major differences between Google’s wholesale copying of websites and its wholesale copying of books that the court implicitly recognized. The following seem to be the advantages and disadvantages of this ruling for Google, the University of Michigan, and others who wish to see the library project reach completion.

Courts have traditionally used four factors to determine fair use—the purpose of the copying, the nature of the work, the extent of the copying, and the effect on the market of the work.

On purpose, the court ruled that Google’s cache was not simply a copy of that work, but added substantial value that was important to users of Google’s search engine. Users could still read Field’s poetry even if his site was down; they could compare Google’s cache with the original site to see if any changes had been made; they could see their search terms highlighted in the page. Furthermore, with a clear banner across the top Google tells its users that this is a copy and provides a link to the original. It also provides methods for website owners to remove their pages from the cache. This emphasis on opt out seems critical, since Google has argued that book publishers can simply tell them if they don’t want their books digitized. Also, the court ruled that the Google’s status as a commercial enterprise doesn’t matter here. Advantage for Google et al.

On the nature of the work, the court looked less at the quality of Field’s writing (“Simple flavors, simple aromas, simple preparation…”) than at Field’s intentions. Since he “sought to make his works available to the widest possible audience for free” by posting his poems on the Internet, and since Field was aware that he could (through the robots.txt file) exclude search engines from indexing his site, the court thought Field’s case with respect to this fair use factor was weakened. But book publishers and authors fighting Google will argue that they do not intend this free and wide distribution. Disadvantage for Google et al.

One would think that the third factor, the extent of the copying, would be a clear loser for Google, since they copy entire web pages as a matter of course. But the Nevada court ruled that because Google’s cache serves “multiple transformative and socially valuable purposes…that could not be effectively accomplished by using only portions” of web pages, and because Google points users to the original texts, this wholesale copying was OK. You can see why Google’s lawyers are overjoyed by this part of the ruling with respect to the book digitization project. Big advantage for Google et al.

Perhaps the cruelest part of the ruling had to do with the fourth factor of fair use, the effect on the market of the work. The court determined from its reading of Field’s ode to tea that “there is no evidence of any market for Field’s works.” Ouch. But there is clearly a market for many books that remain in copyright. And since the Google library project has just begun we don’t have any economic data about Google Book Search’s impact on the market for hard copies. No clear winner here.

In additional, the Nevada court added a critical fifth factor for determining fair use in this case: “Google’s Good Faith.” By providing ways to include and exclude materials from its cache, by providing a way to complain to the company, and by clearly spelling out its intentions in the display of the cache, the court determined that Google was acting in good faith—it was simply trying to provide a useful service and had no intention to profit from Field’s obsession with tea. Google has a number of features that replicate this sense of good faith in its book program, like providing links to libraries and booksellers, methods for publishers and authors to complain, and techniques for preventing user copies of copyrighted works. Advantage for Google et al.

A couple of final points that may work against Google. First, the court made a big deal out of the fact that the cache copying was completely automated, which the Google book project is clearly not. Second, the ruling constantly emphasizes the ability of Field to opt out of the program, but upset book publishers and authors believe this should be opt in, and it’s quite possible another court could agree with that position, which would weaken many of the points made above.

Google, the Khmer Rouge, and the Public Good

Like Daniel into the lion’s den, Mary Sue Coleman, the President of the University of Michigan, yesterday went in front of the Association of American Publishers to defend her institution’s participation in Google’s massive book digitization project. Her speech, “Google, the Khmer Rouge and the Public Good,” is an impassioned defense of the project, if a bit pithy at certain points. It’s worth reading in its entirety, but here are some highlights with commentary.

In two prior posts, I wondered what will happen to those digital copies of the in-copyright books the university receives as part of its deal with Google. Coleman obviously knew that this was a major concern of her audience, and she went overboard to satisfy them: “Believe me, students will not be reading digital copies of ‘Harry Potter’ in their dorm rooms…We will safeguard the entirety of this archive with the same diligence we accord our most sensitive materials at the University: medical records, Defense Department data, and highly infectious disease agents used in research.” I’m not sure if books should be compared to infectious disease agents, but it seems clear that the digital copies Michigan receives are not likely to make it into “the wild” very easily.

Coleman reminded her audience that for a long time the books in the Michigan library did not circulate and were only accessible to the Board of Regents and the faculty (no students allowed, of course). Finally Michigan President James Angell declared that books were “not to be locked up and kept away from readers, but to be placed at their disposal with the utmost freedom.” Coleman feels that the Google project is a natural extension of that declaration, and more broadly, of the university’s mission to disseminate knowledge.

Ultimately, Coleman turns from more abstract notions of sharing and freedom to the more practical considerations of how students learn today: “When students do research, they use the Internet for digitized library resources more than they use the library proper. It’s that simple. So we are obligated to take the resources of the library to the Internet. When people turn to the Internet for information, I want Michigan’s great library to be there for them to discover.” Sounds about right to me.

Wikipedia vs. Encyclopaedia Britannica Keyword Shootout Results

In my post “Wikipedia vs. Encyclopaedia Britannica for Digital Research”, I asked you to compare two lists of significant keywords and phrases, derived from matching articles on George H. W. Bush in Wikipedia and the Encyclopaedia Britannica. Which one is a better keyword profile—a data mining list that could be used to find other documents on the first President Bush in a sea of documents—and which list do you think was derived from Wikipedia? The people have spoken and it’s time to open the envelope.

Incredibly, as of this writing everyone who has voted has chosen list #2 as being the better of the two, with 79% of the voters believing that this list was extracted from Wikipedia. Well, the majority is half right.

First, a couple of caveats. For some reason Yahoo’s Term Extraction service returned more terms for the second article than the first (I’m not sure why, but my experience has been that the service is fickle in this way). In addition, the second article is much shorter than the first, and Yahoo has a maximum character length for documents it will process. I suspect that the first article was truncated on its way to Yahoo’s server. Regardless, I agree that the second list is better (though it may have been helped by these factors).

But it may surprise some that list #2 comes from the Encyclopaedia Britannica rather than Wikipedia. There are clearly a lot of Wikipedia true believers out there (including, at times, myself). Despite its flaws, however, I still think Wikipedia will probably do just as well for keyword profiling of documents as the Encyclopaedia Britannica. And qualitative considerations are essentially moot since the Encyclopaedia Britannica has rendered itself useless anyway for data-mining purposes by gating its content.

Digital History on Focus 580

From the shameless plug dept.: If you missed Roy Rosenzweig’s and my appearance on the Kojo Nnamdi Show, I’ll be on Focus 580 this Friday, February 3, 2006, at 11 AM ET/10 AM CT on the Illinois NPR station WILL. (If you don’t live in the listening area for WILL, their website also has a live stream of the audio.) I’ll be discussing Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web and answering questions from the audience. If you’re reading this message after February 3, you can download the MP3 file of the show.

Wikipedia vs. Encyclopaedia Britannica for Digital Research

In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source’s openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia’s openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I’m writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo’s Term Extraction web service. Vote on which one you think is a better profile, and I’ll reveal which list belongs to which reference work later this week.

Article #1
president bush
saddam hussein
fall of the berlin wall
tiananmen square
thanksgiving day
american troops
manuel noriega
halabja
invasion of panama
gulf war
help
saudi arabia
united nations
berlin wall

Article #2
president george bush
george bush
mikhail gorbachev
soviet union
collapse
reunification of germany
thurgood marshall
union
clarence thomas
joint chiefs of staff
cold war
manuel antonio noriega
iraq
george
nonaggression pact
david h souter
antonio noriega
president george

How Much Google Knows About You

As the U.S. Justice Department put pressure on Google this week to hand over their search records in a questionable pursuit of evidence for an overturned pornography law, I wondered: How much information does Google really know about us? Strangely, at nearly the same time an email arrived from Google (one of the Google Friends Newsletters) telling me that they had just launched Google Personal Search Trends. Someone in the legal department must not have vetted that email: Google Personal Search Trends reveals exactly how much they know about you. So, how much?

A lot. If you have a Google account (you have one if you have a software developer’s username, a Gmail account, or other Google service account), you can login to your Personal Search Trends page and find out. I logged in and even though I’ve never checked a box or filled out a consent form saying that I don’t mind if Google collects information about my search habits, there appeared a remarkable and slightly unsettling series of charts and tables about me and what I’m interested in.

You can discover not only your top 10 search phrases but also the top 10 sites you visit and the top 10 links you click on. Like Santa, Google knows when you are awake and when you are sleeping—amazingly, no searches for me between midnight and 6 AM ET over the past 12 months. And comparing my search habits with its vast database of users, Google Personal Search Trends tells me that I might also like go to websites on RSS, Charles Dickens, Frankenstein, search engine optimization, and Virginia Tech football. (It’s very wrong about that last one, which I hope it only derives from my search terms and websites visited and not also from the IP address of my laptop in an office on the campus of a Virginia state university.)

Of course, you begin to wonder: wouldn’t someone else like to see this same set of charts and tables? Couldn’t they glean a tremendous amount of information about me? This disturbing feeling grows when you do some more investigation of what Google’s storing on your hard drive in addition to theirs. For instance, if you use Google’s Book Search, they know through a cookie stored on your computer which books you’ve looked at—as well as how many pages of each book (so they can block you from reading too much of a copyrighted book).

Seems like the time is ripe for Google to offer its users a similar deal to the one TiVo has had for years: If you want us to provide the “best” search experience—extras in addition to the basic web search such as personalized search results and recommendations based on what you seem to like—you must provide us with some identifying information; if you want to search the web without these extras, then so be it—we’ll only save your searches on a fully anonymous basis for our internal research. Surely when government entities and private investigators hear about Google Personal Search Trends, they’ll want to have a look. One suspects that in China and perhaps the United States too, someone’s already doing just that.

css.php