Category: Archives

Understanding reCAPTCHA

reCAPTCHAOne of the things I added to this blog when I moved from my own software to WordPress was the red and yellow box in the comments section, which defends this blog against comment spam by asking commenters to decipher a couple of words. Such challenge-response systems are called CAPTCHAs (a tortured and unmellifluous acroynm of “completely automated public Turing test to tell computers and humans apart”). What really caught my imagination about the CAPTCHA I’m using, called reCAPTCHA, is that it uses words from books scanned by the Internet Archive/Open Content Alliance. Thus at the same time commenters solve the word problems they are effectively serving as human OCR machines.

To date, about two million words have been deciphered using reCAPTCHA (see the article in Technology Review lauding reCAPTCHA’s mastermind, Luis von Ahn), which is a great start but by my calculation (100,000 words per average book) only the equivalent of about 20 books. Of course, it’s really much more than that because the words in reCAPTCHA are the hardest ones to decipher by machine and are sprinkled among thousands of books.

Indeed, that is the true genius of reCAPTCHA—it “tells computers and humans apart” by first using OCR software to find words computers can’t decipher, then feeds those words to humans, who can decipher the words (proving themselves human). Therefore a spammer running OCR software (as many of them do to decipher lesser CAPTCHAs), will have great difficulty cracking it. If you would like an in-depth lesson about how reCAPTCHA (and CAPTCHAs in general) works, take a listen to Steve Gibson’s podcast on the subject.

The brilliance of reCAPTCHA and its simultaneous assistance to the digital commons leads one to ponder: What other aspects of digitization, cataloging, and research could be aided by giving a large, distributed group of humans the bits that computers have great difficulty with?

And imagine the power of this system if all 60 million CAPTCHAs answered daily were reCAPTCHAs instead. Why not convert your blog or login system to reCAPTCHA today?

Shakespeare’s Hard Drive

Congrats to Matt Kirschenbaum on his thought-provoking article in the Chronicle of Higher Education, “Hamlet.doc? Literature in a Digital Age.” Matt makes two excellent points. First, “born digital” literature presents incredible new opportunities for research, because manuscripts written on computers retain significant metadata and draft tracking that allows for major insights into an author’s thought and writing process. Second, scholars who wish to study such literature in the future need to be proactive in pushing for writing environments, digital standards, and archival storage that will provide accessibility and persistence for these advantages.

“The Object of History” Site Launches

Thanks to the hard work of my colleagues at the Center for History and New Media, led by Sharon Leon, you can now go behind the scenes with the curators of the National Museum of American History. This month the discussion begins with the famous Greensboro Woolworth’s lunch counter and the origins of the Civil Rights movement. Each month will highlight a new object and its corresponding context, delivered in rich multimedia and with the opportunity to chat with the curators themselves.

A Closer Look at the National Archives-Footnote Agreement

I’ve spent the past two weeks trying to get a better understanding of the agreement signed by the National Archives and Footnote, about which I raised several concerns in my last post. Before making further (possibly unfounded) criticisms I thought it would a good idea to talk to both NARA and Footnote. So I picked up the phone and found several people eager to clarify things. At NARA, Jim Hastings, director of access programs, was particularly helpful in explaining their perspective. (Alas, NARA’s public affairs staff seemed to have only the sketchiest sense of key details.) Most helpful—and most eager to rebut my earlier post—were Justin Schroepfer and Peter Drinkwater, the marketing director and product lead at Footnote. Much to their credit, Justin and Peter patiently answered most of my questions about the agreement and the operation of the Footnote website.

Surprisingly, everyone I spoke to at both NARA and Footnote emphasized that despite the seemingly set-in-stone language of the legal agreement, there is a great deal of latitude in how it is executed, and they asked me to spread the word about how historians and the general public can weigh in. It has received virtually no publicity, but NARA is currently in a public comment phase for the Footnote (a/k/a iArchives) agreement. Scroll down to the bottom of the “Comment on Draft Policy” page at NARA’s website and you’ll find a request for public comment (you should email your thoughts to Vision@nara.gov). It’s a little odd to have a request for comment after the ink is dry on an agreement or policy, and this URL probably should have been included in the press release of the Footnote agreement, but I do think after speaking with them that both NARA and Footnote are receptive to hearing responses to the agreement. Indeed, in response to this post and my prior post on the agreement, Footnote has set up a web page, “Finding the Right Balance,” to receive feedback from the general public on the issues I’ve raised. They also asked me to round up professional opinion on the deal.

I assume Footnote will explain their policies in greater depth on their blog, but we agreed that it would be helpful to record some important details of our conversations in this space. Here are the answers Justin and Peter gave to a few pointed questions.

When I first went to the Footnote site, I was unpleasantly surprised that it required registration even to look at “milestone” documents like Lincoln’s draft of the Gettysburg Address. (Unfortunately, Footnote doesn’t have a list of all of its free content yet, so it’s hard to find such documents.) Justin and Peter responded that when they launched the site there was an error in the document viewer, so they had to add authentication to all document views. A fix was rolled out on January 23, and it’s now possible to view these important documents without registering.

You do need to register, however, to print or download any document, whether it’s considered “free” or “premium.” Why? Justin and Peter candidly noted that although they have done digitization projects before, the National Archives project, which contains millions of critical—and public domain—documents, is a first for them. They are understandably worried about the “leakage” of documents from their site, and want to take it one step at a time. So to start they will track all downloads to see how much escapes, especially in large batches. I noted that downloading and even reusing these documents (even en masse) very well might be legal, despite Footnote’s terms of service, because the scans are “slavish” copies of the originals, which are not protected by copyright. Footnote lawyers are looking at copyright law and what other primary-source sites are doing, and they say that they view these initial months as a learning experience to see if the terms of service can or should change. Footnote’s stance on copyright law and terms of usage will clearly be worth watching.

Speaking of terms of usage, I voiced a similar concern about Footnote’s policies toward minors. As you’ll recall, Footnote’s terms of service say the site is intended for those 18 and older, thus seeming to turn away the many K-12 classes that could take advantage of it. Justin and Peter were most passionate on this point. They told me that Footnote would like to give free access to the site for the K-12 market, but pointed to the restrictiveness of U.S. child protection laws. Because the Footnote site allows users to upload documents as well as view them, they worry about what youngsters might find there in addition to the NARA docs. These laws also mandate the “over 18” clause because the site captures personal information. It seems to me that there’s probably a technical solution that could be found here, similar to the one PBS.org uses to provide K-12 teaching materials without capturing information from the students.

Footnote seems willing to explore such a possibility, but again, Justin and Peter chalked up problems to the newness of the agreement and their inexperience running an interactive site with primary documents such as these. Footnote’s lawyers consulted (and borrowed, in some cases) the boilerplate language from terms of service at other sites, like Ancestry.com. But again, the Footnote team emphasized that they are going to review the policies and look into flexibility under the laws. They expect to tweak their policies in the coming months.

So, now is your chance to weigh in on those potential changes. If you do send a comment to either Footnote or NARA, try to be specific in what you would like to see. For instance, at the Center for History and New Media we are exploring the possibility of mining historical texts, which will only be possible to do on these millions of NARA documents if the Archives receives not only the page images from Footnote but also the OCRed text. (The handwritten documents cannot be automatically transcribed using optical character recognition, of course, but there are many typescript documents that have been converted to machine-readable text.) NARA has not asked to receive the text for each document back from Footnote—only the metadata and a combined index of all documents. There was some discussion that NARA is not equipped to handle the flood of data that a full-text database would entail. Regardless, I believe it would be in the best interest of historical researchers to have NARA receive this database, even if they are unable to post it to the web right away.

The Flawed Agreement between the National Archives and Footnote, Inc.

I suppose it’s not breaking news that libraries and archives aren’t flush with cash. So it must be hard for a director of such an institution when a large corporation, or even a relatively small one, comes knocking with an offer to digitize one’s holdings in exchange for some kind of commercial rights to the contents. But as a historian worried about open access to our cultural heritage, I’m a little concerned about the new agreement between Footnote, Inc. and the United States National Archives. And I’m surprised that somehow this agreement has thus far flown under the radar of all of those who attacked the troublesome Smithsonian/Showtime agreement. Guess what? From now until 2012 it will cost you $100 a year, or even more offensively, $1.99 a page, for online access to critical historical documents such as the Papers of the Continental Congress.

This was the agreement signed by Archivist of the United States Allen Weinstein and Footnote, Inc., a Utah-based digital archives company, on January 10, 2007. For the next five years, unless you have the time and money to travel to Washington, you’ll have to fork over money to Footnote to take a peek at Civil War pension documents or the case files of the early FBI. The National Archives says this agreement is “non-exclusive”—I suppose crossing their fingers that Google will also come along and make a deal—but researchers shouldn’t hold their breaths for other options.

Footnote.com, the website that provide access to these millions of documents, charges for anything more than viewing a small thumbnail of a page or photograph. Supposedly the value-added of the site (aside from being able to see detailed views of the documents) is that it allows you to save and annotate documents in your own library, and share the results of your research (though not the original documents). Hmm, I seem to remember that there’s a tool being developed that will allow you to do all of that—for free, no less.

Moreover, you’ll also be subject to some fairly onerous terms of usage on Footnote.com, especially considering that this is our collective history and that all of these documents are out of copyright. (For a detailed description of the legal issues involved here, please see Chapter 7 of Digital History, “Owning the Past?”, especially the section covering the often bogus claims of copyright on scanned archival materials.) I’ll let the terms speak for themselves (plus one snide aside): “Professional historians and others conducting scholarly research may use the Website [gee, thanks], provided that they do so within the scope of their professional work, that they obtain written permission from us before using an image obtained from the Website for publication, and that they credit the source. You further agree that…you will not copy or distribute any part of the Website or the Service in any medium without Footnote.com’s prior written authorization.”

Couldn’t the National Archives have at least added a provision to the agreement with Footnote to allow students free access to these documents? I guess not; from the terms of usage: “The Footnote.com Website is intended for adults over the age of 18.” What next? Burly bouncers carding people who want to see the Declaration of Independence?

Raw Archives and Hurricane Katrina

Several weeks ago during my talk on the “Possibilities and Problems of Digital History and Digital Collections” at the joint meeting of the Council of State Archivists, the National Association of Government Archives and Records Administrators, and the Society of American Archivists (CoSA, NAGARA, and SAA), I received a pointed criticism from an audience member during the question-and-answer period. Having just shown the September 11 Digital Archive, the questioner wanted to know how this qualified as an “archive,” since archives are generally based upon rigorous principles of value, selection, and provenance. It’s a valid critique—though a distinction that might be lost on a layperson who is unaware of archival science and might consider their shoebox of photos an “archive.” Maybe it’s time for a new term: the raw archive. On the Internet, these raw archives are all around us.

Just think about Flickr, Blogger, or even (dare I speak its name) YouTube. These sites are documenting—perhaps in an exhibitionist way, but documenting nonetheless—the lives of millions of people. They are also aggregating that documentation in an astonishing way that was not possible before the web. They are not archives in the traditional sense, instead eschewing selection biases for a come one, come all attitude that has produced collections of photos, articles, and videos several orders of magnitude larger than anything in the physical world. They may be easy to disparage, but I suspect they will be extraordinarily useful for future historians and researchers.

Or I should say would be, if they were being run by entities that are concerned with the very long run. But the Flickrs of the web are companies, and have little commitment to store their contents for ten, much less a hundred, years.

That’s why more institutions with a long-term view, such as universities, libraries, and museums, need to think about getting into the raw archive business. We in the noncommercial world should be incredibly thankful for the Internet Archive, which has probably done the most in this respect. Institutions that are oriented toward the long run have to think about adding the raw to their already substantial holdings of the “cooked” (that is, traditional archives).

Our latest contribution to this effort is the Hurricane Digital Memory Bank, which has just undergone a redesign and which now has over 5000 contributions. It’s a great example of what can be done with the raw, when thought about with the researcher, rather than voyeur, in mind. On this anniversary of Hurricane Katrina, I invite you to add your recollections, photos, and other raw materials to the growing archive. And please tell others. We have a come one, come all attitude toward contributions, and need as many people as possible to help us build the (raw) archive.

Mapping What Americans Did on September 11

I gave a talk a couple of days ago at the annual meeting of the Society for American Archivists (to a great audience—many thanks to those who were there and asked such terrific questions) in which I showed how researchers in the future will be able to intelligently search, data mine, and map digital collections. As an example, I presented some preliminary work I’ve done on our September 11 Digital Archive combining text analysis with geocoding to produce overlays on Google Earth that show what people were thinking or doing on 9/11 in different parts of the United States. I promised a follow-up article in this space for those who wanted to learn how I was able to do this. The method provides an overarching view of patterns in a large collection (in the case of the September 11 Digital Archive, tens of thousands of stories), which can then be prospected further to answer research questions. Let’s start with the end product: two maps (a wide view and a detail) of those who were watching CNN on 9/11 (based on a text analysis of our stories database, and colored blue) and those who prayed on 9/11 (colored red).

September 11 Digital Archive stories about CNN and prayer mapped onto Google Earth
Google Earth map of the United States showing stories with CNN viewing (blue) and stories with prayer (red) [view full-size version for better detail]

September 11 Digital Archive stories about CNN and prayer mapped onto Google Earth - detail
Detail of the Eastern United States [view full-size version for better detail]

By panning and zooming, you can see some interesting patterns. Some of these patterns may be obvious to us, but a future researcher with little knowledge of our present could find out easily (without reading thousands of stories) that prayer was more common in rural areas of the U.S. in our time, and that there was especially a dichotomy between the very religious suburbs (or really, the exurbs) of cities like Dallas and the mostly urban CNN-watchers. (I’ll present more surprising data in this space as we approach the fifth anniversary of 9/11.)

OK, here’s how to replicate this. First, a caveat. Since I have direct access to the September 11 Digital Archive database, as well as the ability to run server-to-server data exchanges with Google and Yahoo (through their API programs), I was able to put together a method that may not be possible for some of you without some programming skills and direct access to similar databases. For those in this blog’s audience who do have that capacity, here’s the quick, geeky version: using regular expressions, form an SQL query into the database you are researching to find matching documents; select geographical information (either from the metadata, or, if you are dealing with raw documents, pull identifying data from the main text by matching, say, 5-digit numbers for zip codes); put these matches into an array, and then iterate through the array to send each location to either Yahoo’s or Google’s geocoding service via their maps API; take the latitude and longitude from the result set from Yahoo or Google and add these to your array; iterate again through the array to create a KML (Keynote Markup Language) file by wrapping each field with the appropriate KML tag.

For everyone else, here’s the simplest method I could find for reproducing the maps I created. We’re going to use a web-based front end for Yahoo’s geocoding API, Phillip Holmstrand’s very good free service, and then modify the results a bit to make them a little more appropriate for scholarly research.

First of all, you need to put together a spreadsheet in Excel (or Access or any other spreadsheet program; you can also just create a basic text document with columns and tabs between fields so it looks like a spreadsheet). Hopefully you will not be doing this manually; if you can get a tab-delimited text export from the collection you wish to research, that would be ideal. One or more columns should identify the location of the matching document. Make separate columns for street address, city, state/province, and zip codes (if you only have one or a few of these, that’s totally fine). If you have a distinct URL for each document (e.g., a letter or photograph), put that in another column; same for other information such as a caption or description and the title of the document (again, if any). You don’t need these non-location columns; the only reason to include them is if you wish to click on a dot on Google Earth and bring up the corresponding document in your web browser (for closer reading or viewing).

Be sure to title each column, i.e., use text in the topmost cell with specific titles for the columns, with no spaces. I recommend “street_address,” “city,” “state,” zip_code,” “title,” “description,” and “url” (again, you may only have one or more of these; for the CNN example I used only the zip codes). Once you’re done with the spreadsheet, save it as a tab-delimited text file by using that option in Excel (or Access or whatever) under the menu item “Save as…”

Now open that new file in a text editor like Notepad on the PC or Textedit on the Mac (or BBEdit or anything else other than a word processor, since Word, e.g., will reformat the text). Make sure that it still looks roughly like a spreadsheet, with the title of the columns at the top and each column separated by some space. Use “Select all” from the “Edit” menu and then “Copy.”

Now open your web browser and go to Phillip Holmstrand’s geocoding website and go through the steps. “Step #1” should have “tab delimited” selected. Paste your columned text into the big box in “Step #2” (you will need to highlight the example text that’s already there and delete it before pasting so that you don’t mingle your data with the example). Click “Validate Source” in “Step #3.” If you’ve done everything right thus far, you will get a green message saying “validated.”

In “Step #4” you will need to match up the titles of your columns with the fields that Yahoo accepts, such as address, zip code, and URL. Phillip’s site is very smart and so will try to do this automatically for you, but you may need to be sure that it has done the matching correctly (if you use the column titles I suggest, it should work perfectly). Remember, you don’t need to select each one of these parameters if you don’t have a column for every one. Just leave them blank.

Click “Run Geocoder” in “Step #5” and watch as the latitudes and longitudes appear in the box in “Step #6.” Wait until the process is totally done. Phillip’s site will then map the first 100 points on a built-in Yahoo map, but we are going to take our data with us and modify it a bit. Select “Download to Google Earth (KML) File” at the bottom of “Step #6.” Remember where you save the file. The default name for that file will be “BatchGeocode.kml”. Feel free to change the name, but be sure to keep “.kml” at the end.

While Phillip’s site takes care of a lot of steps for you, if you try right away to open the KML file in Google Earth you will notice that all of the points are blazing white. This is fine for some uses (show me where the closest Starbucks is right now!), but scholarly research requires the ability to compare different KML files (e.g., between CNN viewers and those who prayed). So we need to implement different colors for distinct datasets.

Open your KML file in a text editor like Notepad or Textedit. Don’t worry if you don’t know XML or HTML (if you do know these languages, you will feel a bit more comfortable). Right near the top of the document, there will be a section that looks like this:

<Style id=”A”><IconStyle><scale>0.8</scale><Icon><href>root://icons/
palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon></IconStyle>
<LabelStyle><scale>0</scale></LabelStyle></Style>

To color the dots that this file produces on Google Earth, we need to add a set of “color tags” between <IconStyle> and <scale>. Using your text editor, insert “<color></color>” at that point. Now you should have a section that looks like this:

<Style id=”A”><IconStyle><color></color><scale>0.8</scale><Icon><href>root:
//icons/palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon></IconStyle>
<LabelStyle><scale>0</scale></LabelStyle></Style>

We’re almost done, but unfortunately things get a little more technical. Google uses what’s called an ABRG value for defining colors in Google Earth files. ABRG stands for “alpha, blue, green, red.” In other words, you will have to tell the program how much blue, green, and red you want in the color, plus the alpha value, which determines how opaque or transparent the dot is. Alas, each of these four parts must be expressed in a two-digit hexidecimal format ranging from “00” (no amount) to “ff” (full amount). Combining each of these two-digit values gives you the necessary full string of eight characters. (I know, I know—why not just <color>red</color>? Don’t ask.) Anyhow, a fully opaque red dot would be <color>ff00ff00</color>, since that value has full (“ff”) opacity and full (“ff”) red value (opacity being the first and second places of the eight characters and red being the fifth and sixth places of the eight characters). Welcome to the joyous world of ABRG.

Let me save you some time. I like to use 50% opacity so I can see through dots. That helps give a sense of mass when dots are close to or on top of each other, as is often the case in cities. (You can also vary the size of the dots, but let’s wait for another day on that one.) So: semi-transparent red is “7f00ff00”; semi-transparent blue is “7fff0000”; semi-transparent green is “7f0000ff”; semi-transparent yellow is “7f00ffff”. (No, green and red don’t make yellow, but they do in this case. Don’t ask.) So for blue dots that you can see through, as in the CNN example, the final code should have “7fff0000″ inserted between <color> and </color>, resulting in:

<Style id=”A”><IconStyle><color>7fff0000</color><scale>0.8</scale><Icon><href>
root://icons/palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon>
</IconStyle><LabelStyle><scale>0</scale></LabelStyle></Style>

When you’ve inserted your color choice, save the KML document in your text editor and run the Google Earth application. From within that application, choose “Open…” from the “File” menu and select the KML file you just edited. Google Earth will load the data and you will see colored dots on your map. To compare two datasets, as I did with prayer and CNN viewership, simply open more than one KML file. You can toggle each set of dots on and off by clicking the checkboxes next to their filenames in the middle section of the panel on the left. Zoom and pan, add other datasets (such as population statistics), add a third or fourth KML file. Forget about all the tech stuff and begin your research.

[For those who just want to try out using a KML file for research in Google Earth, here are a few from the September 11 Digital Archive. Right-click (or control-click on a Mac) to save the files to your computer, then open them within Google Earth, which you can download from here. These are files mapping the locations of: those who watched CNN; those who watched Fox News (far fewer than CNN since Fox News was just getting off the ground, but already showing a much more rural audience compared to CNN); and those who prayed on 9/11.]

Digital History on Focus 580

From the shameless plug dept.: If you missed Roy Rosenzweig’s and my appearance on the Kojo Nnamdi Show, I’ll be on Focus 580 this Friday, February 3, 2006, at 11 AM ET/10 AM CT on the Illinois NPR station WILL. (If you don’t live in the listening area for WILL, their website also has a live stream of the audio.) I’ll be discussing Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web and answering questions from the audience. If you’re reading this message after February 3, you can download the MP3 file of the show.

Kojo Nnamdi Show Questions

Roy Rosenzweig and I had a terrific time on The Kojo Nnamdi Show today. If you missed the radio broadcast you can listen to it online on the WAMU website. There were a number of interesting calls from the audience, and we promised several callers that we would answer a couple of questions off the air; here they are.

Barbara from Potomac, MD asks, “I’m wondering whether new products that claim to help compress and organize data (I think one is called “C-Gate” [Kathy, an alert reader of his blog, has pointed out that Barbara probably means the giant disk drive company Seagate]) help out [to solve the problem of storing digital data for the long run]? The ads claim that you can store all sorts of data—from PowerPoint presentations and music to digital files—in a two-ounce standalone disk or other device.”

As we say in the book, we’re skeptical of using rare and/or proprietary formats to store digital materials for the long run. Despite the claims of many companies about new and novel storage devices, it’s unclear whether these specialized devices will be accessible in ten or a hundred years. We recommend sticking with common, popular formats and devices (at this point, probably standard hard drives and CD- or DVD-ROMs) if you want to have the best odds of preserving your materials for the long run. The National Institute of Standards and Technology (NIST) provides a good summary of how to store optical media such as CDs and DVDs for long periods of time.

Several callers asked where they could go if they have materials on old media, such as reel-to-reel or 8-track tapes, that they want to convert to a digital format.

You can easily find online some of the companies we mentioned that will (for a fee) transfer your own media files onto new devices. Google for the media you have (e.g., “8-track tape”) along with the words “conversion services” or “transfer services.” I probably overestimated the cost for these services; most conversions will cost less than $100 per tape. However, the older the media the more expensive it will be. I’ll continue to look into places in the Washington area that might provide these services for free, such as libraries and archives.

Digital History on The Kojo Nnamdi Show

From the shameless plug dept.: Roy Rosenzweig and I will be discussing our book Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web this Tuesday, January 10, on The Kojo Nnamdi Show. The show is produced at Washington’s NPR station, WAMU. We’re on live from noon to 1 PM EST, and you’ll be able to ask us questions by phone (1-800-433-8850), via email (kojo@wamu.org), or through the web. The show will be replayed from 8-9 PM EST on Tuesday night, and syndicated via iTunes and other outlets as part of NPR’s terrific podcast series (look for The Kojo Nnamdi Show/Tech Tuesday). You’ll also be able to get the audio stream directly from the show’s website. I’ll probably answer some additional questions from the audience in this space.