Look at the bottom of this page for Illustrated New York: The Metropolis of To-day (1888), digitized by Google at the University of Michigan Library. Using the natural language processing of Google Maps to scan the text for addresses, the locations and surrounding text are placed onto a map of lower Manhattan. A great example of the power of historical data mining and the combination of digital resources via APIs (made easier for Google, of course, because this is all in-house). Kudos to the Google Book Search team.
I gave a talk a couple of days ago at the annual meeting of the Society for American Archivists (to a great audience—many thanks to those who were there and asked such terrific questions) in which I showed how researchers in the future will be able to intelligently search, data mine, and map digital collections. As an example, I presented some preliminary work I’ve done on our September 11 Digital Archive combining text analysis with geocoding to produce overlays on Google Earth that show what people were thinking or doing on 9/11 in different parts of the United States. I promised a follow-up article in this space for those who wanted to learn how I was able to do this. The method provides an overarching view of patterns in a large collection (in the case of the September 11 Digital Archive, tens of thousands of stories), which can then be prospected further to answer research questions. Let’s start with the end product: two maps (a wide view and a detail) of those who were watching CNN on 9/11 (based on a text analysis of our stories database, and colored blue) and those who prayed on 9/11 (colored red).
Google Earth map of the United States showing stories with CNN viewing (blue) and stories with prayer (red) [view full-size version for better detail]
Detail of the Eastern United States [view full-size version for better detail]
By panning and zooming, you can see some interesting patterns. Some of these patterns may be obvious to us, but a future researcher with little knowledge of our present could find out easily (without reading thousands of stories) that prayer was more common in rural areas of the U.S. in our time, and that there was especially a dichotomy between the very religious suburbs (or really, the exurbs) of cities like Dallas and the mostly urban CNN-watchers. (I’ll present more surprising data in this space as we approach the fifth anniversary of 9/11.)
OK, here’s how to replicate this. First, a caveat. Since I have direct access to the September 11 Digital Archive database, as well as the ability to run server-to-server data exchanges with Google and Yahoo (through their API programs), I was able to put together a method that may not be possible for some of you without some programming skills and direct access to similar databases. For those in this blog’s audience who do have that capacity, here’s the quick, geeky version: using regular expressions, form an SQL query into the database you are researching to find matching documents; select geographical information (either from the metadata, or, if you are dealing with raw documents, pull identifying data from the main text by matching, say, 5-digit numbers for zip codes); put these matches into an array, and then iterate through the array to send each location to either Yahoo’s or Google’s geocoding service via their maps API; take the latitude and longitude from the result set from Yahoo or Google and add these to your array; iterate again through the array to create a KML (Keynote Markup Language) file by wrapping each field with the appropriate KML tag.
For everyone else, here’s the simplest method I could find for reproducing the maps I created. We’re going to use a web-based front end for Yahoo’s geocoding API, Phillip Holmstrand’s very good free service, and then modify the results a bit to make them a little more appropriate for scholarly research.
First of all, you need to put together a spreadsheet in Excel (or Access or any other spreadsheet program; you can also just create a basic text document with columns and tabs between fields so it looks like a spreadsheet). Hopefully you will not be doing this manually; if you can get a tab-delimited text export from the collection you wish to research, that would be ideal. One or more columns should identify the location of the matching document. Make separate columns for street address, city, state/province, and zip codes (if you only have one or a few of these, that’s totally fine). If you have a distinct URL for each document (e.g., a letter or photograph), put that in another column; same for other information such as a caption or description and the title of the document (again, if any). You don’t need these non-location columns; the only reason to include them is if you wish to click on a dot on Google Earth and bring up the corresponding document in your web browser (for closer reading or viewing).
Be sure to title each column, i.e., use text in the topmost cell with specific titles for the columns, with no spaces. I recommend “street_address,” “city,” “state,” zip_code,” “title,” “description,” and “url” (again, you may only have one or more of these; for the CNN example I used only the zip codes). Once you’re done with the spreadsheet, save it as a tab-delimited text file by using that option in Excel (or Access or whatever) under the menu item “Save as…”
Now open that new file in a text editor like Notepad on the PC or Textedit on the Mac (or BBEdit or anything else other than a word processor, since Word, e.g., will reformat the text). Make sure that it still looks roughly like a spreadsheet, with the title of the columns at the top and each column separated by some space. Use “Select all” from the “Edit” menu and then “Copy.”
Now open your web browser and go to Phillip Holmstrand’s geocoding website and go through the steps. “Step #1” should have “tab delimited” selected. Paste your columned text into the big box in “Step #2” (you will need to highlight the example text that’s already there and delete it before pasting so that you don’t mingle your data with the example). Click “Validate Source” in “Step #3.” If you’ve done everything right thus far, you will get a green message saying “validated.”
In “Step #4” you will need to match up the titles of your columns with the fields that Yahoo accepts, such as address, zip code, and URL. Phillip’s site is very smart and so will try to do this automatically for you, but you may need to be sure that it has done the matching correctly (if you use the column titles I suggest, it should work perfectly). Remember, you don’t need to select each one of these parameters if you don’t have a column for every one. Just leave them blank.
Click “Run Geocoder” in “Step #5” and watch as the latitudes and longitudes appear in the box in “Step #6.” Wait until the process is totally done. Phillip’s site will then map the first 100 points on a built-in Yahoo map, but we are going to take our data with us and modify it a bit. Select “Download to Google Earth (KML) File” at the bottom of “Step #6.” Remember where you save the file. The default name for that file will be “BatchGeocode.kml”. Feel free to change the name, but be sure to keep “.kml” at the end.
While Phillip’s site takes care of a lot of steps for you, if you try right away to open the KML file in Google Earth you will notice that all of the points are blazing white. This is fine for some uses (show me where the closest Starbucks is right now!), but scholarly research requires the ability to compare different KML files (e.g., between CNN viewers and those who prayed). So we need to implement different colors for distinct datasets.
Open your KML file in a text editor like Notepad or Textedit. Don’t worry if you don’t know XML or HTML (if you do know these languages, you will feel a bit more comfortable). Right near the top of the document, there will be a section that looks like this:
To color the dots that this file produces on Google Earth, we need to add a set of “color tags” between <IconStyle> and <scale>. Using your text editor, insert “<color></color>” at that point. Now you should have a section that looks like this:
We’re almost done, but unfortunately things get a little more technical. Google uses what’s called an ABRG value for defining colors in Google Earth files. ABRG stands for “alpha, blue, green, red.” In other words, you will have to tell the program how much blue, green, and red you want in the color, plus the alpha value, which determines how opaque or transparent the dot is. Alas, each of these four parts must be expressed in a two-digit hexidecimal format ranging from “00” (no amount) to “ff” (full amount). Combining each of these two-digit values gives you the necessary full string of eight characters. (I know, I know—why not just <color>red</color>? Don’t ask.) Anyhow, a fully opaque red dot would be <color>ff00ff00</color>, since that value has full (“ff”) opacity and full (“ff”) red value (opacity being the first and second places of the eight characters and red being the fifth and sixth places of the eight characters). Welcome to the joyous world of ABRG.
Let me save you some time. I like to use 50% opacity so I can see through dots. That helps give a sense of mass when dots are close to or on top of each other, as is often the case in cities. (You can also vary the size of the dots, but let’s wait for another day on that one.) So: semi-transparent red is “7f00ff00”; semi-transparent blue is “7fff0000”; semi-transparent green is “7f0000ff”; semi-transparent yellow is “7f00ffff”. (No, green and red don’t make yellow, but they do in this case. Don’t ask.) So for blue dots that you can see through, as in the CNN example, the final code should have “7fff0000″ inserted between <color> and </color>, resulting in:
When you’ve inserted your color choice, save the KML document in your text editor and run the Google Earth application. From within that application, choose “Open…” from the “File” menu and select the KML file you just edited. Google Earth will load the data and you will see colored dots on your map. To compare two datasets, as I did with prayer and CNN viewership, simply open more than one KML file. You can toggle each set of dots on and off by clicking the checkboxes next to their filenames in the middle section of the panel on the left. Zoom and pan, add other datasets (such as population statistics), add a third or fourth KML file. Forget about all the tech stuff and begin your research.
[For those who just want to try out using a KML file for research in Google Earth, here are a few from the September 11 Digital Archive. Right-click (or control-click on a Mac) to save the files to your computer, then open them within Google Earth, which you can download from here. These are files mapping the locations of: those who watched CNN; those who watched Fox News (far fewer than CNN since Fox News was just getting off the ground, but already showing a much more rural audience compared to CNN); and those who prayed on 9/11.]
As the saying goes, imitation is the sincerest form of flattery. So at the Center for History and New Media, we’re currently feeling extremely flattered that our initiatives in collecting and presenting recent history—the Echo Project (covering the history of science, technology, and industry), the September 11 Digital Archive, and the Hurricane Digital Memory Bank—are being imitated by people using a wave of new websites that help them locate recollections, images, and other digital objects on a map. Here’s an example from the mapping site Platial:
And similar map from our 9/11 project:
Of course, we’re delighted to have imitators (and indeed, in turn we have imitated others), since we are trying to disseminate as widely as possible methods for saving the digital record of the present for future generations. It’s great to see new sites like Platial, CommunityWalk, and Wayfaring providing easy-to-use, collaborative maps that scattered groups of people can use to store photos, memories, and other artifacts.
I was interviewed yesterday by CNN about a new project at the Center for History and New Media, the Hurricane Digital Memory Bank, which uses digital technology to record memories, photographs, and other media related to the Hurricanes Katrina, Rita, and Wilma. (CNN is going to feature the project sometime this week on its program The Situation Room.) The HDMB is a democratic historical project similar to our September 11 Digital Archive, which saved the recollections and digital files of tens of thousands of contributors from around the world; this time we’re trying to save thousands of perspectives on what occurred on the Gulf Coast in the fall of 2005. What amazes me is how the interest in online historical projects and collections has exploded recently. Several of the web projects I’ve co-directed over the last five years have engaged in collecting history online. But even a project with as prominent a topic as September 11 took a long time to be picked up by the mass media. This time CNN called us just a few weeks after we launched the website, and before we’ve done any real publicity. Here are three developments from the last two years I think account for this sharply increased interest.
Technologies enabling popular writing (blogs) and image sharing (e.g., Flickr) have moved into the mainstream, creating an unprecedented wave of self-documentation and historicizing. Blogs, of course, have given millions of people a taste for daily or weekly self-documentation unseen since the height of diary use in the late nineteenth century. And it used to be fairly complicated to set up an online gallery of one’s photos. Now you can do it with no technical know-how whatsoever, and it’s become much easier for others to find these photos (partly due to tagging/folksonomies). The result is that millions of photographs are being shared daily and the general public is getting used to the instantaneous documentation of events. Look at what happened in the hours after the London subway bombings— photographic documentation of the event that took place on photo-sharing sites within two days formerly would have taken months or even years for archivists to compile.
New web services are making combinations of these democratic efforts at documentation feasible and compelling. Our big innovation for the HDMB is to locate each contribution on an interactive map (using the Google Maps API), which allows one to compare the experiences and images from one place (e.g. an impoverished parish in New Orleans) with another (e.g., a wealthier suburb of Baton Rouge). (Can someone please come up with a better word for these combinations than the current “mashups”?) Through the savvy use of unique Technorati or Flickr tags, a scattered group of friends or colleagues can now automatically associate a group of documents or photographs to create an instant collection on an event or issue.
The mass media has almost completely reversed its formerly antagonistic posture toward new media. CNN now has at least two dedicated “Internet reporters” who look for new websites and scan blogs for news and commentary—once disparaged as the last refuge of unpublishable amateurs. In the last year the blogosphere has actually broken several stories (e.g., the Dan Rather document scandal), and many journalists have started their own blogs. The Washington Post has just hired its first full-time blogger. Technorati now tracks over 24 million blogs; even if 99% of those are discussing the latest on TomKat (the celebrity marriage) or Tomcat (the Linux server technology for Java), there are still a lot of new, interesting perspectives out there to be recorded for posterity.