Category: Google

Google Books: Champagne or Sour Grapes?

Is it possible to have a balanced discussion of Google’s outrageously ambitious and undoubtedly flawed project to scan tens of millions of books in dozens of research libraries? I have noted in this space the advantages and disadvantages of Google Books—sometimes both at one time. Heck, the only time this blog has ever been seriously “dugg” is when I noted the appearance of fingers in some Google scans. Google Books is an easy target.

This week Paul Duguid has received a lot of positive press (e.g., Peter Brantley, if:book) for his dressing down of Google Books, “Inheritance and loss? A brief survey of Google Books.” It’s a very clever article, using poorly scanned Google copies of Lawrence Sterne’s absurdist and raunchy comedy Tristram Shandy to reveal the extent of Google’s folly and their “disrespect” for physical books.

I thought I would enjoy reading Duguid’s article, but I found myself oddly unenthusiastic by the end.

Of course Google has poor scans—as the saying goes, haste makes waste—but this is not a scientific survey of the percentage of pages that are unreadable or missing (surely less than 0.1% in my viewing of scores of Victorian books). Nor does the article note that Google might have possible remedies for some of these inadequacies. For example, they almost certainly have higher-resolution, higher-contrast scans that are different than the lo-res ones they display (a point made at the Million Books workshop; they use the originals for OCR), which they can revisit to produce better copies for the web. Just as they have recently added commentary to Google News, they could have users flag problematic pages. Truly bad books could be rescanned or replaced by other libraries’ versions.

Most egregiously, none of the commentaries I have seen on Duguid’s jeremiad have noted the telling coda to the article: “This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation.” The question of playing to the audience obviously arises.

Google Books will never be perfect, or even close. Duguid is right that it disrespects age-old, critical elements of books. (Although his point that Google disrespects metadata strangely fails to note that Google is one of the driving forces behind the Future of Bibliographic Control meetings, which are all about metadata.) Google Books is the outcome, like so many things at Google, of a mathematical challenge: How can you scan tens of millions of books in five years? It’s easy to say they should do a better job and get all the details right, but if you do the calculations of that assessment, you’ll probably see that the perfect library scanning project would take 50 years rather than 5. As in OCR, getting from 98% to 100% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary (I have seen OCR specialists estimate that even 90% accuracy is perfectly fine for search), it must have been an easy decision to make.

Complaining about the quality, thoroughness, and fidelity of Google’s (public) scans distracts us from the larger problem of Google Books. As I have argued repeatedly in this space, the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open. Recently they have added the ability to view some books in “plain text” (i.e., the OCRed text, but it’s hard to copy text from multiple pages at once), and even in some cases to download PDFs of public domain works. But those moves don’t go far enough for scholarly needs. We need what Cliff Lynch of CNI has called “computational access,” a higher level of access that is less about reading a page image on your computer than applying digital tools and analyses to many pages or books at one time to create new knowledge and understanding.

An API would be ideal for this purpose if Google doesn’t want to expose their entire collection. Google has APIs for most of their other projects—why not Google Books?

[Image courtesy of Ubisoft.]

August 16, 2007
Google Books: What’s Not to Like?

The American Historical Association’s Rob Townsend takes some sharp jabs at Google’s ambitious library scanning project. Some of the comments are equally sharp.

May 9, 2007
Google Book Search Now Maps Locations in the Text

Look at the bottom of this page for Illustrated New York: The Metropolis of To-day (1888), digitized by Google at the University of Michigan Library. Using the natural language processing of Google Maps to scan the text for addresses, the locations and surrounding text are placed onto a map of lower Manhattan. A great example of the power of historical data mining and the combination of digital resources via APIs (made easier for Google, of course, because this is all in-house). Kudos to the Google Book Search team.

January 26, 2007
Five Catalonian Libraries Join the Google Library Project

The Google Library Project has, for the most part, focused on American libraries, thus pushing the EU to mount a competing project; will this announcement (which includes the National Library of Barcelona), coming on the heels of an agreement with the Complutense University of Madrid, signal the beginning of Google making inroads in Europe?

January 12, 2007
Mapping What Americans Did on September 11

I gave a talk a couple of days ago at the annual meeting of the Society for American Archivists (to a great audience—many thanks to those who were there and asked such terrific questions) in which I showed how researchers in the future will be able to intelligently search, data mine, and map digital collections. As an example, I presented some preliminary work I’ve done on our September 11 Digital Archive combining text analysis with geocoding to produce overlays on Google Earth that show what people were thinking or doing on 9/11 in different parts of the United States. I promised a follow-up article in this space for those who wanted to learn how I was able to do this. The method provides an overarching view of patterns in a large collection (in the case of the September 11 Digital Archive, tens of thousands of stories), which can then be prospected further to answer research questions. Let’s start with the end product: two maps (a wide view and a detail) of those who were watching CNN on 9/11 (based on a text analysis of our stories database, and colored blue) and those who prayed on 9/11 (colored red).

Google Earth map of the United States showing stories with CNN viewing (blue) and stories with prayer (red) [view full-size version for better detail]

Detail of the Eastern United States [view full-size version for better detail]

By panning and zooming, you can see some interesting patterns. Some of these patterns may be obvious to us, but a future researcher with little knowledge of our present could find out easily (without reading thousands of stories) that prayer was more common in rural areas of the U.S. in our time, and that there was especially a dichotomy between the very religious suburbs (or really, the exurbs) of cities like Dallas and the mostly urban CNN-watchers. (I’ll present more surprising data in this space as we approach the fifth anniversary of 9/11.)

OK, here’s how to replicate this. First, a caveat. Since I have direct access to the September 11 Digital Archive database, as well as the ability to run server-to-server data exchanges with Google and Yahoo (through their API programs), I was able to put together a method that may not be possible for some of you without some programming skills and direct access to similar databases. For those in this blog’s audience who do have that capacity, here’s the quick, geeky version: using regular expressions, form an SQL query into the database you are researching to find matching documents; select geographical information (either from the metadata, or, if you are dealing with raw documents, pull identifying data from the main text by matching, say, 5-digit numbers for zip codes); put these matches into an array, and then iterate through the array to send each location to either Yahoo’s or Google’s geocoding service via their maps API; take the latitude and longitude from the result set from Yahoo or Google and add these to your array; iterate again through the array to create a KML (Keynote Markup Language) file by wrapping each field with the appropriate KML tag.

For everyone else, here’s the simplest method I could find for reproducing the maps I created. We’re going to use a web-based front end for Yahoo’s geocoding API, Phillip Holmstrand’s very good free service, and then modify the results a bit to make them a little more appropriate for scholarly research.

First of all, you need to put together a spreadsheet in Excel (or Access or any other spreadsheet program; you can also just create a basic text document with columns and tabs between fields so it looks like a spreadsheet). Hopefully you will not be doing this manually; if you can get a tab-delimited text export from the collection you wish to research, that would be ideal. One or more columns should identify the location of the matching document. Make separate columns for street address, city, state/province, and zip codes (if you only have one or a few of these, that’s totally fine). If you have a distinct URL for each document (e.g., a letter or photograph), put that in another column; same for other information such as a caption or description and the title of the document (again, if any). You don’t need these non-location columns; the only reason to include them is if you wish to click on a dot on Google Earth and bring up the corresponding document in your web browser (for closer reading or viewing).

Be sure to title each column, i.e., use text in the topmost cell with specific titles for the columns, with no spaces. I recommend “street_address,” “city,” “state,” zip_code,” “title,” “description,” and “url” (again, you may only have one or more of these; for the CNN example I used only the zip codes). Once you’re done with the spreadsheet, save it as a tab-delimited text file by using that option in Excel (or Access or whatever) under the menu item “Save as…”

Now open that new file in a text editor like Notepad on the PC or Textedit on the Mac (or BBEdit or anything else other than a word processor, since Word, e.g., will reformat the text). Make sure that it still looks roughly like a spreadsheet, with the title of the columns at the top and each column separated by some space. Use “Select all” from the “Edit” menu and then “Copy.”

Now open your web browser and go to Phillip Holmstrand’s geocoding website and go through the steps. “Step #1” should have “tab delimited” selected. Paste your columned text into the big box in “Step #2” (you will need to highlight the example text that’s already there and delete it before pasting so that you don’t mingle your data with the example). Click “Validate Source” in “Step #3.” If you’ve done everything right thus far, you will get a green message saying “validated.”

In “Step #4” you will need to match up the titles of your columns with the fields that Yahoo accepts, such as address, zip code, and URL. Phillip’s site is very smart and so will try to do this automatically for you, but you may need to be sure that it has done the matching correctly (if you use the column titles I suggest, it should work perfectly). Remember, you don’t need to select each one of these parameters if you don’t have a column for every one. Just leave them blank.

Click “Run Geocoder” in “Step #5” and watch as the latitudes and longitudes appear in the box in “Step #6.” Wait until the process is totally done. Phillip’s site will then map the first 100 points on a built-in Yahoo map, but we are going to take our data with us and modify it a bit. Select “Download to Google Earth (KML) File” at the bottom of “Step #6.” Remember where you save the file. The default name for that file will be “BatchGeocode.kml”. Feel free to change the name, but be sure to keep “.kml” at the end.

While Phillip’s site takes care of a lot of steps for you, if you try right away to open the KML file in Google Earth you will notice that all of the points are blazing white. This is fine for some uses (show me where the closest Starbucks is right now!), but scholarly research requires the ability to compare different KML files (e.g., between CNN viewers and those who prayed). So we need to implement different colors for distinct datasets.

Open your KML file in a text editor like Notepad or Textedit. Don’t worry if you don’t know XML or HTML (if you do know these languages, you will feel a bit more comfortable). Right near the top of the document, there will be a section that looks like this:

<Style id=”A”><IconStyle><scale>0.8</scale><Icon><href>root://icons/
palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon></IconStyle>
<LabelStyle><scale>0</scale></LabelStyle></Style>

To color the dots that this file produces on Google Earth, we need to add a set of “color tags” between <IconStyle> and <scale>. Using your text editor, insert “<color></color>” at that point. Now you should have a section that looks like this:

<Style id=”A”><IconStyle><color></color><scale>0.8</scale><Icon><href>root:
//icons/palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon></IconStyle>
<LabelStyle><scale>0</scale></LabelStyle></Style>

We’re almost done, but unfortunately things get a little more technical. Google uses what’s called an ABRG value for defining colors in Google Earth files. ABRG stands for “alpha, blue, green, red.” In other words, you will have to tell the program how much blue, green, and red you want in the color, plus the alpha value, which determines how opaque or transparent the dot is. Alas, each of these four parts must be expressed in a two-digit hexidecimal format ranging from “00” (no amount) to “ff” (full amount). Combining each of these two-digit values gives you the necessary full string of eight characters. (I know, I know—why not just <color>red</color>? Don’t ask.) Anyhow, a fully opaque red dot would be <color>ff00ff00</color>, since that value has full (“ff”) opacity and full (“ff”) red value (opacity being the first and second places of the eight characters and red being the fifth and sixth places of the eight characters). Welcome to the joyous world of ABRG.

Let me save you some time. I like to use 50% opacity so I can see through dots. That helps give a sense of mass when dots are close to or on top of each other, as is often the case in cities. (You can also vary the size of the dots, but let’s wait for another day on that one.) So: semi-transparent red is “7f00ff00”; semi-transparent blue is “7fff0000”; semi-transparent green is “7f0000ff”; semi-transparent yellow is “7f00ffff”. (No, green and red don’t make yellow, but they do in this case. Don’t ask.) So for blue dots that you can see through, as in the CNN example, the final code should have “7fff0000″ inserted between <color> and </color>, resulting in:

<Style id=”A”><IconStyle><color>7fff0000</color><scale>0.8</scale><Icon><href>
root://icons/palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon>
</IconStyle><LabelStyle><scale>0</scale></LabelStyle></Style>

When you’ve inserted your color choice, save the KML document in your text editor and run the Google Earth application. From within that application, choose “Open…” from the “File” menu and select the KML file you just edited. Google Earth will load the data and you will see colored dots on your map. To compare two datasets, as I did with prayer and CNN viewership, simply open more than one KML file. You can toggle each set of dots on and off by clicking the checkboxes next to their filenames in the middle section of the panel on the left. Zoom and pan, add other datasets (such as population statistics), add a third or fourth KML file. Forget about all the tech stuff and begin your research.

[For those who just want to try out using a KML file for research in Google Earth, here are a few from the September 11 Digital Archive. Right-click (or control-click on a Mac) to save the files to your computer, then open them within Google Earth, which you can download from here. These are files mapping the locations of: those who watched CNN; those who watched Fox News (far fewer than CNN since Fox News was just getting off the ground, but already showing a much more rural audience compared to CNN); and those who prayed on 9/11.]

August 8, 2006
Google Fingers

No, it’s not another amazing new piece of software from Google, which will type for you (though that would be nice). Just something that I’ve noticed while looking at many nineteenth-century books in Google’s massive digitization project. The following screenshot nicely reminds us that at the root of the word “digitization” is “digit,” which is from the Latin word “digitus,” meaning finger. It also reminds us that despite our perception of Google as a collection of computer geniuses, and despite their use of advanced scanning technology, their library project involves an almost unfathomable amount of physical labor. I’m glad that here and there, the people doing this difficult work (or at least their fingers) are being immortalized.

[The first page of a Victorian edition of Plato’s Euthyphron, a dialogue about the origin and nature of piety. Insert your own joke here about Google’s “Don’t be evil” motto.]

June 26, 2006
Google Book Search Blog

For those interested in the Google book digitization project (one of my three copyright-related stories to watch for 2006), Google launched an official blog yesterday. Right now “Inside Google Book Search” seems more like “Outside Google Book Search,” with a first post celebrating the joys of books and discovery, and with a set of links lauding the project, touting “success stories,” and soliciting participation from librarians, authors, and publishers. Hopefully we’ll get more useful insider information about the progress of the project, hints about new ways of searching millions of books, and other helpful tips for scholars in the near future. As I recently wrote in an article in D-Lib Magazine, Google’s project has some serious—perhaps fatal—flaws for those in the digital humanities (not so for the competing, but much smaller, Open Content Alliance). In particular, it would be nice to have more open access to the text (rather than mere page images) of pre-1923 books (i.e., those that are out of copyright). Of course, I’m a historian of the Victorian era who wants to scan thousands of nineteenth-century books using my own digital tools, not a giant company that may want to protect its very expensive investment in digitizing whole libraries.

May 9, 2006
The Single Box Humanities Search

I recently polled my graduate students to see where they turn to begin research for a paper. I suppose this shouldn’t come as a surprise: the number one answer—by far—was Google. Some might say they’re lazy or misdirected, but the allure of that single box—and how well it works for most tasks—is incredibly strong. Try getting students to go to five or six different search engines for gated online databases such as ProQuest Academic and JSTOR—all of which have different search options and produce a complex array of results compared to Google. I was thinking about this recently as I tested the brand new scholarly search engine from Microsoft, Windows Live Academic. Windows Live Academic is a direct competitor to Google Scholar, which has been in business now for over a year but is still in “beta” (like most Google products). Both are trying to provide that much-desired single box for academic researchers. And while those in the sciences may eventually be happy with this new option from Microsoft (though it’s currently much rougher than Google’s beta, as you’ll see), like Google Scholar, Windows Live Academic is a big disappointment for students, teachers, and professors in the humanities. I suspect there are three main reasons for this lack of a high-quality single box humanities search.

First, a quick test of Google Scholar and Windows Live Academic. Can either one produce the source of the famous “frontier thesis,” probably the best-known thesis in American historiography?

Clearly, the usefulness of these search results are dubious, especially Windows Live Academic (The Political Economy of Land Conflict in the Eastern Brazilian Amazon as the top result?). Why can’t these giant companies do better than this for humanities searches?

Obviously, the people designing and building these “academic” search engines are from a distinct subset of academia: computer science and mathematical fields such as physics. So naturally they focus on their own fields first. Both Google Scholar and Windows Live Academic work fairly well if you would like to know about black holes or encryption. Moreover, “scholarship” in these fields generally means articles, not books. Google Scholar and Windows Live Academic are dominated by journal-based publications, though both sometimes show books in their search results. But when Google Scholar does so, these books seem to appear because articles that match the search terms cite these works, not because of the relevance of the text of the books themselves.

In addition, humanities articles aren’t as easy as scientific papers to subject to bibliometrics—methods such as citation analysis that reveal the most important or influential articles in a field. Science papers tend to cite many more articles (and fewer books) in a way that makes them subject to extensive recursive analysis. Thus a search on “search” on Google Scholar aptly points a researcher to Sergey Brin’s and Larry Page’s seminal paper outlining how Google would work, because hundreds of other articles on search technology dutifully refer to that paper in their opening paragraph or footnote.

Most important, however, is the question of open access. Outlets for scientific articles are more open and indexable by search engines than humanities journals. In addition to many major natural and social science journals, CiteSeer (sponsored by Microsoft) and ArXiv.org make hundreds of thousands of articles on computer science, physics, and mathematics freely available. This disparity in openness compared to humanities scholarship is slowly starting to change—the American Historical Review, for instance, recently made all new articles freely available online—but without a concerted effort to open more gates, finding humanities papers through a single search box will remain difficult to achieve. Microsoft claims in its FAQ for Windows Live Academic that it will get around to including better results for subjects like history, but like Google they are going to have a hard time doing that well without open historical resources.

UPDATE [18 April 2006]: Microsoft has contacted me about this post; they are interested in learning more about what humanities scholars expect from a specialized academic search engine.

UPDATE [21 April 2006]: Bill Turkel makes the great point that Google’s main search does a much better job than Google Scholar at finding the original article and author of the frontier thesis:

April 17, 2006
Mapping Recent History

As the saying goes, imitation is the sincerest form of flattery. So at the Center for History and New Media, we’re currently feeling extremely flattered that our initiatives in collecting and presenting recent history—the Echo Project (covering the history of science, technology, and industry), the September 11 Digital Archive, and the Hurricane Digital Memory Bank—are being imitated by people using a wave of new websites that help them locate recollections, images, and other digital objects on a map. Here’s an example from the mapping site Platial:

And similar map from our 9/11 project:

Of course, we’re delighted to have imitators (and indeed, in turn we have imitated others), since we are trying to disseminate as widely as possible methods for saving the digital record of the present for future generations. It’s great to see new sites like Platial, CommunityWalk, and Wayfaring providing easy-to-use, collaborative maps that scattered groups of people can use to store photos, memories, and other artifacts.

April 11, 2006
Search Engine Optimization for Smarties

A Google search for “Sputnik” gives you an authoritative site from NASA in the top ten search results, but also a web page from the skydiver and ballroom-dancing enthusiast Michael Wright. This wildly democratic mix of sources perennially leads some educators to wring their hands about the state of knowledge, as yet another op-ed piece in the New York Times does today (“Searching for Dummies” by Edward Tenner). It’s a strange moment for the Times to publish this kind of lament; it seems like an op-ed left over from 1997, and as I’ve previously written in this space (and elsewhere with Roy Rosenzweig), contrary to Tenner’s one example of searching in vain for “World History,” online historical information is actually getting better, not worse (especially if you assess the web as a whole rather than complain about a few top search results). Anyway, Tenner does make one very good point: “More owners of free high-quality content should learn the tradecraft of tweaking their sites to improve search engine rankings.” This “tradecraft” is generally called “search engine optimization,” and I’ve long thought I should let those in academia (and other creators of reliable, noncommercial digital resources) in on the not-so-secret ways you can move your website higher up in the Google rankings (as well as in the rankings of other search engines).

1. Start with an appropriate domain name. Ideally, your domain should contain the top keywords you expect people searching for your topic to type into Google. At CHNM we love the name “Echo” for our history of science website, but we probably should have made the URL historyofscience.gmu.edu rather than echo.gmu.edu. Professors like to name digital projects something esoteric or poetic, preferably in Greek or Latin. That’s fine. But make the URL something more meaningful (and yes, more prosaic, if necessary) for search engines. If you read Google’s Web Search API documentation, you’ll realize that their spider can actually parse domain names for keywords, even if you run these words together.

2. If you’ve already launched your website, don’t change its address if it already has a lot of links to it. “Inbound” links are the currency of Google rankings. (You can check on how many links there are to your site by typing “link:[your domain name here]” into Google.) We can’t change Echo’s address now, because it’s already got hundreds of links to it, and those links count for a lot. (Despite the poetic name, we’re in the top ten for “history of science.”) There are some fancy ways to “redirect” sites from an old domain to a new one, but it’s tricky.

3. Get as many links to your site as you can from high-quality, established, prominent websites. Here’s where academics and those working in museums and libraries are at an advantage. You probably already have access to some very high-ranking, respected sites. Work at the Smithsonian or the Library of Congress? Want an extremely high-ranking website on any topic? Simply link to the new website (appropriately named, of course) from the home page of your main site (the home page is generally the best page to get a link from). Wait a month or two and you’re done, because www.si.edu and www.loc.gov wield enormous power in Google’s mathematical ranking system. A related point is…

4. Ask other sites to link to your site using the keywords you want. If you have a site on the Civil War, a bad link is one that says, “Like the Civil War? Check out this site.” A helpful link is one that says, “This is a great site on the Civil War.” If you use the Google Sitemap service, it will tell you what the most popular keywords are in links to your site.

5. Include keywords in file names and directory names across your site, and don’t skimp on the letters. This point is similar to #1, only for subtopics and pages on your site. Have a bibliography of Civil War books? Name the file “civilwarbibliography.html” rather than just “biblio.html” or some nonsense letters or numbers.

6. Speaking of nonsense letters and numbers, if your site is database-driven, recast ungainly numbers and letters in the URL (known in geek-speak as the “query string”), e.g., change www.yoursite.org/archive.cfm?author=x15y&text=15325662&lng=eng to www.yoursite.org/archive/rousseau/emile/english_translation.html. Have someone who knows how to do “URL rewriting” change those URLs to readable strings (if you use the Apache web server software, as 70% of sites do, the software that does this is called “mod_rewrite”; it still keeps those numbers and letters in memory, but doesn’t let the human or machine audiences see them).

7. Be very careful about hiring someone to optimize your site, and don’t do anything shifty like putting white text with your keywords on a white background. Read Google’s warning about search engine optimization and shady methods and their propensity to ban sites for subterfuge.

8. Don’t bother with metatags. Google and other search engines don’t care about these old, hidden HTML tags that were supposed to tell search engines what a web page was about.

9. Be patient. For most sites, it’s a slow rise to the top, accumulating links, awareness in the real world and on the web, etc. Moreover, there is definitely a first-mover advantage—being highly ranked creates a virtuous circle, because by being in the top ten, other sites link to your site because they find it more easily than others. Thus Michael Wright’s page on Sputnik, which is nine years old, remains stubbornly in the top ten. But one of the advantages a lot of academic and nonprofit sites have over the Michael Wrights of the world is that we’re part of institutions that are in it for the long run (and don’t have ballroom dancing classes). I’m more sanguine than Edward Teller that in near future, great sites, many of them from academia, will rise to the top, and be found by all of those Google-centric students the educators worry about.

But these sites (and their producers) could use a little push. Hope this helps.

(You might also want to read the chapter Roy and I wrote on building an audience for your website in Digital History, especially the section that includes a discussion of how Google works, as well as another section of the book on “Site Structure and Good URLs.”)

March 26, 2006