Is it possible to have a balanced discussion of Google’s outrageously ambitious and undoubtedly flawed project to scan tens of millions of books in dozens of research libraries? I have noted in this space the advantages and disadvantages of Google Books—sometimes both at one time. Heck, the only time this blog has ever been seriously “dugg” is when I noted the appearance of fingers in some Google scans. Google Books is an easy target.
This week Paul Duguid has received a lot of positive press (e.g., Peter Brantley, if:book) for his dressing down of Google Books, “Inheritance and loss? A brief survey of Google Books.” It’s a very clever article, using poorly scanned Google copies of Lawrence Sterne’s absurdist and raunchy comedy Tristram Shandy to reveal the extent of Google’s folly and their “disrespect” for physical books.
I thought I would enjoy reading Duguid’s article, but I found myself oddly unenthusiastic by the end.
Of course Google has poor scans—as the saying goes, haste makes waste—but this is not a scientific survey of the percentage of pages that are unreadable or missing (surely less than 0.1% in my viewing of scores of Victorian books). Nor does the article note that Google might have possible remedies for some of these inadequacies. For example, they almost certainly have higher-resolution, higher-contrast scans that are different than the lo-res ones they display (a point made at the Million Books workshop; they use the originals for OCR), which they can revisit to produce better copies for the web. Just as they have recently added commentary to Google News, they could have users flag problematic pages. Truly bad books could be rescanned or replaced by other libraries’ versions.
Most egregiously, none of the commentaries I have seen on Duguid’s jeremiad have noted the telling coda to the article: “This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation.” The question of playing to the audience obviously arises.
Google Books will never be perfect, or even close. Duguid is right that it disrespects age-old, critical elements of books. (Although his point that Google disrespects metadata strangely fails to note that Google is one of the driving forces behind the Future of Bibliographic Control meetings, which are all about metadata.) Google Books is the outcome, like so many things at Google, of a mathematical challenge: How can you scan tens of millions of books in five years? It’s easy to say they should do a better job and get all the details right, but if you do the calculations of that assessment, you’ll probably see that the perfect library scanning project would take 50 years rather than 5. As in OCR, getting from 98% to 100% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary (I have seen OCR specialists estimate that even 90% accuracy is perfectly fine for search), it must have been an easy decision to make.
Complaining about the quality, thoroughness, and fidelity of Google’s (public) scans distracts us from the larger problem of Google Books. As I have argued repeatedly in this space, the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open. Recently they have added the ability to view some books in “plain text” (i.e., the OCRed text, but it’s hard to copy text from multiple pages at once), and even in some cases to download PDFs of public domain works. But those moves don’t go far enough for scholarly needs. We need what Cliff Lynch of CNI has called “computational access,” a higher level of access that is less about reading a page image on your computer than applying digital tools and analyses to many pages or books at one time to create new knowledge and understanding.
An API would be ideal for this purpose if Google doesn’t want to expose their entire collection. Google has APIs for most of their other projects—why not Google Books?
[Image courtesy of Ubisoft.]
“the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open.”
Yes, absolutely. Why not an API for GoogleBooks?
As Brewster Kahle pointed out in a LibraryJournal interview posted yesterday, “Google says it has the right to scan people’s books to create web services, yet it doesn’t allow other people to scan its scans to create web services. I say let’s have the Golden Rule apply: do unto others.”
Or better yet, why don’t libraries and archives participate in the Open Content Alliance project instead of letting Google come in and hijack their collections.
As Kahle states, “It takes guts on the part of our leadership to keep librarians first-class members of this information world, not just in a service role of feeding the machine and then checking out at the end of the day because everything’s going to be handled by some great search engine in the sky. No. It should be handled by us. We have the tools to build this open world right now. We can invest in ourselves, in the traditions that we come from. This is a choice.”
Oy, everyone wants to be the first-class member, don’t they? Google wants to be head honcho, so they locks their stacks down. The librarians want to remain the gatekeepers, so they grapple to keep the books “in their place” (the library, no doubt).
Maybe the real issue is that the notion of gatekeeper, owner, first-class member needs to be ditched in favor of more useful, functioning, and robust models. Think open-source.
[…] Dan Cohen’s Digital Humanities Blog » Google Books: Champagne or Sour Grapes? (tags: google) […]
[…] Books, and Patrick Leary, author of “Googling the Victorians.” I’m sticking with my original negative opinion of the article, which Leary agrees completely […]
[…] readers of this blog know of my aversion to jeremiads about Google, but Rob’s piece is well-reasoned and I agree with much of what he […]
[…] [with OpenSocial and the Open Handset Alliance], why not join the Open Content Alliance?” As I’ve noted in this space, openness is the preeminent question about Google Books, rather than questions of scan or search […]
[…] Shandy to illustrate problems in scanning quality and metadata. But, as Dan Cohen argues in Google Books: Champagne or Sour Grapes?, Google is making a defensible trade-off between rapid, mass digitization and quality control; […]
[…] Google Books: Champagne or Sour Grapes? […]
[…] problemas en la calidad del escaneo y en los metadata. Sin embargo, tal como Dan Cohen argumenta en “Google Books: Champagne or Sour Grapes?”, Google está operando con un defendible equilibrio entre rapidez, digitalización masiva y control […]
[…] As predicted in this space six months ago, Google has added the ability for users to report missing or poorly scanned pages in their Book Search. (From my post “Google Books: Champagne or Sour Grapes?“: “Just as they have recently added commentary to Google News, they could have users flag problematic pages.”) […]
Hello Dr. Cohen,
Just a quick comment regarding Google Books. Their project is indeed ambitious and, ultimately will result in some messy work. However, I don’t know that they are the ones to blame when it comes to quality in scans. In my own experience in reviewing some of these works (especially Civil War unit histories from the late nineteenth and early twentieth century), I have found that scans are being done by libraries (Harvard comes to mind in at least two instances). The only time that Google does the scanning is when somebody wants their book in Google Books specifically to be presented as a “preview.” In that case, the author either scans their own book and sends in the scans or sends Google a copy of the book. When the book is sent to Google, they dismantle it in order to make higher quality scans.
for that, m, I do know that