Why Google Books Should Have an API

No Way Out[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]

My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?

Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).

But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.

How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.

Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.

This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).

It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.

[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]

Comments

Alexis says:

That’s funny, I was just last week whining that Google Books doesn’t have an API, or, for that matter, even the ability to do a moderately complex search using metadata at all. Of course, my complaint was a little less academic than your own (being entitled, I believe, “Google Book Search Blows.”)

Hope Greenberg says:

Thank you for suggesting this. It’s hard to believe that Google fell for the classic blunder that many digital text collections make: the belief that people want to find and read specific works rather than mine them for information. As a historian focusing on the 19th century (in my case, on material culture as well as women’s reading and writing) I want to look at things like what women were reading and how they wrote about it, or about how books and magazines feature in women’s writing, or where works by women writers are found in periodicals of the time. Additionally, I want to find examples of what women bought, what they wore, how they bought it, and how they spoke about their purchases.

As a different example, a colleague who studies proverbs is interested in finding examples of specific proverbs in works, then exploring how they are used. In other words, he is looking for the phrase first, then at the context.

In both these examples, trying to choose a select body of works is counterproductive. The broader the selection of materials, the more useful the collection.

Alexis says:

Ooohh!!!! Update!

As of 3 minutes ago, I entered a generic search term (“social”) and Google Books offered a subject search to me! It appears to work using subject:yourseach as the method.

http://books.google.com/books?q=social+subject:%22Democracy%22&as_brr=1

It still doesn’t allow spidering or offer an API, but allowing broader use of metadata is a huge development, if you ask me.

Alexis says:

Dan,

I’ve been prodding Google Books a lot over the last week or so, and I found some interesting correlations between it and the OCLC Worldcat site. I’m kind of wondering if they aren’t sharing a backend somehow, which might explain Google’s reluctance to allow an API (OCLC prohibits the use of automated processes on its data).

See my post at http://redheadedstepchild.org/lists/scratchpad/entry58/ for a description of how I came to this hypothesis. It’s pure speculation at this point, but some of the parallels are interesting.

[…] the sort of thing that would allow their digital book collection join the microformat web. Dan Cohen recently made his own pitch for a Google Books API, while Alexis Turner has found tantalizing evidence that Google is already sharing their book data […]

[…] integrated into the collection (as Token-X is with the Willa Cather Archive), invoked through an API, or run on collections that we build ourselves by downloading relevant resources. And we need to […]

[…] Cohen has great arguments for the importance of a digitized collection like Google Books not only having an API, but having a good […]

[…] years in this space I have been arguing for the necessity of such access (first envisioned, to give due credit, by Cliff Lynch of CNI). […]

[…] a CSE. This could be done “by hand,” but many observers have noted the desirability of having an API to call results for domain targets against such content pools. Bridging “in” and […]

Leave a Reply