Initial Thoughts on the Google Books Ngram Viewer and Datasets

First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the potential of digital collections on the web, Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to explore text-mining patterns and anomalies; perhaps only Wordle has the same dead-simple, addictive quality as Google Ngrams. Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.

Second, on the concurrent launch of “Culturomics“: Naming new fields is always contentious, as is declaring precedence. Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage and the article that launched it, Michel et al., stake out supposedly new ground without making sufficient reference to prior work and even (ahem) some vaguely familiar, if simpler, graphs and intellectual justifications. Yes, “Culturomics” sounds like an 80s new wave band. If we’re going to coin neologisms, let’s at least go with Sean Gillies’ satirical alternative: Freakumanities. No, there were no humanities scholars in sight in the Culturomics article. But I’m also sure that longtime “humanities computing” scholars consider advocates of “digital humanities” like me Johnnies-come-lately. Luckily, digital humanities is nice, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c’mon, Cantabs, at least return the favor by following some people on Twitter.)

Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies makes some excellent points about why his Corpus of Historical American English (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks some tough questions about Google’s OCR. On Twitter many of us were finding serious problems with the long “s” before 1800 (Danny Sullivan got straight to the naughty point with his discourse on the history of the f-bomb). But the Freakumanities, er, Culturomics guys themselves talk about this problem in their caveats, as does Google.

Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many of us have been doing for some time. Professors love to look a gift horse in the mouth. But let’s also ride the horse and see where it takes us.

So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of pages those instances appear on and number of works they appear in—hint at much work to be done:

1) The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams. While everyone obsesses about individuals words (guilty here too of unigramism) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like “loving” or “arranged.” (This is something we learned in working on our NEH-funded grant on text mining for historians.)

2) We should remember that some of the best uses of Google’s n-grams will come from using this data along with other data. My gripe with the “Culturomics” name was that it implied (from “genomics”) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.

3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. We historians and humanists should be looking carefully at the more complex formulations of Michel et al., when they move beyond linguistics and unigram patterns to investigate in shrewd ways topics like how fleeting fame is and whether the suppression of authors by totalitarian regimes works. Good stuff.

4) For me, the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading, from the bird’s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it’s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you’ll get the book scans the data represents. That’s why I have my students use Mark Davies’ Time Magazine Corpus when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.

How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.

44 thoughts on “Initial Thoughts on the Google Books Ngram Viewer and Datasets

  1. (too long for a tweet)

    Ngrams’s multilingualness is a bit suspect–my very first Russian search yielded texts in Serbian (ok) and Gaelic (?).

    At the same time, though, Davies curates (smaller) Spanish and Portuguese corpora in addition to the COHA. Considering that, as far I can tell, one can’t *search* multilingually on ngrams, there’s no real difference between a drop down menu and clicking on a different part of corpus.byu.edu

    So while you’re right to say that the data will improve (regarding my top quibble), saying that COHA is not multilingual but ngrams is seems like an unfair comparison.

  2. Have to admit I’m having a very hard time figuring out how to make it useful. I need to go right to the texts, and I need something more like proximity search. It’s somewhat useful to chart the frequency of the word “slave,” but I already knew it was used a lot and I’m not sure I gain much more by knowing it peaked in 1860. If you enter word pairs it actually gives you the wrong impression–the most interesting stuff happens when words are paired. But it gives you separate lines which only intersect in frequency.In that sense it does more harm than good–it reinforces some kind of odd idea that words bear no relation to other words. (While I don’t know for a fact that there a linguists who believe that, I’d risk a significant sum betting that there are)

    I looks to me like a lot of time and effort spent to do something fairly useless. I’m having a very hard time seeing what I could do with it–maybe someone can show me

  3. These are very good news that Google started to provide tools and datasets to researches. I am not sure about the value of this first delivery, but hopefully more to come – such as references between keywords and authors.
    I am using Google Books for cross-reference analysis for a while. You can see an example in a short paper I published in The Information Society, Vol 26:2 (“A New Type of Historical Knowledge”). At least I didn’t create a new discipline name for this 🙂
    Shai Ophir

  4. I’m skeptical. I tried “informer” and got a huge spike in 1820:

    http://ngrams.googlelabs.com/graph?content=informer&year_start=1800&year_end=2000&corpus=0&smoothing=3

    Is that because people suddenly decided to inform or because google happened to scan a ton of statutes for that particular year that mentioned informers? I think the latter:

    http://www.google.com/search?q=%22informer%22&tbs=bks:1,cdr:1,cd_min:1819,cd_max:1823&lr=lang_en

    So far it is more fun to find errors like these than do actual research.

  5. I am an architectural historian and it would be really useful for me to isolate certain genres like builder’s guides. This would facilitate comparison of observed field data (like the instances of mansard roofs) to written material. There are plenty of research opportunities examining the gaps within between the two but as of yet it is quite difficult to quickly examine the corpus of building and home making guides. Even if there weren’t many gaps it’d be useful as a verification of decades of fieldwork.

    One day. A boy can dream. In any event, the inclusion non-architecturally focused material give you an idea of how large certain ideas loomed in popular conversation, which is also useful.

    Random note: I searched “Regan.” Predictably, lots of people were writing about him during his presidency, afterward not so much. Then in 2000 there is a Regan explosion.

    I thought it was interesting. Myth making perhaps or maybe the dialogue around all presidents follows this model.

    http://bit.ly/g4N8B1
    http://bit.ly/ftyTLj

  6. The fact that the raw data are available is a significant plus imo. I agree the COHA is quite excellent, and better in many ways, but as far as I can tell it’s a pure web-query interface: the underlying data are not available for download, at least not to the general public.

  7. Pingback: shinenkan
  8. As an MA Student in DH in Ireland, I can see the petty scenario of why anyone would bother to build a tool, as far as cultoromics is concerned, I think it was way too presumptuous to label something that only included a fraction of the published world available to the world who had publishing…. I love Ngram Viewer, but I would not support it as a Culturomic evaluation. It certainly provides an indicator, but the corpus is a sampling at the very least. On the fact that someone built a tool to help with this – I would have to say this is pretty brilliant on the corpus they have in front of them…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s