Initial Thoughts on the Google Books Ngram Viewer and Datasets

First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the potential of digital collections on the web, Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to explore text-mining patterns and anomalies; perhaps only Wordle has the same dead-simple, addictive quality as Google Ngrams. Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.

Second, on the concurrent launch of “Culturomics“: Naming new fields is always contentious, as is declaring precedence. Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage and the article that launched it, Michel et al., stake out supposedly new ground without making sufficient reference to prior work and even (ahem) some vaguely familiar, if simpler, graphs and intellectual justifications. Yes, “Culturomics” sounds like an 80s new wave band. If we’re going to coin neologisms, let’s at least go with Sean Gillies’ satirical alternative: Freakumanities. No, there were no humanities scholars in sight in the Culturomics article. But I’m also sure that longtime “humanities computing” scholars consider advocates of “digital humanities” like me Johnnies-come-lately. Luckily, digital humanities is nice, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c’mon, Cantabs, at least return the favor by following some people on Twitter.)

Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies makes some excellent points about why his Corpus of Historical American English (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks some tough questions about Google’s OCR. On Twitter many of us were finding serious problems with the long “s” before 1800 (Danny Sullivan got straight to the naughty point with his discourse on the history of the f-bomb). But the Freakumanities, er, Culturomics guys themselves talk about this problem in their caveats, as does Google.

Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many of us have been doing for some time. Professors love to look a gift horse in the mouth. But let’s also ride the horse and see where it takes us.

So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of pages those instances appear on and number of works they appear in—hint at much work to be done:

1) The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams. While everyone obsesses about individuals words (guilty here too of unigramism) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like “loving” or “arranged.” (This is something we learned in working on our NEH-funded grant on text mining for historians.)

2) We should remember that some of the best uses of Google’s n-grams will come from using this data along with other data. My gripe with the “Culturomics” name was that it implied (from “genomics”) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.

3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. We historians and humanists should be looking carefully at the more complex formulations of Michel et al., when they move beyond linguistics and unigram patterns to investigate in shrewd ways topics like how fleeting fame is and whether the suppression of authors by totalitarian regimes works. Good stuff.

4) For me, the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading, from the bird’s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it’s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you’ll get the book scans the data represents. That’s why I have my students use Mark Davies’ Time Magazine Corpus when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.

How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.

December 19, 2010 44 Comments

In Books, Google, Text Mining

Comments

Moacir says:

December 19, 2010 at 11:02 pm

(too long for a tweet)

Ngrams’s multilingualness is a bit suspect–my very first Russian search yielded texts in Serbian (ok) and Gaelic (?).

At the same time, though, Davies curates (smaller) Spanish and Portuguese corpora in addition to the COHA. Considering that, as far I can tell, one can’t *search* multilingually on ngrams, there’s no real difference between a drop down menu and clicking on a different part of corpus.byu.edu

So while you’re right to say that the data will improve (regarding my top quibble), saying that COHA is not multilingual but ngrams is seems like an unfair comparison.

mike o'malley says:

December 20, 2010 at 12:17 am

Have to admit I’m having a very hard time figuring out how to make it useful. I need to go right to the texts, and I need something more like proximity search. It’s somewhat useful to chart the frequency of the word “slave,” but I already knew it was used a lot and I’m not sure I gain much more by knowing it peaked in 1860. If you enter word pairs it actually gives you the wrong impression–the most interesting stuff happens when words are paired. But it gives you separate lines which only intersect in frequency.In that sense it does more harm than good–it reinforces some kind of odd idea that words bear no relation to other words. (While I don’t know for a fact that there a linguists who believe that, I’d risk a significant sum betting that there are)

I looks to me like a lot of time and effort spent to do something fairly useless. I’m having a very hard time seeing what I could do with it–maybe someone can show me

Martin Foys says:

December 20, 2010 at 11:45 am

Yes. When this gets linked up to topic modeling, things are going to get interesting.

chris forster · Google nGrams: Quick Response to Mike O’Malley says:

December 20, 2010 at 12:14 pm

[…] search which someone much cleverer than I came up with of “beft/best.” (Dan Cohen mentions this example with reference to Danny Sullivan’s post.) That one image confirms what we already know about […]

Shai Ophir says:

December 21, 2010 at 11:25 am

These are very good news that Google started to provide tools and datasets to researches. I am not sure about the value of this first delivery, but hopefully more to come – such as references between keywords and authors.
I am using Google Books for cross-reference analysis for a while. You can see an example in a short paper I published in The Information Society, Vol 26:2 (“A New Type of Historical Knowledge”). At least I didn’t create a new discipline name for this 🙂
Shai Ophir

My 11 Favorite Things from the Interwebz This Year | an/archivista says:

December 21, 2010 at 1:44 pm

[…] occurrence over time. In addition to the tool, Google’s also making the raw data available. Dan Cohen calls the viewer a “gateway drug to the digital humanities,” and I hope that gateway […]

Elena Razlogova says:

December 22, 2010 at 5:10 am

I’m skeptical. I tried “informer” and got a huge spike in 1820:

http://ngrams.googlelabs.com/graph?content=informer&year_start=1800&year_end=2000&corpus=0&smoothing=3

Is that because people suddenly decided to inform or because google happened to scan a ton of statutes for that particular year that mentioned informers? I think the latter:

http://www.google.com/search?q=%22informer%22&tbs=bks:1,cdr:1,cd_min:1819,cd_max:1823&lr=lang_en

So far it is more fun to find errors like these than do actual research.

Mike Gushard says:

December 22, 2010 at 5:51 pm

I am an architectural historian and it would be really useful for me to isolate certain genres like builder’s guides. This would facilitate comparison of observed field data (like the instances of mansard roofs) to written material. There are plenty of research opportunities examining the gaps within between the two but as of yet it is quite difficult to quickly examine the corpus of building and home making guides. Even if there weren’t many gaps it’d be useful as a verification of decades of fieldwork.

One day. A boy can dream. In any event, the inclusion non-architecturally focused material give you an idea of how large certain ideas loomed in popular conversation, which is also useful.

Random note: I searched “Regan.” Predictably, lots of people were writing about him during his presidency, afterward not so much. Then in 2000 there is a Regan explosion.

I thought it was interesting. Myth making perhaps or maybe the dialogue around all presidents follows this model.

http://bit.ly/g4N8B1
http://bit.ly/ftyTLj

Mike Gushard says:

December 22, 2010 at 5:53 pm

*Reagan not Regan of course.

This is what I get for trying to tap a reply out on my phone. Damn you auto-complete!

Mark N. says:

December 27, 2010 at 5:38 am

The fact that the raw data are available is a significant plus imo. I agree the COHA is quite excellent, and better in many ways, but as far as I can tell it’s a pure web-query interface: the underlying data are not available for download, at least not to the general public.

Dave says:

December 30, 2010 at 4:25 pm

We created a Facebook page for people to share interesting ngrams. You can check it out here:

http://www.facebook.com/nteresting.ngrams

L’interprétation des graphiques produits par Ngram Viewer » Article » OWNI, Digital Journalism says:

January 11, 2011 at 11:52 am

[…] questions méthodologiques et épistémologiques à l’article de Socioargu ainsi qu’à ceux de Dan Cohen [en], d’Olivier Ertzscheid, et à la discussion sur Language Log […]

New toy from Google Labs « MLibrary Chatter: The Shhh! Stops Here says:

January 12, 2011 at 10:58 am

[…] the Google Books Ngram Viewer is not just a toy, as Dan Cohen, historian and digital humanities guru, explains. It is kind of fun, though, and also offers a glimpse of the potential for exploration and research […]

Inaugural edition of the Digital Humanities Blog Carnival | nicomachus.net says:

January 17, 2011 at 5:23 am

[…] Initial Thoughts on the Google Books Ngram Viewer and Datasets, Dan Cohen shares his reflections on how Google’s Ngram Viewer might be useful to humanities […]

N-Grams « the long nineteenth century says:

March 14, 2011 at 9:30 pm

[…] fun to play with, but, as I’m sure all my hermeneutically suspicious readers know, there are plenty of objections to taking the findings seriously. The team of non-digital-humanist scientists behind […]

Jonathan Stray » A computational journalism reading list says:

April 2, 2011 at 8:05 pm

[…] On a related note, Google N-gram viewer lets you look at the frequency of short phrases within 4% of all books published, ever. The excellent paper gives examples of how to use this for cultural research. Dan Cohen has important criticisms. […]

jumping on the ngram bandwagon at RMCLAS | parezco y digo says:

April 5, 2011 at 5:32 pm

[…] excitement and wariness over the prospects for a humanist mining of the corpus. (See, for example, Dan Cohen and Mike O’Malley for historians who are cautiously optimistic and crankily skeptical […]

Kulturomia i Google Ngram Viewer - historiaimedia.org says:

May 31, 2011 at 5:49 pm

[…] eksploracji danych, nawet jeśli jakość tego badania pozostawia wiele do życzenia. Dan Cohen przekonuje, że ten projekt może mieć duże znaczenie dla promocji idei badań humanistycznych […]

Week One Lecture, Part Two of Two: Close Readings, Distant Readings, and Everything In-Between « Literature in a Wired World says:

July 11, 2011 at 6:24 am

[…] (collection) of work (note that the viewer has some major issues, some of which Dan Cohen discusses in this blog post; read more on how it works here). Again, it’s a simple tool: you enter one or more words and […]

Brian Sarnacki |  says:

September 1, 2011 at 9:32 am

[…] One of McNeely’s final thoughts, that academics in the laboratories are beginning to confront “humanities scholars on their own turf,” (273) provides the best example of why humanists must assert their position in the laboratory. McNeely’s suggestion also appears particularly accurate in the light of the “discovery” of “Culturomics .” A good example of the university-industrial complex that dominates many research universities, Harvard and MIT researchers teamed up with the Google Books project to examine an unprecedented amount of written works throughout history. While a useful tool, the, as McNeely puts it, “hubris that comes from transgressing disciplinary boundaries” (273) is painfully clear. Leading humanists from any number of the universities in the Boston area could have been included, but were not. If there had been more humanists, perhaps the academics who “founded” “Culturomics ” may have realized this type of research had already been happening for years. […]

Data visualization; or the outside corner as fool’s errand | Dylan Mulvin says:

October 1, 2011 at 3:41 pm

[…] be a lot of ambivalence about what data visualization can accomplish. For every response from the Digital Humanities crew on n-grams there is a management seminar on data visualization for “business […]

My Sandbox · Digging into Data says:

October 29, 2011 at 11:09 pm

[…] second article by Dan Cohen, “Initial Thoughts on the GoogleBooks Ngram viewer and datasets” provides more current reflexions on where the potential of data mining currently stands. Google […]

Data Mining: Instead of Finding the Needle in the Haystack, Realizing that the Hay may be More Interesting! « The Journey to Enlightenment: Making the Leap to the Digital Age says:

October 30, 2011 at 2:27 am

[…] on Ngrams and surveying the Voyeur tools and pondering tag clouds, I think I can safely echo Mike O’Malley’s comments (comment #2) that finding the utility seems rather mystifying. Do I need to know the frequency of […]

Gateway Drugs, Statistical Analysis, and Text Mining | sackerman51 says:

October 31, 2011 at 12:55 pm

[…] the readings on data mining for this week, I got a little sidetracked thinking about Professor Cohen’s analysis that “Digital Humanities needs gateway drugs. Kudos to the pushers on the Google books […]

Information and Data in the Digital Age | History in the Digital says:

January 13, 2012 at 10:29 am

[…] by Google. Although their assertion that they are breaking new ground is annoying (since it ignores previous work), their approach to things like the appearance and disappearance of fame is interesting and raises […]

Google and Digital Humanities | THATCamp Ohio State University says:

February 22, 2012 at 5:11 pm

[…] The Ngram Viewer uses the raw data (OCR’d text) from the Google Books project and lets the user search for the incidence of particular words in published books over the last couple of centuries. (Try it, it’s fun!) Compared to the controversy surrounding the Google Books project, the debates around the Ngram viewer have been small potatoes, but they sure make interesting reading for DHers. Dan Cohen wrote a great blog post about it. […]

Critique of Google’s Ngram Viewer: « jhwhistory says:

March 23, 2012 at 9:18 am

[…] D, ‘Initial thoughts on the Google Books Ngram Viewer and datasets’, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/, consulted on […]

Hist 696: Digging into Data « Wandering but not Lost says:

March 24, 2012 at 9:36 am

Reminder | Adventures in Digital History 3.0 says:

March 25, 2012 at 8:55 pm

[…] Oct 2006). “Applying Quantitative Analysis to Classic Lit,” Wired, Dec. 2009; Cohen, Google Books, Ngrams and Culturomics; Rob Nelson, Mining the Dispatch. This entry was posted in Announcements, dh2012 by jmcclurken. […]

shinenkan says:

October 1, 2012 at 10:36 am

[…] [3] http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/ […]

Sapping Attention: Keeping the words in Topic Models says:

January 10, 2013 at 1:30 pm

[…] first time I myself read up on topic modeling was after seeing it referenced in the comments to Dan Cohen's first post about Google Ngrams.) Bookworm is obviously similar to Ngrams: it's designed to keep the Ngrams strategy of […]

Out of Vogue, Out of Mind? – Old Colonies in the New Empire | Christopher M. Church says:

February 15, 2013 at 4:01 pm

[…] more on the advantages and limitations of Google nGrams, please see here and […]

What is datamining, and does it encourage the creation of a specific kind of history? | chicirafoster says:

April 18, 2013 at 10:05 am

[…] http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/; consulted 10 April […]

Participations » Prendre les procédures au sérieux says:

May 14, 2013 at 9:48 am

[…] on the Google Books Ngram Viewer and Datasets », Dan Cohen’s Digital Humanities Blog, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/ (accès le […]

Google Ngrams Viewer: How good is it really? | UH Digital History Blog says:

March 1, 2014 at 5:51 pm

[…] Overall Google Ngram Viewer has a lot to offer historians. It allows them to see patterns or trends in data over a longer period than would be possible if they were researching through traditional methods. It stores a vast amount of data in a small space which can be accessed immediately. Finally, it offers historians a simple and manageable tool in the emerging and sometimes complicated discipline of Digital History, as Dan Cohen discusses here. […]

How wary do historians have to be when using Ngram Viewer? | sarahburginhistory says:

March 3, 2014 at 2:09 pm

[…] The site has Application Programming Interface allowing the user to manipulate the data by entering predictive words and to export information for their own use/research. Under advanced usage there is a guide on how to perform a more in depth search using features like part-of-speech tags and wild card, inflection and case insensitive searches. There are ngram compositions which are operators that can be used to combine ngrams for use in topic modelling and better comparisons and analysis leading to more interesting interpretations. […]

How does the digital change the nature of historical research? | cherylmcentenaryblog says:

April 23, 2014 at 7:33 am

[…] Daniel, J., ‘Initial Thoughts on the Google Books Ngram Viewer and Datasets’, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/; consulted 1st March […]

How does the digital change the nature of historical research? | UH Digital History Blog says:

April 23, 2014 at 7:36 am

The Committee on the Present Danger and Neoconservatism in Ngrams - The Soviet Threat says:

July 4, 2014 at 3:59 am

[…] brand-new Google Books Ngram Viewer as a ‘gateway drug’ into the digital humanities (http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/). I’ve been playing around with it recently and I’m […]

L’interprétation des graphiques produits par Ngram Viewer | Déjà Vu says:

November 6, 2014 at 5:14 am

[…] méthodologiques et épistémologiques à l’article de socioargu ainsi qu’à ceux de Dan Cohen, d’Olivier Ertzscheid, et à la discussion sur Language […]

A Critique of Culturomics: An Annotated Bibliography - Exploring Digital Humanities says:

November 28, 2014 at 8:03 pm

[…] Cohen, Dan. “Initial Thoughts on the Google Books Ngram Viewer and Datasets.” Dan Cohen 19 Dec. 2010. Web. 19 Nov. 2014. <http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/>. […]

S.C. Healy says:

December 11, 2014 at 10:47 pm

As an MA Student in DH in Ireland, I can see the petty scenario of why anyone would bother to build a tool, as far as cultoromics is concerned, I think it was way too presumptuous to label something that only included a fraction of the published world available to the world who had publishing…. I love Ngram Viewer, but I would not support it as a Culturomic evaluation. It certainly provides an indicator, but the corpus is a sampling at the very least. On the fact that someone built a tool to help with this – I would have to say this is pretty brilliant on the corpus they have in front of them…

The Committee on the Present Danger and Neoconservatism in Ngrams - Still Out in the Cold says:

April 2, 2015 at 8:40 am

10. Digital Historiography | History 9817 says:

November 14, 2015 at 10:21 am

[…] Cohen, “Initial Thoughts on the Google Books Ngram Viewer and Datasets,” DanCohen.org, 19 September […]

Initial Thoughts on the Google Books Ngram Viewer and Datasets

Comments

Leave a Reply Cancel reply