Initial Thoughts on the Google Books Ngram Viewer and Datasets

First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the potential of digital collections on the web, Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to explore text-mining patterns and anomalies; perhaps only Wordle has the same dead-simple, addictive quality as Google Ngrams. Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.

Second, on the concurrent launch of “Culturomics“: Naming new fields is always contentious, as is declaring precedence. Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage and the article that launched it, Michel et al., stake out supposedly new ground without making sufficient reference to prior work and even (ahem) some vaguely familiar, if simpler, graphs and intellectual justifications. Yes, “Culturomics” sounds like an 80s new wave band. If we’re going to coin neologisms, let’s at least go with Sean Gillies’ satirical alternative: Freakumanities. No, there were no humanities scholars in sight in the Culturomics article. But I’m also sure that longtime “humanities computing” scholars consider advocates of “digital humanities” like me Johnnies-come-lately. Luckily, digital humanities is nice, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c’mon, Cantabs, at least return the favor by following some people on Twitter.)

Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies makes some excellent points about why his Corpus of Historical American English (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks some tough questions about Google’s OCR. On Twitter many of us were finding serious problems with the long “s” before 1800 (Danny Sullivan got straight to the naughty point with his discourse on the history of the f-bomb). But the Freakumanities, er, Culturomics guys themselves talk about this problem in their caveats, as does Google.

Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many of us have been doing for some time. Professors love to look a gift horse in the mouth. But let’s also ride the horse and see where it takes us.

So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of pages those instances appear on and number of works they appear in—hint at much work to be done:

1) The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams. While everyone obsesses about individuals words (guilty here too of unigramism) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like “loving” or “arranged.” (This is something we learned in working on our NEH-funded grant on text mining for historians.)

2) We should remember that some of the best uses of Google’s n-grams will come from using this data along with other data. My gripe with the “Culturomics” name was that it implied (from “genomics”) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.

3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. We historians and humanists should be looking carefully at the more complex formulations of Michel et al., when they move beyond linguistics and unigram patterns to investigate in shrewd ways topics like how fleeting fame is and whether the suppression of authors by totalitarian regimes works. Good stuff.

4) For me, the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading, from the bird’s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it’s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you’ll get the book scans the data represents. That’s why I have my students use Mark Davies’ Time Magazine Corpus when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.

How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.

December 19, 2010

Books, Google, Text Mining

44 responses to “Initial Thoughts on the Google Books Ngram Viewer and Datasets”

Moacir

December 19, 2010

(too long for a tweet)

Ngrams’s multilingualness is a bit suspect–my very first Russian search yielded texts in Serbian (ok) and Gaelic (?).

At the same time, though, Davies curates (smaller) Spanish and Portuguese corpora in addition to the COHA. Considering that, as far I can tell, one can’t *search* multilingually on ngrams, there’s no real difference between a drop down menu and clicking on a different part of corpus.byu.edu

So while you’re right to say that the data will improve (regarding my top quibble), saying that COHA is not multilingual but ngrams is seems like an unfair comparison.

Log in to Reply
mike o’malley

December 20, 2010

Have to admit I’m having a very hard time figuring out how to make it useful. I need to go right to the texts, and I need something more like proximity search. It’s somewhat useful to chart the frequency of the word “slave,” but I already knew it was used a lot and I’m not sure I gain much more by knowing it peaked in 1860. If you enter word pairs it actually gives you the wrong impression–the most interesting stuff happens when words are paired. But it gives you separate lines which only intersect in frequency.In that sense it does more harm than good–it reinforces some kind of odd idea that words bear no relation to other words. (While I don’t know for a fact that there a linguists who believe that, I’d risk a significant sum betting that there are)

I looks to me like a lot of time and effort spent to do something fairly useless. I’m having a very hard time seeing what I could do with it–maybe someone can show me

Log in to Reply
Martin Foys

December 20, 2010

Yes. When this gets linked up to topic modeling, things are going to get interesting.

Log in to Reply
chris forster · Google nGrams: Quick Response to Mike O’Malley

December 20, 2010

[…] search which someone much cleverer than I came up with of “beft/best.” (Dan Cohen mentions this example with reference to Danny Sullivan’s post.) That one image confirms what we already know about […]

Log in to Reply
Shai Ophir

December 21, 2010

These are very good news that Google started to provide tools and datasets to researches. I am not sure about the value of this first delivery, but hopefully more to come – such as references between keywords and authors.
I am using Google Books for cross-reference analysis for a while. You can see an example in a short paper I published in The Information Society, Vol 26:2 (“A New Type of Historical Knowledge”). At least I didn’t create a new discipline name for this 🙂
Shai Ophir

Log in to Reply
My 11 Favorite Things from the Interwebz This Year | an/archivista

December 21, 2010

[…] occurrence over time. In addition to the tool, Google’s also making the raw data available. Dan Cohen calls the viewer a “gateway drug to the digital humanities,” and I hope that gateway […]

Log in to Reply
Elena Razlogova

December 22, 2010

I’m skeptical. I tried “informer” and got a huge spike in 1820:

http://ngrams.googlelabs.com/graph?content=informer&year_start=1800&year_end=2000&corpus=0&smoothing=3

Is that because people suddenly decided to inform or because google happened to scan a ton of statutes for that particular year that mentioned informers? I think the latter:

http://www.google.com/search?q=%22informer%22&tbs=bks:1,cdr:1,cd_min:1819,cd_max:1823&lr=lang_en

So far it is more fun to find errors like these than do actual research.

Log in to Reply
Mike Gushard

December 22, 2010

I am an architectural historian and it would be really useful for me to isolate certain genres like builder’s guides. This would facilitate comparison of observed field data (like the instances of mansard roofs) to written material. There are plenty of research opportunities examining the gaps within between the two but as of yet it is quite difficult to quickly examine the corpus of building and home making guides. Even if there weren’t many gaps it’d be useful as a verification of decades of fieldwork.

One day. A boy can dream. In any event, the inclusion non-architecturally focused material give you an idea of how large certain ideas loomed in popular conversation, which is also useful.

Random note: I searched “Regan.” Predictably, lots of people were writing about him during his presidency, afterward not so much. Then in 2000 there is a Regan explosion.

I thought it was interesting. Myth making perhaps or maybe the dialogue around all presidents follows this model.

http://bit.ly/g4N8B1
http://bit.ly/ftyTLj

Log in to Reply
Mike Gushard

December 22, 2010

*Reagan not Regan of course.

This is what I get for trying to tap a reply out on my phone. Damn you auto-complete!

Log in to Reply
Mark N.

December 27, 2010

The fact that the raw data are available is a significant plus imo. I agree the COHA is quite excellent, and better in many ways, but as far as I can tell it’s a pure web-query interface: the underlying data are not available for download, at least not to the general public.

Log in to Reply
Dave

December 30, 2010

We created a Facebook page for people to share interesting ngrams. You can check it out here:

http://www.facebook.com/nteresting.ngrams

Log in to Reply
L’interprétation des graphiques produits par Ngram Viewer » Article » OWNI, Digital Journalism

January 11, 2011

[…] questions méthodologiques et épistémologiques à l’article de Socioargu ainsi qu’à ceux de Dan Cohen [en], d’Olivier Ertzscheid, et à la discussion sur Language Log […]

Log in to Reply
New toy from Google Labs « MLibrary Chatter: The Shhh! Stops Here

January 12, 2011

[…] the Google Books Ngram Viewer is not just a toy, as Dan Cohen, historian and digital humanities guru, explains. It is kind of fun, though, and also offers a glimpse of the potential for exploration and research […]

Log in to Reply
Inaugural edition of the Digital Humanities Blog Carnival | nicomachus.net

January 17, 2011

[…] Initial Thoughts on the Google Books Ngram Viewer and Datasets, Dan Cohen shares his reflections on how Google’s Ngram Viewer might be useful to humanities […]

Log in to Reply
N-Grams « the long nineteenth century

March 14, 2011

[…] fun to play with, but, as I’m sure all my hermeneutically suspicious readers know, there are plenty of objections to taking the findings seriously. The team of non-digital-humanist scientists behind […]

Log in to Reply
Jonathan Stray » A computational journalism reading list

April 2, 2011

[…] On a related note, Google N-gram viewer lets you look at the frequency of short phrases within 4% of all books published, ever. The excellent paper gives examples of how to use this for cultural research. Dan Cohen has important criticisms. […]

Log in to Reply
jumping on the ngram bandwagon at RMCLAS | parezco y digo

April 5, 2011

[…] excitement and wariness over the prospects for a humanist mining of the corpus. (See, for example, Dan Cohen and Mike O’Malley for historians who are cautiously optimistic and crankily skeptical […]

Log in to Reply
Kulturomia i Google Ngram Viewer – historiaimedia.org

May 31, 2011

[…] eksploracji danych, nawet jeśli jakość tego badania pozostawia wiele do życzenia. Dan Cohen przekonuje, że ten projekt może mieć duże znaczenie dla promocji idei badań humanistycznych […]

Log in to Reply
Week One Lecture, Part Two of Two: Close Readings, Distant Readings, and Everything In-Between « Literature in a Wired World

July 11, 2011

[…] (collection) of work (note that the viewer has some major issues, some of which Dan Cohen discusses in this blog post; read more on how it works here). Again, it’s a simple tool: you enter one or more words and […]

Log in to Reply
Brian Sarnacki | <!– History Grad Student –>

September 1, 2011

[…] One of McNeely’s final thoughts, that academics in the laboratories are beginning to confront “humanities scholars on their own turf,” (273) provides the best example of why humanists must assert their position in the laboratory. McNeely’s suggestion also appears particularly accurate in the light of the “discovery” of “Culturomics .” A good example of the university-industrial complex that dominates many research universities, Harvard and MIT researchers teamed up with the Google Books project to examine an unprecedented amount of written works throughout history. While a useful tool, the, as McNeely puts it, “hubris that comes from transgressing disciplinary boundaries” (273) is painfully clear. Leading humanists from any number of the universities in the Boston area could have been included, but were not. If there had been more humanists, perhaps the academics who “founded” “Culturomics ” may have realized this type of research had already been happening for years. […]

Log in to Reply
Data visualization; or the outside corner as fool’s errand | Dylan Mulvin

October 1, 2011

[…] be a lot of ambivalence about what data visualization can accomplish. For every response from the Digital Humanities crew on n-grams there is a management seminar on data visualization for “business […]

Log in to Reply
My Sandbox · Digging into Data

October 29, 2011

[…] second article by Dan Cohen, “Initial Thoughts on the GoogleBooks Ngram viewer and datasets” provides more current reflexions on where the potential of data mining currently stands. Google […]

Log in to Reply
Data Mining: Instead of Finding the Needle in the Haystack, Realizing that the Hay may be More Interesting! « The Journey to Enlightenment: Making the Leap to the Digital Age

October 30, 2011

[…] on Ngrams and surveying the Voyeur tools and pondering tag clouds, I think I can safely echo Mike O’Malley’s comments (comment #2) that finding the utility seems rather mystifying. Do I need to know the frequency of […]

Log in to Reply
Gateway Drugs, Statistical Analysis, and Text Mining | sackerman51

October 31, 2011

[…] the readings on data mining for this week, I got a little sidetracked thinking about Professor Cohen’s analysis that “Digital Humanities needs gateway drugs. Kudos to the pushers on the Google books […]

Log in to Reply
Information and Data in the Digital Age | History in the Digital

January 13, 2012

[…] by Google. Although their assertion that they are breaking new ground is annoying (since it ignores previous work), their approach to things like the appearance and disappearance of fame is interesting and raises […]

Log in to Reply
Google and Digital Humanities | THATCamp Ohio State University

February 22, 2012

[…] The Ngram Viewer uses the raw data (OCR’d text) from the Google Books project and lets the user search for the incidence of particular words in published books over the last couple of centuries. (Try it, it’s fun!) Compared to the controversy surrounding the Google Books project, the debates around the Ngram viewer have been small potatoes, but they sure make interesting reading for DHers. Dan Cohen wrote a great blog post about it. […]

Log in to Reply
Critique of Google’s Ngram Viewer: « jhwhistory

March 23, 2012

[…] D, ‘Initial thoughts on the Google Books Ngram Viewer and datasets’, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/, consulted on […]

Log in to Reply
Hist 696: Digging into Data « Wandering but not Lost

March 24, 2012

[…] second article by Dan Cohen, “Initial Thoughts on the GoogleBooks Ngram viewer and datasets” provides more current reflexions on where the potential of data mining currently stands. Google […]

Log in to Reply
Reminder | Adventures in Digital History 3.0

March 25, 2012

[…] Oct 2006). “Applying Quantitative Analysis to Classic Lit,” Wired, Dec. 2009; Cohen, Google Books, Ngrams and Culturomics; Rob Nelson, Mining the Dispatch. This entry was posted in Announcements, dh2012 by jmcclurken. […]

Log in to Reply
shinenkan

October 1, 2012

[…] [3] http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/ […]

Log in to Reply
Sapping Attention: Keeping the words in Topic Models

January 10, 2013

[…] first time I myself read up on topic modeling was after seeing it referenced in the comments to Dan Cohen's first post about Google Ngrams.) Bookworm is obviously similar to Ngrams: it's designed to keep the Ngrams strategy of […]

Log in to Reply
Out of Vogue, Out of Mind? – Old Colonies in the New Empire | Christopher M. Church

February 15, 2013

[…] more on the advantages and limitations of Google nGrams, please see here and […]

Log in to Reply
What is datamining, and does it encourage the creation of a specific kind of history? | chicirafoster

April 18, 2013

[…] http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/; consulted 10 April […]

Log in to Reply
Participations » Prendre les procédures au sérieux

May 14, 2013

[…] on the Google Books Ngram Viewer and Datasets », Dan Cohen’s Digital Humanities Blog, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/ (accès le […]

Log in to Reply
Google Ngrams Viewer: How good is it really? | UH Digital History Blog

March 1, 2014

[…] Overall Google Ngram Viewer has a lot to offer historians. It allows them to see patterns or trends in data over a longer period than would be possible if they were researching through traditional methods. It stores a vast amount of data in a small space which can be accessed immediately. Finally, it offers historians a simple and manageable tool in the emerging and sometimes complicated discipline of Digital History, as Dan Cohen discusses here. […]

Log in to Reply
How wary do historians have to be when using Ngram Viewer? | sarahburginhistory

March 3, 2014

[…] The site has Application Programming Interface allowing the user to manipulate the data by entering predictive words and to export information for their own use/research. Under advanced usage there is a guide on how to perform a more in depth search using features like part-of-speech tags and wild card, inflection and case insensitive searches. There are ngram compositions which are operators that can be used to combine ngrams for use in topic modelling and better comparisons and analysis leading to more interesting interpretations. […]

Log in to Reply
How does the digital change the nature of historical research? | cherylmcentenaryblog

April 23, 2014

[…] Daniel, J., ‘Initial Thoughts on the Google Books Ngram Viewer and Datasets’, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/; consulted 1st March […]

Log in to Reply
How does the digital change the nature of historical research? | UH Digital History Blog

April 23, 2014

[…] Daniel, J., ‘Initial Thoughts on the Google Books Ngram Viewer and Datasets’, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/; consulted 1st March […]

Log in to Reply
The Committee on the Present Danger and Neoconservatism in Ngrams – The Soviet Threat

July 4, 2014

[…] brand-new Google Books Ngram Viewer as a ‘gateway drug’ into the digital humanities (http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/). I’ve been playing around with it recently and I’m […]

Log in to Reply
L’interprétation des graphiques produits par Ngram Viewer | Déjà Vu

November 6, 2014

[…] méthodologiques et épistémologiques à l’article de socioargu ainsi qu’à ceux de Dan Cohen, d’Olivier Ertzscheid, et à la discussion sur Language […]

Log in to Reply
A Critique of Culturomics: An Annotated Bibliography – Exploring Digital Humanities

November 28, 2014

[…] Cohen, Dan. “Initial Thoughts on the Google Books Ngram Viewer and Datasets.” Dan Cohen 19 Dec. 2010. Web. 19 Nov. 2014. <http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/>. […]

Log in to Reply
S.C. Healy

December 11, 2014

As an MA Student in DH in Ireland, I can see the petty scenario of why anyone would bother to build a tool, as far as cultoromics is concerned, I think it was way too presumptuous to label something that only included a fraction of the published world available to the world who had publishing…. I love Ngram Viewer, but I would not support it as a Culturomic evaluation. It certainly provides an indicator, but the corpus is a sampling at the very least. On the fact that someone built a tool to help with this – I would have to say this is pretty brilliant on the corpus they have in front of them…

Log in to Reply
The Committee on the Present Danger and Neoconservatism in Ngrams – Still Out in the Cold

April 2, 2015

[…] brand-new Google Books Ngram Viewer as a ‘gateway drug’ into the digital humanities (http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/). I’ve been playing around with it recently and I’m […]

Log in to Reply
10. Digital Historiography | History 9817

November 14, 2015

[…] Cohen, “Initial Thoughts on the Google Books Ngram Viewer and Datasets,” DanCohen.org, 19 September […]

Log in to Reply

Initial Thoughts on the Google Books Ngram Viewer and Datasets

44 responses to “Initial Thoughts on the Google Books Ngram Viewer and Datasets”

Leave a Reply Cancel reply