Category: Books

The Idealization of the Book

Let me take the liberty of being the last academic with a blog to comment on the launch of Amazon’s new e-book reader, the Kindle. And let me also not waste any time on its design, screen, wireless technology, business model, or its uncanny resemblance to the Sinclair ZX80 I used in seventh grade. What little I have to say about the Kindle has less to do with the “e” than with the “book” part of it.

Although I’m generally an early adopter of technology, the Kindle—and indeed all e-book readers—strike me as similar to “photoplays,” or the filming of stage performances, that followed the introduction of film in the early twentieth century. In Janet Murray’s Hamlet on the Holodeck, she distinguishes between “additive” and “expressive” features of new media, and notes that photoplays were “a merely additive art form (photography plus theatre).” Only when filmmakers learned to use montage, close-ups, zooms, and the like as part of storytelling did photoplays give way to the new expressive form of movies.

Just as many of those who were used to plays assumed the highest form of film would be the fixed-camera photographing of Shakespeare, those used to books assume the highest form of digital reading will be the book transported to a dedicated electronic device. This idealization of reading paper as the highest form of intellectual consumption has led so many to believe that we need an electronic book reader like Amazon’s just-released Kindle: to do real reading we have to take a text from our computer and put it onto a book-ish device that’s as close to paper as possible.

Wrong. Just as people kept going to plays, people will continue to read books (albeit perhaps fewer) while they adjust to online reading for many other purposes. Only a rigid elitist would insist that book reading is optimal in all cases. What book or journal allows me to keep up with the work of over 250 scholars in the digital humanities? My RSS reader does, and quite well. And while some of us older folks may idealize the daily reading of newspapers (in addition to loving books I subscribe to two newspapers because I love them so) we might as well admit that online reading is a far better way to stay informed about many topics—just ask sports junkies. Or compare the breadth and depth of the coverage of Web trends between the New York Times‘s business section and the TechCrunch blog.

Matt Kirschenbaum of the Maryland Institute for Technology in the Humanities eloquently covers this issue in his far more subtle analysis of the state of reading than found in the overanxious National Endowment for the Arts’ report To Read or Not to Read: A Question of National Consequence. (Unfortunately Matt’s article is behind the Chronicle of Higher Ed‘s electronic gates; when will they join the New York Times and the Wall Street Journal in opening these gates and becoming part of the online discussion?) Matt highlights the many new forms of reading uncatalogued—or worse, dismissed—by the NEA report. These types include the exponentially growing forms of online reading that young people take for granted. While not idealizing these new forms, Matt notes that they can (contrary to the NEA’s belief) involve serious thought, and that they can engender writing as well.

As Matt points out, people are already voraciously reading on their computers, and when they read in an electronic format they want to take full advantage of the medium—link to texts from their blog or syllabus, email them, connect them to the universe of other writing and other people online.

To be sure, the reading of books has declined and there are elements of that decline to worry about. But let’s also remember that that very little of what kids read offline is Proust, and not all of what kids read online is their Facebook news feed.

Update: The Chronicle of Higher Ed has made a rare exception to their gating and provided an open access copy of Matt’s article.

Symposium on the Future of Scholarly Communication

For those who missed it, between October 12 and 27, 2007, there was a very thoughtful and insightful online discussion of how the publication of scholarship is changing—or trying to change—in the digital age. Participating in the discussion were Ed Felton, David Robinson, Paul DiMaggio, and Andrew Appel from Princeton University (the symposium was hosted by the Center for Information Technology Policy at Princeton), Ira Fuchs of the Mellon Foundation, Peter Suber of the indispensable Open Access News blog (and philosophy professor at Earlham College), Stan Katz, the President Emeritus of the American Council of Learned Societies, and Laura Brown of Ithaka (and formerly the President of Oxford University Press USA).

The symposium is really worth reading from start to finish. (Alas, one of the drawbacks of hosting a symposium on a blog is that it keeps everything in reverse chronological order; it would be great if CITP could flip the posts now that the discussion has ended.) But for those of us in the humanities the most relevant point is that we are going to have a much harder transition to an online model of scholarship than in the sciences. The main reason for this is that for us the highest form of scholarship is the book, whereas in the sciences it is the article, which is far more easily put online, posted in various forms (including as pre- and e-prints), and networked to other articles (through, e.g., citation analysis). In addition, we’re simply not as technologically savvy. As Paul DiMaggio points out, “every computer scientist who received his or her Ph.D. in computer science after 1980 or so has a website” (on which they can post their scholarly production), whereas the number is about 40% for political scientists and I’m sure far less for historians and literature professors.

I’m planning a long post in this space on the possible ways for humanities professors to move from print to open online scholarship; this discussion is great food for thought.

Tony Grafton on Digital Texts and Reading

Anthony Grafton was the first person to turn me onto intellectual history. His seminar on ideas in the Renaissance was one of the most fascinating courses I took at Princeton, and I still remember well Tony rocking in his seat, looking a bit like a young Karl Marx, making brilliant connections among a broad array of sources.

So it’s not unexpected given his wide-ranging interests but still terrific to see a scholar who has spent so much time with early books thinking deeply about “digitization and its discontents” in his article “Future Reading” in the latest issue of The New Yorker. And it’s even more gratifying to see Tony note in his online companion piece to “Future Reading,” “Adventures in Wonderland,” that “One of the best ways to get a handle on the sprawling world of digital sources is through George Mason University’s Center for History and New Media.”

Steven Johnson at the Italian Embassy

Well, they didn’t have my favorite wine (Villa Cafaggio Chianti Classico Reserva, if you must know), but I had a nice evening at the Italian Embassy in Washington. The occasion was the start of a conference, “Using New Technologies to Explore Cultural Heritage,” jointly sponsored by the National Endowment for the Humanities and the Consiglio Nazionale delle Ricerche (National Research Council) of Italy. The setting was the embassy’s postmodern take on the Florentine palazzo (see below); the speaker was bestselling author and digerati Steven Johnson (Everything Bad is Good for You: How Today’s Popular Culture Is Actually Making Us Smarter; Outside.in).

Italian Embassy

Steven Johnson

Johnson’s talk was entitled “The Open Book: The Future of Text in the Digital Age.” (I present his thoughts here without criticism; it’s late.) Johnson argued that despite all of the hand-wringing and dire predictions, the book was not in decline. Indeed, he thought that because of new media books have new channels to expand into. While some believed ten years ago that we were entering an age of image and video, the rise of web instead led to the continued dominance of text, online and off. He noted that more hardcover books were sold in 2006 than 2005; and more in 2005 than in 2004. Newspapers have huge online audiences that dwarf their paper readership, thus strengthening their importance to culture.

Johnson pointed to four important innovations in online writing:

1) Collaborative writing is in a golden age because of the Internet. One need only look at Wikipedia, especially the social process of its underlying discussion pages (in addition to the surface article pages).

2) Fan fiction is also in its heyday. There are almost 300,000 (!) fan-written, unauthorized sequels to Harry Potter on fanfiction.net. There are even countless reviews of this fan fiction.

3) Blogging has become an important force, and great for authors. Blogs often provide unpolished comments about books by readers that are just as helpful as professional reviews.

4) Discovery of relevant materials and passages has been made much easier by new media–just think about the difference between research for a book now and roaming through the stacks in a library. Software like DEVONthink has made scholarship easier by connecting hidden dots and sorting through masses of text.

Finally, Johnson argued that despite the allure of the web, physical books are still the best way for an author to get inside someone’s head and convince them about something important. The book still has much greater weight and impact than even the most important blog post.

Google Books: Is It Good for History?

The September 2007 issue of the American Historical Association’s Perspectives is now available online, and it is worth reading Rob Townsend’s article “Google Books: Is It Good for History?” The article is an update of Rob’s much-debated post on the AHA blog in May, and I believe this revised version now reads as the best succinct critique of Google Books available (at least from the perspective of scholars). Rob finds fault with Google’s poor scans, frequently incorrect metadata, and too-narrow interpretation of the public domain.

Regular readers of this blog know of my aversion to jeremiads about Google, but Rob’s piece is well-reasoned and I agree with much of what he says.

Why Google Books Should Have an API

No Way Out[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]

My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?

Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).

But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.

How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.

Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.

This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).

It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.

[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]

Debating Paul Duguid’s Google Books Lament

Over at the O’Reilly Radar, Peter Brantley reprints an interesting debate between Paul Duguid, author of the much-discussed recent article about the quality of Google Books, and Patrick Leary, author of “Googling the Victorians.” I’m sticking with my original negative opinion of the article, which Leary agrees completely with.

Google Books: Champagne or Sour Grapes?

Beyond Good and EvilIs it possible to have a balanced discussion of Google’s outrageously ambitious and undoubtedly flawed project to scan tens of millions of books in dozens of research libraries? I have noted in this space the advantages and disadvantages of Google Books—sometimes both at one time. Heck, the only time this blog has ever been seriously “dugg” is when I noted the appearance of fingers in some Google scans. Google Books is an easy target.

This week Paul Duguid has received a lot of positive press (e.g., Peter Brantley, if:book) for his dressing down of Google Books, “Inheritance and loss? A brief survey of Google Books.” It’s a very clever article, using poorly scanned Google copies of Lawrence Sterne’s absurdist and raunchy comedy Tristram Shandy to reveal the extent of Google’s folly and their “disrespect” for physical books.

I thought I would enjoy reading Duguid’s article, but I found myself oddly unenthusiastic by the end.

Of course Google has poor scans—as the saying goes, haste makes waste—but this is not a scientific survey of the percentage of pages that are unreadable or missing (surely less than 0.1% in my viewing of scores of Victorian books). Nor does the article note that Google might have possible remedies for some of these inadequacies. For example, they almost certainly have higher-resolution, higher-contrast scans that are different than the lo-res ones they display (a point made at the Million Books workshop; they use the originals for OCR), which they can revisit to produce better copies for the web. Just as they have recently added commentary to Google News, they could have users flag problematic pages. Truly bad books could be rescanned or replaced by other libraries’ versions.

Most egregiously, none of the commentaries I have seen on Duguid’s jeremiad have noted the telling coda to the article: “This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation.” The question of playing to the audience obviously arises.

Google Books will never be perfect, or even close. Duguid is right that it disrespects age-old, critical elements of books. (Although his point that Google disrespects metadata strangely fails to note that Google is one of the driving forces behind the Future of Bibliographic Control meetings, which are all about metadata.) Google Books is the outcome, like so many things at Google, of a mathematical challenge: How can you scan tens of millions of books in five years? It’s easy to say they should do a better job and get all the details right, but if you do the calculations of that assessment, you’ll probably see that the perfect library scanning project would take 50 years rather than 5. As in OCR, getting from 98% to 100% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary (I have seen OCR specialists estimate that even 90% accuracy is perfectly fine for search), it must have been an easy decision to make.

Complaining about the quality, thoroughness, and fidelity of Google’s (public) scans distracts us from the larger problem of Google Books. As I have argued repeatedly in this space, the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open. Recently they have added the ability to view some books in “plain text” (i.e., the OCRed text, but it’s hard to copy text from multiple pages at once), and even in some cases to download PDFs of public domain works. But those moves don’t go far enough for scholarly needs. We need what Cliff Lynch of CNI has called “computational access,” a higher level of access that is less about reading a page image on your computer than applying digital tools and analyses to many pages or books at one time to create new knowledge and understanding.

An API would be ideal for this purpose if Google doesn’t want to expose their entire collection. Google has APIs for most of their other projects—why not Google Books?

[Image courtesy of Ubisoft.]

A Companion to Digital Humanities

The entirety of this major work (640 pages, 37 chapters), edited by Susan Schreibman, Ray Siemens, and John Unsworth, is now available online. Kudos to the editors and to Blackwell Publishing for putting it on the web for free.

Google Fingers

No, it’s not another amazing new piece of software from Google, which will type for you (though that would be nice). Just something that I’ve noticed while looking at many nineteenth-century books in Google’s massive digitization project. The following screenshot nicely reminds us that at the root of the word “digitization” is “digit,” which is from the Latin word “digitus,” meaning finger. It also reminds us that despite our perception of Google as a collection of computer geniuses, and despite their use of advanced scanning technology, their library project involves an almost unfathomable amount of physical labor. I’m glad that here and there, the people doing this difficult work (or at least their fingers) are being immortalized.

[The first page of a Victorian edition of Plato’s Euthyphron, a dialogue about the origin and nature of piety. Insert your own joke here about Google’s “Don’t be evil” motto.]