Category: Libraries

The “Google Five” Describe Progress, Challenges

Among other things learned by the original five libraries that signed up with Google to have their collections digitized is this gem: “About one percent of the Bodleian Library’s books have uncut pages, meaning they’ve never been opened.” I used to find books like this at Yale and felt quite bad for their authors. Imagine all of the effort that goes into writing a book–and then no one, for hundreds of years at Oxford or Yale (bookish places with esoteric interests, wouldn’t you say?), takes even a peek inside. Evidently the long tail doesn’t quite extend all the way.

Five Catalonian Libraries Join the Google Library Project

The Google Library Project has, for the most part, focused on American libraries, thus pushing the EU to mount a competing project; will this announcement (which includes the National Library of Barcelona), coming on the heels of an agreement with the Complutense University of Madrid, signal the beginning of Google making inroads in Europe?

Impact of Field v. Google on the Google Library Project

I’ve finally had a chance to read the federal district court ruling in a case, Field v. Google, that has not been covered much (except in the technology press), but which has obvious and important implications for the upcoming battle over the legality of Google’s library digitization project. The case, Field v. Google, involved a lawyer who dabbles in some online poetry, and who was annoyed that Google’s spider cached a version of his copyrighted ode to delicious tea (“Many of us must have it iced, some of us take it hot and combined with milk, and others are not satisfied unless they know that only the rarest of spices and ingredients are contained therein…”). Field sued Google for copyright infringement; Google argued fair use. Field lost the case, with most of his points rejected by the court. The Electronic Frontier Foundation has hailed Google’s victory as a significant one, and indeed there are some very good aspects of the ruling for the book copying case. But there also seem to be some major differences between Google’s wholesale copying of websites and its wholesale copying of books that the court implicitly recognized. The following seem to be the advantages and disadvantages of this ruling for Google, the University of Michigan, and others who wish to see the library project reach completion.

Courts have traditionally used four factors to determine fair use—the purpose of the copying, the nature of the work, the extent of the copying, and the effect on the market of the work.

On purpose, the court ruled that Google’s cache was not simply a copy of that work, but added substantial value that was important to users of Google’s search engine. Users could still read Field’s poetry even if his site was down; they could compare Google’s cache with the original site to see if any changes had been made; they could see their search terms highlighted in the page. Furthermore, with a clear banner across the top Google tells its users that this is a copy and provides a link to the original. It also provides methods for website owners to remove their pages from the cache. This emphasis on opt out seems critical, since Google has argued that book publishers can simply tell them if they don’t want their books digitized. Also, the court ruled that the Google’s status as a commercial enterprise doesn’t matter here. Advantage for Google et al.

On the nature of the work, the court looked less at the quality of Field’s writing (“Simple flavors, simple aromas, simple preparation…”) than at Field’s intentions. Since he “sought to make his works available to the widest possible audience for free” by posting his poems on the Internet, and since Field was aware that he could (through the robots.txt file) exclude search engines from indexing his site, the court thought Field’s case with respect to this fair use factor was weakened. But book publishers and authors fighting Google will argue that they do not intend this free and wide distribution. Disadvantage for Google et al.

One would think that the third factor, the extent of the copying, would be a clear loser for Google, since they copy entire web pages as a matter of course. But the Nevada court ruled that because Google’s cache serves “multiple transformative and socially valuable purposes…that could not be effectively accomplished by using only portions” of web pages, and because Google points users to the original texts, this wholesale copying was OK. You can see why Google’s lawyers are overjoyed by this part of the ruling with respect to the book digitization project. Big advantage for Google et al.

Perhaps the cruelest part of the ruling had to do with the fourth factor of fair use, the effect on the market of the work. The court determined from its reading of Field’s ode to tea that “there is no evidence of any market for Field’s works.” Ouch. But there is clearly a market for many books that remain in copyright. And since the Google library project has just begun we don’t have any economic data about Google Book Search’s impact on the market for hard copies. No clear winner here.

In additional, the Nevada court added a critical fifth factor for determining fair use in this case: “Google’s Good Faith.” By providing ways to include and exclude materials from its cache, by providing a way to complain to the company, and by clearly spelling out its intentions in the display of the cache, the court determined that Google was acting in good faith—it was simply trying to provide a useful service and had no intention to profit from Field’s obsession with tea. Google has a number of features that replicate this sense of good faith in its book program, like providing links to libraries and booksellers, methods for publishers and authors to complain, and techniques for preventing user copies of copyrighted works. Advantage for Google et al.

A couple of final points that may work against Google. First, the court made a big deal out of the fact that the cache copying was completely automated, which the Google book project is clearly not. Second, the ruling constantly emphasizes the ability of Field to opt out of the program, but upset book publishers and authors believe this should be opt in, and it’s quite possible another court could agree with that position, which would weaken many of the points made above.

Google, the Khmer Rouge, and the Public Good

Like Daniel into the lion’s den, Mary Sue Coleman, the President of the University of Michigan, yesterday went in front of the Association of American Publishers to defend her institution’s participation in Google’s massive book digitization project. Her speech, “Google, the Khmer Rouge and the Public Good,” is an impassioned defense of the project, if a bit pithy at certain points. It’s worth reading in its entirety, but here are some highlights with commentary.

In two prior posts, I wondered what will happen to those digital copies of the in-copyright books the university receives as part of its deal with Google. Coleman obviously knew that this was a major concern of her audience, and she went overboard to satisfy them: “Believe me, students will not be reading digital copies of ‘Harry Potter’ in their dorm rooms…We will safeguard the entirety of this archive with the same diligence we accord our most sensitive materials at the University: medical records, Defense Department data, and highly infectious disease agents used in research.” I’m not sure if books should be compared to infectious disease agents, but it seems clear that the digital copies Michigan receives are not likely to make it into “the wild” very easily.

Coleman reminded her audience that for a long time the books in the Michigan library did not circulate and were only accessible to the Board of Regents and the faculty (no students allowed, of course). Finally Michigan President James Angell declared that books were “not to be locked up and kept away from readers, but to be placed at their disposal with the utmost freedom.” Coleman feels that the Google project is a natural extension of that declaration, and more broadly, of the university’s mission to disseminate knowledge.

Ultimately, Coleman turns from more abstract notions of sharing and freedom to the more practical considerations of how students learn today: “When students do research, they use the Internet for digitized library resources more than they use the library proper. It’s that simple. So we are obligated to take the resources of the library to the Internet. When people turn to the Internet for information, I want Michigan’s great library to be there for them to discover.” Sounds about right to me.

Rough Start for Digital Preservation

How hard will it be to preserve today’s digital record for tomorrow’s historians, researchers, and students? Judging by the preliminary results of some attempts to save for the distant future the September 11 Digital Archive (a project I co-directed), it won’t be easy. While there are some bright spots to the reports in D-Lib Magazine last month on the efforts of four groups to “ingest” (or digitally accession) the thousands of files from the 9/11 collection, the overall picture is a little bit sobering. And this is a fairly well-curated (though by no means perfect) collection. Just imagine what ingesting a messy digital collection, e.g., the hard drive of your average professor, would entail. Here are some of the important lessons from these early digital preservation attempts, as I see it.

But first, a quick briefing on the collection. The September 11 Digital Archive is a joint project of the Center for History and New Media at George Mason University and the American Social History Project/Center for Media and Learning at the Graduate Center of the City University of New York. From January 2002 to the present (though mostly in the first two years) it has collected via the Internet (and some analog means, later run through digitization processes) about 150,000 objects, ranging from emails and BlackBerry communications to voicemail and digital audio, to typed recollections, photographs, and art. I think it’s a remarkable collection that will be extremely valuable to researchers in the future who wish to understand the attacks of 9/11 and their aftermath. In September 2003, the Library of Congress agreed to accession the collection, one of its first major digital accessions.

We started the project as swiftly as possible after 9/11, with the sense that we should do our best on the preservation front, but also with the understanding that we would probably have to cut some corners if we wanted to collect as much as we could. We couldn’t deliberate for months about the perfect archival structure or information architecture or wait for the next release of DSpace. Indeed, I wrote most of the code for the project in a week or so over the holiday break at the end of 2001. Not my best PHP programming effort ever, but it worked fine for the project. And as Clay Shirky points out in the D-Lib opening piece, this is likely to be the case for many projects—after all, projects that spend a lot of time and effort on correct metadata schemes and advanced hardware and software probably are going to be in the position to preserve their own materials anyway. The question is what will happen when more normal collections are passed from their holders to preservation outfits, such as the Library of Congress.

All four of the groups that did a test ingest of our 9/11 collection ran into some problems, though not necessarily at the points they expected. Harvard, Johns Hopkins, Old Dominion, and Stanford encountered some hurdles, beginning with my first point:

You can’t trust anything, even simple things like file types. The D-Lib reports note that a very small but still significant percentage of files in the 9/11 collection seemed to not be the formats they presented themselves as. What amazes me reading this is that I wrote some code to validate file types as they were being uploaded by contributors onto our server, using some powerful file type assessment tools built into PHP and Apache (our web server software). Obviously these validations failed to work perfectly. When you consider handling billion-object collections, even a 1% (or .1%) error rate is a lot. Which leads me to point #2…

We may have to modify to preserve. Although for generations archival science has emphasized keeping objects in their original format, I wonder if it might have been better if (as we had thought about at first on the 9/11 project) we had converted files contributed by the general public into just a few standardized formats. For instance, we could have converted (using the powerful ImageMagick server software) all of the photographs into one of the JPEG formats (yes, there are more than one, which turned out to be a pain). We would have “destroyed” the original photograph in the upload process—indeed, worse than that from a preservation perspective, we would have compressed it again, losing some information—but we could have presented the Library of Congress with a simplified set of files. That simplification process leads me to point #3…

Simple almost always beats complex when it comes to computer technology. I have incredible admiration for preservation software such as DSpace and Fedora, and I tend toward highly geeky solutions, but I’m much more pessimistic than those who believe that we are on the verge of preservation solutions that will keep digital files for centuries. Maybe it’s the historian of the Victorian age in me, reminding myself of the fate of so many nineteenth-century books that were not acid-free and so are deteriorating slowly in libraries around the world. Anyway, it was nice to see Shirky conclude in a similar vein that it looks like digital preservation efforts will have to be “data-centric” rather than “tool-centric” or “process-centric.” Specific tools will fade away over time, and so will ways of processing digital materials. Focusing on the data itself and keeping those files intact (and in use—that which is frequently used will be preserved) is critical. We’ll hopefully be able to access those saved files in the future with a variety of tools and using a variety of processes that haven’t even been invented yet.

2006: Crossroads for Copyright

The coming year is shaping up as one in which a number of copyright and intellectual property issues will be highly contested or resolved, likely having a significant impact on academia and researchers who wish to use digital materials in the humanities. In short, at stake in 2006 are the ground rules for how professors, teachers, and students may carry out their work using computer technology and the Internet. Here are three major items to follow closely.

Item #1: What Will Happen to Google’s Massive Digitization Project?

The conflict between authors, publishers, and Google will probably reach a showdown in 2006, with either the beginning of court proceedings or some kind of compromise. Google believes it has a good case for continuing to digitize library books, even those still under copyright; some authors and most publishers believe otherwise. So far, not much in the way of compromise. Indeed, if you have been following the situation carefully, it’s clear that each side is making clever pre-trial maneuvers to bolster their case. Google cleverly changed the name of its project to Google Book Search from Google Print, which emphasizes not the (possibly illegal) wholesale digitization of printed works but the fact that the program is (as Google’s legal briefs assert) merely a parallel project to their indexing of the web. The implication is that if what they’re doing with their web search is OK (for which they also need to make copies, albeit of born-digital pages), then Google Book Search is also OK. As Larry Lessig, Siva Vaidhyanathan, and others have highlighted, if the ruling goes against Google given this parallelism (“it’s all in the service of search”), many important web services might soon be illegal as well.

Meanwhile, the publishers have made some shrewd moves of their own. They have announced a plan to work with Amazon to accept micropayments for a few page views from a book (e.g., a recipe). And HarperCollins recently decided to embark on its own digitization program, ostensibly to provide book searches through its website. If you look at the legal basis of fair use (which Google is professing for its project), you’ll understand why these moves are important to the publishers: they can now say that Google’s project hurts the market for their works, even if Google shows only a small amount of a copyrighted book. In addition, a judge can no longer rule that Google is merely providing a service of great use to the public that the publishers themselves are unable or unwilling to provide. And I thought the only smart people in this debate were on Google’s side.

If you haven’t already read it, I recommend looking at my notes on what a very smart lawyer and a digital visionary have to say about the impending lawsuits.

Item #2: Chipping Away at the DMCA

In the first few months of 2006, the Copyright Office of the United States will be reviewing the dreadful Digital Millenium Copyright Act—one of the biggest threats to scholars who wish to use digital materials. The DMCA has effectively made many researchers, such as film studies professors, criminals, because they often need to circumvent rights management protection schemes on devices like DVDs to use them in a classroom or for in-depth study (or just to play them on certain kinds of computers). This circumvention is illegal under the law, even if you own the DVD. Currently there are only four minor exemptions to the DMCA, so it is critical that other exemptions for teachers, students, and scholars be granted. If you would like to help out, you can go to the Copyright Office’s website in January and sign your name to various efforts to carve out exemptions. One effort you can join, for instance, is spearheaded by Peter Decherney and others at the University of Pennsylvania. They want to clear the way for fully legal uses of audiovisual works in educational settings. Please contact me if you would like to add your name to that important effort.

Item #3: Libraries Reach a Crossroads

In an upcoming post I plan to discuss at length a fascinating article (to be published in 2006) by Rebecca Tushnet, a Georgetown law professor, that highlights the strange place at which libraries have arrived in the digital age. Libraries are the center of colleges and universities (often quite literally), but their role has been increasingly challenged by the Internet and the protectionist copyright laws this new medium has engendered. Libraries have traditionally been in the long-term purchasing and preservation business, but they increasing spend their budgets on yearly subscriptions to digital materials that could disappear if their budgets shrink. They have also been in the business of sharing their contents as widely as possible, to increase knowledge and understanding broadly in society; in this way, they are unique institutions with “special concerns not necessarily captured by the end-consumer-oriented analysis with which much copyright scholarship is concerned,” as Prof. Tushnet convincingly argues. New intellectual property laws (such as the DMCA) threaten this special role of libraries (aloof from the market), and if they are going to maintain this role, 2006 will have to be the year they step forward and reassert themselves.

Introduction to Firefox Scholar

This week in the electronic version, and next week in the print version, the Chronicle of Higher Education is running an article (subscription required) on a new software project I’m co-directing, Firefox Scholar, which will be a set of extensions to the popular open source web browser that will help researchers, teachers, and students. My thanks to the many people who have emailed who are interested in the project. For them and for others who would like to know more, here’s a brief summary of Firefox Scholar from our grant proposal to the Institute for Museum and Library Services, which has generously provided $250,000 to initiate the project. Please contact me if you would like occasional updates on the project or would like a beta release of the browser when it is available in the late summer of 2006.

The web browser has become the primary means for accessing information, documents, and artifacts from libraries and museums around the country and the world, thanks in large part to the tremendous commitment these institutions have made to bringing their collections online (as either simple citations or complete text and images). Unfortunately for scholars, while tens of millions of dollars have been spent to create digital resources, far less funding and effort has been allocated for the development of tools to facilitate the use of these resources. The browser remains merely a passive window allowing one to view, but not easily collect, annotate, or manipulate these objects. Moreover, from the user’s perspective individual library and museum collections remain just that—separate websites with distinct designs and different ways of displaying their information, making traditional scholarly practices of bringing together and studying objects of interest from across these collections unnecessarily difficult.

Firefox Scholar, a set of tools incorporated into popular, open, and free web software, will address these major problems by creating a web browser that is “smarter” in two key ways. First, one tool will enable the browser to intelligently sense when its user is viewing a digital library or museum object; this will allow the browser to capture information from the page automatically, such as the creator, title, date of creation, and copyright information. Second, another tool will store and organize this information, as well as full copies of items and web pages (not just their citation information) if so desired by the user and permitted by the institution’s site, allowing the user to sort, annotate, search, and manipulate these individualized collections created for scholarly purposes. Critically, all of this will occur within the web browser itself, not in a separate, standalone application; the web browser will be used not just to discover information, but also to collect, organize, and analyze scholarly materials.