A Closer Look at the National Archives-Footnote Agreement

I’ve spent the past two weeks trying to get a better understanding of the agreement signed by the National Archives and Footnote, about which I raised several concerns in my last post. Before making further (possibly unfounded) criticisms I thought it would a good idea to talk to both NARA and Footnote. So I picked up the phone and found several people eager to clarify things. At NARA, Jim Hastings, director of access programs, was particularly helpful in explaining their perspective. (Alas, NARA’s public affairs staff seemed to have only the sketchiest sense of key details.) Most helpful—and most eager to rebut my earlier post—were Justin Schroepfer and Peter Drinkwater, the marketing director and product lead at Footnote. Much to their credit, Justin and Peter patiently answered most of my questions about the agreement and the operation of the Footnote website.

Surprisingly, everyone I spoke to at both NARA and Footnote emphasized that despite the seemingly set-in-stone language of the legal agreement, there is a great deal of latitude in how it is executed, and they asked me to spread the word about how historians and the general public can weigh in. It has received virtually no publicity, but NARA is currently in a public comment phase for the Footnote (a/k/a iArchives) agreement. Scroll down to the bottom of the “Comment on Draft Policy” page at NARA’s website and you’ll find a request for public comment (you should email your thoughts to Vision@nara.gov). It’s a little odd to have a request for comment after the ink is dry on an agreement or policy, and this URL probably should have been included in the press release of the Footnote agreement, but I do think after speaking with them that both NARA and Footnote are receptive to hearing responses to the agreement. Indeed, in response to this post and my prior post on the agreement, Footnote has set up a web page, “Finding the Right Balance,” to receive feedback from the general public on the issues I’ve raised. They also asked me to round up professional opinion on the deal.

I assume Footnote will explain their policies in greater depth on their blog, but we agreed that it would be helpful to record some important details of our conversations in this space. Here are the answers Justin and Peter gave to a few pointed questions.

When I first went to the Footnote site, I was unpleasantly surprised that it required registration even to look at “milestone” documents like Lincoln’s draft of the Gettysburg Address. (Unfortunately, Footnote doesn’t have a list of all of its free content yet, so it’s hard to find such documents.) Justin and Peter responded that when they launched the site there was an error in the document viewer, so they had to add authentication to all document views. A fix was rolled out on January 23, and it’s now possible to view these important documents without registering.

You do need to register, however, to print or download any document, whether it’s considered “free” or “premium.” Why? Justin and Peter candidly noted that although they have done digitization projects before, the National Archives project, which contains millions of critical—and public domain—documents, is a first for them. They are understandably worried about the “leakage” of documents from their site, and want to take it one step at a time. So to start they will track all downloads to see how much escapes, especially in large batches. I noted that downloading and even reusing these documents (even en masse) very well might be legal, despite Footnote’s terms of service, because the scans are “slavish” copies of the originals, which are not protected by copyright. Footnote lawyers are looking at copyright law and what other primary-source sites are doing, and they say that they view these initial months as a learning experience to see if the terms of service can or should change. Footnote’s stance on copyright law and terms of usage will clearly be worth watching.

Speaking of terms of usage, I voiced a similar concern about Footnote’s policies toward minors. As you’ll recall, Footnote’s terms of service say the site is intended for those 18 and older, thus seeming to turn away the many K-12 classes that could take advantage of it. Justin and Peter were most passionate on this point. They told me that Footnote would like to give free access to the site for the K-12 market, but pointed to the restrictiveness of U.S. child protection laws. Because the Footnote site allows users to upload documents as well as view them, they worry about what youngsters might find there in addition to the NARA docs. These laws also mandate the “over 18” clause because the site captures personal information. It seems to me that there’s probably a technical solution that could be found here, similar to the one PBS.org uses to provide K-12 teaching materials without capturing information from the students.

Footnote seems willing to explore such a possibility, but again, Justin and Peter chalked up problems to the newness of the agreement and their inexperience running an interactive site with primary documents such as these. Footnote’s lawyers consulted (and borrowed, in some cases) the boilerplate language from terms of service at other sites, like Ancestry.com. But again, the Footnote team emphasized that they are going to review the policies and look into flexibility under the laws. They expect to tweak their policies in the coming months.

So, now is your chance to weigh in on those potential changes. If you do send a comment to either Footnote or NARA, try to be specific in what you would like to see. For instance, at the Center for History and New Media we are exploring the possibility of mining historical texts, which will only be possible to do on these millions of NARA documents if the Archives receives not only the page images from Footnote but also the OCRed text. (The handwritten documents cannot be automatically transcribed using optical character recognition, of course, but there are many typescript documents that have been converted to machine-readable text.) NARA has not asked to receive the text for each document back from Footnote—only the metadata and a combined index of all documents. There was some discussion that NARA is not equipped to handle the flood of data that a full-text database would entail. Regardless, I believe it would be in the best interest of historical researchers to have NARA receive this database, even if they are unable to post it to the web right away.

Google Book Search Now Maps Locations in the Text

Look at the bottom of this page for Illustrated New York: The Metropolis of To-day (1888), digitized by Google at the University of Michigan Library. Using the natural language processing of Google Maps to scan the text for addresses, the locations and surrounding text are placed onto a map of lower Manhattan. A great example of the power of historical data mining and the combination of digital resources via APIs (made easier for Google, of course, because this is all in-house). Kudos to the Google Book Search team.

The Hindle Fellowship in the History of Technology

Those with a recent doctorate in the history of technology or a related field might be interested in applying for this fellowship, which provides $10,000 for the purpose of preparing a dissertation for publication. The deadline is April 1, 2007.

Blackboard’s Entry into Web 2.0 Unveiled: Scholar.com

Maybe they should have kept it veiled. I’m surprised at how poorly designed this site is (surely a freshman who knows Ruby on Rails and a little Photoshop could have put together a better social bookmarking site in a week), not to mention that additions to the site are limited to users of the Blackboard course management system. How do they plan to get the scale necessary for network effects? From students who are thrilled by the new functionality of the website they have to go to for their classes?

The Flawed Agreement between the National Archives and Footnote, Inc.

I suppose it’s not breaking news that libraries and archives aren’t flush with cash. So it must be hard for a director of such an institution when a large corporation, or even a relatively small one, comes knocking with an offer to digitize one’s holdings in exchange for some kind of commercial rights to the contents. But as a historian worried about open access to our cultural heritage, I’m a little concerned about the new agreement between Footnote, Inc. and the United States National Archives. And I’m surprised that somehow this agreement has thus far flown under the radar of all of those who attacked the troublesome Smithsonian/Showtime agreement. Guess what? From now until 2012 it will cost you $100 a year, or even more offensively, $1.99 a page, for online access to critical historical documents such as the Papers of the Continental Congress.

This was the agreement signed by Archivist of the United States Allen Weinstein and Footnote, Inc., a Utah-based digital archives company, on January 10, 2007. For the next five years, unless you have the time and money to travel to Washington, you’ll have to fork over money to Footnote to take a peek at Civil War pension documents or the case files of the early FBI. The National Archives says this agreement is “non-exclusive”—I suppose crossing their fingers that Google will also come along and make a deal—but researchers shouldn’t hold their breaths for other options.

Footnote.com, the website that provide access to these millions of documents, charges for anything more than viewing a small thumbnail of a page or photograph. Supposedly the value-added of the site (aside from being able to see detailed views of the documents) is that it allows you to save and annotate documents in your own library, and share the results of your research (though not the original documents). Hmm, I seem to remember that there’s a tool being developed that will allow you to do all of that—for free, no less.

Moreover, you’ll also be subject to some fairly onerous terms of usage on Footnote.com, especially considering that this is our collective history and that all of these documents are out of copyright. (For a detailed description of the legal issues involved here, please see Chapter 7 of Digital History, “Owning the Past?”, especially the section covering the often bogus claims of copyright on scanned archival materials.) I’ll let the terms speak for themselves (plus one snide aside): “Professional historians and others conducting scholarly research may use the Website [gee, thanks], provided that they do so within the scope of their professional work, that they obtain written permission from us before using an image obtained from the Website for publication, and that they credit the source. You further agree that…you will not copy or distribute any part of the Website or the Service in any medium without Footnote.com’s prior written authorization.”

Couldn’t the National Archives have at least added a provision to the agreement with Footnote to allow students free access to these documents? I guess not; from the terms of usage: “The Footnote.com Website is intended for adults over the age of 18.” What next? Burly bouncers carding people who want to see the Declaration of Independence?

Readings for a Field in Digital History

An incredibly helpful list from Bill Turkel of nearly a hundred books that either directly or indirectly address issues central to the study of digital history.

Five Catalonian Libraries Join the Google Library Project

The Google Library Project has, for the most part, focused on American libraries, thus pushing the EU to mount a competing project; will this announcement (which includes the National Library of Barcelona), coming on the heels of an agreement with the Complutense University of Madrid, signal the beginning of Google making inroads in Europe?

A Companion to Digital Humanities

The entirety of this major work (640 pages, 37 chapters), edited by Susan Schreibman, Ray Siemens, and John Unsworth, is now available online. Kudos to the editors and to Blackwell Publishing for putting it on the web for free.

Creating a Blog from Scratch, Part 8: Full Feeds vs. Partial Feeds

One seemingly minor aspect of blogs I failed to consider carefully when I programmed this site was the composition of its feed. (Frankly, I was more concerned with the merely technical question of how to write code that spits out a valid RSS or Atom feed.) Looking at a lot of blogs and their feeds, I just assumed that the standard way of doing it was to put a small part of the full post in the feed—e.g., the first 50 words or the first paragraph—and then let the reader click through to the full post on your site. I noticed that some bloggers put their entire blog in their feed, but as a new blogger—one who had just spent a lot of time redesigning his old website to accommodate a blog—I couldn’t figure out why one would want to do that since it rendered your site irrelevant. It may seem minor, but a year later I’ve realized that there is, in part, a philosophical difference between a full and partial feed. Choosing which type of feed you are going to use means making a choice about the nature of your blog—and, surprisingly, the nature of your ego too. Subscribers to this blog’s feed have probably noticed that as of my last post I’ve switched from a partial feed to a full feed, so you already know the outcome of the debate I’ve had in my head about this distinction, but let me explain my reasoning and the advantages and disadvantages of full and partial feeds.

Putting the entire content of your blog into your feed has many practical advantages. Most obviously, it saves your readers the extra step of clicking on a link in their feed reader to view your full post. They can read your blog offline as well as online, and more easily access it on a non-computer device like a cell phone. Machine audiences can also take advantage of the full feed, searching it for keywords desired by other machines or people. For instance, most blog search engines allow you to set up feeds for posts from any blogger that contain certain words or phrases.

More important, providing a full feed conforms better with a philosophy I’ve tried to promote in this space, one of open access and the sharing of knowledge. A full feed allows for the easy redistribution of your writing and the combination of your posts with others on similar topics from other bloggers. A full feed is closer to “open source” than a feed that is tied to a particular site. For this reason, until the advent of in-feed advertising, most professional bloggers had partial feeds so readers would have to view advertising next to the full text of a post.

Even from the perspective of a non-commercial blogger—or more precisely the perspective of that blogger’s ego—full feeds can be slightly problematic. A liberated, full feed is less identifiably from you. As literary theorists know well, reading environments have a significant impact on the reception of a text. A full feed means that most of your blog’s audience will be reading it without the visual context of your site (its branding, in ad-speak), instead looking at the text in the homogenized reading environment of a feed reader. I’ve just switched from NetNewsWire to Google Reader to browse other blogs, and I especially like the way that Google’s feed reader provides a seamless stream of blog posts, one after the other, on a scrolling web page. I’m able to scan the many blogs I read quickly and easily. That reading style and context, however, makes me much less aware of specific authors. It makes the academic blogosphere seem like a stream of posts by a collective consciousness. Perhaps that’s fine from an information consumption standpoint, but it’s not so wonderful if you believe that individual voices and perspectives matter a great deal. Of course, some writers cut through the clutter and make me aware of their distinctive style and thoughts, but most don’t.

At the Center for History and New Media, we’ve been thinking a lot about the blog as a medium for academic conversation and publication—and even promotion and tenure—and the homogenized feed reader environment is a bit unsettling. Yes, it can be called academic narcissism, but maintaining authorial voice and also being able to measure the influence of individual voices is important to the future of academic blogging.

I’ve already mentioned in this space that I would like to submit this blog as part of my tenure package, for my own good, of course, but also to make a statement that blogs can and should be a part of the tenure review process and academic publication in general. But tenure committees, which generally focus on peer-reviewed writing, will need to see some proof of a blog’s use and impact. Right now the best I can do is to provide some basic stats about the readership of this blog, such as subscriptions to the feed.

But with a full feed, you can slowly loose track of your audience. Providing your entire posts in the feed allows anyone to resyndicate it, aggregate it, mash it up, or simply copy it. I must admit, I am a little leery of this possibility. To be sure, there are great uses for aggregation and resyndication. This blog is resyndicated on a site dedicated to the future of the academic cyberinfrastructure, and I’m honored that someone thought to include this modest blog among so many terrific blogs charting the frontiers of libraries, technology, and research. On the other hand, even before I started this blog I had experiences where content from my site appeared somewhere else for less virtuous reasons. I don’t have time to tell the full story here, but in 2005 an unscrupulous web developer used text from my website and a small trick called a “302 redirect” to boost the Google rankings of one of his clients. It was more amusing than infuriating—for a while a dentist in Arkansas had my bio instead of his. More seriously, millions of spam blogs scrape content from legitimate blogs, a process made much easier if you provide a full feed. And there are dozens of feed aggregators that will create a website from other people’s content without their permission. Regardless of the purpose, above board or below, I have no way of knowing about readers or subscribers to my blog when it appears in these other contexts.

But these concerns do not outweigh the spirit and practical advantages of a full feed. So enjoy the new feed—unless you’re that Arkansas dentist.

Part 9: The Conclusion

Creating a Blog from Scratch, Part 7: Tags, What Are They Good For?

Evidently quite a few things. In the past few years, tags have been attached to virtually everything, from web links to photos to bars. The University of Pennsylvania has recently introduced a way for those on campus to tag items in their online catalog, Franklin. With the arrival of the Zotero server this year, it will be possible for the community of Zotero users to collaboratively tag almost any object of research, from books to sculptures to letters. For their promoters, tags are a low-cost, democratic advance over traditional systems of cataloging. Detractors disparage tags as lacking the rigor of those tried-and-true methods. As I started to think about the composition of this blog, all I wanted to know was, why do so many blogs have tags all over them and what function or functions do they serve? Do I need them? What are they good for?

I have to admit that when I started this blog I had a visceral dislike of tags, probably because I was approaching them from the perspective of an academic who liked the precision and professionalism of the card catalog and encyclopedia. Tags seemed fatally flawed as putative successors to Library of Congress subject headings or the indexes in the back of books. I still believe the much-ballyhooed “tag clouds,” or set of tags of various sizes arranged in a pattern to show the contents of a blog or book or site, are poor substitutes for a good index of a work—not only because indexes are usually done by professionals who know what to highlight and how to summarize those topics, but also because indexes tell little stories through their levels, modifiers, and page numbers. For instance, here’s a section of the index the talented Jim O’Brien did for my book Equations from God:

Euclid, 165; in mathematics education, 147, 148, 214n185; Elements by, 21, 106, 138, 179, 180, 214n185; long-lasting influence of, 21, 58, 79, 147, 164, 174; waning influence of, in late Victorian era, 138, 148, 164, 178-179, 180 (see also non-Euclidean geometry)

At a glance you can tell the story line about Euclid—the ancient Greek mathematician’s incredibly long relevance (well into the modern era), and his eventual fall from grace in the nineteenth century in the face of a new kind of geometry. Some have proposed adding the hierarchical levels and other index-like features to tags to approach this level of usefulness, but that misses the point of tagging: it works because it’s done in a simple, generally offhand way. Add a lot of thought and hurdles to the process, and you’ll kill tagging. Tagging is a classic case of the “good enough” besting the “perfect” in new media.

Despite my hesitancy, I figured that there must be some reason to use tags on this blog. So I included them in the database but chose, due to my initial aversion, not to show them all over my site like many blogs do. They would just sit in the background and in the RSS feed. It turned out that was a very good compromise as I began to appreciate that tags are good at some functions that traditional taxonomies don’t address.

Much of the antagonism between the promoters and detractors of tags seems to arise from the sense—I believe, the incorrect sense—that they are competitors for the same market. But when you actually look at tags in action and actuality, it’s clear that they serve a number of functions that are distinct from the traditional cataloging functions and that make them poor replacements for high-quality categorization.

For example, look at the variety of tags on a highly used folksonomic site like del.icio.us, the grandaddy of social bookmarking. To be sure, there are some fine categorizations of websites. But del.icio.us also harbors a large number of tags with other aims. Coexisting with tags that might be at home in a Library of Congress subject heading (e.g., “history”) are tags like “readlater” (busy people marking a site as worth going back to when they get the chance), “hist301” (a tag used by students in a particular class for a particular semester), “natn” (used by listeners of the podcast “Net at Nite” to submit websites to the hosts for consideration), and of course every possible variation of “cool” (to signify a site’s…coolness).

Awareness of these other kinds of tags made me realize that what distinguishes tags from traditional forms of categorization, aside from the obvious amateur/democratic vs. professional distinction, is that while both are forms of description, tags often have specific audiences and time frames in mind, while traditional categorizations (such as Library of Congress subject headings) have only a vague general audience in mind and try to be as timeless as possible.

This distinction is particularly true when you realize that tags are strongly interwoven with feeds (RSS). Since people can subscribe to the feed of a tag, tagging a blog post in effect places it into a live, running stream of alerts to an awaiting audience. Want to alert John Musser, who maintains the list of APIs I have frequently referred to in this space, about a new API? Just tag a blog post “API” or “APIs” and I suspect John will hear about it very soon, as will a very large audience of those interested in knitting together information on the web.

Thus tags have a great utility on the “live” web, as the blog search engine Technorati calls it, as well as for personal uses of an individual or microaudiences like a college class or even for inane commentary (“awesome”). Yet I still feel that as an entrée into a blog, as the equivalent of scanning a table of contents or the index of a book, they are fairly poor. I had planned to expose my internal tags of posts to the audience of this blog in some “traditional” blog way—at the bottom of each post, down the left sidebar, in a tag cloud—but it didn’t seem helpful. If someone wants to find all of my posts on copyright, they can search for them in the upper right search box. And the tag clouds I’ve tried all seem to misrepresent the overall thrust of this blog since (like everyone else using tags) I haven’t put a lot of thought into the tags.

My hunch early on was that tags are best heard from but not seen, and I think I was mostly right about that.

Next up in the series: I make my first change to the blog, from a partial feed to a full feed, and explain the advantages and disadvantages of both—and why I’ve decided to switch.

Part 8: Full Feeds vs. Partial Feeds