Archives, Libraries, Museums, Science Fiction

Haunted by the Past


Top: The Scarif Archive in Rogue One / Bottom: Robotic storage facility in the Mansueto Library at the University of Chicago

Ever since Jyn Erso and Cassian Andor extracted the Death Star plans from a digital repository on the planet Scarif in Rogue One, libraries, archives, and museums have played an important role in tentpole science fiction films. From Luke Skywalker’s library of Jedi wisdom books in The Last Jedi, to Blade Runner 2049’s multiple storage media for DNA sequences, to a fateful scene in an ethnographic museum in Black Panther, the imposing and evocative halls of cultural heritage organizations have been in the foreground of the imagined future.

There have been scattered instances of cultural memory institutions in such films in the past—my colleagues in the library will recall, with some eye-rolling, the librarian Jocasta Nu in Star Wars, Episode II: Attack of the Clones—but the appearance of these institutions  in recent speculative fiction on the screen seem especially relevant and rich, and central to their plots.

Which begs the question: Why are today’s science fiction films obsessed with libraries, archives, and museums?

The answer of course is rooted in how science fiction has always pursued a heightened understanding of our very real present. At the same time that these movies portray an imagined future, they are also exploring our current anxiety about the past and how it is stored; how we simultaneously wish to leave the past behind, and how it may also be impossible to shake it. They indicate that we live in an age that has an extremely strained relationship with history itself. These films are processing that anxiety on Hollywood’s big screen at a time when our small screens, social media, and browser histories document and preserve so much of we do and say.

Luke Skywalker’s collection of rare books in The Last Jedi neatly captures the tension inherent in these movies. In an egg-shaped stone hut reminiscent of (and indeed filmed in) the rural parts of western Ireland where Christian monasteries were established in the Middle Ages, Luke’s archive of Jedi books represent a profound bond to the traditional wisdom of the Jedi cult. Yet as the movie proceeds, it is clear that these volumes are also a strong link in the chain that holds Luke back. Ultimately his little library is not a source of knowledge, but one of angst. It makes him surly and disassociated from present possibilities, and he must ultimately sever himself from the past that is encapsulated in paper. Burning the books becomes a necessary precursor to his taking action, and to moving to the metaphysical (and more real) plane of the Jedi.

Black Panther uses two characters, rather than one, to embody the tense dynamic between setting history aside and being unable to let it go: the dueling figures of T’Challa (Black Panther) and N’Jadaka (Erik Killmonger). T’Challa understands that black people have been abused and enslaved, globally, for centuries. And yet he imagines a day when Wakanda steps beyond this past, and integrates their society and advanced technology with the outside world that has done so much wrong to them. He is a forward-looking optimist.

N’Jadaka, on the other hand, seethes with anger about the past, and how it is so vividly documented in the halls of cultural heritage institutions. Before he declines into a more monochromatic villain, he experiences frankly justifiable rage at what whites have done with black culture—namely, stolen and stored it like an alien, and lesser, culture, in glass-cased museums. A pivotal scene in one such museum reflects the troubled genesis of institutions such as the Pitt Rivers Museum, which collected artifacts of non-white culture from the British Empire to be viewed and dissected by professors in Oxford.

In one of the most memorable lines of Public Enemy’s It Takes a Nation of Millions to Hold Us Back, the seminal rap album that documents what happened to African slaves and their descendents in the United States, Flava Flav shouts “I got a right to be hostile!” given this terrible history. A poster of that album is on the wall of N’Jadaka’s father’s apartment in Oakland, and it frames, like the glass case in the museum, the young man’s views of the world in which his ancestors have been constantly subjugated.

Blade Runner 2049 is even more unrelentingly pessimistic about the future and its connection to the past. In the movie’s opening, we are told that the documentary evidence of that past has been wiped out in a catastrophic electronic pulse that destroyed digital photographs and electronic records. As we learn, however, not all archives are lost. While personal images and documents that were never printed are gone forever, some plutocratic corporations maintain archival records, and we see several of them in the film: digital media as well as formats encased in glass spheres and more recognizable microfilm. Nevertheless, these archives are imperfect, like so much in the film. Even a leather-bound handwritten book of records in a wasteland orphanage has critical pages ripped out.

Because it is based on the work of Philip K. Dick, who was obsessed with libraries as part of a larger obsession with memory and reality, Blade Runner 2049 ultimately binds not only the past and present together, but the archival and the alive. Humans and replicants, the movie seems to argue, are simply incarnations of archival records, fleshy beings made up of the synthetic or parental DNA that form their core information architecture and the libraries of memories that are either fabricated or lived. This uneasy fusion is at the dark core of the film and its philosophical examination of the permeable boundary between the real and the artificial.

For all of these films, the past constantly threatens to come back to haunt the present. (Just ask those on the Death Star.) In turn, these big-screen portrayals of imagined libraries, archives, and museums should make us reconsider how what we preserve and make accessible reflects—and perhaps determines—who we really are.


Archives, History, Twitter

The Significance of the Twitter Archive at the Library of Congress

It started with some techies casually joking around, and ended with the President of the United States being its most avid user. In between, it became the site of comedy and protest, several hundred million human users and countless bots, the occasional exchange of ideas and a constant stream of outrage.

All along, the Library of Congress was preserving it all. Billions of tweets, saved over 12 years, now rub shoulders with books, manuscripts, recordings, and film among the Library’s extensive holdings.

On December 31, however, this archiving will end. The day after Christmas, the Library announced that it would no longer save all tweets after that date, but instead will choose tweets to preserve “on a very selective basis,” for major events, elections, and political import. The rest of Twitter’s giant stream will flow by, untapped and ephemeral.

The Twitter archive may not be the record of our humanity that we wanted, but it’s the record we have. Due to Twitter’s original terms of service and the public availability of most tweets, which stand in contrast to many other social media platforms, such as Facebook and Snapchat, we are unlikely to preserve anything else like it from our digital age.

Undoubtedly many would consider that a good thing, and that the Twitter archive deserves the kind of mockery that flourishes on the platform itself. What can we possibly learn from the unchecked ramblings and ravings of so many, condensed to so few characters?

Yet it’s precisely this offhandedness and enforced brevity that makes the Twitter archive intriguing. Researchers have precious few sources for the plain-spoken language and everyday activities and thought of a large swath of society.

Most of what is archived is indeed done so on a very selective basis, assessed for historical significance at the time of preservation. Until the rise of digital documents and communications, the idea of “saving it all” seemed ridiculous, and even now it seems like a poor strategy given limited resources. Archives have always had to make tough choices about what to preserve and what to discard.

However, it is also true that we cannot always anticipate what future historians will want to see and read from our era. Much of what is now studied from the past are materials that somehow, fortunately, escaped the trash bin. Cookbooks give us a sense of what our ancestors ate and celebrated. Pamphlets and more recently zines document ideas and cultures outside the mainstream.

Historians have also used records in unanticipated ways. Researchers have come to realize that the Proceedings of the Old Bailey, transcriptions from London’s central criminal court, are the only record we have of the spoken words of many people who lived centuries ago but were not in the educated or elite classes. That we have them talking about the theft of a pig rather than the thought of Aristotle only gives us greater insight into the lived experience of their time.

The Twitter archive will have similar uses for researchers of the future, especially given its tremendous scale and the unique properties of the platform behind the short messages we see on it. Preserved with each tweet, but hidden from view, is additional information about tweeters and their followers. Using sophisticated computational methods, it is possible to visualize large-scale connections within the mass of users that will provide a good sense of our social interactions, communities, and divisions.

Since Twitter launched a year before the release of the iPhone, and flourished along with the smartphone, the archive is also a record of what happened when computers evolved from desktop to laptop to the much more personal embrace of our hands.

Since so many of us now worry about the impact of these devices and social media on our lives and mental health, this story and its lessons may ultimately be depressing. As we are all aware, of course, history and human expression are not always sweetness and light.

We should feel satisfied rather than dismissive that we will have a dozen years of our collective human expression to look back on, the amusing and the ugly, the trivial and, perhaps buried deep within the archive, the profound.

Archives, Crowdsourcing, Digital Public Library of America, History, Open Access

Roy’s World

In one of his characteristically humorous and self-effacing autobiographical stories, Roy Rosenzweig recounted the uneasy feeling he had when he was working on an interactive CD-ROM about American history in the 1990s. The medium was brand new, and to many in academia, superficial and cartoonish compared to a serious scholarly monograph.

Roy worried about how his colleagues and others in the profession would view the shiny disc on the social history of the U.S., and his role in creating it. After a hard day at work on this earliest of digital histories, he went to the gym, and above his treadmill was a television tuned to Entertainment Tonight. Mary Hart was interviewing Fabio, fresh off the great success of his “I Can’t Believe It’s Not Butter” ad campaign. “What’s next for Fabio?” Hart asked him. He replied: “Well, Mary, I’m working on an interactive CD-ROM.”

Roy Rosenzweig

Ten years ago today Roy Rosenzweig passed away. Somehow it has now been longer since he died than the period of time I was fortunate enough to know him. It feels like the opposite, given the way the mind sustains so powerfully the memory of those who have had a big impact on you.

The field that Roy founded, digital history, has also aged. So many more historians now use digital media and technology to advance their discipline that it no longer seems new or odd like an interactive CD-ROM.

But what hasn’t changed is Roy’s more profound vision for digital history. If anything, more than ever we live in Roy’s imagined world. Roy’s passion for open access to historical documents has come to fruition in countless online archives and the Digital Public Library of America. His drive to democratize not only access to history but also the historical record itself—especially its inclusion of marginalized voices—can been seen in the recent emphasis on community archive-building. His belief that history should be a broad-based shared enterprise, rather than the province of the ivory tower, can be found in crowdsourcing efforts and tools that allow for widespread community curation, digital preservation, and self-documentation.

It still hurts that Roy is no longer with us. Thankfully his mission and ideas and sensibilities are as vibrant as ever.

Archives, DPLA, Libraries, Museums, News

The Digital Public Library of America, Me, and You

Twenty years ago Roy Rosenzweig imagined a compelling mission for a new institution: “To use digital media and computer technology to democratize history—to incorporate multiple voices, reach diverse audiences, and encourage popular participation in presenting and preserving the past.” I’ve been incredibly lucky to be a part of that mission for over twelve years, at what became the Roy Rosenzweig Center for History and New Media, with last five and a half years as director.

Today I am announcing that I will be leaving the center, and my professorship at George Mason University, the home of RRCHNM, but I am not leaving Roy’s powerful vision behind. Instead, I will be extending his vision—one now shared by so many—on a new national initiative, the Digital Public Library of America. I will be the founding executive director of the DPLA.

The DPLA, which you will be hearing much more about in the coming months, will be connecting the riches of America’s libraries, archives, and museums so that the public can access all of those collections in one place; providing a platform, with an API, for others to build creative and transformative applications upon; and advocating strongly for a public option for reading and research in the twenty-first century. The DPLA will in no way replace the thousands of public libraries that are at the heart of so many communities across this country, but instead will extend their commitment to the public sphere, and provide them with an extraordinary digital attic and the technical infrastructure and services to deliver local cultural heritage materials everywhere in the nation and the world. DPLA_logo The DPLA has been in the planning stages for the last few years, but is about to spin out of Harvard’s Berkman Center for Internet and Society and move from vision to reality. It will officially launch, as an independent nonprofit, on April 18 at the Boston Public Library. I will move to Boston with my family this summer to lead the organization, which will be based there. It is such a great honor to have this opportunity.

Until then I will be transitioning from my role as director of RRCHNM, and my academic life at Mason. Everything at the center will be in great hands, of course; as anyone who visits the center immediately grasps, it is a highly collaborative and nonhierarchical place with an amazing staff and an especially experienced and innovative senior staff. They will continue to shape “the future the past,” as Roy liked to put it. I will miss my good friends at the center, but I still expect to work closely with them, since so many critical software initiatives, educational projects, and digital collections are based at RRCHNM. A search for a new director will begin shortly. I will also greatly miss my colleagues in Mason’s wonderful Department of History and Art History.

At the same time, I look forward to collaborating with new friends, both in the Boston office of the DPLA and across the United States. The DPLA is a unique, special idea—you don’t get to build a massive new library every day. It is apt that the DPLA will launch at the Boston Public Library’s McKim Building, with those potent words carved into stone above its entrance: “Free to all.” The architect Charles Follen McKim rightly called it “a palace for the people,” where anyone could enter to learn, create, and be entertained by the wonders of books and other forms of human expression.

We now have the chance to build something like this for the twenty-first century—a rare, joyous possibility in our too-often cynical age. I hope you will join me in this effort, with your ideas, your contributions, your energy, and your public spirit.

Let’s build the Digital Public Library of America together.

Archives, Pedagogy, Text Mining

A Million Syllabi

Today I’m releasing a database of over a million syllabi gathered by my Syllabus Finder tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I’m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude.

I created the Syllabus Finder in 2002 when Google released their first API to access their search engine. The initial API included the ability to grab cached HTML from millions of web pages, which I realized could then be scanned using high-relevancy keywords to identify pages that were most likely syllabi. In addition to my lousy PHP code that got it up and running, the brilliant Simon Kornblith wrote some additional code to make it work well. The result was a tool that was quite popular (1.3 million queries) until Google deprecated their original API in 2009 in favor of (what I consider to be) a less useful API. (With the original API you could basically clone, which I’m sure was not popular at the Googleplex.)

If you are interested in the kind of research that can be done on these syllabi, please read my Journal of American History article “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.” For that article I used regular expressions to pull book titles out of a thousand American history surveys to see how textbooks and other works are used by instructors. Some hidden elements emerged. I’m excited to see what creative ideas other scholars and researchers come up with for this large database.

Some important clarifications and caveats:

1) I’m providing this archive in the same spirit (and under same regulations) that the Internet Archive provides web corpora (indeed, this corpus could probably be recreated from the Internet Archive’s Wayback Machine, albeit after a lot of work). To the best of my knowledge, and because of the way they were obtained, all of the documents this database contains were posted on the open web, and were cached (or not) respecting open-web standards such as robots.txt. It does not contain any syllabi that were posted in private places, such as gated Blackboard installations. Indeed, I suspect that most of these syllabi come from universities where it is expected that professors post syllabi in an open fashion (as is the case here at Mason), or from professors like me who believe that openness is good for scholarship and teaching. But as with the Internet Archive, if you are the creator of a syllabus and really can’t sleep unless it is purged from this research database, contact me.

2) This database is provided as is and without support. I get enough email and unfortunately cannot answer questions. If you are appreciative, you can make a tax-free donation to the Center for History and New Media, for which you will receive a hug from me. The database is intended for non-commercial use of the type seen in my JAH article.

3) The database is an SQL dump consisting of 1.4 million rows. The columns are syllabiID (the Syllabus Finder’s unique identifier), url (web address of the syllabus at the time it was found), title (of the web page the syllabus was on), date_added (when it was added to the Syllabus Finder database), and chnm_cache (the HTML of the page on the date it was added). The database is 804 MB uncompressed. The corpus is heavily U.S.-centric because web pages were matched to English-language words, and for a time the Syllabus Finder only took pages from .edu domains (thus leaving out, e.g., URLs).

4) Because the Syllabus Finder was completely automated, some percentage of the 1.4 million documents are not syllabi (my best guess is about 20%). Most often these incorrect matches are associated course documents such as assignments, which are interesting in their own right. But some are oddball documents that just looked like syllabi to the algorithms. I have made no attempt to weed them out.

If you understand all of this clearly, then here’s a million syllabi for you: CHNM Syllabus Finder Corpus, Version 1.0 (30 March 2011) (265 MB download, zipped SQL file)

UPDATE 1 (11pm 3/30/11): Matt Burton has helpfully provided a torrent for this file. If you can, please use it instead of the direct download.

UPDATE 2 (9pm 3/31/11): Unfortunately I should have checked the exported database before posting. Version 1.0 does indeed have the URLs, titles, and dates of about 1.45 million syllabi but it is missing a majority of the HTML caches of those syllabi. I am working to recreate the full database, which will be much larger and more useful.

Archives, Custom

Digital Ephemera and the Calculus of Importance

[Thoughts prompted by an invitation to write a piece on the significance of “Notes, Lists, and Everyday Inscriptions” for The New Everyday, an innovative experiment in web publishing sponsored by MediaCommons. Since the editors of this edition of The New Everyday asked for something out of the ordinary for their curated collection, I thought it was time to unveil my Gladwell-esque theory of how criminal profiling and archival priorities share a mathematical foundation.]

How important are small written ephemera such as notes, especially now that we create an almost incalculable number of them on digital services such as Twitter? Ever since the Library of Congress surprised many with its announcement that it would accession the billions of public tweets since 2006, the subject has been one of significant debate. Critics lamented what they felt was a lowering of standards by the library—a trendy, presentist diversion from its national mission of saving historically valuable knowledge. In their minds, Twitter is a mass of worthless and mundane musings by the unimportant, and thus obviously unworthy of an archivist’s attention. The humorist Andy Borowitz summarized this cultural critique in a mocking headline: “Library of Congress to Acquire Entire Twitter Archive; Will Rename Itself ‘Museum of Crap.’

Few readers of this blog will be surprised to find that I take a rather different view of the matter. How could we not want to preserve a vast record of everyday life and thoughts from tens of millions of people, however mundane? (For more on my views of the Twitter/Library of Congress debate, and to inflate my ego, please consult articles from the New York Times, the Washington Post, and Slate.)

As any practicing historian knows, some of the most critical collections of primary sources are ephemera that someone luckily saved for the future. For example, historians of the English Civil War are deeply thankful that Humphrey Bartholomew had the presence of mind to save 50,000 pamphlets (once considered throwaway pieces of hack writing) from the seventeenth century and give them to a library at Oxford. Similarly, I recently discovered during a behind-the-scenes tour of the Cambridge University Library that the library’s off-limits tower, long rumored by undergraduates to be filled with pornography, is actually stocked with old genre fiction such as Edwardian spy novels. (See photographic evidence, below.) Undoubtedly the librarians of 1900 were embarrassed by the stuff; today, social historians and literary scholars can rejoice that they didn’t throw these cheap volumes out. As I have argued in this space, scholars have uses for archives that archivists cannot anticipate.

But let me set aside for a moment my optimistic disposition about the Twitter archive and instead meet the critics halfway. Suppose that we really don’t know if the archive will be useful or not—or worse, perhaps we are relatively sure it will be utterly worthless. Does that necessarily mean that the Library or Congress should not have accessioned it? I was thinking about this fair-minded version of the “What to save?” conundrum recently when I remembered a penetrating article about criminal profiling, which, of all things, helpfully reveals the correct calculus about the importance of digital ephemera such as tweets.

* * *

The act of stopping certain air travelers for additional checks—to give them more costly attention—is a difficult task riven by conflicting theories of whom to check and (as mathematicians know) associated search algorithms. Do utterly random checks work best? Should the extra searches focus on certain groups or certain bits of information (one-way tickets, cash purchases)? Many on the right (which is also home, I suspect, to many of the critics who scoff at the Twitter archive) believe in strong profiling—that is, spending nearly the entire budget and time of the Transportation Security Administration profiling Middle Easterners and Muslims. Many on the left counter that this strong profiling leads to insidious stereotyping.

A more powerful critique of strong profiling was advanced last year by the computational statistician William Press in “Strong Profiling is Not Mathematically Optimal for Discovering Rare Malfeasors” (Proceedings of the National Academy of Sciences, 2009). Press acknowledges that the issue of profiling (whether for terrorists at the airport or for criminals in a traffic stop) has enormous social and political implications. But he seeks to answer a more basic question: does strong profiling actually work? Or is there a more optimal mathematical formula for spending scarce time and resources to achieve the desired outcome?

Press examines two idealized mathematical cases. The first, the “authoritarian” strategy, assumes that we have perfect surveillance of society and precisely know the odds that someone will be a criminal (and thus worthy of additional screening). The second, the “democratic” strategy, assumes that our knowledge of people is messy and incomplete. In that case of imperfect information the mathematics is much more complex, because we can’t assign a reliable probability of criminality to each person and then give them security attention at an intensity commensurate to that value. It turns out that in the democratic case, the fuzzier mathematics strongly suggest a broader range of attention.

Moreover, even beyond the obvious fact that that the democratic model is closest to real life, the democratic algorithm for profiling is better than the authoritarian model, even if that state of omnipotent knowledge was achievable. Even if we had Minority Report-style knowledge, or even if we believed that the universe of potential criminals was entirely a subset of a particular group, it would be unwise to fully rely on this knowledge. To do so would lead to “oversampling,” an inefficient overemphasis on particular individuals. Of course we should pay attention to those with the maximum probability of being a criminal. But we also have to mix into our algorithm some attention to those who are seemingly innocent to achieve the best outcome—to stop the most crimes.

Through some mathematics we need not get into here, Press concludes that the optimal formula for paying attention to subjects is to avoid using the straight probability that each person is a criminal and instead use the square root of that value. For instance, if you feel Person A is 100 times more likely to be a terrorist than Person B, you should spend 10 times, not 100 times, the resources on Person A over Person B. Moreover, as our certainty about potential suspects decreases, the democratic sampling model becomes increasingly more efficient compared to the authoritarian model.

Although couched in the language of crime prevention, what Press is really talking about is the calculus of importance. As Press himself notes, “The idea of sampling by square-root probabilities is quite general and can have many other applications.”

* * *

As it turns out, the calculus of importance is the same for the Transportation Security Administration and for the Library of Congress. Press’s conclusions apply directly to the archivist’s dilemma of how to spend limited resources on saving objects in a digital age. The criminals in our library scenario are people or documents likely to be important to future researchers; innocents are those whom future historians will find uninteresting. Additional screening is the act of archiving—that is, selection for greater attention.

What does this mean for the archiving of digital emphemera such as status updates—those little, seemingly worthless online notes? It means we should continue to expend the majority of resources on those documents and people of most likely future interest, but not to the exclusion of objects and figures that currently seem unimportant.

In other words, if you believe that the notebooks of a known writer are likely to be 100 times more important to future historians and researchers than the blog of a nobody, you should spend 10, not 100, times the resources in preserving those notebooks over the blog. It’s still a considerable gap, but much less than the traditional (authoritarian) model would suggest. The calculus of importance thus implies that libraries and archives should consciously pursue contents such as those in the Cambridge University Library tower, even if they feel it runs counter to common sense.

So even if the skeptics are right and the Twitter archive is a boondoggle for the Library of Congress, it is the correct kind of bet on the future value of digital ephemera, the equivalent of the TSA spending 10% of their budget to examine more closely threats other than those posed by twentysomething Arabs.

The accessioning of the Twitter archive by the Library of Congress is not an expensive affair. Tweets are small digital objects, and even billions of them fit on a few cheap drives. Even with digital asset management, IT labor across time, and electricity costs, storing billions of tweets is economical, especially compared to the cost of storing physical books. University of Michigan Librarian Paul Courant has calculated [Word doc] that the present value of the cost to store a book on library shelves in perpetuity is about $100 (mostly in physical plant costs). An equivalent electronic text costs just $5.

This vast disparity only serves to reinforce the calculus of importance and archival imperatives of institutions such as the Library of Congress. The library and other keepers of our cultural heritage should be doing much more to save the digital ephemera of our age, no matter what we contemporaries think of these scrawls on the web. You never know when a historian will pan a bit of gold out of that seemingly worthless stream.

Archives, History

Virtual Museum of the Gulag Seized

Depressing and not getting enough notice: masked police recently raided the office of the Russian human rights group Memorial, which has been digitally cataloguing the artifacts and names of those affected by the Soviet Gulag. The police took drives containing biographical information on more than 50,000 victims of Stalinist repression and over 10,000 digital photographs, among other unique archival documents. We worked with Memorial on our Gulag history project. (Thanks to Elena Razlogova for bringing this to my attention.)