Custom – Dan Cohen

Digital Ephemera and the Calculus of Importance

[Thoughts prompted by an invitation to write a piece on the significance of “Notes, Lists, and Everyday Inscriptions” for The New Everyday, an innovative experiment in web publishing sponsored by MediaCommons. Since the editors of this edition of The New Everyday asked for something out of the ordinary for their curated collection, I thought it was time to unveil my Gladwell-esque theory of how criminal profiling and archival priorities share a mathematical foundation.]

How important are small written ephemera such as notes, especially now that we create an almost incalculable number of them on digital services such as Twitter? Ever since the Library of Congress surprised many with its announcement that it would accession the billions of public tweets since 2006, the subject has been one of significant debate. Critics lamented what they felt was a lowering of standards by the library—a trendy, presentist diversion from its national mission of saving historically valuable knowledge. In their minds, Twitter is a mass of worthless and mundane musings by the unimportant, and thus obviously unworthy of an archivist’s attention. The humorist Andy Borowitz summarized this cultural critique in a mocking headline: “Library of Congress to Acquire Entire Twitter Archive; Will Rename Itself ‘Museum of Crap.’”

Few readers of this blog will be surprised to find that I take a rather different view of the matter. How could we not want to preserve a vast record of everyday life and thoughts from tens of millions of people, however mundane? (For more on my views of the Twitter/Library of Congress debate, and to inflate my ego, please consult articles from the New York Times, the Washington Post, and Slate.)

As any practicing historian knows, some of the most critical collections of primary sources are ephemera that someone luckily saved for the future. For example, historians of the English Civil War are deeply thankful that Humphrey Bartholomew had the presence of mind to save 50,000 pamphlets (once considered throwaway pieces of hack writing) from the seventeenth century and give them to a library at Oxford. Similarly, I recently discovered during a behind-the-scenes tour of the Cambridge University Library that the library’s off-limits tower, long rumored by undergraduates to be filled with pornography, is actually stocked with old genre fiction such as Edwardian spy novels. (See photographic evidence, below.) Undoubtedly the librarians of 1900 were embarrassed by the stuff; today, social historians and literary scholars can rejoice that they didn’t throw these cheap volumes out. As I have argued in this space, scholars have uses for archives that archivists cannot anticipate.

But let me set aside for a moment my optimistic disposition about the Twitter archive and instead meet the critics halfway. Suppose that we really don’t know if the archive will be useful or not—or worse, perhaps we are relatively sure it will be utterly worthless. Does that necessarily mean that the Library or Congress should not have accessioned it? I was thinking about this fair-minded version of the “What to save?” conundrum recently when I remembered a penetrating article about criminal profiling, which, of all things, helpfully reveals the correct calculus about the importance of digital ephemera such as tweets.

* * *

The act of stopping certain air travelers for additional checks—to give them more costly attention—is a difficult task riven by conflicting theories of whom to check and (as mathematicians know) associated search algorithms. Do utterly random checks work best? Should the extra searches focus on certain groups or certain bits of information (one-way tickets, cash purchases)? Many on the right (which is also home, I suspect, to many of the critics who scoff at the Twitter archive) believe in strong profiling—that is, spending nearly the entire budget and time of the Transportation Security Administration profiling Middle Easterners and Muslims. Many on the left counter that this strong profiling leads to insidious stereotyping.

A more powerful critique of strong profiling was advanced last year by the computational statistician William Press in “Strong Profiling is Not Mathematically Optimal for Discovering Rare Malfeasors” (Proceedings of the National Academy of Sciences, 2009). Press acknowledges that the issue of profiling (whether for terrorists at the airport or for criminals in a traffic stop) has enormous social and political implications. But he seeks to answer a more basic question: does strong profiling actually work? Or is there a more optimal mathematical formula for spending scarce time and resources to achieve the desired outcome?

Press examines two idealized mathematical cases. The first, the “authoritarian” strategy, assumes that we have perfect surveillance of society and precisely know the odds that someone will be a criminal (and thus worthy of additional screening). The second, the “democratic” strategy, assumes that our knowledge of people is messy and incomplete. In that case of imperfect information the mathematics is much more complex, because we can’t assign a reliable probability of criminality to each person and then give them security attention at an intensity commensurate to that value. It turns out that in the democratic case, the fuzzier mathematics strongly suggest a broader range of attention.

Moreover, even beyond the obvious fact that that the democratic model is closest to real life, the democratic algorithm for profiling is better than the authoritarian model, even if that state of omnipotent knowledge was achievable. Even if we had Minority Report-style knowledge, or even if we believed that the universe of potential criminals was entirely a subset of a particular group, it would be unwise to fully rely on this knowledge. To do so would lead to “oversampling,” an inefficient overemphasis on particular individuals. Of course we should pay attention to those with the maximum probability of being a criminal. But we also have to mix into our algorithm some attention to those who are seemingly innocent to achieve the best outcome—to stop the most crimes.

Through some mathematics we need not get into here, Press concludes that the optimal formula for paying attention to subjects is to avoid using the straight probability that each person is a criminal and instead use the square root of that value. For instance, if you feel Person A is 100 times more likely to be a terrorist than Person B, you should spend 10 times, not 100 times, the resources on Person A over Person B. Moreover, as our certainty about potential suspects decreases, the democratic sampling model becomes increasingly more efficient compared to the authoritarian model.

Although couched in the language of crime prevention, what Press is really talking about is the calculus of importance. As Press himself notes, “The idea of sampling by square-root probabilities is quite general and can have many other applications.”

* * *

As it turns out, the calculus of importance is the same for the Transportation Security Administration and for the Library of Congress. Press’s conclusions apply directly to the archivist’s dilemma of how to spend limited resources on saving objects in a digital age. The criminals in our library scenario are people or documents likely to be important to future researchers; innocents are those whom future historians will find uninteresting. Additional screening is the act of archiving—that is, selection for greater attention.

What does this mean for the archiving of digital emphemera such as status updates—those little, seemingly worthless online notes? It means we should continue to expend the majority of resources on those documents and people of most likely future interest, but not to the exclusion of objects and figures that currently seem unimportant.

In other words, if you believe that the notebooks of a known writer are likely to be 100 times more important to future historians and researchers than the blog of a nobody, you should spend 10, not 100, times the resources in preserving those notebooks over the blog. It’s still a considerable gap, but much less than the traditional (authoritarian) model would suggest. The calculus of importance thus implies that libraries and archives should consciously pursue contents such as those in the Cambridge University Library tower, even if they feel it runs counter to common sense.

So even if the skeptics are right and the Twitter archive is a boondoggle for the Library of Congress, it is the correct kind of bet on the future value of digital ephemera, the equivalent of the TSA spending 10% of their budget to examine more closely threats other than those posed by twentysomething Arabs.

The accessioning of the Twitter archive by the Library of Congress is not an expensive affair. Tweets are small digital objects, and even billions of them fit on a few cheap drives. Even with digital asset management, IT labor across time, and electricity costs, storing billions of tweets is economical, especially compared to the cost of storing physical books. University of Michigan Librarian Paul Courant has calculated [Word doc] that the present value of the cost to store a book on library shelves in perpetuity is about $100 (mostly in physical plant costs). An equivalent electronic text costs just $5.

This vast disparity only serves to reinforce the calculus of importance and archival imperatives of institutions such as the Library of Congress. The library and other keepers of our cultural heritage should be doing much more to save the digital ephemera of our age, no matter what we contemporaries think of these scrawls on the web. You never know when a historian will pan a bit of gold out of that seemingly worthless stream.

May 17, 2010 2 Comments

Introducing Digital Humanities Now

Do the digital humanities need journals? Although I’m very supportive of the new journals that have launched in the last year, and although I plan to write for them from time to time, there’s something discordant about a nascent field—one so steeped in new technology and new methods of scholarly communication—adopting a format that is struggling in the face of digital media.

I often say to non-digital humanists that every Friday at 5 I know all of the most important books, articles, projects, and news of the week—without the benefit of a journal, a newsletter, or indeed any kind of formal publication by a scholarly society. I pick up this knowledge by osmosis from the people I follow online.

I subscribe to the blogs of everyone working centrally or tangentially to digital humanities. As I have argued from the start, and against the skeptics and traditionalists who thinks blogs can only be narcissistic, half-baked diaries, these outlets are just publishing platforms by another name, and in my area there are an incredible number of substantive ones.

More recently, social media such as Twitter has provided a surprisingly good set of pointers toward worthy materials I should be reading or exploring. (And as happened with blogs five years ago, the critics are now dismissing Twitter as unscholarly, missing the filtering function it somehow generates among so many unfiltered tweets.) I follow as many digital humanists as I can on Twitter, and created a comprehensive list of people in digital humanities. (You can follow me @dancohen.)

For a while I’ve been trying to figure out a way to show this distilled “Friday at 5” view of digital humanities to those new to the field, or those who don’t have time to read many blogs or tweets. This week I saw a tweet from Tom Scheinfeldt (blog|Twitter) (who in turn saw a tweet from James Neal) about a new service called Twittertim.es, which creates a real-time publication consisting of articles highlighted by people you follow on Twitter. I had a thought: what if I combined the activities of several hundred digital humanities scholars with Twittertim.es?

Digital Humanities Now is a new web publication that is the experimental result of this thought. It aggregates thousands of tweets and the hundreds of articles and projects those tweets point to, and boils everything down to the most-discussed items, with commentary from Twitter. A slightly longer discussion of how the publication was created can be found on the DHN “About” page.

Does the process behind DHN work? From the early returns, the algorithms have done fairly well, putting on the front page articles on grading in a digital age, bringing high-speed networking to liberal arts colleges, Google’s law archive search, and (appropriately enough) a talk on how to deal with streams of content given limited attention. Perhaps Digital Humanities Now will show a need for the light touch of a discerning editor. This could certainly be added on top of the raw feed of all interest items (about 50 a day, out of which only 2 or 3 make it into DHN), but I like the automated simplicity of DHN 1.0.

Despite what I’m sure will be some early hiccups, my gut is that some version of this idea could serve as a rather decent new form of publication that focuses the attention of those in a particular field on important new developments and scholarly products. I’m not holding my breath that someday scholars will put an appearance in DHN on their CVs. But as I recently told an audience of executive directors of scholarly societies at an American Council of Learned Societies meeting, if you don’t do something like this, someone else will.

I suppose DHN is a prod to them and others to think about new forms of scholarly validation and attention, beyond the journal. Ultimately, journals will need the digital humanities more than we need them.

November 18, 2009 27 Comments

The Spider and the Web: Results

A couple of weeks ago at the Digital Dilemmas Symposium in New York I tried something new: using Twitter to replicate digitally the traditional “author’s query,” where a scholar asks readers of a journal for assistance with a research project. I believe the results of this experiment are instructive about the significant advantages—and some disadvantages—for academia of what has come to be known as crowdsourcing.

For those who didn’t follow this experiment live via Twitter, you should first read the two initial posts in this series. The experiment was fairly simple: I prepared followers of my blog and my Twitter feed (as of this writing I have roughly the same number of blog subscribers and Twitter followers, about 1,600 on each service) by noting that I would reveal a historical puzzle at a particular time. At the beginning of my talk in New York, my blog auto-posted the scan of an object found in a Victorian archaeological dig, which I simultaneously tweeted.

I asked those following me online to work together to figure out what the object was. Participants in the experiment could post live comments on Twitter, and others could follow along by searching for the #digdil09 hashtag. (A hashtag is a hopefully unique string of characters that enables a search of Twitter to reveal all comments at a specific conference or on a particular subject.) I encouraged everyone to talk to each other and leverage each other’s knowledge. In addition, I set up what in the age of the print journal would have been a ridiculous deadline: only one hour for the crowd to solve the mystery. For a bit of theater (“stunt lecturing”?) I flashed the Twitter stream behind me from time to time during my talk.

It took much less time than an hour for a solution: nine minutes, to be exact, for a preliminary answer and 29 minutes for a fairly rich description of the object to emerge from the collective responses of roughly a hundred participants. Solution: the object was an ornamental gorget from the Cahokia tribe.

spider_tweet_2

What happened along the way was as interesting as the result (which I have to admit was rather satisfying given the possibility of a live crowd in NYC laughing at me for using Twitter). First, Twitter was remarkably effective in multiplying my voice. Indeed, in the first five minutes about a dozen others on Twitter retweeted (rebroadcast) my mystery to their followers. This “Twitter multiplier effect” meant that within minutes many thousands of people got word of my experiment; over 1,900 actually viewed the object on my blog. And I’m lucky enough to have a particularly knowledgeable crowd following me on Twitter, as you can see from the word cloud of my followers’ bios.

Once the race was on, solvers took two distinct paths toward a solution. The first path was the one I was trying to encourage: some quick thoughts about facets of the object, followed by scholarly debate. I mentioned that the object was made out of shell but was found far away from water in the Midwest (of the U.S.), which led to some interesting speculation about origins and movement of Native Americans, Europeans, and Africans. Others focused on the iconography of the spider; what could it symbolize and which cultures used it? These were decent lines of inquiry that one could imagine in the back pages of a Victorian journal.

spider_tweet_5

spider_tweet_4

Twitter is mocked for its almost comical terseness, but even the most hardened Twitter skeptic must admit tweets such as these are far from useless assistance. And the power of this crowdsourcing is even more evident as you look at the full discussion trail as researchers pick up information from each other to take their speculations a step further.

The experiment was not, however, an unalloyed success, partly due to a mistake I made in setting it up. In hindsight I gave away too much my original post, mentioning St. Clair and the fact that the piece was made out of shell. Alas, Googling keywords such as these (as well as the obvious “spider”) immediately gets one hot on the trail of the solution. It’s clear from the stream of tweets that a good portion of the solving audience took the “Google knows all” approach rather than the “scholarly discussion” approach.

I suppose even this aspect of the experiment is not uninteresting; I’ll leave it to others in the comments below to discuss the merits of the “Google” approach, as well as the merits (and demerits) of this experiment in general.

[Afterword: As many have pointed out on Twitter, the experiment would have been better had I not posted an object that could be found online. To be honest, I thought I had found an unusual object with no scanned version; it shows how much has been digitized, and how good search is even on a small amount of metadata.]

April 29, 2009 16 Comments