Author: Dan Cohen

Rough Start for Digital Preservation

How hard will it be to preserve today’s digital record for tomorrow’s historians, researchers, and students? Judging by the preliminary results of some attempts to save for the distant future the September 11 Digital Archive (a project I co-directed), it won’t be easy. While there are some bright spots to the reports in D-Lib Magazine last month on the efforts of four groups to “ingest” (or digitally accession) the thousands of files from the 9/11 collection, the overall picture is a little bit sobering. And this is a fairly well-curated (though by no means perfect) collection. Just imagine what ingesting a messy digital collection, e.g., the hard drive of your average professor, would entail. Here are some of the important lessons from these early digital preservation attempts, as I see it.

But first, a quick briefing on the collection. The September 11 Digital Archive is a joint project of the Center for History and New Media at George Mason University and the American Social History Project/Center for Media and Learning at the Graduate Center of the City University of New York. From January 2002 to the present (though mostly in the first two years) it has collected via the Internet (and some analog means, later run through digitization processes) about 150,000 objects, ranging from emails and BlackBerry communications to voicemail and digital audio, to typed recollections, photographs, and art. I think it’s a remarkable collection that will be extremely valuable to researchers in the future who wish to understand the attacks of 9/11 and their aftermath. In September 2003, the Library of Congress agreed to accession the collection, one of its first major digital accessions.

We started the project as swiftly as possible after 9/11, with the sense that we should do our best on the preservation front, but also with the understanding that we would probably have to cut some corners if we wanted to collect as much as we could. We couldn’t deliberate for months about the perfect archival structure or information architecture or wait for the next release of DSpace. Indeed, I wrote most of the code for the project in a week or so over the holiday break at the end of 2001. Not my best PHP programming effort ever, but it worked fine for the project. And as Clay Shirky points out in the D-Lib opening piece, this is likely to be the case for many projects—after all, projects that spend a lot of time and effort on correct metadata schemes and advanced hardware and software probably are going to be in the position to preserve their own materials anyway. The question is what will happen when more normal collections are passed from their holders to preservation outfits, such as the Library of Congress.

All four of the groups that did a test ingest of our 9/11 collection ran into some problems, though not necessarily at the points they expected. Harvard, Johns Hopkins, Old Dominion, and Stanford encountered some hurdles, beginning with my first point:

You can’t trust anything, even simple things like file types. The D-Lib reports note that a very small but still significant percentage of files in the 9/11 collection seemed to not be the formats they presented themselves as. What amazes me reading this is that I wrote some code to validate file types as they were being uploaded by contributors onto our server, using some powerful file type assessment tools built into PHP and Apache (our web server software). Obviously these validations failed to work perfectly. When you consider handling billion-object collections, even a 1% (or .1%) error rate is a lot. Which leads me to point #2…

We may have to modify to preserve. Although for generations archival science has emphasized keeping objects in their original format, I wonder if it might have been better if (as we had thought about at first on the 9/11 project) we had converted files contributed by the general public into just a few standardized formats. For instance, we could have converted (using the powerful ImageMagick server software) all of the photographs into one of the JPEG formats (yes, there are more than one, which turned out to be a pain). We would have “destroyed” the original photograph in the upload process—indeed, worse than that from a preservation perspective, we would have compressed it again, losing some information—but we could have presented the Library of Congress with a simplified set of files. That simplification process leads me to point #3…

Simple almost always beats complex when it comes to computer technology. I have incredible admiration for preservation software such as DSpace and Fedora, and I tend toward highly geeky solutions, but I’m much more pessimistic than those who believe that we are on the verge of preservation solutions that will keep digital files for centuries. Maybe it’s the historian of the Victorian age in me, reminding myself of the fate of so many nineteenth-century books that were not acid-free and so are deteriorating slowly in libraries around the world. Anyway, it was nice to see Shirky conclude in a similar vein that it looks like digital preservation efforts will have to be “data-centric” rather than “tool-centric” or “process-centric.” Specific tools will fade away over time, and so will ways of processing digital materials. Focusing on the data itself and keeping those files intact (and in use—that which is frequently used will be preserved) is critical. We’ll hopefully be able to access those saved files in the future with a variety of tools and using a variety of processes that haven’t even been invented yet.

2006: Crossroads for Copyright

The coming year is shaping up as one in which a number of copyright and intellectual property issues will be highly contested or resolved, likely having a significant impact on academia and researchers who wish to use digital materials in the humanities. In short, at stake in 2006 are the ground rules for how professors, teachers, and students may carry out their work using computer technology and the Internet. Here are three major items to follow closely.

Item #1: What Will Happen to Google’s Massive Digitization Project?

The conflict between authors, publishers, and Google will probably reach a showdown in 2006, with either the beginning of court proceedings or some kind of compromise. Google believes it has a good case for continuing to digitize library books, even those still under copyright; some authors and most publishers believe otherwise. So far, not much in the way of compromise. Indeed, if you have been following the situation carefully, it’s clear that each side is making clever pre-trial maneuvers to bolster their case. Google cleverly changed the name of its project to Google Book Search from Google Print, which emphasizes not the (possibly illegal) wholesale digitization of printed works but the fact that the program is (as Google’s legal briefs assert) merely a parallel project to their indexing of the web. The implication is that if what they’re doing with their web search is OK (for which they also need to make copies, albeit of born-digital pages), then Google Book Search is also OK. As Larry Lessig, Siva Vaidhyanathan, and others have highlighted, if the ruling goes against Google given this parallelism (“it’s all in the service of search”), many important web services might soon be illegal as well.

Meanwhile, the publishers have made some shrewd moves of their own. They have announced a plan to work with Amazon to accept micropayments for a few page views from a book (e.g., a recipe). And HarperCollins recently decided to embark on its own digitization program, ostensibly to provide book searches through its website. If you look at the legal basis of fair use (which Google is professing for its project), you’ll understand why these moves are important to the publishers: they can now say that Google’s project hurts the market for their works, even if Google shows only a small amount of a copyrighted book. In addition, a judge can no longer rule that Google is merely providing a service of great use to the public that the publishers themselves are unable or unwilling to provide. And I thought the only smart people in this debate were on Google’s side.

If you haven’t already read it, I recommend looking at my notes on what a very smart lawyer and a digital visionary have to say about the impending lawsuits.

Item #2: Chipping Away at the DMCA

In the first few months of 2006, the Copyright Office of the United States will be reviewing the dreadful Digital Millenium Copyright Act—one of the biggest threats to scholars who wish to use digital materials. The DMCA has effectively made many researchers, such as film studies professors, criminals, because they often need to circumvent rights management protection schemes on devices like DVDs to use them in a classroom or for in-depth study (or just to play them on certain kinds of computers). This circumvention is illegal under the law, even if you own the DVD. Currently there are only four minor exemptions to the DMCA, so it is critical that other exemptions for teachers, students, and scholars be granted. If you would like to help out, you can go to the Copyright Office’s website in January and sign your name to various efforts to carve out exemptions. One effort you can join, for instance, is spearheaded by Peter Decherney and others at the University of Pennsylvania. They want to clear the way for fully legal uses of audiovisual works in educational settings. Please contact me if you would like to add your name to that important effort.

Item #3: Libraries Reach a Crossroads

In an upcoming post I plan to discuss at length a fascinating article (to be published in 2006) by Rebecca Tushnet, a Georgetown law professor, that highlights the strange place at which libraries have arrived in the digital age. Libraries are the center of colleges and universities (often quite literally), but their role has been increasingly challenged by the Internet and the protectionist copyright laws this new medium has engendered. Libraries have traditionally been in the long-term purchasing and preservation business, but they increasing spend their budgets on yearly subscriptions to digital materials that could disappear if their budgets shrink. They have also been in the business of sharing their contents as widely as possible, to increase knowledge and understanding broadly in society; in this way, they are unique institutions with “special concerns not necessarily captured by the end-consumer-oriented analysis with which much copyright scholarship is concerned,” as Prof. Tushnet convincingly argues. New intellectual property laws (such as the DMCA) threaten this special role of libraries (aloof from the market), and if they are going to maintain this role, 2006 will have to be the year they step forward and reassert themselves.

Creating a Blog from Scratch, Part 4: Searching for a Good Search

It often surprises those who have never looked at server logs (the detailed statistics about a website) that a tremendous percentage of site visitors come from searches. In the case of the Center for History and New Media, this is a staggering 400,000 unique visitors a month out of about one million. Furthermore, many of these visitors ignore a website’s navigation and go right to the site search box to complete their quest for information. While I’m not a big fan of consultants that tell webmasters to sacrifice virtually everything for usability, I do feel that searching has been undervalued by digital humanities projects, in part because so much effort goes into digitization, markup, interpretation, and other time-consuming tasks. But there’s another, technical reason too: it’s actually very hard to create an effective search—one, for instance, that finds phrases as well as single words, that is able to rank matches well, and that is easy to maintain through software and server upgrades. In this installment of “Creating a Blog from Scratch” (for those who missed them, here are parts 1, 2, and 3) I’ll take you behind the scenes to explain the pluses and minuses of the various options for adding a search feature to a blog, or any database-driven website for that matter.

There are basically four options for searching a website that is generated out of a database: 1) have the database do it for you, since it already has indexing and searching built in; 2) install another software package on your server that spiders your site, indices it, and powers your search; 3) use an application programming interface (API) from Google, Yahoo, or MSN to power the search, taking search results from this external source and shoehorning them into your website’s design; 4) outsourcing the search entirely by passing search queries to Google, Yahoo, or MSN’s website, with a modifier that says “only search my site for these words.”

Option #1 seems like the simplest. Just create an SQL statement (a line of code in database lingo) that sends the visitor’s query to the database software—in the case of this blog, the popular MySQL—and have it return a list of entries that match the query. Unfortunately, I’ve been using MySQL extensively for five years now and have found its ability to match such queries less than adequate. First of all, until the most recent version of the MySQL it would not handle phrase searching at all, so you would have to strip quotation marks out of queries and fool the user into believing your site could do something that it couldn’t (that is, do a search like Google could). Secondly, I have found its indexing and ranking schemes to be far behind what you expect from a major search engine. Maybe this has changed in version 5, but for many years it seemed as if MySQL was using search principles from the early 1990s, where the number of times a word appeared on the page signified how well the page matched the query (rather than the importance of the place of each instance of the word on the page, or even better, how important the document was in the constellation of pages that contained that word). MySQL will return a fraction from 0 to 1 for the relevance of a match, but it’s a crude measure. I’m still not convinced, even with the major upgrades in version 5, that MySQL’s searching is acceptable for demanding users.

Option #2 is to install specialized search packages such as the open source ht://Dig on your server, point it to your blog (or website) and let it spider the whole thing, just as Google or Yahoo does from the outside. These software packages can do a decent job indexing and swiftly finding documents that seem more relevant than the rankings in MySQL. But using them obviously requires installing and maintaining another complicated piece of software, and I’ve found that spiders have a way of wandering beyond the parameters you’ve set for them, or flaking out during server upgrades. (Over the last few days, for instance, I’ve had two spiders request hundreds of posts from this blog that don’t exist. Maybe they can see into the future.) Anecdotally, I also think that the search results are better from commercial services such as Google or Yahoo.

I’ve become increasingly enamored of Option #3, which is to use APIs, or direct server-to-server communications, with the indices maintained by Google, Yahoo, or Microsoft. The advantage of these APIs is that they provide you with very high quality search results and query handling (at least for Google and Yahoo; MSN is far behind). Ranking is done properly, with the most important documents (e.g., blog posts that many other bloggers link to or that you have referenced many times on your own site) coming up first if there are multiple hits in the search results. And these search giants have far more sophisticated ways of handling phrase searches (even long ones) and boolean searches than MySQL. The disadvantage of APIs is that for some reason the indices made available to software developers are only a fraction the size of the main indices for these search engines, and are only updated about once a month. So visitors may not find recent material, or some material that is ranked fairly low, through API searches. Another possibility for Option #3 is to use the API for a blog search engine, rather than a broad search engine. For instance, Technorati has a blog-specific search API. Since Technorati automatically receives a ping from my Atom feed every time I post (via FeedBurner), it’s possible that this (or another blog search engine) will ultimately provide a solid API-based search.

I’ve been experimenting with ways of getting new material into the main Google index swiftly (i.e., within a day or two rather than a month or two), and have come up with a good enough solution that I have chosen Option #4: outsourcing the search entirely to Google, by using their free (though unfortunately ad-supported) site-specific search. With little fanfare, this year Google released Google Sitemaps, which provides an easy way for those who maintain websites, especially database-driven ones, to specify where all of their web pages are using an XML schema. (Spiders often miss web pages generated out of a database because there are often so many of them, and some of these pages may not be linked to.) While not guaranteeing that everything in your sitemap will be crawled and indexed, Google does say that it makes it easier for them to crawl your site more effectively. (By the way, Google’s recent acquisition of 5 percent of AOL seems to have been, at least ostensibly, very much about providing AOL with better crawls, thus upping the visibility of their millions of pages without messing with Google’s ranking schemes.) And—here’s the big news if you’ve made it this far—I’ve found that having a sitemap gets new blog posts into the main Google index extremely fast. Indeed, usually within 24 hours of submitting a new post Google downloads my updated sitemap (created automatically by a PHP script I’ve written), sees the new URL for the post, and adds it to its index. This means that I can very effectively use the Google’s main search engine for this blog, although because I’m not using the API I can’t format the results page to match the design of my site exactly.

One final note, and I think an important one for those looking to increase the visibility of their blog posts (or any web page created from a database) in Google’s search results: have good URLs, i.e., ones with important keywords rather than meaningless numbers or letters. Database-driven sites often have such poor URLs featuring an ugly string of variables, which is a shame, since server technology (such as Apache’s mod_rewrite) allows webmasters to replace these variables with more memorable words. Moreover, Google, Yahoo, and other search engines clearly favor keywords in URLs (very apparent when you begin to work with Google’s Web API), assigning them a high value when determining the relevance of a web page to a query. Some blog software automatically creates good URLs (like Blogger, owned by Google), while many other software packages do not—typically emphasizing the date of a post in the URL or the page number in the blog. For my own blogging software, I designed a special field in the database just for URLs, so I can craft a particularly relevant and keyword-laden string. Mod_rewrite takes care of the rest, translating this string into an ID number that’s retrieved by the database to generate the page you’re reading.

For many reasons, including making it accessible to alternative platforms such as audio browsers and cell phones, I wanted to generate this page in strict XHTML, unlike my old website, which had poor coding practices left over from the 1990s. Unfortunately, as the next post in this series details, I failed terribly in the pursuit of this goal, and this floundering made me think twice about writing my own blogging software when existing packages like WordPress will generate XHTML for you, with no fuss.

Part 5: What is XHTML, and Why Should I Care?

Creating a Blog from Scratch, Part 3: The Double Life of Blogs

In the first two posts in this series, I discussed the origins of blogs and how they led to certain elements in popular blog software that were in some cases good and in others bad for my own purposes—to start a blog that consisted of short articles on the intersection of digital technology, the humanities, and related topics (rather than my personal life or links with commentary). What I didn’t realize as I set about writing my own blog software from scratch for this project was that in truth a blog leads two lives: one as a website and another as a feed, or syndicated digest of the more complete website. Understanding this double life and the role and details of RSS feeds led to further thoughts about how to design a blog, and how certain choices are encoded into blogging software. Those choices, as I’ll explain in this post, determine to a large extent what kind of blog you are writing.

Creating a blog from scratch is a great first project for someone new to web programming and databases. It involves putting things into a database (writing a post), taking them out to insert into a web page, and perhaps coding some secondary scripts such as a search engine or a feed generator. It stretches all of the scripting muscles without pulling them. And in my case, I had already decided (as discussed in the prior post) I didn’t need (or want) a lot of bells and whistles, like a system for allowing comments or trackback/ping features. So I began to write my blogging application with the assumption that it would be a very straightforward affair.

Indeed, at first designing the database couldn’t have been simpler. The columns (as database fields are called in MySQL) were naturally an ID number, a title, the body of the post (the text you’re reading right now), and the time posted (while I disparaged time in my last post I thought it was important enough to have this field for reference). But then I began to consider two things that were critical and that I thought popular blogging software did well: automatically generating an RSS feed (or feeds) and—for some blog software—creating URLs for each post that nicely contain keywords from the title. This latter feature is very important for visibility in search engines (the topic of my next post in this series).

I know a lot about search engines, but knew very little about RSS feeds, and when I started to think about what my blogging application would need to autogenerate its own feed, I realized the database had to be slightly more elaborate than my original schema. I had of course heard of RSS before, and the idea of syndication. But before I began to think about this blog and write the software that drives it I hadn’t really understood its significance or its complexity. It’s supposed to be “Really Simple Syndication” as one of the definitions for the acronym RSS asserts, and indeed the XML schemas for various RSS feeds are relatively simple.

But they also contain certain assumptions. For instance, all RSS feeds have a description of each blog post (in the Atom feed, it’s called the “summary”), but these descriptions vary from feed to feed and among different RSS feed types. For some it is the first paragraph of the post, for others the first 100 characters, and for still others it’s a specially crafted “teaser” (to use the TV news lingo for lines like “Coming up, what every parent needs to know to prevent a dingo from eating their baby”). Along with the title to the post this snippet is generally the only thing many people see—the people who don’t visit your blog’s website but only scan it in syndication.

So what kind of description did I want, and how to create it? I liked the idea of an autogenerated snippet—press “post” and you’re done. On the other hand, choosing a random number of characters out of a hat that would work well for every post seemed silly. I’m used to writing abstracts for publications, so that was another possibility. But then I would have to do even more work after I finished writing a post. And I also needed to factor in that the description had to prod people to look at the whole post on my website, something the “teaser” people understood. So I decided to compromise and leave part to the code and part to me: I would have the software take the first paragraph of the entire post and use that as the default summary for the feed, but I had the software put a copy of this paragraph in the database so I could edit it if I wanted to. Automated but alterable. And I also decided that I needed to change my normal academic writing style to have more of an enticing end to every first paragraph. The first paragraph would be somewhere between an abstract and a teaser.

Having made this decision and others like it for the other fields in the feed (should I have “channels”? why does there have to be an “author” field in a feed for a solo blog?), I had to choose which feed to create: RSS 1.0, RSS 2.0, Atom? Why there are so many feed types somewhat eludes me; I find it weird to go to some blogs that have little “chicklets” for every feed under the sun. I’ve now read a lot about the history of RSS but it seems like something that should have become a single international standard early on. Anyway, I wanted the maximum “subscribability” for my blog but didn’t want to suffer writing a PHP script to generate every single kind of feed.

So, time for outsourcing. I created a single script to generate an Atom 1.0 feed (I think the best choice if you’re just starting out), which is picked up by FeedBurner and recast into all of those slightly incompatible other feed types depending on what the subscriber needs.

This would not be the first time I would get lazy and outsource. In the next part of this series, I discuss the different ways one can provide search for a blog, and why I’m currently letting Google do the heavy lifting.

Part 4: Searching for a Good Search

The Wikipedia Story That’s Being Missed

With all of the hoopla over Wikipedia in the recent weeks (covered in two prior posts), most of the mainstream as well as tech media coverage has focused on the openness of the democratic online encyclopedia. Depending on where you stand, this openness creates either a Wild West of publishing, where anything goes and facts are always changeable, or an innovative mode of mostly anonymous collaboration that has managed to construct in just a few years an information resource that is enormous, often surprisingly good, and frequently referenced. But I believe there is another story about Wikipedia that is being missed, a story unrelated to its (perhaps dubious) openness. This story is about Wikipedia being free, in the sense of the open source movement—the fact that anyone can download the entirety of Wikipedia and use it and manipulate it as they wish. And this more hidden story begins when you ask, Why would Google and Yahoo be so interested in supporting Wikipedia?

This year Google and Yahoo pledged to give various freebies to Wikipedia, such as server space and bandwidth (the latter can be the most crippling expense for large, highly trafficked sites with few sources of income). To be sure, both of these behemoth tech companies are filled with geeks who appreciate the anti-authoritarian nature of the Wikipedia project, and probably a significant portion of the urge to support Wikipedia comes from these common sentiments. Of course, it doesn’t hurt that Google and Yahoo buy their bandwidth in bulk and probably have some extra lying around, so to speak.

But Google and Yahoo, as companies at the forefront of search and data-mining technologies and business models, undoubtedly get an enormous benefit from an information resource that is not only open and editable but also free—not just free as in beer but free as in speech. First of all, affiliate companies that Yahoo and Google use to respond to queries, such as Answers.com, primarily use Wikipedia as their main source, benefiting greatly from being able to repackage Wikipedia content (free speech) and from using it without paying (free beer). And Google has recently introduced an automated question-answering service that I suspect will use Wikipedia as one of its resources (if it doesn’t already).

But in the longer term, I think that Google and Yahoo have additional reasons for supporting Wikipedia that have more to do with the methodologies behind complex search and data-mining algorithms, algorithms that need full, free access to fairly reliable (though not necessarily perfect) encyclopedia entries.

Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11” and “Laura”). You would still run into some disambiguation problems (“Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization.

I’m sure Google and Yahoo are doing much more complex processes with the tens of gigabtyes of text on Wikipedia than this, but it’s clear from my own work on H-Bot (which uses its own cache of Wikipedia) that having a constantly updated, easily manipulated encyclopedia-like resource is of tremendous value, not just to the millions of people who access Wikipedia every day, but to the search companies that often send traffic in Wikipedia’s direction.

Update [31 Jan 2006]: I’ve run some tests on the data mining example given here in a new post. See Wikipedia vs. Encyclopaedia Britannica for Digital Research.

Creating a Blog from Scratch, Part 2: Advantages and Disadvantages of Popular Blog Software

In the first post in this series I briefly recounted the early history of blogs (all of five years ago) and noted how many of their current uses have diverged from two early incarnations (as a place to store interesting web links and as the online equivalent of a diary). Unfortunately, these early, dominant forms gave rise to existing blog software that, at least in my mind, is problematic. This “encoding” of original purposes into the basic structure of software is common in software development, and it often leads to features and configurations in later releases that are undesirable to a large number of users. In this post, I discuss the advantages and disadvantages of common blog packages—often deeply encoded into the software.

There are many good reasons to use popular blog software like Moveable Type, Blogger, or WordPress, almost too many to mention in this space. Here are some of the most important reasons, many of them obvious and others perhaps less so:

  • From a single download or by signing up with a service, you get a high level of functionality immediately and can focus on the content of your blog rather than its programming.
  • Their web designs make them instantly recognizable as the genre “blog,” thus making new visitors feel comfortable. For instance, most of them list posts in reverse chronological order, with “archives” that contain posts segmented by months and calendars marking days on which you have recently posted.
  • They allow people with little time or technical expertise to generate a site with well-formed, standards-compliant web code (most recently XHTML).
  • They automatically generate an RSS feed.
  • They have large user bases and active developers, which makes for relatively quick responses to annoyances such as blog spam.
  • They have lots of neat “social” features, such as feedback mechanisms (e.g., comments), and tracking (to see who has linked to one of your posts).
  • Some blog software automatically creates relatively good URLs (more on that in a later post in this series, and why good URLs are important to a blog).
  • Some blog services allow you to post via email and phone in addition to using a web browser.

Some of the disadvantages of popular blog software are merely the flip side of some of these advantages:

  • Even with the many templates blog software comes with, their web designs make most blogs look alike. Yes, you can easily figure out that a site is a blog, but on the other hand they begin to blend together in the mind’s eye. The web is made for variety, not sameness. In addition, you really have to work hard to fit a blog seamlessly into a broader site.
  • The tyranny of the calendar. There’s too much attention to chronology rather than content and the associations between that content. You can almost hear your blog software saying,
    “Boy, Dan had a pretty thin November, posting-wise,” taunting you with that empty calendar, or calling attention to the fact that your last post was “56 days ago.” Quality should triumph over quantity or frequency. Taking the emphasis off of time—perhaps not entirely, but a great deal—seemed to me to be a good first step for my own blog software (you’ll note that I only have a greyed-out date below the big red headline and a tiny “date string” in the buttons for each post). Obviously it makes sense to have recent posts highest on the page, but there may also be older posts that are still relevant or popular with visitors that you would like to highlight or reshuffle back into the mix. “Categories” have helped somewhat in this regard, and now post tagging (folksonomy) presents more hope. But I want to have full control over the position of my posts, recategorize them at will, have breakouts (like this series), different visual presentations, etc. And no thank you to the calendars or monthly archives.
  • Large installed user bases, as those who use Microsoft products will tell you, leads to unsavory attacks. Note the enormous proliferation of blog spam in the last year, mostly done by automated programs that know exactly how to find WordPress comment fields, Moveable Type comment fields, etc. Sure, there are now mechanisms for defending against these attacks, but when you really think about it…
  • The comment feature of blogs is vastly overrated anyway. My back-of-the-envelope calculation is that 1% of blog comments are useful to other readers. A truly important comment will be emailed to the writer of the blog, as I encourage readers to do at the end of every post. Moreover, increasingly a better place for you to comment on someone’s blog is on your own blog, with a link to their post. Indeed, that’s what Technorati and other blog search engines have figured out, and now you can acquire of a feed of comments about your blog from these third parties without opening up your blog to comment spam. (This also eliminates the need for trackback technology in your blog software.) So: no comments on my blog. Sorry. Don’t need the hassle of deleting even the occasional blog spam, and as readers of this blog have already done in droves (thanks!) you can email me if you need to. I’ll be happy to post your comments in this space if they help clarify a topic or make important corrections.
  • The search function is often not very good on blogs, even though search is how many people navigate sites. And trying to have a search function that simultaneously searches a blog and a wider site can be very complicated.
  • Like most software, there is a factor of “lock-in” when you choose an existing blog software package or service. It’s not entirely simple to export your material to a different piece of software. And many blog software packages have made this worse by encouraging posts written with non-standard (i.e., non-XHTML) characters that are used for formatting or style (as with Textile) and are converted to XHTML equivalents on the fly. This makes writing blog posts slightly faster. But if you export those posts, you will lose the important character translations.

Following this assessment of the advantages and disadvantages of popular blog software, I set about creating my own basic software that would easily fit into the web design you see here. Of course, I was throwing the baby out with the bath water by writing my own blog code. Couldn’t I just turn off the comments feature? Didn’t I want that easy XHTML compliance? Come on, are the designs so bad (they’re actually not, especially WordPress’s, but they are fairly similar across blogs)? Don’t I want to be able to phone in a post, or email one from a BlackBerry? (OK, the answer is no on both of those counts.)

But as I mentioned at the beginning of this series, I wanted to learn by doing and making. I didn’t know much about RSS. Which kind of RSS feed was best? How do you make an RSS feed, anyhow? I’ve thought a great deal about searching and data-mining, but what was the best way to search a blog? Were there ways to make a blog more searchable?

With these questions and concerns in mind, I started writing a simple PHP/MySQL application, and began to think about how I would make up for the lack of some of the advantages I’ve outlined above (hint: outsourcing). In the next post in this series, I’ll walk you through the basic setup and puzzle at the variety of RSS feeds.

Part 3: The Double Life of Blogs

Creating a Blog from Scratch, Part 1: What is a Blog, Anyway?

If you look at the bottom of this page, you won’t see any of the telltale signs that it is generated by a blog software package like Blogger, Moveable Type, or WordPress. When I was redesigning this site and wanted to add a blog to it, I made the perhaps foolhardy decision to write my own blogging software. Why, you might ask, would I recreate the proverbial wheel? As I’ll explain in several other columns in this space, writing your own software is one of the best ways to learn—not only about how to write software, but also about genres and to think about (and rethink) some of the assumptions that go into the construction of software written for specific genres. The first question I therefore asked myself was, What is a blog, anyway?

Seems like an easy question. But it’s really not. Blogs began literally as “Web logs,” as logs of links to websites that people thought were interesting and wanted to share with others. Hip readers of this blog will recognize, however, that this task has recently shifted to new “Web 2.0” services like del.icio.us, Furl, and Digg. Many blogs continue to include links to other websites, of course, perhaps with some commentary added, but the blog is quickly becoming the wrong place to merely list a bunch of links.

Following this initial purpose, the blog became a place for early adopters to write about events in their lives. In other words, the closest cognate to an offline genre was the diary. As I’ll argue in my next post in this series, this phase has made a permanent (and not entirely positive) mark on blogging software. Let me say for now that it led to what I would call “the tryanny of the calendar” (note all of those calendars on blogs).

Many bloggers (including this one), however, aren’t writing diary entries. We’re passing along information to readers, some of it topical and time-based and some of it not. For instance, I found a great post on Wally Grotophorst’s blog about getting rid of the need to type in passwords (which I do a lot). Is this topic essentially about Friday, October 28th, 2005 at 2:43 pm as WordPress emphasizes in Wally’s archive? No. It’s about “Faster SSH Logins,” as his effectively terse title suggests. This led me to my first idea for my own blogging software: emphasize, above all, the subject matter and the content of each post.

It also made me realize something even more basic. At heart a blog is very simple: it’s merely a way to dynamically create a site out of a series of “posts,” in the same way that a website consists of a series of web pages. Blogs now do all kinds of things: rant and rave, sell products, provide useful tips, record profound thoughts. With this in mind I started writing some PHP code and setting up a database that would serve this blog, but in a much simpler way than Blogger, Moveable Type, and WordPress.

In Part II of this series, I’ll discuss the advantages and disadvantages of existing blogging software, and how I tried to retain the positives, remove the negatives, and create a much slimmer but highly useful and flexible piece of software that anyone could write with just a little bit of programming experience.

Part 2: Advantages and Disadvantages of Popular Blog Software

Nature Compares Science Entries in Wikipedia with Encyclopaedia Britannica

In an article published tomorrow, but online now, the journal Nature reveals the results of a (relatively small) study it conducted to compare the accuracy of Wikipedia with Encyclopaedia Britannica—at least in the natural sciences. The results may strike some as surprising.

As Jim Giles summarizes in the special report: “Among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three…Only eight serious errors, such as misinterpretations of important concepts, were detected in the pairs of articles reviewed, four from each encyclopaedia. But reviewers also found many factual errors, omissions or misleading statements: 162 and 123 in Wikipedia and Britannica, respectively.”

These results, obtained by sending experts such as the Princeton historian of science Michael Gordin matching entries from the democratic/anarchical online source and the highbrow, edited reference work and having them go over the articles with a fine-toothed comb, should feed into the current debate over the quality of online information. My colleague Roy Rosenzweig has written a much more in-depth (and illuminating) comparison of Wikipedia with print sources in history, due out next year in the Journal of American History, which should spark an important debate in the humanities. I suspect that the Wikipedia articles in history are somewhat different than those in the sciences—it seems from Nature‘s survey that there may be more professional scientists contributing to Wikipedia than professional historians—but couple of the basic conclusions are the same: the prose on Wikipedia is not so terrific but most of its facts are indeed correct, to a far greater extent than Wikipedia’s critics would like to admit.

Alexa Web Search Platform Debuts

I’m currently working on an article for D-Lib Magazine explaining in greater depth how some of my tools that use search engine APIs work (such as the Syllabus Finder and H-Bot). These APIs, such as the services from Google and Yahoo, allow somewhat more direct access to mammoth web databases than you can get through these companies’ more public web interfaces. I thought it would be helpful for the article to discuss some of the advantages and drawbacks of these services, and was just outlining one of my major disappointments with their programming interfaces—namely, that you can’t run sophisticated text analysis on their servers, but have to do post-processing on your own server once you get a set of results back—when it was announced that Alexa released its Web Search Platform. The AWSP allows you to do just what I’ve been wanting to do on an extremely large (4 billion web page) corpus: scan through it in the same way that employees at Yahoo and Google can do, using advanced algorithms and manipulating as large a set of results as you can handle, rather than mere dozens of relevant pages. Here’s what’s notable about AWSP for researchers and digital humanists.

  • Yahoo and Google hobble their APIs by only including a subset of their total web crawl. They seem leery of giving the entire 8 billion pages (in the case of the Google index) to developers. My calculation is that only about 1 in 5 pages in the main Google index makes it into their API index. AWSP provides access to the full crawl on their servers, plus the prior crawl and any crawl in progress. This means that AWSP probably provides the largest dataset researchers can presently access, about 3 times larger than Google or Yahoo (my rough guess from using their APIs is that those datasets are only about 1.5 billion pages, versus about 4 billion for AWSP). It seems ridiculous that this could make a difference (do I really need 250 terabytes of text rather than 75?), but when you’re searching for low-ranking documents like syllabi it could make a big difference. Moreover, with at least two versions of every webpage, it’s conceivable you could write a vertical search engine to compare differences across time on the web.
  • They seem to be using a similar setup to the Ning web application environment to allow nonprogrammers to quickly create a specialized search by cloning a similar search that someone else has already developed. No deep knowledge of a programming language needed (possibly…stay tuned).
  • You can download entire datasets, no matter how large, something that’s impossible on Yahoo and Google. So rather than doing my own crawl for 600,000 syllabi—which broke our relatively high-powered server—you can have AWSP do it for you and then grab the dataset.
  • You can also have AWSP host any search engine you create, which removes a lot of the hassle of setting up a search engine (database software, spider, scripting languages, etc.).
  • OK, now the big drawback. As economists say, there’s no such thing as a free lunch. In the case of AWSP, their business model differs from the Google and Yahoo APIs. Google and Yahoo are trying to give developers just enough so that they create new and interesting applications that rely on but don’t compete directly with Google and Yahoo. AWSP charges (unlike Google and Yahoo) for use, though the charges seem modest for a digital humanities application. While a serious new search engine that would data-mine the entire web might cost in the thousands of dollars, my back of the envelope calculation is that it would cost less than $100 (that is, paid to Alexa, aside from the programming time) to reproduce the Syllabus Finder, plus about $100 per year to provide it to users on their server.

I’ll report more details and thoughts as I test the service further.

First Monday is Second Tuesday This Month

For those who have been asking about the article I wrote with Roy Rosenzweig on the reliability of historical information on the web (summarized in a previous post), it has just appeared on the First Monday website, perhaps a little belatedly given the name of the journal.