Creating a Blog from Scratch, Part 1: What is a Blog, Anyway?

If you look at the bottom of this page, you won’t see any of the telltale signs that it is generated by a blog software package like Blogger, Moveable Type, or WordPress. When I was redesigning this site and wanted to add a blog to it, I made the perhaps foolhardy decision to write my own blogging software. Why, you might ask, would I recreate the proverbial wheel? As I’ll explain in several other columns in this space, writing your own software is one of the best ways to learn—not only about how to write software, but also about genres and to think about (and rethink) some of the assumptions that go into the construction of software written for specific genres. The first question I therefore asked myself was, What is a blog, anyway?

Seems like an easy question. But it’s really not. Blogs began literally as “Web logs,” as logs of links to websites that people thought were interesting and wanted to share with others. Hip readers of this blog will recognize, however, that this task has recently shifted to new “Web 2.0” services like del.icio.us, Furl, and Digg. Many blogs continue to include links to other websites, of course, perhaps with some commentary added, but the blog is quickly becoming the wrong place to merely list a bunch of links.

Following this initial purpose, the blog became a place for early adopters to write about events in their lives. In other words, the closest cognate to an offline genre was the diary. As I’ll argue in my next post in this series, this phase has made a permanent (and not entirely positive) mark on blogging software. Let me say for now that it led to what I would call “the tryanny of the calendar” (note all of those calendars on blogs).

Many bloggers (including this one), however, aren’t writing diary entries. We’re passing along information to readers, some of it topical and time-based and some of it not. For instance, I found a great post on Wally Grotophorst’s blog about getting rid of the need to type in passwords (which I do a lot). Is this topic essentially about Friday, October 28th, 2005 at 2:43 pm as WordPress emphasizes in Wally’s archive? No. It’s about “Faster SSH Logins,” as his effectively terse title suggests. This led me to my first idea for my own blogging software: emphasize, above all, the subject matter and the content of each post.

It also made me realize something even more basic. At heart a blog is very simple: it’s merely a way to dynamically create a site out of a series of “posts,” in the same way that a website consists of a series of web pages. Blogs now do all kinds of things: rant and rave, sell products, provide useful tips, record profound thoughts. With this in mind I started writing some PHP code and setting up a database that would serve this blog, but in a much simpler way than Blogger, Moveable Type, and WordPress.

In Part II of this series, I’ll discuss the advantages and disadvantages of existing blogging software, and how I tried to retain the positives, remove the negatives, and create a much slimmer but highly useful and flexible piece of software that anyone could write with just a little bit of programming experience.

Part 2: Advantages and Disadvantages of Popular Blog Software

Nature Compares Science Entries in Wikipedia with Encyclopaedia Britannica

In an article published tomorrow, but online now, the journal Nature reveals the results of a (relatively small) study it conducted to compare the accuracy of Wikipedia with Encyclopaedia Britannica—at least in the natural sciences. The results may strike some as surprising.

As Jim Giles summarizes in the special report: “Among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three…Only eight serious errors, such as misinterpretations of important concepts, were detected in the pairs of articles reviewed, four from each encyclopaedia. But reviewers also found many factual errors, omissions or misleading statements: 162 and 123 in Wikipedia and Britannica, respectively.”

These results, obtained by sending experts such as the Princeton historian of science Michael Gordin matching entries from the democratic/anarchical online source and the highbrow, edited reference work and having them go over the articles with a fine-toothed comb, should feed into the current debate over the quality of online information. My colleague Roy Rosenzweig has written a much more in-depth (and illuminating) comparison of Wikipedia with print sources in history, due out next year in the Journal of American History, which should spark an important debate in the humanities. I suspect that the Wikipedia articles in history are somewhat different than those in the sciences—it seems from Nature‘s survey that there may be more professional scientists contributing to Wikipedia than professional historians—but couple of the basic conclusions are the same: the prose on Wikipedia is not so terrific but most of its facts are indeed correct, to a far greater extent than Wikipedia’s critics would like to admit.

Alexa Web Search Platform Debuts

I’m currently working on an article for D-Lib Magazine explaining in greater depth how some of my tools that use search engine APIs work (such as the Syllabus Finder and H-Bot). These APIs, such as the services from Google and Yahoo, allow somewhat more direct access to mammoth web databases than you can get through these companies’ more public web interfaces. I thought it would be helpful for the article to discuss some of the advantages and drawbacks of these services, and was just outlining one of my major disappointments with their programming interfaces—namely, that you can’t run sophisticated text analysis on their servers, but have to do post-processing on your own server once you get a set of results back—when it was announced that Alexa released its Web Search Platform. The AWSP allows you to do just what I’ve been wanting to do on an extremely large (4 billion web page) corpus: scan through it in the same way that employees at Yahoo and Google can do, using advanced algorithms and manipulating as large a set of results as you can handle, rather than mere dozens of relevant pages. Here’s what’s notable about AWSP for researchers and digital humanists.

  • Yahoo and Google hobble their APIs by only including a subset of their total web crawl. They seem leery of giving the entire 8 billion pages (in the case of the Google index) to developers. My calculation is that only about 1 in 5 pages in the main Google index makes it into their API index. AWSP provides access to the full crawl on their servers, plus the prior crawl and any crawl in progress. This means that AWSP probably provides the largest dataset researchers can presently access, about 3 times larger than Google or Yahoo (my rough guess from using their APIs is that those datasets are only about 1.5 billion pages, versus about 4 billion for AWSP). It seems ridiculous that this could make a difference (do I really need 250 terabytes of text rather than 75?), but when you’re searching for low-ranking documents like syllabi it could make a big difference. Moreover, with at least two versions of every webpage, it’s conceivable you could write a vertical search engine to compare differences across time on the web.
  • They seem to be using a similar setup to the Ning web application environment to allow nonprogrammers to quickly create a specialized search by cloning a similar search that someone else has already developed. No deep knowledge of a programming language needed (possibly…stay tuned).
  • You can download entire datasets, no matter how large, something that’s impossible on Yahoo and Google. So rather than doing my own crawl for 600,000 syllabi—which broke our relatively high-powered server—you can have AWSP do it for you and then grab the dataset.
  • You can also have AWSP host any search engine you create, which removes a lot of the hassle of setting up a search engine (database software, spider, scripting languages, etc.).
  • OK, now the big drawback. As economists say, there’s no such thing as a free lunch. In the case of AWSP, their business model differs from the Google and Yahoo APIs. Google and Yahoo are trying to give developers just enough so that they create new and interesting applications that rely on but don’t compete directly with Google and Yahoo. AWSP charges (unlike Google and Yahoo) for use, though the charges seem modest for a digital humanities application. While a serious new search engine that would data-mine the entire web might cost in the thousands of dollars, my back of the envelope calculation is that it would cost less than $100 (that is, paid to Alexa, aside from the programming time) to reproduce the Syllabus Finder, plus about $100 per year to provide it to users on their server.

I’ll report more details and thoughts as I test the service further.

Introduction to Firefox Scholar

This week in the electronic version, and next week in the print version, the Chronicle of Higher Education is running an article (subscription required) on a new software project I’m co-directing, Firefox Scholar, which will be a set of extensions to the popular open source web browser that will help researchers, teachers, and students. My thanks to the many people who have emailed who are interested in the project. For them and for others who would like to know more, here’s a brief summary of Firefox Scholar from our grant proposal to the Institute for Museum and Library Services, which has generously provided $250,000 to initiate the project. Please contact me if you would like occasional updates on the project or would like a beta release of the browser when it is available in the late summer of 2006.

The web browser has become the primary means for accessing information, documents, and artifacts from libraries and museums around the country and the world, thanks in large part to the tremendous commitment these institutions have made to bringing their collections online (as either simple citations or complete text and images). Unfortunately for scholars, while tens of millions of dollars have been spent to create digital resources, far less funding and effort has been allocated for the development of tools to facilitate the use of these resources. The browser remains merely a passive window allowing one to view, but not easily collect, annotate, or manipulate these objects. Moreover, from the user’s perspective individual library and museum collections remain just that—separate websites with distinct designs and different ways of displaying their information, making traditional scholarly practices of bringing together and studying objects of interest from across these collections unnecessarily difficult.

Firefox Scholar, a set of tools incorporated into popular, open, and free web software, will address these major problems by creating a web browser that is “smarter” in two key ways. First, one tool will enable the browser to intelligently sense when its user is viewing a digital library or museum object; this will allow the browser to capture information from the page automatically, such as the creator, title, date of creation, and copyright information. Second, another tool will store and organize this information, as well as full copies of items and web pages (not just their citation information) if so desired by the user and permitted by the institution’s site, allowing the user to sort, annotate, search, and manipulate these individualized collections created for scholarly purposes. Critically, all of this will occur within the web browser itself, not in a separate, standalone application; the web browser will be used not just to discover information, but also to collect, organize, and analyze scholarly materials.

Reliability of Information on the Web

Given the current obsession with the reliability (or more often in media coverage, the unreliability) of information on the web—the New York Times weighed in on the matter yesterday, and USA Today carried a scathing op-ed last week—I feel lucky that an article Roy Rosenzweig and I wrote entitled “Web of Lies? Historical Information on the Internet” happens to appear today in First Monday. If you’re interested in the subject, it’s probably best to read the full article, but I’ll provide a quick summary of our argument here.

Using my H-Bot software tool, Roy and I scanned the Internet to assess the quality of online information about history. In short, we found that while critics are correct that there are many error-riddled web pages, on the whole the web presents a relatively sound portrayal of historical facts through a process of consensus. With the right tools, these facts can be extracted from the web, leaving the more problematic web pages aside.

Moreover, this process of historical data mining on the web should prompt further discussion about the significance of all of this historical information online. To do some of our own prompting, we had a special multiple-choice test-taking version of H-Bot take the National Assessment of Educational Progress U.S. History exam using nothing but the web and some fancy algorithms borrowed from computer science. [Spoiler alert: it passed.] This raises new questions that move far beyond simple debates over the reliability of information on the web and into the very nature of teaching, learning, and research in our digital age.

Clifford Lynch and Jonathan Band on Google Book Search

The topic for the November 2005 Washington DC Area Forum on Technology and the Humanities focused on “Massive Digitization Programs and Their Long-Term Implications: Google Print, the Open Content Alliance, and Related Developments.” The two speakers at the forum, Clifford Lynch and Jonathan Band, are among the most intelligent and thought-provoking commentators on the significance of Google’s Book Search project (formerly known as Google Print, with the Google Print Library Project being the company’s attempt to digitize millions of books at the University of Michigan, Stanford, Harvard, Oxford, and the New York Public Library). These are my notes from the forum, highlighting not the basics of the project, which have been covered well in the mainstream media, but angles and points that may interest the readers of this blog.

Clifford Lynch has been the Director of the Coalition for Networked Information (CNI) since July 1997. CNI, jointly sponsored by the Association of Research Libraries and Educause, includes about 200 member organizations concerned with the use of information technology and networked information to enhance scholarship and intellectual productivity. Prior to joining CNI, Lynch spent 18 years at the University of California Office of the President, the last 10 as Director of Library Automation. Lynch, who holds a Ph.D. in Computer Science from the University of California, Berkeley, is an adjunct professor at Berkeley’s School of Information Management and Systems.

Jonathan Band is a Washington-based attorney who helps shape the laws governing intellectual property and the Internet through a combination of legislative and appellate advocacy. He has represented library and technology clients with respect to the drafting of the Digital Millennium Copyright Act (DMCA), database protection legislation, and other statutes relating to copyrights, spam, cybersecurity, and indecency. He received his BA from Harvard College and his JD from Yale Law School. He worked in the Washington, D.C. office of Morrison & Foerster for nearly 20 years before opening his own law firm earlier this year.

Clifford Lynch

  • one of things that have made conversion of back runs of journals easy is the concentration of copyright in the journal owners, rather than the writers of articles
  • contrast this with books, where copyrights are much more elusive
  • strange that the university presses of these same univs. in the google print library project were among the first complainers about the project
  • there’s a lot more to the availability of out of copyright material than copyright law—for instance, look at the policies of museums, which don’t let you take photographs of their out of copyright paintings
  • same thing will likely happen with google print
  • while there has been a lot of press about the dynamic action plan for european digitization, it is probably a plan w/o a budget
  • important to remember that there has been a string of visionary literature—e.g., H.G. Wells’s “worldbrain”—promoting making the world’s knowledge accessible to everyone—knowledge’s power to make people’s lives better—not a commercial view—this feeling was also there at the beginning of the Internet
  • legal justifications have been made for policy decisions that are really bad
  • large scale open access corpora are now showing great value, using data mining applications: see the work of the intelligence community, pharmaceutical industry—will the humanities follow with these large digitization projects
  • we are entering an era that will give new value to ontologies, gazetteers, etc., to aid in searching large corpora
  • if google loses this case, search engines might be outlawed [Lawrence Lessig makes this point on his blog too —DC]
  • because of insane copyright law like sonny bono act there might be a bifurcation of the world into the digitized world of pre-1923 and the copyrighted, gated post-1923 world

Jonathan Band

  • fair use is at base about economics and morality—thus the cases (authors, publishers) against google are interesting cases in a broad social sense, not just pure law
  • only 20% of the books being digitized are out of copyright (approx.)
  • for certain works, like a dictionary, where even a snippet would have an economic impact on the copyright holder, google will probably not make even a snippet available
  • copyright owners say copyright is opt-in, not opt-out (as Google is making it in their progam)—it seems dumb, but this is a big legal issue for these cases
  • owners are correct that copyright is normally an opt-in experience—the owner must be contacted first before you make a use of their work, except when it’s fair use—then you don’t need to ask
  • thus the case will really be about fair use
  • key precendent: kelly vs. arribasoft: image search, found in favor of the search engine; kelly was a cantankerous photographer of the West who posted his photos on his website but didn’t want them copied by arribasoft (2 years ago; ended in 9th circuit); court found that search engine was a transformative use and useful for the public, even though it’s commercial use; court couldn’t find any negative economic impact on the market for kelly’s work [this case is covered in chapter 7 of Digital History —DC]
  • google’s case compares very favorably with arribasoft
  • publishers have weaker case because they are now saying that putting something on the web means that you’re giving an implied license to copy (no implied license for books)—but they’ve argued before that copyright applies just as strongly on the web
  • bot exclusion headers (robots.txt)—respected by search enginesvbut that sounds like opt-out, not opt-in—so publishers also probably shouldn’t be pointing to that in their case
  • publishers are also pointing to the google program for publishers, in which publishers allow google to scan their books and then they share in revenues—publishers are saying that the google library program is undermining this market, where publishers license their material; transaction costs of setting up a similar program for library books would be enormous–indeed it can’t be done: google is probably spending $750 million to scan 30 mil. books (at $25/bk); it would probably cost $1000/bk if you had to clear rights for scanning; no one would ever be able to pay for clearing rights like this, so what google is doing is broad and shallow vs. deep but narrow, which is what you could do if you cleared rights—many of these other digitization projects (e.g., Microsoft) are only doing 100K books at most
  • if google doesn’t succeed at this project, no one else will be able to do it—so if we agree that this book search project is a useful thing, then as a social matter Google should be allowed to do it under fair use
  • what’s the cost to the authors other than a little loss of control?

Do APIs Have a Place in the Digital Humanities?

Since the 1960s, computer scientists have used application programming interfaces (APIs) to provide colleagues with robust, direct access to their databases and digital tools. Access via APIs is generally far more powerful than simple web-based access. APIs often include complex methods drawn from programming languages—precise ways of choosing materials to extract, methods to generate statistics, ways of searching, culling, and pulling together disparate data—that enable outside users to develop their own tools or information resources based on the work of others. In short, APIs hold great promise as a method for combining and manipulating various digital resources and tools in a free-form and potent way.

Unfortunately, even after four decades APIs remain much more common in the sciences and the commercial realm—for example, the APIs provided by search behemoths Google and Yahoo—than in the humanities. There are some obvious reasons for this disparity. By supplying an API, the owners of a resource or tool generally bear most of the cost (on their taxed servers, in technical support and staff time) while receiving little or no (immediate) benefit. Moreover, by essentially making an end-run around the common or “official” ways of accessing a tool or project (such as a web search form for a digital archive), an API may devalue the hard work and thoughtfulness put into the more public front end for a digital project. It is perhaps unsurprising that given these costs even Google and Yahoo, which have the financial strength and personnel to provide APIs for their search engines, continue to keep these programs hobbled—after all, programmers can use their APIs to create derivative search engines that compete directly with Google’s or Yahoo’s results pages, with none of the diverting (and profitable) text advertising.

So why should projects in the digital humanities provide APIs, especially given their often limited (or nonexistent) funding compared to a Google or Yahoo? The reason IBM conceived APIs in the first place, and still today the reason many computer scientists find APIs highly beneficial, is that unlike other forms of access they encourage the kind of energetic and creative grass-roots and third-party development that in the long run—after the initial costs borne by the API’s owner—maximize the value and utility of a digital resource or tool. Motivated by many different goals and employing many different methodologies, users of APIs often take digital resources or tools in directions completely unforeseen by their owners. APIs have provided fertile ground for thousands of developers to experiment with the tremendous indices and document caches maintained by Google and Yahoo. New resources based on these APIs appear weekly, some of them hinting at new methods for digital research, data visualization techniques, and novel ways to data-mine texts and synthesize knowledge.

Is it possible—and worthwhile—for digital humanities projects to provide such APIs for their resources and tools? Which resources or tools would be best suited for an API, and how will the creators of these projects sustain such an additional burden? And are there other forms of access or interoperability that have equal or greater benefits with fewer associated costs?

Welcome to My Blog

Like so many others who enjoy the sound of their own voice and the sight of their own words on a printed page—I would estimate this group as a majority of humanity—I have increasingly felt the urge to write a blog. Blogging has obviously emerged as one of the remarkable, unique products of the web, providing for the first time a nearly frictionless way to immediately reach a worldwide audience with your thoughts.

Having written for paper media, I’ve experienced the frustration of the glacial pace of most publications. In academia this problem is particularly acute. For instance, I completed the first draft of a book chapter I wrote on nineteenth-century mathematics in May of 2002; I finally got to see it in print in May of 2005. Even in the best cases (and there are not many), an academic journal article generally takes a full year from the time you have completed most of the work on the article to the time it shows up on the pages of the journal.

On the other hand, maybe there’s not much urgency in seeing the latest on Victorian mathematics. As far as I know, all of the mathematicians I discuss in the book chapter remain dead, or at least oddly unproductive; those who are interested in their lives and work would just as well wait for a considerate, thoughtful, and complete article regardless of how slowly it took to arrive in print. And unlike in the sciences, there is rarely concern about precedent. My book on the larger history of pure mathematics in the Victorian era has taken about full decade between inception and completion, but I haven’t had many sleepless nights worrying that someone else has duplicated my work or theories.

So here’s the rub, and I suspect I’m not alone in this view: while I’m attracted to the instant gratification of publishing to the web, I’ve more often than not found blogs to be dissatisfying. Perhaps it’s absurd elitism or years of reading overly long tomes. But it’s a feeling that’s hard to shake. The ease with which one can post means that it’s often too easy to post the half-baked and the half-written.

So for this blog I’ve tried to set a higher mark for myself (the elitism now unites with an unwise masochism). While my posts may not be daily, I hope that they will function more like well thought out mini-articles, and transfer to this blog’s audience my understanding of the digital humanities in as great a depth as possible.

Stay tuned for posts explaining how to do for yourself experimental digital work (e.g., how to use the Google Maps API to build your own interactive historical map); posts communicating in a plainspoken way some of the more complex topics in computer science in ways that hopefully will spark ideas among humanists; and posts exploring the implications of new technologies and methodologies for teaching, learning, and researching in a digital age.

I hope that you’ll also join the conversation by emailing me at dcohen@gmu.edu if you have any comments or suggestions.

css.php