Author: Dan Cohen

Q and A on Firefox Scholar

Thanks so much everyone who emailed me over the past week in response to hearing about Firefox Scholar. It’s great to get a sense that a wide range of people (from a number of countries) feel that the time has come for this kind of enhanced scholarly web browser, and it gives our team at the Center for History and New Media a great deal of confidence as we move forward. I’ve received a lot of questions about the project, so I thought I would answer some of the common ones here.

I thought the name of this software was going to be Smartfox. Where did Firefox Scholar come from?
Yes, it’s true that the original name of the project was Smartfox, but after filing our grant proposal to the Institute of Museum and Library Services we discovered that someone had “created” (if we can generously call it that) a copy of an early version of Firefox and renamed it Smartfox. They also acquired the smartfox.org domain name. As so often happens with overlapping domains and names, if you search on Google for “Smartfox” you now get a confusing mix of hits which makes it seem like we are already producing the software. To make things clearer for everyone, we’ve changed the name to Firefox Scholar (which we actually like better). We plan on posting information about the project at firefoxscholar.org (but not yet).

Are you going to use X, Y, or Z protocol/schema to acquire metadata about books, articles, and other objects on the web?
The short answer is yes. In other words, we plan on taking advantage of all existing standards and metadata schemas to acquire metadata into one’s Firefox Scholar folder. OAI, OpenURL, COinS—you name it, we want Firefox Scholar to be compatible with it and take advantage of it. The idea of Firefox Scholar is that there will be tiny widgets for each of these that will enable the browser to natively recognize and store their formats.

I want to beta test the software as soon as it’s available. And by the way, when will it be available?
Great! Just let me know via email and I’ll put you on the list. We plan to make an early version of the software available in the late summer of 2006.

Will the software be available for Internet Explorer?
Sorry, it’s only a set of extensions for Firefox. We understand that a lot of people use Internet Explorer, but among other things (better security, free and openly available source code) Firefox has terrific ways of extending and enhancing it. In turn, we would like to make our own software extensible and enhanceable, and encourage other developers to make additions to it.

I’m a developer and would like to get involved with the project. How can I do so?
Please contact me.

Introduction to Firefox Scholar

This week in the electronic version, and next week in the print version, the Chronicle of Higher Education is running an article (subscription required) on a new software project I’m co-directing, Firefox Scholar, which will be a set of extensions to the popular open source web browser that will help researchers, teachers, and students. My thanks to the many people who have emailed who are interested in the project. For them and for others who would like to know more, here’s a brief summary of Firefox Scholar from our grant proposal to the Institute for Museum and Library Services, which has generously provided $250,000 to initiate the project. Please contact me if you would like occasional updates on the project or would like a beta release of the browser when it is available in the late summer of 2006.

The web browser has become the primary means for accessing information, documents, and artifacts from libraries and museums around the country and the world, thanks in large part to the tremendous commitment these institutions have made to bringing their collections online (as either simple citations or complete text and images). Unfortunately for scholars, while tens of millions of dollars have been spent to create digital resources, far less funding and effort has been allocated for the development of tools to facilitate the use of these resources. The browser remains merely a passive window allowing one to view, but not easily collect, annotate, or manipulate these objects. Moreover, from the user’s perspective individual library and museum collections remain just that—separate websites with distinct designs and different ways of displaying their information, making traditional scholarly practices of bringing together and studying objects of interest from across these collections unnecessarily difficult.

Firefox Scholar, a set of tools incorporated into popular, open, and free web software, will address these major problems by creating a web browser that is “smarter” in two key ways. First, one tool will enable the browser to intelligently sense when its user is viewing a digital library or museum object; this will allow the browser to capture information from the page automatically, such as the creator, title, date of creation, and copyright information. Second, another tool will store and organize this information, as well as full copies of items and web pages (not just their citation information) if so desired by the user and permitted by the institution’s site, allowing the user to sort, annotate, search, and manipulate these individualized collections created for scholarly purposes. Critically, all of this will occur within the web browser itself, not in a separate, standalone application; the web browser will be used not just to discover information, but also to collect, organize, and analyze scholarly materials.

Reliability of Information on the Web

Given the current obsession with the reliability (or more often in media coverage, the unreliability) of information on the web—the New York Times weighed in on the matter yesterday, and USA Today carried a scathing op-ed last week—I feel lucky that an article Roy Rosenzweig and I wrote entitled “Web of Lies? Historical Information on the Internet” happens to appear today in First Monday. If you’re interested in the subject, it’s probably best to read the full article, but I’ll provide a quick summary of our argument here.

Using my H-Bot software tool, Roy and I scanned the Internet to assess the quality of online information about history. In short, we found that while critics are correct that there are many error-riddled web pages, on the whole the web presents a relatively sound portrayal of historical facts through a process of consensus. With the right tools, these facts can be extracted from the web, leaving the more problematic web pages aside.

Moreover, this process of historical data mining on the web should prompt further discussion about the significance of all of this historical information online. To do some of our own prompting, we had a special multiple-choice test-taking version of H-Bot take the National Assessment of Educational Progress U.S. History exam using nothing but the web and some fancy algorithms borrowed from computer science. [Spoiler alert: it passed.] This raises new questions that move far beyond simple debates over the reliability of information on the web and into the very nature of teaching, learning, and research in our digital age.

Clifford Lynch and Jonathan Band on Google Book Search

The topic for the November 2005 Washington DC Area Forum on Technology and the Humanities focused on “Massive Digitization Programs and Their Long-Term Implications: Google Print, the Open Content Alliance, and Related Developments.” The two speakers at the forum, Clifford Lynch and Jonathan Band, are among the most intelligent and thought-provoking commentators on the significance of Google’s Book Search project (formerly known as Google Print, with the Google Print Library Project being the company’s attempt to digitize millions of books at the University of Michigan, Stanford, Harvard, Oxford, and the New York Public Library). These are my notes from the forum, highlighting not the basics of the project, which have been covered well in the mainstream media, but angles and points that may interest the readers of this blog.

Clifford Lynch has been the Director of the Coalition for Networked Information (CNI) since July 1997. CNI, jointly sponsored by the Association of Research Libraries and Educause, includes about 200 member organizations concerned with the use of information technology and networked information to enhance scholarship and intellectual productivity. Prior to joining CNI, Lynch spent 18 years at the University of California Office of the President, the last 10 as Director of Library Automation. Lynch, who holds a Ph.D. in Computer Science from the University of California, Berkeley, is an adjunct professor at Berkeley’s School of Information Management and Systems.

Jonathan Band is a Washington-based attorney who helps shape the laws governing intellectual property and the Internet through a combination of legislative and appellate advocacy. He has represented library and technology clients with respect to the drafting of the Digital Millennium Copyright Act (DMCA), database protection legislation, and other statutes relating to copyrights, spam, cybersecurity, and indecency. He received his BA from Harvard College and his JD from Yale Law School. He worked in the Washington, D.C. office of Morrison & Foerster for nearly 20 years before opening his own law firm earlier this year.

Clifford Lynch

  • one of things that have made conversion of back runs of journals easy is the concentration of copyright in the journal owners, rather than the writers of articles
  • contrast this with books, where copyrights are much more elusive
  • strange that the university presses of these same univs. in the google print library project were among the first complainers about the project
  • there’s a lot more to the availability of out of copyright material than copyright law—for instance, look at the policies of museums, which don’t let you take photographs of their out of copyright paintings
  • same thing will likely happen with google print
  • while there has been a lot of press about the dynamic action plan for european digitization, it is probably a plan w/o a budget
  • important to remember that there has been a string of visionary literature—e.g., H.G. Wells’s “worldbrain”—promoting making the world’s knowledge accessible to everyone—knowledge’s power to make people’s lives better—not a commercial view—this feeling was also there at the beginning of the Internet
  • legal justifications have been made for policy decisions that are really bad
  • large scale open access corpora are now showing great value, using data mining applications: see the work of the intelligence community, pharmaceutical industry—will the humanities follow with these large digitization projects
  • we are entering an era that will give new value to ontologies, gazetteers, etc., to aid in searching large corpora
  • if google loses this case, search engines might be outlawed [Lawrence Lessig makes this point on his blog too —DC]
  • because of insane copyright law like sonny bono act there might be a bifurcation of the world into the digitized world of pre-1923 and the copyrighted, gated post-1923 world

Jonathan Band

  • fair use is at base about economics and morality—thus the cases (authors, publishers) against google are interesting cases in a broad social sense, not just pure law
  • only 20% of the books being digitized are out of copyright (approx.)
  • for certain works, like a dictionary, where even a snippet would have an economic impact on the copyright holder, google will probably not make even a snippet available
  • copyright owners say copyright is opt-in, not opt-out (as Google is making it in their progam)—it seems dumb, but this is a big legal issue for these cases
  • owners are correct that copyright is normally an opt-in experience—the owner must be contacted first before you make a use of their work, except when it’s fair use—then you don’t need to ask
  • thus the case will really be about fair use
  • key precendent: kelly vs. arribasoft: image search, found in favor of the search engine; kelly was a cantankerous photographer of the West who posted his photos on his website but didn’t want them copied by arribasoft (2 years ago; ended in 9th circuit); court found that search engine was a transformative use and useful for the public, even though it’s commercial use; court couldn’t find any negative economic impact on the market for kelly’s work [this case is covered in chapter 7 of Digital History —DC]
  • google’s case compares very favorably with arribasoft
  • publishers have weaker case because they are now saying that putting something on the web means that you’re giving an implied license to copy (no implied license for books)—but they’ve argued before that copyright applies just as strongly on the web
  • bot exclusion headers (robots.txt)—respected by search enginesvbut that sounds like opt-out, not opt-in—so publishers also probably shouldn’t be pointing to that in their case
  • publishers are also pointing to the google program for publishers, in which publishers allow google to scan their books and then they share in revenues—publishers are saying that the google library program is undermining this market, where publishers license their material; transaction costs of setting up a similar program for library books would be enormous–indeed it can’t be done: google is probably spending $750 million to scan 30 mil. books (at $25/bk); it would probably cost $1000/bk if you had to clear rights for scanning; no one would ever be able to pay for clearing rights like this, so what google is doing is broad and shallow vs. deep but narrow, which is what you could do if you cleared rights—many of these other digitization projects (e.g., Microsoft) are only doing 100K books at most
  • if google doesn’t succeed at this project, no one else will be able to do it—so if we agree that this book search project is a useful thing, then as a social matter Google should be allowed to do it under fair use
  • what’s the cost to the authors other than a little loss of control?

Do APIs Have a Place in the Digital Humanities?

Since the 1960s, computer scientists have used application programming interfaces (APIs) to provide colleagues with robust, direct access to their databases and digital tools. Access via APIs is generally far more powerful than simple web-based access. APIs often include complex methods drawn from programming languages—precise ways of choosing materials to extract, methods to generate statistics, ways of searching, culling, and pulling together disparate data—that enable outside users to develop their own tools or information resources based on the work of others. In short, APIs hold great promise as a method for combining and manipulating various digital resources and tools in a free-form and potent way.

Unfortunately, even after four decades APIs remain much more common in the sciences and the commercial realm—for example, the APIs provided by search behemoths Google and Yahoo—than in the humanities. There are some obvious reasons for this disparity. By supplying an API, the owners of a resource or tool generally bear most of the cost (on their taxed servers, in technical support and staff time) while receiving little or no (immediate) benefit. Moreover, by essentially making an end-run around the common or “official” ways of accessing a tool or project (such as a web search form for a digital archive), an API may devalue the hard work and thoughtfulness put into the more public front end for a digital project. It is perhaps unsurprising that given these costs even Google and Yahoo, which have the financial strength and personnel to provide APIs for their search engines, continue to keep these programs hobbled—after all, programmers can use their APIs to create derivative search engines that compete directly with Google’s or Yahoo’s results pages, with none of the diverting (and profitable) text advertising.

So why should projects in the digital humanities provide APIs, especially given their often limited (or nonexistent) funding compared to a Google or Yahoo? The reason IBM conceived APIs in the first place, and still today the reason many computer scientists find APIs highly beneficial, is that unlike other forms of access they encourage the kind of energetic and creative grass-roots and third-party development that in the long run—after the initial costs borne by the API’s owner—maximize the value and utility of a digital resource or tool. Motivated by many different goals and employing many different methodologies, users of APIs often take digital resources or tools in directions completely unforeseen by their owners. APIs have provided fertile ground for thousands of developers to experiment with the tremendous indices and document caches maintained by Google and Yahoo. New resources based on these APIs appear weekly, some of them hinting at new methods for digital research, data visualization techniques, and novel ways to data-mine texts and synthesize knowledge.

Is it possible—and worthwhile—for digital humanities projects to provide such APIs for their resources and tools? Which resources or tools would be best suited for an API, and how will the creators of these projects sustain such an additional burden? And are there other forms of access or interoperability that have equal or greater benefits with fewer associated costs?

Welcome to My Blog

Like so many others who enjoy the sound of their own voice and the sight of their own words on a printed page—I would estimate this group as a majority of humanity—I have increasingly felt the urge to write a blog. Blogging has obviously emerged as one of the remarkable, unique products of the web, providing for the first time a nearly frictionless way to immediately reach a worldwide audience with your thoughts.

Having written for paper media, I’ve experienced the frustration of the glacial pace of most publications. In academia this problem is particularly acute. For instance, I completed the first draft of a book chapter I wrote on nineteenth-century mathematics in May of 2002; I finally got to see it in print in May of 2005. Even in the best cases (and there are not many), an academic journal article generally takes a full year from the time you have completed most of the work on the article to the time it shows up on the pages of the journal.

On the other hand, maybe there’s not much urgency in seeing the latest on Victorian mathematics. As far as I know, all of the mathematicians I discuss in the book chapter remain dead, or at least oddly unproductive; those who are interested in their lives and work would just as well wait for a considerate, thoughtful, and complete article regardless of how slowly it took to arrive in print. And unlike in the sciences, there is rarely concern about precedent. My book on the larger history of pure mathematics in the Victorian era has taken about full decade between inception and completion, but I haven’t had many sleepless nights worrying that someone else has duplicated my work or theories.

So here’s the rub, and I suspect I’m not alone in this view: while I’m attracted to the instant gratification of publishing to the web, I’ve more often than not found blogs to be dissatisfying. Perhaps it’s absurd elitism or years of reading overly long tomes. But it’s a feeling that’s hard to shake. The ease with which one can post means that it’s often too easy to post the half-baked and the half-written.

So for this blog I’ve tried to set a higher mark for myself (the elitism now unites with an unwise masochism). While my posts may not be daily, I hope that they will function more like well thought out mini-articles, and transfer to this blog’s audience my understanding of the digital humanities in as great a depth as possible.

Stay tuned for posts explaining how to do for yourself experimental digital work (e.g., how to use the Google Maps API to build your own interactive historical map); posts communicating in a plainspoken way some of the more complex topics in computer science in ways that hopefully will spark ideas among humanists; and posts exploring the implications of new technologies and methodologies for teaching, learning, and researching in a digital age.

I hope that you’ll also join the conversation by emailing me at dcohen@gmu.edu if you have any comments or suggestions.