Blog – Page 42 – Dan Cohen

An Actual Use for Windows on the Mac

OK, so you can now run Windows on a Mac. So what? For most of us in the humanities, all we need is already on the Mac, which (in addition to intangibles such as the Mac’s design) is why so many of us remain stubbornly attached to Apple’s computers while over the last twenty years almost everyone else has moved to the more generic platform of the PC. Most educational, graphics, and web development software is available for the Mac. (For those in the social and natural sciences, on the other hand, many important software packages are either not available for the Mac or come out later than they do for the PC.) But perhaps there’s the rub. Since many of us only use Macs—especially those that build academic or museum websites—we often don’t see how most people view our sites. Since websites often render differently on different operating systems and web browsers, not checking how your site will look (and perform, if you are using dynamic web technologies) on a PC with IE (still 85% of web surfers) is unwise. Now with Parallels Workstation—the Windows-on-Mac solution that doesn’t require rebooting your computer to switch OSes—you can literally have a window into the world of Windows sitting on your desktop in parallel with your Mac applications. For instance, here’s a screenshot of my Mac desktop with Firefox for the Mac running on the left, and IE for Windows running on the right:

Looks to me like I need to work on the font size differential between Macs and PCs.

This parallelism of operating systems is incredibly handy for web development on a single machine. At the Center for History and New Media we have gone through phrases where we have paid for services that send us static images of our websites on different platforms and in different browsers. We also spend a lot of time running from our Macs over to PCs to check how everything is looking. Now we can do this all on one machine, easily and instantaneously.

Now I just need to install another window for the 2% of web surfers using Linux…

April 20, 2006 Add Comment

The Single Box Humanities Search

I recently polled my graduate students to see where they turn to begin research for a paper. I suppose this shouldn’t come as a surprise: the number one answer—by far—was Google. Some might say they’re lazy or misdirected, but the allure of that single box—and how well it works for most tasks—is incredibly strong. Try getting students to go to five or six different search engines for gated online databases such as ProQuest Academic and JSTOR—all of which have different search options and produce a complex array of results compared to Google. I was thinking about this recently as I tested the brand new scholarly search engine from Microsoft, Windows Live Academic. Windows Live Academic is a direct competitor to Google Scholar, which has been in business now for over a year but is still in “beta” (like most Google products). Both are trying to provide that much-desired single box for academic researchers. And while those in the sciences may eventually be happy with this new option from Microsoft (though it’s currently much rougher than Google’s beta, as you’ll see), like Google Scholar, Windows Live Academic is a big disappointment for students, teachers, and professors in the humanities. I suspect there are three main reasons for this lack of a high-quality single box humanities search.

First, a quick test of Google Scholar and Windows Live Academic. Can either one produce the source of the famous “frontier thesis,” probably the best-known thesis in American historiography?

Clearly, the usefulness of these search results are dubious, especially Windows Live Academic (The Political Economy of Land Conflict in the Eastern Brazilian Amazon as the top result?). Why can’t these giant companies do better than this for humanities searches?

Obviously, the people designing and building these “academic” search engines are from a distinct subset of academia: computer science and mathematical fields such as physics. So naturally they focus on their own fields first. Both Google Scholar and Windows Live Academic work fairly well if you would like to know about black holes or encryption. Moreover, “scholarship” in these fields generally means articles, not books. Google Scholar and Windows Live Academic are dominated by journal-based publications, though both sometimes show books in their search results. But when Google Scholar does so, these books seem to appear because articles that match the search terms cite these works, not because of the relevance of the text of the books themselves.

In addition, humanities articles aren’t as easy as scientific papers to subject to bibliometrics—methods such as citation analysis that reveal the most important or influential articles in a field. Science papers tend to cite many more articles (and fewer books) in a way that makes them subject to extensive recursive analysis. Thus a search on “search” on Google Scholar aptly points a researcher to Sergey Brin’s and Larry Page’s seminal paper outlining how Google would work, because hundreds of other articles on search technology dutifully refer to that paper in their opening paragraph or footnote.

Most important, however, is the question of open access. Outlets for scientific articles are more open and indexable by search engines than humanities journals. In addition to many major natural and social science journals, CiteSeer (sponsored by Microsoft) and ArXiv.org make hundreds of thousands of articles on computer science, physics, and mathematics freely available. This disparity in openness compared to humanities scholarship is slowly starting to change—the American Historical Review, for instance, recently made all new articles freely available online—but without a concerted effort to open more gates, finding humanities papers through a single search box will remain difficult to achieve. Microsoft claims in its FAQ for Windows Live Academic that it will get around to including better results for subjects like history, but like Google they are going to have a hard time doing that well without open historical resources.

UPDATE [18 April 2006]: Microsoft has contacted me about this post; they are interested in learning more about what humanities scholars expect from a specialized academic search engine.

UPDATE [21 April 2006]: Bill Turkel makes the great point that Google’s main search does a much better job than Google Scholar at finding the original article and author of the frontier thesis:

April 17, 2006 2 Comments

Mapping Recent History

As the saying goes, imitation is the sincerest form of flattery. So at the Center for History and New Media, we’re currently feeling extremely flattered that our initiatives in collecting and presenting recent history—the Echo Project (covering the history of science, technology, and industry), the September 11 Digital Archive, and the Hurricane Digital Memory Bank—are being imitated by people using a wave of new websites that help them locate recollections, images, and other digital objects on a map. Here’s an example from the mapping site Platial:

And similar map from our 9/11 project:

Of course, we’re delighted to have imitators (and indeed, in turn we have imitated others), since we are trying to disseminate as widely as possible methods for saving the digital record of the present for future generations. It’s great to see new sites like Platial, CommunityWalk, and Wayfaring providing easy-to-use, collaborative maps that scattered groups of people can use to store photos, memories, and other artifacts.

April 11, 2006 Add Comment

Measuring the Audience of a Digital Humanities Project

Karen Motylewski of the Institute of Museum and Library Services recently pressed an audience of recent IMLS grantees to think about how they might measure the success of their digital projects. As she was well aware, academics often bristle at the quantitative measurement of the audience for their websites because it smacks of commercialism. Also, we professors and librarians and curators generally avoid taking classes in such base topics as marketing. But Karen has a point. Indeed, Roy Rosenzweig and I devote an entire chapter in Digital History to how to build an audience—not for commercial or narcissistic reasons, but because an academic digital project should be, as we say, “useful and used.” I started this blog to explain in greater depth some of the projects and research I’m working on in the digital humanities, but I also did it (as readers of my five-part series on “Creating a Blog from Scratch” will know; 1, 2, 3, 4, 5) to learn first-hand about the composition of blogs and the technologies behind them. Writing my own code for this blog forced me to examine in detail—and occasionally rethink—some blogging conventions (technical, design, and content). And one of the benefits of doing so has been a realization that I have significantly underestimated the power of RSS. I now think it may be the best measurement of utility for an academic website, far better than server logs or other quantitative measurements. Let me explain why.

Think of your reading habits—specifically, periodicals. You probably subscribe to a newspaper, a magazine or two (or three), and perhaps some academic or specialist journals. Every time you go to the dentist, you also probably voraciously read all of those salacious magazines and lifestyle handbooks you don’t subscribe to. If you’re in a particularly bad waiting room, you probably read anything that’s lying around, even if you would never buy those magazines at a newstand. As anyone in the magazine or newspaper business will tell you, what they really want is subscribers, not casual, one-time readers. Subscribers have shown a level of interest in, and dedication to, a periodical that is several levels above all other readers.

Now look carefully at web server logs—the trail of a website’s readers. Most visitors to a typical website are like the third type of magazine reader—simply passing through on the way to get their cavities filled. They generally come from search engines, quickly scan a page, and leave, their IP address never to be seen again.

Moreover, up to three-quarters of traffic to most websites is from bots (i.e., Google’s indexing spider)—a machine audience that you probably care little about, except as a way to drive traffic to your site from search requests. On this site in March 2006, the human audience looked at about 10,000 pages; machines requested over 26,000 pages. This doesn’t even take into account “server spam,” which consists of fake requests to your server to make it look like another website is sending a lot of traffic your way. In March, superhott.com was the number one “referrer” to this blog. Great.

So now we are down to about 10% of the top line number of “visitors” to your website. You are likely getting depressed. But here’s where another point Roy and I make comes into play: “think about community, not numbers of visitors.” That other 10% includes a number of people who love your site and what it has to offer but only visit every once in a while.

Then there are the subscribers. RSS truly provides an online analog to periodical subscriptions; “subscriptions” is a very good word for it since subscribers receive each update automatically. RSS finally allows digital humanities projects to assess how many people are really committed to a site. Notably, this number may or may not follow overall site traffic patterns. For instance, here’s a comparison of server logs for this site with RSS subscriptions:

In the noise of all of the bot traffic and disinterested visitors (top chart; the orange bar represents unique visitors, the dark blue is page views), I’m grateful that subscriptions to this blog (bottom chart) have climbed steadily since its inception four months ago. Should this blog have the enormous traffic of a BoingBoing? No. That’s not why I started it. I’m trying to reach a fairly specific audience that is several orders of magnitude smaller than the big tech/geek audience for BoingBoing. Success means reaching and having a conversation with those people—the people who I believe are doing critical work for the future of education, libraries, and the humanities—not with a mass audience. I hope this site is slowly creeping toward that modest goal. By tracking RSS subscriptions, other digital humanities projects can also see if they’re reaching their envisioned audience.

But how do you use RSS if your site isn’t a blog? If your site is a digital collection or archive, you can add a “news about this site” or “new features/new additions” RSS feed, as we have done for the Hurricane Digital Memory Bank. If your project involves software development, you can put code update announcements into an RSS feed. Even if your site is relatively static, new services such as watchthiswebsite.com will send out notifications of site changes to interested parties. Once you have an RSS feed (you should link to it from your home page so that RSS-aware browsers can find it quickly), you can then use services such as Feedburner to track RSS subscriptions more carefully.

With all of its faults and problems, I suspect we will soon be saying, “The server log is dead.” Long live RSS.

April 4, 2006 Add Comment

The Final Four’s Impact on Websites

I work at George Mason University. Unless you live off the grid (and if so, how are you reading this?), you’ve probably heard that our basketball team is in the Final Four this weekend. There has been a great deal of talk around campus about the impact this astonishing feat will have on the university’s stature and undergraduate admissions. But what about its effect on Mason’s websites? A bit of unscientific evidence from Alexaholic, which creates website traffic graphs using data from Amazon.com’s Alexa web service:

Our domain has gone from being about the 5300th most popular on the web to about 2100th since Mason was selected (controversially) for the tournament on March 12. OK, we’re not exactly in Yahoo territory, but we’ve bypassed dozens of other universities in our steep two-week climb.

March 31, 2006 Add Comment

Search Engine Optimization for Smarties

A Google search for “Sputnik” gives you an authoritative site from NASA in the top ten search results, but also a web page from the skydiver and ballroom-dancing enthusiast Michael Wright. This wildly democratic mix of sources perennially leads some educators to wring their hands about the state of knowledge, as yet another op-ed piece in the New York Times does today (“Searching for Dummies” by Edward Tenner). It’s a strange moment for the Times to publish this kind of lament; it seems like an op-ed left over from 1997, and as I’ve previously written in this space (and elsewhere with Roy Rosenzweig), contrary to Tenner’s one example of searching in vain for “World History,” online historical information is actually getting better, not worse (especially if you assess the web as a whole rather than complain about a few top search results). Anyway, Tenner does make one very good point: “More owners of free high-quality content should learn the tradecraft of tweaking their sites to improve search engine rankings.” This “tradecraft” is generally called “search engine optimization,” and I’ve long thought I should let those in academia (and other creators of reliable, noncommercial digital resources) in on the not-so-secret ways you can move your website higher up in the Google rankings (as well as in the rankings of other search engines).

1. Start with an appropriate domain name. Ideally, your domain should contain the top keywords you expect people searching for your topic to type into Google. At CHNM we love the name “Echo” for our history of science website, but we probably should have made the URL historyofscience.gmu.edu rather than echo.gmu.edu. Professors like to name digital projects something esoteric or poetic, preferably in Greek or Latin. That’s fine. But make the URL something more meaningful (and yes, more prosaic, if necessary) for search engines. If you read Google’s Web Search API documentation, you’ll realize that their spider can actually parse domain names for keywords, even if you run these words together.

2. If you’ve already launched your website, don’t change its address if it already has a lot of links to it. “Inbound” links are the currency of Google rankings. (You can check on how many links there are to your site by typing “link:[your domain name here]” into Google.) We can’t change Echo’s address now, because it’s already got hundreds of links to it, and those links count for a lot. (Despite the poetic name, we’re in the top ten for “history of science.”) There are some fancy ways to “redirect” sites from an old domain to a new one, but it’s tricky.

3. Get as many links to your site as you can from high-quality, established, prominent websites. Here’s where academics and those working in museums and libraries are at an advantage. You probably already have access to some very high-ranking, respected sites. Work at the Smithsonian or the Library of Congress? Want an extremely high-ranking website on any topic? Simply link to the new website (appropriately named, of course) from the home page of your main site (the home page is generally the best page to get a link from). Wait a month or two and you’re done, because www.si.edu and www.loc.gov wield enormous power in Google’s mathematical ranking system. A related point is…

4. Ask other sites to link to your site using the keywords you want. If you have a site on the Civil War, a bad link is one that says, “Like the Civil War? Check out this site.” A helpful link is one that says, “This is a great site on the Civil War.” If you use the Google Sitemap service, it will tell you what the most popular keywords are in links to your site.

5. Include keywords in file names and directory names across your site, and don’t skimp on the letters. This point is similar to #1, only for subtopics and pages on your site. Have a bibliography of Civil War books? Name the file “civilwarbibliography.html” rather than just “biblio.html” or some nonsense letters or numbers.

6. Speaking of nonsense letters and numbers, if your site is database-driven, recast ungainly numbers and letters in the URL (known in geek-speak as the “query string”), e.g., change www.yoursite.org/archive.cfm?author=x15y&text=15325662&lng=eng to www.yoursite.org/archive/rousseau/emile/english_translation.html. Have someone who knows how to do “URL rewriting” change those URLs to readable strings (if you use the Apache web server software, as 70% of sites do, the software that does this is called “mod_rewrite”; it still keeps those numbers and letters in memory, but doesn’t let the human or machine audiences see them).

7. Be very careful about hiring someone to optimize your site, and don’t do anything shifty like putting white text with your keywords on a white background. Read Google’s warning about search engine optimization and shady methods and their propensity to ban sites for subterfuge.

8. Don’t bother with metatags. Google and other search engines don’t care about these old, hidden HTML tags that were supposed to tell search engines what a web page was about.

9. Be patient. For most sites, it’s a slow rise to the top, accumulating links, awareness in the real world and on the web, etc. Moreover, there is definitely a first-mover advantage—being highly ranked creates a virtuous circle, because by being in the top ten, other sites link to your site because they find it more easily than others. Thus Michael Wright’s page on Sputnik, which is nine years old, remains stubbornly in the top ten. But one of the advantages a lot of academic and nonprofit sites have over the Michael Wrights of the world is that we’re part of institutions that are in it for the long run (and don’t have ballroom dancing classes). I’m more sanguine than Edward Teller that in near future, great sites, many of them from academia, will rise to the top, and be found by all of those Google-centric students the educators worry about.

But these sites (and their producers) could use a little push. Hope this helps.

(You might also want to read the chapter Roy and I wrote on building an audience for your website in Digital History, especially the section that includes a discussion of how Google works, as well as another section of the book on “Site Structure and Good URLs.”)

March 26, 2006 Add Comment

Google Adds Topic Clusters to Search Results

Google has been very conservative about changing their search results page. Indeed, the design of the page and the information presented has changed little since the search engine’s public introduction in 1998. Innovations have literally been marginal: Google has added helpful spelling corrections (“Did you mean…?”), related search terms, and news items near the top of the page, and of course the ubiquitous text ads to the right. But the primary search results block has remained fairly untouched. Competitors have come and gone (mostly the latter), promoting new—and they say better—ways of browsing masses of information. But Google’s clean, relevant list has brushed off these upstarts. So it surprised me when I was doing some fact checking on a book I’m finishing to see the following search results page:

As you can see, Google has evidently introduced a search results page that clusters relevant web pages by subject matter. Google has often disparaged other search engines that do this sort of clustering, like the gratingly named Clusty and Vivisimo, perhaps because Google’s engineers must be some of the few geeks who understand that regular human beings don’t particularly care for fancier ways of structuring or visualizing search results. Just the text, ma’am.

But while this addition of clustering (based on the information theory of document classification, as I recently discussed in D-Lib and in a popular prior blog post) to Google’s search results page is surprising, the way they’ve done it is typically simple and useful. No little topic folders in a sidebar; no floating circles connected by relationship lines. The page registers the same visually, but it’s more helpful. I was looking for the year in which the Victorian artist C.R. Ashbee died, and the first three results are about him. Then, above the fold, there’s a block of another three results that are mildly set apart (note the light grey lines), asking if I meant to look up information about the Ashbee Lacrosse League (with a link to the full results for that topic), then back to the artist. The page reads like a conversation, without any annoying, overly fancy technical flourishes: “Here’s some info about C.R. Ashbee…oh, did you mean the lacrosse league?…if you didn’t here’s some more about the artist.”

Now I just hope they add this clustering to their Web Search API, which would really help out with H-Bot, my automated historical fact finder.

March 21, 2006 Add Comment

What Would You Do With a Million Books?

What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.

1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.

2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.

3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.

March 17, 2006 1 Comment

Where Are the Noncommercial APIs?

Readers of this blog know that one of my pet peeves as someone trying to develop software tools for scholars, teachers, and students is the lack of application programming interfaces (APIs) for educational resources. APIs greatly facilitate the use of these resources and allow third parties to create new services on top of them, such as the Google Maps “mashups” that have become a phenomenon in the last year. (Please see my post “Do APIs Have a Place in the Digital Humanities?” as well as the Hurricane Digital Memory Bank for more on APIs and to see what a historical mashup looks like.) Now a clearing house for APIs shows the extent to which noncommercial resources—and especially those in the humanities—have been left out in the cold in this promising new phase of the web. Count with me the total number of noncommercial, educationally-oriented APIs out of the nearly 200 listed on Programmable Web.

That’s right, for the humanities the answer is one: the Library of Congress’s somewhat clunky SRU (Search/Retrieve via URL). Maybe in a broader definition you could count the API from the BBC archive, though it seems to be more about current events. The Internet Archive’s API is currently focused on facilitating uploads into its system rather than, say, historical data mining of the web. A potentially rich API for finding book information, ISBNdb.com, seems promising, but shouldn’t there be a noncommercial entity offering this service (I assume ISSNdb.com will eventually charge or limit this important service)?

By my count the only other noncommercial APIs are from large U.S. government scientific institutions such as NASA, NIH, and NOAA. Surely this long list is missing some other APIs out there, such as one for OAI-PMH. If so, let Programmable Web know—most “Web 2.0” developers are looking here first to get ideas for services, and we don’t need more mashups focusing on the real estate market.

March 10, 2006 Add Comment

When Machines Are the Audience

I recently received an email from someone at the Woodrow Wilson Center that began in the following way: “Dear Sir/Madam: I was wondering if you might share the following fellowship opportunity with the members of your list…The Africa Program is pleased to announce that it is now accepting applications…” The email was, of course, tagged as spam by my email software, since it looked suspiciously like what the U.S. Secret Service calls a 419 fraud scheme, or a scam where someone (generally from Africa) asks you to send them your bank account information so they can smuggle cash out of their country (the transfer then occurs in the opposite direction, in case you were wondering). Checking the email against a statistical list of high-likelihood spam triggers identified the repeated use of words such as “application,” “generous,” “Africa,” and “award,” as well as the phrases “submitted electronically” and the opening “Dear Sir/Madam.” The email piqued my curiosity because over the past year I’ve started altering some of my email writing to avoid precisely this problem of a “false positive” spam label, e.g., never sending just an attachment with no text (a class spam trigger) and avoiding the use of phrases such as “Hey, you’ve got to look at this.” In other words, I’ve semi-consciously started writing for a new audience: machines. One of the central theories of humanities disciplines such as literature and history is that our subjects write for an audience (or audiences). What happens when machines are part of this audience?

As the Woodrow Wilson Center email shows, the fact that digital text is machine readable suddenly makes the use of specific words problematic, because keyword searches can much more easily uncover these words (and perhaps act on them) than in a world of paper. It would be easy to find, for instance, all of the emails about Monica Lewinsky in the 40 million Clinton White House emails saved by the National Archives because “Lewinsky” is such an unusual word. Flipping that logic around, if I were currently involved in a White House scandal, I would studiously avoid the use of any identifying keywords (e.g., “Abramoff”) in my email correspondence.

In other cases, this keyword visibility is desirable. For instance, if I were a writer today thinking about my Word files, I would consider including or excluding certain words from each file for future research (either by myself or by others). Indeed, the “smart folder” technology in Apple’s Spotlight search or the upcoming Windows Vista search can automatically group documents based on the presence of a keyword or set of keywords. When people ask me how they can create a virtual network of websites on a historical topic, I often respond by saying that they could include at the bottom of each web page in the network a unique invented string of characters (e.g., “medievalhistorynetwork”). After Google indexes all of the web pages with this string, you could easily create a specialized search engine that scans only these particular sites.

“Machine audience consciousness” has probably already infected many other realms of our writing. Have some other examples? Let me know and I’ll post them here.

March 2, 2006 2 Comments