Education Text Mining

10 Most Popular Philosophy Syllabi

It’s time once again to find the most influential syllabi in a discipline—this time, philosophy—as determined by data gleaned from the Syllabus Finder. As with my earlier analysis of the most popular history syllabi the following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), the number of times Syllabus Finder users inspected a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and the overall “attractiveness” of a syllabus (defined as the ratio of full reads to mere glances). It goes without saying (but I’ll say it) that this methodology is unscientific and gives an advantage to older syllabi, but it still probably provides a good sense of the most visible and viewed syllabi on the web. Anyway, here are the ten most popular philosophy syllabi.

#1 – Philosophy of Art and Beauty, Julie Van Camp, California State University, Long Beach, Spring 1998 (total of 3992 points)

#2 – Introduction to Philosophy, Andreas Teuber, Brandeis University, Fall 2004 (3699 points)

#3 – Law, Philosophy, and the Humanities, Julie Van Camp, California State University, Long Beach, Fall 2003 (3174 points)

#4 – Introduction to Philosophy, Jonathan Cohen, University of California, San Diego, Fall 1999 (2448 points)

#5 – Comparative Methodology, Bryan W. Van Norden, Vassar College, multiple semesters (1944 points)

#6 – Aesthetics, Steven Crowell, Rice University, Fall 2003 (1913 points)

#7 – Philosophical Aspects of Feminism, Lisa Schwartzman, Michigan State University, Spring 2001 (1782 points)

#8 – Morality and Society, Christian Perring, University of Kentucky, Spring 1996 (1912 points)

#9 – Gay and Lesbian Philosophy, David Barber, University of Maryland, Spring 2002 (1442 points)

#10 – Social and Political Philosophy, Eric Barnes, Mount Holyoke College, Fall 1999 (1395 points)

I will leave it to readers of this blog to assess and compare these syllabi, but two brief comments. First of all, the diversity of topics within this list is notable compared to the overwhelming emphasis on American history among the most popular history syllabi. Asthetics, politics, law, morality, gender, sexuality, and methodology are all represented. Second, congratulations to Julie Van Camp of California State University, Long Beach, who becomes the first professor with two top syllabi in a discipline. Professor Van Camp was a very early adopter of the web, having established a personal home page almost ten years ago with links to all of her syllabi. Van Camp should watch her back, however; Andreas Teuber of Brandeis is coming up quickly with what seems to be the Platonic ideal of an introductory course on philosophy. In less than two years since its inception his syllabus has been very widely consulted.

[The fine print of how the rankings were determined: 1 point was awarded for each time a syllabus showed up in a Syllabus Finder search result; 10 points were awarded for each time a Syllabus Finder user clicked through to view the entire syllabus; 100 points were awarded for each percent of “attractiveness,” where 100% attractive means that every time a syllabus made an appearance in a search result it was clicked on for further information. For instance, the top syllabus appeared in 2164 searches and was clicked on 125 times (5.78% of the searches), for a point total of 2164 + (125 X 10) + (5.78 X 100) = 3992.]

Books Digitization Google Text Mining

Google Book Search Blog

For those interested in the Google book digitization project (one of my three copyright-related stories to watch for 2006), Google launched an official blog yesterday. Right now “Inside Google Book Search” seems more like “Outside Google Book Search,” with a first post celebrating the joys of books and discovery, and with a set of links lauding the project, touting “success stories,” and soliciting participation from librarians, authors, and publishers. Hopefully we’ll get more useful insider information about the progress of the project, hints about new ways of searching millions of books, and other helpful tips for scholars in the near future. As I recently wrote in an article in D-Lib Magazine, Google’s project has some serious—perhaps fatal—flaws for those in the digital humanities (not so for the competing, but much smaller, Open Content Alliance). In particular, it would be nice to have more open access to the text (rather than mere page images) of pre-1923 books (i.e., those that are out of copyright). Of course, I’m a historian of the Victorian era who wants to scan thousands of nineteenth-century books using my own digital tools, not a giant company that may want to protect its very expensive investment in digitizing whole libraries.

Programming Web Design

Using AJAX Wisely

Since its name was coined on February 18, 2005, AJAX (for Asynchronous JavaScript and XML) has been a much-discussed new web technology. For those not involved in web production, essentially AJAX is a method for dynamically changing parts of a web page without reloading the entire thing; like other dynamic technologies such as Flash, it makes the web browser seem more like a desktop application than a passive window for reading documents. Unlike Flash, however, AJAX applications have generally focused less on interactive graphics (and the often cartoony elements that are now associated with Flash) and more on advanced presentation of text and data, making it attractive to those in academia, libraries, and museums. It’s easy to imagine, for instance, an AJAX-based online library catalog that would allow for an easy refinement of a book search (reordering or adding new possibilities) without a new query submission for each iteration. Despite such promise, or perhaps because of the natural lag between commercial and noncommercial implementations of web technologies, AJAX has not been widely used in academia. That’s fine. Unlike the dot-coms, we should first be asking: What are appropriate uses for AJAX?

As with all technologies, it’s important that AJAX be used in a way that advances the pedagogical, archival, or analytical goals of a project, and with a recognition of its advantages and disadvantages. Such sober assessment is often difficult, however, in the face of hype. Let me put one prick in the AJAX bubble, though, which can help us orient the technology properly: AJAX often scrubs away useful URLs—the critical web addresses students, teachers, and scholars rely on to find and cite web pages and digital objects. For some, the ability to reference documents accurately over time is less of a concern compared to functionality and fancy design—but the lack of URLs for specific “documents” (in the broad sense of the word) on some AJAX sites make it troubling for academic use. Brewster Kahle, the founder of the Internet Archive, surmised that his archive may hold the blog of a future president; if she’s using some of the latest AJAX-based websites, we historians will have a very hard time finding her early thoughts because they won’t have a fixed (and indexable) address.

If not implemented carefully, AJAX (like Flash) could end up like the lamentable 1990s web technology “frames,” which could, for instance, hide the exact address of a scanned medieval folio in a window distinct from the site’s navigation, as in the Koninklijke Bibliotheek’s Medieval Illuminated Manuscripts site—watch how the URL at the top of your browser never changes as you click on different folios, frustrating anyone who wants to reference a specific page. Accurate citations are a core requirement for academic work. We need to be able to reference URLs that aren’t simply a constantly changing, fluid environment.

At the Center for History and New Media, our fantastic web developers Jim Safley and Nate Agrin have implemented AJAX in the right way, I believe, for our Hurricane Digital Memory Bank. In prior projects that gathered recollections and digital objects like photographs for future researchers, such as the September 11 Digital Archive, we worried about making the contribution form too long. We wanted as many people as possible to contribute, but we also knew that itchy web surfers are often put off by multi-page forms to fill out.

Jim and Nate solved this tension brilliantly by making the contribution form for the Hurricane Digital Memory Bank dynamic using AJAX. The form is relatively short but certain sections can change or expand to accept different kinds of objects, text, or geographical information depending on the interactions of the user with the form and accompanying map. It is simultaneously rich and unimposing. When you click on a link that says “Provide More Information” a new section of the form extends beyond the original.

Once a contribution has been accepted, however, it’s assigned a useful, permanent web address that can be referenced easily. Each digital object in the archive, from video to audio to text, has its own unique identifier, which is made explicit at the bottom of the window for that object (e.g., “Cite as: Object #139, Hurricane Digital Memory Bank: Preserving the Stories of Katrina, Rita, and Wilma, 17 November 2005, <>”).

AJAX will likely have a place in academic digital projects—just a more narrow place than out on the wild web.

Mac Virtualization Web Design Windows

An Actual Use for Windows on the Mac

OK, so you can now run Windows on a Mac. So what? For most of us in the humanities, all we need is already on the Mac, which (in addition to intangibles such as the Mac’s design) is why so many of us remain stubbornly attached to Apple’s computers while over the last twenty years almost everyone else has moved to the more generic platform of the PC. Most educational, graphics, and web development software is available for the Mac. (For those in the social and natural sciences, on the other hand, many important software packages are either not available for the Mac or come out later than they do for the PC.) But perhaps there’s the rub. Since many of us only use Macs—especially those that build academic or museum websites—we often don’t see how most people view our sites. Since websites often render differently on different operating systems and web browsers, not checking how your site will look (and perform, if you are using dynamic web technologies) on a PC with IE (still 85% of web surfers) is unwise. Now with Parallels Workstation—the Windows-on-Mac solution that doesn’t require rebooting your computer to switch OSes—you can literally have a window into the world of Windows sitting on your desktop in parallel with your Mac applications. For instance, here’s a screenshot of my Mac desktop with Firefox for the Mac running on the left, and IE for Windows running on the right:

Looks to me like I need to work on the font size differential between Macs and PCs.

This parallelism of operating systems is incredibly handy for web development on a single machine. At the Center for History and New Media we have gone through phrases where we have paid for services that send us static images of our websites on different platforms and in different browsers. We also spend a lot of time running from our Macs over to PCs to check how everything is looking. Now we can do this all on one machine, easily and instantaneously.

Now I just need to install another window for the 2% of web surfers using Linux…

Google Open Access Search

The Single Box Humanities Search

I recently polled my graduate students to see where they turn to begin research for a paper. I suppose this shouldn’t come as a surprise: the number one answer—by far—was Google. Some might say they’re lazy or misdirected, but the allure of that single box—and how well it works for most tasks—is incredibly strong. Try getting students to go to five or six different search engines for gated online databases such as ProQuest Academic and JSTOR—all of which have different search options and produce a complex array of results compared to Google. I was thinking about this recently as I tested the brand new scholarly search engine from Microsoft, Windows Live Academic. Windows Live Academic is a direct competitor to Google Scholar, which has been in business now for over a year but is still in “beta” (like most Google products). Both are trying to provide that much-desired single box for academic researchers. And while those in the sciences may eventually be happy with this new option from Microsoft (though it’s currently much rougher than Google’s beta, as you’ll see), like Google Scholar, Windows Live Academic is a big disappointment for students, teachers, and professors in the humanities. I suspect there are three main reasons for this lack of a high-quality single box humanities search.

First, a quick test of Google Scholar and Windows Live Academic. Can either one produce the source of the famous “frontier thesis,” probably the best-known thesis in American historiography?

Clearly, the usefulness of these search results are dubious, especially Windows Live Academic (The Political Economy of Land Conflict in the Eastern Brazilian Amazon as the top result?). Why can’t these giant companies do better than this for humanities searches?

Obviously, the people designing and building these “academic” search engines are from a distinct subset of academia: computer science and mathematical fields such as physics. So naturally they focus on their own fields first. Both Google Scholar and Windows Live Academic work fairly well if you would like to know about black holes or encryption. Moreover, “scholarship” in these fields generally means articles, not books. Google Scholar and Windows Live Academic are dominated by journal-based publications, though both sometimes show books in their search results. But when Google Scholar does so, these books seem to appear because articles that match the search terms cite these works, not because of the relevance of the text of the books themselves.

In addition, humanities articles aren’t as easy as scientific papers to subject to bibliometrics—methods such as citation analysis that reveal the most important or influential articles in a field. Science papers tend to cite many more articles (and fewer books) in a way that makes them subject to extensive recursive analysis. Thus a search on “search” on Google Scholar aptly points a researcher to Sergey Brin’s and Larry Page’s seminal paper outlining how Google would work, because hundreds of other articles on search technology dutifully refer to that paper in their opening paragraph or footnote.

Most important, however, is the question of open access. Outlets for scientific articles are more open and indexable by search engines than humanities journals. In addition to many major natural and social science journals, CiteSeer (sponsored by Microsoft) and make hundreds of thousands of articles on computer science, physics, and mathematics freely available. This disparity in openness compared to humanities scholarship is slowly starting to change—the American Historical Review, for instance, recently made all new articles freely available online—but without a concerted effort to open more gates, finding humanities papers through a single search box will remain difficult to achieve. Microsoft claims in its FAQ for Windows Live Academic that it will get around to including better results for subjects like history, but like Google they are going to have a hard time doing that well without open historical resources.

UPDATE [18 April 2006]: Microsoft has contacted me about this post; they are interested in learning more about what humanities scholars expect from a specialized academic search engine.

UPDATE [21 April 2006]: Bill Turkel makes the great point that Google’s main search does a much better job than Google Scholar at finding the original article and author of the frontier thesis:

Google History Maps Mashups

Mapping Recent History

As the saying goes, imitation is the sincerest form of flattery. So at the Center for History and New Media, we’re currently feeling extremely flattered that our initiatives in collecting and presenting recent history—the Echo Project (covering the history of science, technology, and industry), the September 11 Digital Archive, and the Hurricane Digital Memory Bank—are being imitated by people using a wave of new websites that help them locate recollections, images, and other digital objects on a map. Here’s an example from the mapping site Platial:

And similar map from our 9/11 project:

Of course, we’re delighted to have imitators (and indeed, in turn we have imitated others), since we are trying to disseminate as widely as possible methods for saving the digital record of the present for future generations. It’s great to see new sites like Platial, CommunityWalk, and Wayfaring providing easy-to-use, collaborative maps that scattered groups of people can use to store photos, memories, and other artifacts.

Audience RSS Stats

Measuring the Audience of a Digital Humanities Project

Karen Motylewski of the Institute of Museum and Library Services recently pressed an audience of recent IMLS grantees to think about how they might measure the success of their digital projects. As she was well aware, academics often bristle at the quantitative measurement of the audience for their websites because it smacks of commercialism. Also, we professors and librarians and curators generally avoid taking classes in such base topics as marketing. But Karen has a point. Indeed, Roy Rosenzweig and I devote an entire chapter in Digital History to how to build an audience—not for commercial or narcissistic reasons, but because an academic digital project should be, as we say, “useful and used.” I started this blog to explain in greater depth some of the projects and research I’m working on in the digital humanities, but I also did it (as readers of my five-part series on “Creating a Blog from Scratch” will know; 1, 2, 3, 4, 5) to learn first-hand about the composition of blogs and the technologies behind them. Writing my own code for this blog forced me to examine in detail—and occasionally rethink—some blogging conventions (technical, design, and content). And one of the benefits of doing so has been a realization that I have significantly underestimated the power of RSS. I now think it may be the best measurement of utility for an academic website, far better than server logs or other quantitative measurements. Let me explain why.

Think of your reading habits—specifically, periodicals. You probably subscribe to a newspaper, a magazine or two (or three), and perhaps some academic or specialist journals. Every time you go to the dentist, you also probably voraciously read all of those salacious magazines and lifestyle handbooks you don’t subscribe to. If you’re in a particularly bad waiting room, you probably read anything that’s lying around, even if you would never buy those magazines at a newstand. As anyone in the magazine or newspaper business will tell you, what they really want is subscribers, not casual, one-time readers. Subscribers have shown a level of interest in, and dedication to, a periodical that is several levels above all other readers.

Now look carefully at web server logs—the trail of a website’s readers. Most visitors to a typical website are like the third type of magazine reader—simply passing through on the way to get their cavities filled. They generally come from search engines, quickly scan a page, and leave, their IP address never to be seen again.

Moreover, up to three-quarters of traffic to most websites is from bots (i.e., Google’s indexing spider)—a machine audience that you probably care little about, except as a way to drive traffic to your site from search requests. On this site in March 2006, the human audience looked at about 10,000 pages; machines requested over 26,000 pages. This doesn’t even take into account “server spam,” which consists of fake requests to your server to make it look like another website is sending a lot of traffic your way. In March, was the number one “referrer” to this blog. Great.

So now we are down to about 10% of the top line number of “visitors” to your website. You are likely getting depressed. But here’s where another point Roy and I make comes into play: “think about community, not numbers of visitors.” That other 10% includes a number of people who love your site and what it has to offer but only visit every once in a while.

Then there are the subscribers. RSS truly provides an online analog to periodical subscriptions; “subscriptions” is a very good word for it since subscribers receive each update automatically. RSS finally allows digital humanities projects to assess how many people are really committed to a site. Notably, this number may or may not follow overall site traffic patterns. For instance, here’s a comparison of server logs for this site with RSS subscriptions:

In the noise of all of the bot traffic and disinterested visitors (top chart; the orange bar represents unique visitors, the dark blue is page views), I’m grateful that subscriptions to this blog (bottom chart) have climbed steadily since its inception four months ago. Should this blog have the enormous traffic of a BoingBoing? No. That’s not why I started it. I’m trying to reach a fairly specific audience that is several orders of magnitude smaller than the big tech/geek audience for BoingBoing. Success means reaching and having a conversation with those people—the people who I believe are doing critical work for the future of education, libraries, and the humanities—not with a mass audience. I hope this site is slowly creeping toward that modest goal. By tracking RSS subscriptions, other digital humanities projects can also see if they’re reaching their envisioned audience.

But how do you use RSS if your site isn’t a blog? If your site is a digital collection or archive, you can add a “news about this site” or “new features/new additions” RSS feed, as we have done for the Hurricane Digital Memory Bank. If your project involves software development, you can put code update announcements into an RSS feed. Even if your site is relatively static, new services such as will send out notifications of site changes to interested parties. Once you have an RSS feed (you should link to it from your home page so that RSS-aware browsers can find it quickly), you can then use services such as Feedburner to track RSS subscriptions more carefully.

With all of its faults and problems, I suspect we will soon be saying, “The server log is dead.” Long live RSS.

Audience Stats

The Final Four’s Impact on Websites

I work at George Mason University. Unless you live off the grid (and if so, how are you reading this?), you’ve probably heard that our basketball team is in the Final Four this weekend. There has been a great deal of talk around campus about the impact this astonishing feat will have on the university’s stature and undergraduate admissions. But what about its effect on Mason’s websites? A bit of unscientific evidence from Alexaholic, which creates website traffic graphs using data from’s Alexa web service:

Our domain has gone from being about the 5300th most popular on the web to about 2100th since Mason was selected (controversially) for the tournament on March 12. OK, we’re not exactly in Yahoo territory, but we’ve bypassed dozens of other universities in our steep two-week climb.

Audience Google Search Tutorials

Search Engine Optimization for Smarties

A Google search for “Sputnik” gives you an authoritative site from NASA in the top ten search results, but also a web page from the skydiver and ballroom-dancing enthusiast Michael Wright. This wildly democratic mix of sources perennially leads some educators to wring their hands about the state of knowledge, as yet another op-ed piece in the New York Times does today (“Searching for Dummies” by Edward Tenner). It’s a strange moment for the Times to publish this kind of lament; it seems like an op-ed left over from 1997, and as I’ve previously written in this space (and elsewhere with Roy Rosenzweig), contrary to Tenner’s one example of searching in vain for “World History,” online historical information is actually getting better, not worse (especially if you assess the web as a whole rather than complain about a few top search results). Anyway, Tenner does make one very good point: “More owners of free high-quality content should learn the tradecraft of tweaking their sites to improve search engine rankings.” This “tradecraft” is generally called “search engine optimization,” and I’ve long thought I should let those in academia (and other creators of reliable, noncommercial digital resources) in on the not-so-secret ways you can move your website higher up in the Google rankings (as well as in the rankings of other search engines).

1. Start with an appropriate domain name. Ideally, your domain should contain the top keywords you expect people searching for your topic to type into Google. At CHNM we love the name “Echo” for our history of science website, but we probably should have made the URL rather than Professors like to name digital projects something esoteric or poetic, preferably in Greek or Latin. That’s fine. But make the URL something more meaningful (and yes, more prosaic, if necessary) for search engines. If you read Google’s Web Search API documentation, you’ll realize that their spider can actually parse domain names for keywords, even if you run these words together.

2. If you’ve already launched your website, don’t change its address if it already has a lot of links to it. “Inbound” links are the currency of Google rankings. (You can check on how many links there are to your site by typing “link:[your domain name here]” into Google.) We can’t change Echo’s address now, because it’s already got hundreds of links to it, and those links count for a lot. (Despite the poetic name, we’re in the top ten for “history of science.”) There are some fancy ways to “redirect” sites from an old domain to a new one, but it’s tricky.

3. Get as many links to your site as you can from high-quality, established, prominent websites. Here’s where academics and those working in museums and libraries are at an advantage. You probably already have access to some very high-ranking, respected sites. Work at the Smithsonian or the Library of Congress? Want an extremely high-ranking website on any topic? Simply link to the new website (appropriately named, of course) from the home page of your main site (the home page is generally the best page to get a link from). Wait a month or two and you’re done, because and wield enormous power in Google’s mathematical ranking system. A related point is…

4. Ask other sites to link to your site using the keywords you want. If you have a site on the Civil War, a bad link is one that says, “Like the Civil War? Check out this site.” A helpful link is one that says, “This is a great site on the Civil War.” If you use the Google Sitemap service, it will tell you what the most popular keywords are in links to your site.

5. Include keywords in file names and directory names across your site, and don’t skimp on the letters. This point is similar to #1, only for subtopics and pages on your site. Have a bibliography of Civil War books? Name the file “civilwarbibliography.html” rather than just “biblio.html” or some nonsense letters or numbers.

6. Speaking of nonsense letters and numbers, if your site is database-driven, recast ungainly numbers and letters in the URL (known in geek-speak as the “query string”), e.g., change to Have someone who knows how to do “URL rewriting” change those URLs to readable strings (if you use the Apache web server software, as 70% of sites do, the software that does this is called “mod_rewrite”; it still keeps those numbers and letters in memory, but doesn’t let the human or machine audiences see them).

7. Be very careful about hiring someone to optimize your site, and don’t do anything shifty like putting white text with your keywords on a white background. Read Google’s warning about search engine optimization and shady methods and their propensity to ban sites for subterfuge.

8. Don’t bother with metatags. Google and other search engines don’t care about these old, hidden HTML tags that were supposed to tell search engines what a web page was about.

9. Be patient. For most sites, it’s a slow rise to the top, accumulating links, awareness in the real world and on the web, etc. Moreover, there is definitely a first-mover advantage—being highly ranked creates a virtuous circle, because by being in the top ten, other sites link to your site because they find it more easily than others. Thus Michael Wright’s page on Sputnik, which is nine years old, remains stubbornly in the top ten. But one of the advantages a lot of academic and nonprofit sites have over the Michael Wrights of the world is that we’re part of institutions that are in it for the long run (and don’t have ballroom dancing classes). I’m more sanguine than Edward Teller that in near future, great sites, many of them from academia, will rise to the top, and be found by all of those Google-centric students the educators worry about.

But these sites (and their producers) could use a little push. Hope this helps.

(You might also want to read the chapter Roy and I wrote on building an audience for your website in Digital History, especially the section that includes a discussion of how Google works, as well as another section of the book on “Site Structure and Good URLs.”)

APIs Google Search Text Mining

Google Adds Topic Clusters to Search Results

Google has been very conservative about changing their search results page. Indeed, the design of the page and the information presented has changed little since the search engine’s public introduction in 1998. Innovations have literally been marginal: Google has added helpful spelling corrections (“Did you mean…?”), related search terms, and news items near the top of the page, and of course the ubiquitous text ads to the right. But the primary search results block has remained fairly untouched. Competitors have come and gone (mostly the latter), promoting new—and they say better—ways of browsing masses of information. But Google’s clean, relevant list has brushed off these upstarts. So it surprised me when I was doing some fact checking on a book I’m finishing to see the following search results page:

As you can see, Google has evidently introduced a search results page that clusters relevant web pages by subject matter. Google has often disparaged other search engines that do this sort of clustering, like the gratingly named Clusty and Vivisimo, perhaps because Google’s engineers must be some of the few geeks who understand that regular human beings don’t particularly care for fancier ways of structuring or visualizing search results. Just the text, ma’am.

But while this addition of clustering (based on the information theory of document classification, as I recently discussed in D-Lib and in a popular prior blog post) to Google’s search results page is surprising, the way they’ve done it is typically simple and useful. No little topic folders in a sidebar; no floating circles connected by relationship lines. The page registers the same visually, but it’s more helpful. I was looking for the year in which the Victorian artist C.R. Ashbee died, and the first three results are about him. Then, above the fold, there’s a block of another three results that are mildly set apart (note the light grey lines), asking if I meant to look up information about the Ashbee Lacrosse League (with a link to the full results for that topic), then back to the artist. The page reads like a conversation, without any annoying, overly fancy technical flourishes: “Here’s some info about C.R. Ashbee…oh, did you mean the lacrosse league?…if you didn’t here’s some more about the artist.”

Now I just hope they add this clustering to their Web Search API, which would really help out with H-Bot, my automated historical fact finder.