Archives Pedagogy Text Mining

A Million Syllabi

Today I’m releasing a database of over a million syllabi gathered by my Syllabus Finder tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I’m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude.

I created the Syllabus Finder in 2002 when Google released their first API to access their search engine. The initial API included the ability to grab cached HTML from millions of web pages, which I realized could then be scanned using high-relevancy keywords to identify pages that were most likely syllabi. In addition to my lousy PHP code that got it up and running, the brilliant Simon Kornblith wrote some additional code to make it work well. The result was a tool that was quite popular (1.3 million queries) until Google deprecated their original API in 2009 in favor of (what I consider to be) a less useful API. (With the original API you could basically clone, which I’m sure was not popular at the Googleplex.)

If you are interested in the kind of research that can be done on these syllabi, please read my Journal of American History article “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.” For that article I used regular expressions to pull book titles out of a thousand American history surveys to see how textbooks and other works are used by instructors. Some hidden elements emerged. I’m excited to see what creative ideas other scholars and researchers come up with for this large database.

Some important clarifications and caveats:

1) I’m providing this archive in the same spirit (and under same regulations) that the Internet Archive provides web corpora (indeed, this corpus could probably be recreated from the Internet Archive’s Wayback Machine, albeit after a lot of work). To the best of my knowledge, and because of the way they were obtained, all of the documents this database contains were posted on the open web, and were cached (or not) respecting open-web standards such as robots.txt. It does not contain any syllabi that were posted in private places, such as gated Blackboard installations. Indeed, I suspect that most of these syllabi come from universities where it is expected that professors post syllabi in an open fashion (as is the case here at Mason), or from professors like me who believe that openness is good for scholarship and teaching. But as with the Internet Archive, if you are the creator of a syllabus and really can’t sleep unless it is purged from this research database, contact me.

2) This database is provided as is and without support. I get enough email and unfortunately cannot answer questions. If you are appreciative, you can make a tax-free donation to the Center for History and New Media, for which you will receive a hug from me. The database is intended for non-commercial use of the type seen in my JAH article.

3) The database is an SQL dump consisting of 1.4 million rows. The columns are syllabiID (the Syllabus Finder’s unique identifier), url (web address of the syllabus at the time it was found), title (of the web page the syllabus was on), date_added (when it was added to the Syllabus Finder database), and chnm_cache (the HTML of the page on the date it was added). The database is 804 MB uncompressed. The corpus is heavily U.S.-centric because web pages were matched to English-language words, and for a time the Syllabus Finder only took pages from .edu domains (thus leaving out, e.g., URLs).

4) Because the Syllabus Finder was completely automated, some percentage of the 1.4 million documents are not syllabi (my best guess is about 20%). Most often these incorrect matches are associated course documents such as assignments, which are interesting in their own right. But some are oddball documents that just looked like syllabi to the algorithms. I have made no attempt to weed them out.

If you understand all of this clearly, then here’s a million syllabi for you: CHNM Syllabus Finder Corpus, Version 1.0 (30 March 2011) (265 MB download, zipped SQL file)

UPDATE 1 (11pm 3/30/11): Matt Burton has helpfully provided a torrent for this file. If you can, please use it instead of the direct download.

UPDATE 2 (9pm 3/31/11): Unfortunately I should have checked the exported database before posting. Version 1.0 does indeed have the URLs, titles, and dates of about 1.45 million syllabi but it is missing a majority of the HTML caches of those syllabi. I am working to recreate the full database, which will be much larger and more useful.