A Million Syllabi

Today I’m releasing a database of over a million syllabi gathered by my Syllabus Finder tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I’m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude.

I created the Syllabus Finder in 2002 when Google released their first API to access their search engine. The initial API included the ability to grab cached HTML from millions of web pages, which I realized could then be scanned using high-relevancy keywords to identify pages that were most likely syllabi. In addition to my lousy PHP code that got it up and running, the brilliant Simon Kornblith wrote some additional code to make it work well. The result was a tool that was quite popular (1.3 million queries) until Google deprecated their original API in 2009 in favor of (what I consider to be) a less useful API. (With the original API you could basically clone google.com, which I’m sure was not popular at the Googleplex.)

If you are interested in the kind of research that can be done on these syllabi, please read my Journal of American History article “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.” For that article I used regular expressions to pull book titles out of a thousand American history surveys to see how textbooks and other works are used by instructors. Some hidden elements emerged. I’m excited to see what creative ideas other scholars and researchers come up with for this large database.

Some important clarifications and caveats:

1) I’m providing this archive in the same spirit (and under same regulations) that the Internet Archive provides web corpora (indeed, this corpus could probably be recreated from the Internet Archive’s Wayback Machine, albeit after a lot of work). To the best of my knowledge, and because of the way they were obtained, all of the documents this database contains were posted on the open web, and were cached (or not) respecting open-web standards such as robots.txt. It does not contain any syllabi that were posted in private places, such as gated Blackboard installations. Indeed, I suspect that most of these syllabi come from universities where it is expected that professors post syllabi in an open fashion (as is the case here at Mason), or from professors like me who believe that openness is good for scholarship and teaching. But as with the Internet Archive, if you are the creator of a syllabus and really can’t sleep unless it is purged from this research database, contact me.

2) This database is provided as is and without support. I get enough email and unfortunately cannot answer questions. If you are appreciative, you can make a tax-free donation to the Center for History and New Media, for which you will receive a hug from me. The database is intended for non-commercial use of the type seen in my JAH article.

3) The database is an SQL dump consisting of 1.4 million rows. The columns are syllabiID (the Syllabus Finder’s unique identifier), url (web address of the syllabus at the time it was found), title (of the web page the syllabus was on), date_added (when it was added to the Syllabus Finder database), and chnm_cache (the HTML of the page on the date it was added). The database is 804 MB uncompressed. The corpus is heavily U.S.-centric because web pages were matched to English-language words, and for a time the Syllabus Finder only took pages from .edu domains (thus leaving out, e.g., .ac.uk URLs).

4) Because the Syllabus Finder was completely automated, some percentage of the 1.4 million documents are not syllabi (my best guess is about 20%). Most often these incorrect matches are associated course documents such as assignments, which are interesting in their own right. But some are oddball documents that just looked like syllabi to the algorithms. I have made no attempt to weed them out.

If you understand all of this clearly, then here’s a million syllabi for you: CHNM Syllabus Finder Corpus, Version 1.0 (30 March 2011) (265 MB download, zipped SQL file)

UPDATE 1 (11pm 3/30/11): Matt Burton has helpfully provided a torrent for this file. If you can, please use it instead of the direct download.

UPDATE 2 (9pm 3/31/11): Unfortunately I should have checked the exported database before posting. Version 1.0 does indeed have the URLs, titles, and dates of about 1.45 million syllabi but it is missing a majority of the HTML caches of those syllabi. I am working to recreate the full database, which will be much larger and more useful.

Comments

[…] for History and New Media at George Mason University released a really interesting dataset today: a million syllabi culled from the web, from […]

mcburton says:

Hey all,
I just got a torrent tracker up and running and am seeding from home. Spread the word & help seed if you can!
http://tweedpiratebay.appspot.com/static/chnm_syllabus_finder_corpus.torrent

Let me know if it breaks
@mcburton

What a great project, Dan. Thanks for making this open for everyone to play with!

Lev Manovich says:

Great!

I was thinking for a while how nice it will be to get lots of syllabi and then look at stats for books used, terms etc

thank you,

Lev

Jason Priem says:

You might want to check out:

Kousha, K., & Thelwall, M. (2008). Assessing the impact of disciplinary research on teaching: An automatic analysis of online syllabuses. Journal of the American Society for Information Science and Technology, 59(13), 2060-2069. doi:10.1002/asi.20920

They suggest inclusion in syllabi constitutes a neglected (and measurable) dimension of scholarly impact, at least in some fields. But they start from articles, and then do web searches to see if they’re in syllabi. With this dataset, you could approach it from the opposite direction, starting with syllabi.

[…] at George Mason University, hopes the repository of one million syllabi he posted today on his Web site will help fuel academic […]

Douglas Knox says:

Dan, this is wonderful, creative, innovative almost ten years ago and no less so today. The interest in your release of this as a data set is heartening. I read and accepted the warranty in downloading, and have no regrets. Reluctantly, though, I have to question a detail relating to orders of magnitude.

In the file available for download it looks more like just under 17,000 syllabi. There are indeed more than 1.4 million rows in the database, but for most of them the chnm_cache field is empty — anything with an ID number over 20,823, or anything harvested after 2002. I double-checked this with spot inspection of the unzipped SQL file. Is there more data that didn’t make it through export? If it were a million files, wouldn’t it be more like maybe 20-40 gigabytes? A syllabus is more likely to be 10K or more than it is to be just 1 kilobyte. Have I miscalculated somewhere?

Even on an “as is” basis, more than 16,000 syllabi are plenty already to be interesting. At least 370 “blink” tags in the service of higher education in 2002 are evidence of a near-forgotten world now.

[…] at George Mason University, hopes the repository of one million syllabi he posted today on his Web site will help fuel academic […]

[…] fun stuff on teaching this week. First, Dan Cohen released his Million Syllabi into the world.  He released it as a .sql file (for obvious and good reasons), but it would be more useful to me […]

[…] Dan Cohen has just released a database of over one million course syllabi, gathered from the Internet between 2002 and 2009. The data is […]

[…] probably saw that Dan Cohen has released a million syllabi for text analysis and data-mining; at Snarkmarket, they’re having a […]

[…] Dan Cohen has just released a database of over one million course syllabi, gathered from the Internet between 2002 and 2009. The data is […]

Alex Garcia says:

Dan,

Awesome data, but as Douglas said above – having the full 1,000,000 records would be even better :). Are there any hopes that you will publish the full database?

Btw, I see that there are few PDFs in the mix, and I could not open a single one of them… Did they get damaged during export?

Alex

I can’t get the torrent file to open — Vuze gives me an error.

[…] for History and New Media at George Mason University released a really interesting dataset today: a million syllabi culled from the web, from […]

[…] Cohen releases a database of over a million academic syllabi automatically collected […]

Paul Dixon says:

Is there any update on this, or is the data too hard to recover cleanly?

Dan Cohen says:

@Paul: still working on it. Hoping to make some progress soon.

For a curriculum project, e worked on something similar specifically for African Studies in 2000. We didn’t set up a query, but found syllabi and entered URLs into a searchable database. Many of the links are dead, and of course, there was no resources to update this. Here is the link:
http://africa.berkeley.edu/academics/SyllabiSelector.php
I look forward to browsing your database.

[…] came across this link for over 1.4 million syllabi, as compiled by Dan Cohen, over at CHNM. Granted, he admits that as […]

Harpreet Singh says:

Dan, is the link to the syllabus finder tool broken? Where can I download the full 1 million syllabi? Thank you.

Dan Cohen says:

@Harpreet: For now, you can get the data set here. We are still working on getting the full text of the majority of the syllabi. Email me if you think you can help on that front.

[…] on the “Million Syllabi Project Hack-a-thon“, where “we explore new ways of using the million syllabi dataset gathered by Dan Cohen’s Syllabus Finder Tool” (from the web site). 10 years worth of […]

[…] enables student and instructor inputs and a data mining and visualization tool that draws on the Syllabus Finder database, the Internet Archive, and the Common Crawl tool and corpus to produce within-system and broad […]

[…] enables student and instructor inputs and a data mining and visualization tool that draws on the Syllabus Finder database, the Internet Archive, and the Common Crawl tool and corpus to produce within-system and broad […]

[…] exponential increase in information and data it has enabled; Dan Cohen’s recent release of a million syllabi as a single searchable database is a case in point. Nowhere are the quantitative dimensions of this […]

[…] from various institutions, scraping the Web (with inspiration from Dan Cohen’s earlier Syllabus Finder project), and begging UNC’s Sakai people for data dumps. Then, while presenting on a Digital […]

[…] course and give it away under some type of create commons licensing. There have been a variety of efforts to collect and publish syllabi, which might help researchers and intrepid faculty willing to mine […]

[…] not. Sometimes some digital tool or platform that seems like a wonderful thing fizzles, like Dan Cohen’s marvelous Syllabus Finder, R.I.P., but at least eventually something more robust comes along. Even commercial tools get […]

[…] project to attempt to gather syllabuses together. The syllabus data came primarily from a project in the early 2000s by Dan Cohen while at George Mason University. He scraped the web for links to […]

[…] the University of North Carolina-Chapel Hill, and Swarthmore College, built off the 2002-2009 “Million Syllabi” database created by Dan Cohen, the Executive Director of the Digital Public Library of […]

Leave a Reply