A Million Syllabi

Today I’m releasing a database of over a million syllabi gathered by my Syllabus Finder tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I’m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude.

I created the Syllabus Finder in 2002 when Google released their first API to access their search engine. The initial API included the ability to grab cached HTML from millions of web pages, which I realized could then be scanned using high-relevancy keywords to identify pages that were most likely syllabi. In addition to my lousy PHP code that got it up and running, the brilliant Simon Kornblith wrote some additional code to make it work well. The result was a tool that was quite popular (1.3 million queries) until Google deprecated their original API in 2009 in favor of (what I consider to be) a less useful API. (With the original API you could basically clone google.com, which I’m sure was not popular at the Googleplex.)

If you are interested in the kind of research that can be done on these syllabi, please read my Journal of American History article “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.” For that article I used regular expressions to pull book titles out of a thousand American history surveys to see how textbooks and other works are used by instructors. Some hidden elements emerged. I’m excited to see what creative ideas other scholars and researchers come up with for this large database.

Some important clarifications and caveats:

1) I’m providing this archive in the same spirit (and under same regulations) that the Internet Archive provides web corpora (indeed, this corpus could probably be recreated from the Internet Archive’s Wayback Machine, albeit after a lot of work). To the best of my knowledge, and because of the way they were obtained, all of the documents this database contains were posted on the open web, and were cached (or not) respecting open-web standards such as robots.txt. It does not contain any syllabi that were posted in private places, such as gated Blackboard installations. Indeed, I suspect that most of these syllabi come from universities where it is expected that professors post syllabi in an open fashion (as is the case here at Mason), or from professors like me who believe that openness is good for scholarship and teaching. But as with the Internet Archive, if you are the creator of a syllabus and really can’t sleep unless it is purged from this research database, contact me.

2) This database is provided as is and without support. I get enough email and unfortunately cannot answer questions. If you are appreciative, you can make a tax-free donation to the Center for History and New Media, for which you will receive a hug from me. The database is intended for non-commercial use of the type seen in my JAH article.

3) The database is an SQL dump consisting of 1.4 million rows. The columns are syllabiID (the Syllabus Finder’s unique identifier), url (web address of the syllabus at the time it was found), title (of the web page the syllabus was on), date_added (when it was added to the Syllabus Finder database), and chnm_cache (the HTML of the page on the date it was added). The database is 804 MB uncompressed. The corpus is heavily U.S.-centric because web pages were matched to English-language words, and for a time the Syllabus Finder only took pages from .edu domains (thus leaving out, e.g., .ac.uk URLs).

4) Because the Syllabus Finder was completely automated, some percentage of the 1.4 million documents are not syllabi (my best guess is about 20%). Most often these incorrect matches are associated course documents such as assignments, which are interesting in their own right. But some are oddball documents that just looked like syllabi to the algorithms. I have made no attempt to weed them out.

If you understand all of this clearly, then here’s a million syllabi for you: CHNM Syllabus Finder Corpus, Version 1.0 (30 March 2011) (265 MB download, zipped SQL file)

UPDATE 1 (11pm 3/30/11): Matt Burton has helpfully provided a torrent for this file. If you can, please use it instead of the direct download.

UPDATE 2 (9pm 3/31/11): Unfortunately I should have checked the exported database before posting. Version 1.0 does indeed have the URLs, titles, and dates of about 1.45 million syllabi but it is missing a majority of the HTML caches of those syllabi. I am working to recreate the full database, which will be much larger and more useful.

March 30, 2011

Archives, Pedagogy, Text Mining

33 responses to “A Million Syllabi”

A Snarkmarket mini-collaboration: Snarksyllabi « Snarkmarket

March 30, 2011

[…] for History and New Media at George Mason University released a really interesting dataset today: a million syllabi culled from the web, from […]

Log in to Reply
mcburton

March 30, 2011

Hey all,
I just got a torrent tracker up and running and am seeding from home. Spread the word & help seed if you can!
http://tweedpiratebay.appspot.com/static/chnm_syllabus_finder_corpus.torrent

Let me know if it breaks
@mcburton

Log in to Reply
Syllabi Data Mining « Jonathan Tregear

March 31, 2011

[…] A Million Syllabi http://www.dancohen.org/2011/03/30/a-million-syllabi/ […]

Log in to Reply
Brian Croxall

March 31, 2011

What a great project, Dan. Thanks for making this open for everyone to play with!

Log in to Reply
Lev Manovich

March 31, 2011

Great!

I was thinking for a while how nice it will be to get lots of syllabi and then look at stats for books used, terms etc

thank you,

Lev

Log in to Reply
Jason Priem

March 31, 2011

You might want to check out:

Kousha, K., & Thelwall, M. (2008). Assessing the impact of disciplinary research on teaching: An automatic analysis of online syllabuses. Journal of the American Society for Information Science and Technology, 59(13), 2060-2069. doi:10.1002/asi.20920

They suggest inclusion in syllabi constitutes a neglected (and measurable) dimension of scholarly impact, at least in some fields. But they start from articles, and then do web searches to see if they’re in syllabi. With this dataset, you could approach it from the opposite direction, starting with syllabi.

Log in to Reply
New Million-Syllabi Repository Could Reveal Trends in Teaching – Wired Campus – The Chronicle of Higher Education

March 31, 2011

[…] at George Mason University, hopes the repository of one million syllabi he posted today on his Web site will help fuel academic […]

Log in to Reply
Douglas Knox

March 31, 2011

Dan, this is wonderful, creative, innovative almost ten years ago and no less so today. The interest in your release of this as a data set is heartening. I read and accepted the warranty in downloading, and have no regrets. Reluctantly, though, I have to question a detail relating to orders of magnitude.

In the file available for download it looks more like just under 17,000 syllabi. There are indeed more than 1.4 million rows in the database, but for most of them the chnm_cache field is empty — anything with an ID number over 20,823, or anything harvested after 2002. I double-checked this with spot inspection of the unzipped SQL file. Is there more data that didn’t make it through export? If it were a million files, wouldn’t it be more like maybe 20-40 gigabytes? A syllabus is more likely to be 10K or more than it is to be just 1 kilobyte. Have I miscalculated somewhere?

Even on an “as is” basis, more than 16,000 syllabi are plenty already to be interesting. At least 370 “blink” tags in the service of higher education in 2002 are evidence of a near-forgotten world now.

Log in to Reply
New Million-Syllabi Repository Could Reveal Trends in Teaching « The EdTech News Blog

April 1, 2011

[…] at George Mason University, hopes the repository of one million syllabi he posted today on his Web site will help fuel academic […]

Log in to Reply
Friday Quick Hits and Varia « The New Archaeology of the Mediterranean World

April 1, 2011

[…] fun stuff on teaching this week. First, Dan Cohen released his Million Syllabi into the world. He released it as a .sql file (for obvious and good reasons), but it would be more useful to me […]

Log in to Reply
Weekly News Roundup | MindShift

April 1, 2011

[…] Dan Cohen has just released a database of over one million course syllabi, gathered from the Internet between 2002 and 2009. The data is […]

Log in to Reply
Weekend Reading: Carnival Edition – ProfHacker – The Chronicle of Higher Education

April 1, 2011

[…] probably saw that Dan Cohen has released a million syllabi for text analysis and data-mining; at Snarkmarket, they’re having a […]

Log in to Reply
Ed-Tech Weekly News Roundup | Hack Education

April 2, 2011

[…] Dan Cohen has just released a database of over one million course syllabi, gathered from the Internet between 2002 and 2009. The data is […]

Log in to Reply
Alex Garcia

April 3, 2011

Dan,

Awesome data, but as Douglas said above – having the full 1,000,000 records would be even better :). Are there any hopes that you will publish the full database?

Btw, I see that there are few PDFs in the mix, and I could not open a single one of them… Did they get damaged during export?

Alex

Log in to Reply
Brett Boessen

April 3, 2011

I can’t get the torrent file to open — Vuze gives me an error.

Log in to Reply
This is Something…

April 4, 2011

[…] for History and New Media at George Mason University released a really interesting dataset today: a million syllabi culled from the web, from […]

Log in to Reply
Euromachs Blog » Blog Archive » Web Readings Weekly Roundup

April 5, 2011

[…] A Million Syllabi […]

Log in to Reply
Recent Linkage 12 « Signifying Media

April 8, 2011

[…] Cohen releases a database of over a million academic syllabi automatically collected […]

Log in to Reply
Paul Dixon

May 9, 2011

Is there any update on this, or is the data too hard to recover cleanly?

Log in to Reply
Dan Cohen

May 10, 2011

@Paul: still working on it. Hoping to make some progress soon.

Log in to Reply
Martha Saavedra

May 26, 2011

For a curriculum project, e worked on something similar specifically for African Studies in 2000. We didn’t set up a query, but found syllabi and entered URLs into a searchable database. Many of the links are dead, and of course, there was no resources to update this. Here is the link:
http://africa.berkeley.edu/academics/SyllabiSelector.php
I look forward to browsing your database.

Log in to Reply
A million syllabi « My History 511 Blog Site

February 10, 2012

[…] came across this link for over 1.4 million syllabi, as compiled by Dan Cohen, over at CHNM. Granted, he admits that as […]

Log in to Reply
Harpreet Singh

February 21, 2012

Dan, is the link to the syllabus finder tool broken? Where can I download the full 1 million syllabi? Thank you.

Log in to Reply
Dan Cohen

February 22, 2012

@Harpreet: For now, you can get the data set here. We are still working on getting the full text of the majority of the syllabi. Email me if you think you can help on that front.

Log in to Reply
Learning from other people – Academic Summer Camp (except in winter???) « Nick Falkner

June 2, 2012

[…] on the “Million Syllabi Project Hack-a-thon“, where “we explore new ways of using the million syllabi dataset gathered by Dan Cohen’s Syllabus Finder Tool” (from the web site). 10 years worth of […]

Log in to Reply
Craft and Joseph-Nicholas named first DIL/IAH Faculty Fellows – Digital Innovation Lab

November 20, 2012

[…] enables student and instructor inputs and a data mining and visualization tool that draws on the Syllabus Finder database, the Internet Archive, and the Common Crawl tool and corpus to produce within-system and broad […]

Log in to Reply
Craft and Joseph-Nicholas named first DIL/IAH Faculty Fellows Carolina Digital Humanities Initiative

November 20, 2012

[…] enables student and instructor inputs and a data mining and visualization tool that draws on the Syllabus Finder database, the Internet Archive, and the Common Crawl tool and corpus to produce within-system and broad […]

Log in to Reply
Burnable Books | Medieval Studies in the Age of Big Data: A serial forum

December 13, 2012

[…] exponential increase in information and data it has enabled; Dan Cohen’s recent release of a million syllabi as a single searchable database is a case in point. Nowhere are the quantitative dimensions of this […]

Log in to Reply
new semester, new project | the ivi project: inquire, visualize, innovate

August 23, 2013

[…] from various institutions, scraping the Web (with inspiration from Dan Cohen’s earlier Syllabus Finder project), and begging UNC’s Sakai people for data dumps. Then, while presenting on a Digital […]

Log in to Reply
Free is better. Why I’m giving away my course. | A better train wreck.

June 2, 2014

[…] course and give it away under some type of create commons licensing. There have been a variety of efforts to collect and publish syllabi, which might help researchers and intrepid faculty willing to mine […]

Log in to Reply
Embracing ephemerality in the digital humanities | history, CLASS

February 2, 2016

[…] not. Sometimes some digital tool or platform that seems like a wonderful thing fizzles, like Dan Cohen’s marvelous Syllabus Finder, R.I.P., but at least eventually something more robust comes along. Even commercial tools get […]

Log in to Reply
More Than a Million Syllabuses at Your Fingertips – Artificial Intelligence Online

August 4, 2016

[…] project to attempt to gather syllabuses together. The syllabus data came primarily from a project in the early 2000s by Dan Cohen while at George Mason University. He scraped the web for links to […]

Log in to Reply
Sharing Syllabi: What’s Gained, What Challenges Remain | After Class

October 12, 2018

[…] the University of North Carolina-Chapel Hill, and Swarthmore College, built off the 2002-2009 “Million Syllabi” database created by Dan Cohen, the Executive Director of the Digital Public Library of […]

Log in to Reply

A Million Syllabi

33 responses to “A Million Syllabi”

Leave a Reply Cancel reply