Categories
APIs Text Mining Web Services Wikis Yahoo

Wikipedia vs. Encyclopaedia Britannica for Digital Research

In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source’s openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia’s openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I’m writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo’s Term Extraction web service. Vote on which one you think is a better profile, and I’ll reveal which list belongs to which reference work later this week.

Article #1
president bush
saddam hussein
fall of the berlin wall
tiananmen square
thanksgiving day
american troops
manuel noriega
halabja
invasion of panama
gulf war
help
saudi arabia
united nations
berlin wall

Article #2
president george bush
george bush
mikhail gorbachev
soviet union
collapse
reunification of germany
thurgood marshall
union
clarence thomas
joint chiefs of staff
cold war
manuel antonio noriega
iraq
george
nonaggression pact
david h souter
antonio noriega
president george

Categories
Google Privacy Search Text Mining

How Much Google Knows About You

As the U.S. Justice Department put pressure on Google this week to hand over their search records in a questionable pursuit of evidence for an overturned pornography law, I wondered: How much information does Google really know about us? Strangely, at nearly the same time an email arrived from Google (one of the Google Friends Newsletters) telling me that they had just launched Google Personal Search Trends. Someone in the legal department must not have vetted that email: Google Personal Search Trends reveals exactly how much they know about you. So, how much?

A lot. If you have a Google account (you have one if you have a software developer’s username, a Gmail account, or other Google service account), you can login to your Personal Search Trends page and find out. I logged in and even though I’ve never checked a box or filled out a consent form saying that I don’t mind if Google collects information about my search habits, there appeared a remarkable and slightly unsettling series of charts and tables about me and what I’m interested in.

You can discover not only your top 10 search phrases but also the top 10 sites you visit and the top 10 links you click on. Like Santa, Google knows when you are awake and when you are sleeping—amazingly, no searches for me between midnight and 6 AM ET over the past 12 months. And comparing my search habits with its vast database of users, Google Personal Search Trends tells me that I might also like go to websites on RSS, Charles Dickens, Frankenstein, search engine optimization, and Virginia Tech football. (It’s very wrong about that last one, which I hope it only derives from my search terms and websites visited and not also from the IP address of my laptop in an office on the campus of a Virginia state university.)

Of course, you begin to wonder: wouldn’t someone else like to see this same set of charts and tables? Couldn’t they glean a tremendous amount of information about me? This disturbing feeling grows when you do some more investigation of what Google’s storing on your hard drive in addition to theirs. For instance, if you use Google’s Book Search, they know through a cookie stored on your computer which books you’ve looked at—as well as how many pages of each book (so they can block you from reading too much of a copyrighted book).

Seems like the time is ripe for Google to offer its users a similar deal to the one TiVo has had for years: If you want us to provide the “best” search experience—extras in addition to the basic web search such as personalized search results and recommendations based on what you seem to like—you must provide us with some identifying information; if you want to search the web without these extras, then so be it—we’ll only save your searches on a fully anonymous basis for our internal research. Surely when government entities and private investigators hear about Google Personal Search Trends, they’ll want to have a look. One suspects that in China and perhaps the United States too, someone’s already doing just that.

Categories
Education Technology

“Legal Cheating” in the Wall Street Journal

In a forthcoming article in the Chronicle of Higher Education, Roy Rosenzweig and I argue that the ubiquity of the Internet in students’ lives and advances in digital information retrieval threaten to erode multiple-choice testing, and much of standardized testing in general. A revealing article in this weekend’s Wall Street Journal shows that some schools are already ahead of the curve: “In a wireless age where kids can access the Internet’s vast store of information from their cellphones and PDAs, schools have been wrestling with how to stem the tide of high-tech cheating. Now some educators say they have the answer: Change the rules and make it legal. In doing so, they’re permitting all kinds of behavior that had been considered off-limits just a few years ago.” So which anything-goes schools are permitting this behavior, and what exactly are they doing?

The surprise is that it is actually occurring in the more rigorous and elite public and private schools, and they are allowing students to bring Internet-enabled devices into the exam room. Moreover, they are backed not by liberal education professors but by institutions such as the Bill and Melinda Gates Foundation and pragmatic observers of the information economy. As the WSJ (as well as Roy and I) point out, their argument parallels that of the introduction of calculators into mathematics education in the 1980s, eventually leading to the inclusion of these formerly taboo devices on the SATs in 1994, a move that few have since criticized. Today, if one of the main tools workers use in a digital age is the Internet, why not include it in test-taking? After all, asserts M.I.T. economist Frank Levy, it’s more important to locate and piece together information about the World Bank than to know when it was founded. “This is the way the world works,” Harvard Director of Admissions Marlyn McGrath commonsensically notes.

Of course, the bigger question, only partially addressed by the WSJ article, is how the use of these devices will change instruction in fields such as history. From elementary through high school, such instruction has often been filled with the rote memorization of dates and facts, which are easily testable (and rapidly graded) on multiple-choice forms. But we should remember that the multiple-choice test is only a century old; there have been, and there will surely be again, more instructive ways to teach and test such rich disciplines as history, literature, and philosophy.

Categories
Amazon Audience Books Publishing

First Impressions of Amazon Connect

Having already succumbed to the siren’s song that prodded me narcissistically to create a blog, I had very little resistance left when Amazon.com emailed me to ask if I might like to join the beta of program that allows authors to reach potential buyers and existing owners of their books by writing blog-like posts. Called “Amazon Connect,” this service will soon be made available to the authors of all of the books available for purchase on Amazon. Here are some notes about my experience joining the program (and how you can join if you’re an author), some thoughts about what Amazon Connect might be able to do, and some insider information about their upcoming launch.

First, the inside scoop. As far as I can tell, Amazon Connect began around Thanksgiving 2005 with a pilot that enlisted about a dozen authors. It has been slowly expanding since then but is still in beta, and a quiet beta at that. It’s unlikely you’ve seen an Amazon Connect section on one of their web pages. However, I recently learned from the Amazon Connect team that in early February the service will have its official launch, with a big publicity push.

After that point, each post an author makes will appear on the Amazon.com page for his or her book(s). I found out by writing a post of my own that his feature is actually already enabled, as you can see by looking at the page for Digital History (scroll down the page a bit to see my post).

But the launch will also entail a much more significant change—to the home page of Amazon.com itself, which is of course individualized for each user. Starting in February, on the home page of every Amazon user who has purchased your book(s), your posts will show up immediately. Since it’s unlikely that a purchaser of a book will return to that book’s buy page, this appearance on the Amazon home page is important: Authors will effectively gain the ability to send messages to a sizable number of their readers.

Since generally it has been impossible to compile a decent contact list for those who buy a specific book (unless you’re in the NSA or CIA), Amazon’s idea is intriguing. While Amazon Connect is clearly intended to sell more books, and the writing style they advocate less than academic (“a conversational, first-person tone”), it’s remarkable to think that the author of a scholarly monograph might be able to reach a good portion of their audience this way. Indeed, I suspect that for authors of academic press books that might not sell hundreds of thousands of copies, the proportion of buyers of their book that use Amazon is much higher than for popular books (since those books are sold in a higher percentage at physical Barnes & Noble and Borders stores, and increasingly at Costco and Wal-Mart). Could Amazon Connect foster smaller communities of authors and readers, for more esoteric topics?

If you are an author and would like to join the Amazon Connect beta in time for the February launch, here’s what you need to do:

1) First, you must have an Amazon account. If you already have one, go to the special Amazon Connect website, login, and claim your book(s) using the “Register Your Bibliography” link. This involves listing the contact info for your publisher, editor, publicist, or other third party that can verify that you are actually the author of the book(s) you list. About a week later you’ll get an email confirming that you have been verified.

2) Create a profile. You are required to upload a photo, write a short biography, and provide some other information about yourself (such as your email address) that you can choose to share with your audience (I didn’t fill a lot of this out, such as my favorite movies).

3) Once you’ve been added to the system, you can start writing posts. Good luck saying hello to your readers, and remember Amazon Connect rule #5: “No boring content”!

Categories
Academia Email Internet Technology Web

Data on How Professors Use Technology

Rob Townsend, the Assistant Director of Research and Publications at the American Historical Association and the author of many insightful (and often indispensible) reports about the state of higher education, writes with some telling new data from the latest National Study of Postsecondary Faculty (conducted by the U.S. Department of Education roughly every five years since 1987). Rob focused on several questions about the use of technology in colleges and universities. The results are somewhat surprising and thought-provoking.

Here are two relatively new questions, exactly as they are written on the survey form (including the boldface in the first question; more on that later), which you can download from the Department of Education website. “[FILL INSTNAME]” is obviously replaced in the actual questionnaire by the faculty member’s institution.

Q39. During the 2003 Fall Term at [FILL INSTNAME], did you have one or more web sites for any of your teaching, advising, or other instructional duties? (Web sites used for instructional duties might include the syllabus, readings, assignments, and practice exams for classes; might enable communication with students via listservs or online forums; and might provide real-time computer-based instruction.)

Q41: During the 2003 Fall Term at [FILL INSTNAME], how many hours per week did you spend
communicating by e-mail (electronic mail) with your students?

Using the Department of Education’s web service to create bar graphs from their large data set, Rob generated these two charts:

Rob points out that historians are on the low end of e-mail usage in the academy, though it seems not too far off from other disciplines in the humanities and social sciences. A more statistically significant number to get (and probably impossible using this data set) would be the time spent on e-mail per student, since the number of students varies widely among the disciplines. [Update: Within hours of this post Rob had crunched the numbers and came up with an average of 2 minutes per student for history instructors (average of 83 students divided by 2.8 hours spent writing e-mail per week).]

For me, the surprising chart is the first one, on the adoption of the web in teaching, advising, or other instructional duties. Only about a 5-10% rise in the use of the web from 1998 to 2003 for most disciplines, and a decline for English and Literature? This, during a period of enormous, exponential growth in the web, a period that also saw many institutions of higher education mandate that faculty put their syllabi on the Internet (often paying for expensive course management software to do so)?

I have two theories about this chart, with the possibility that both theories are having an effect on the numbers. First, I wonder if that boldfaced “you” in Q39 made a number of professors answer “no” if technically they had someone else (e.g., a teaching assistant or department staffer) put their syllabus or other course materials online. I did some further research after hearing from Rob and noticed that buried in the 1998 survey questionnaire was a slightly different wording, with no boldface: “During the 1998 Fall Term, did you have websites for any of the classes you taught?” Maybe those wordsmiths in English and Literature were parsing the language of the 2003 question a little too closely (or maybe they were just reading it correctly, unlike faculty members from the other disciplines).

My second theory is a little more troubling for cyber-enthusiasts who believe that the Internet will take over the academy in the next decade, fully changing the face of research and instruction. Take a look at this chart from the Pew Internet and American Life Project:

Note how after an initial surge in Internet adoption in the late 1990s the rate of growth has slowed considerably. A minority, small but significant, will probably never adopt the Internet as an important, daily medium of interaction and information. If we believe the Department of Education numbers, within this minority is apparently a sizable segment of professors. According to additional data extracted by Rob Townsend, it looks like this segment is about 16% of history professors and about 21% of English and Literature professors. (These are faculty members who in the fall of 2003 did not use e-mail or the web at all in their instruction.) Remarkably, among all disciplines about a quarter (24.2%) of the faculty fall into this no-tech group. Seems to me it’s going to be a long, long time before that number is reduced to zero.

Categories
Google History Search Text Mining

10 Most Popular History Syllabi

My Syllabus Finder search engine has been in use for three years now, and I thought it would be interesting to look back at the nearly half-million searches and 640,000 syllabi it has handled to see which syllabi have been the most popular. The following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), read a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and “attractiveness” of a syllabus (defined as the ratio of full reads to mere glances). Here are the most popular history syllabi on the web.

#1 – U.S. History to 1870 (Eric Mayer, Victor Valley College, total of 6104 points)

#2 – America in the Progressive Era (Robert Bannister, Swarthmore College, 6000 points)

#3 – The American Colonies (Bruce Dorsey, Swarthmore College, 5589 points)

#4 – The American Civil War (Sheila Culbert, Dartmouth College, 5521 points)

#5 – Early Modern Europe (Andrew Plaa, Columbia University, 5485 points)

#6 – The United States since 1945 (Robert Griffith, American University, 5109 points)

#7 – American Political and Social History II (Robert Dykstra, University at Albany, State University of New York, 5048 points)

#8 – The World Since 1500 (Sarah Watts, Wake Forest University, 4760 points)

#9 – The Military and War in America (Nicholas Pappas, Sam Houston State University, 4740 points)

#10 – World Civilization I (Jim Jones, West Chester University of Pennsylvania, 4636 points)

This is, of course, a completely unscientific study. It obviously gives an advantage to older syllabi, since those courses have been online longer and thus could show up in search results for several years. On the other hand, the ten syllabi listed here range almost uniformly from 1998 to 2005.

Whatever its faults, the study does provide a good sense of the most visible and viewed syllabi on the web (high Google rankings help these syllabi get into a lot of Syllabus Finder search results), and I hope it provides a sense of the kinds of syllabi people frequently want to consult (or crib)—mostly introductory courses in American history. The variety of institutions represented is also notable (and holds true beyond the top ten; no domination by, e.g., Ivy League schools). I’ll probably do some more sophisticated analyses when I have the time; if there’s interest from this blog’s audience I’ll calculate the most popular history syllabi from 2005 courses, or the top ten for other topics. If you would like to read a far more elaborate (and scientific) data-mining study I did using the Syllabus Finder, please take a look at “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.”

[How the rankings were determined: 1 point was awarded for each time a syllabus showed up in a Syllabus Finder search result; 10 points were awarded for each time a Syllabus Finder user clicked through to view the entire syllabus; 100 points were awarded for each percent of “attractiveness,” where 100% attractive meant that every time a syllabus made an appearance in a search result it was clicked on for further information. For instance, the top syllabus appeared in 1211 searches and was clicked on 268 times (22.13% of the searches), for a point total of 1211 + (268 X 10) + (22.13 X 100) = 6104.]

Categories
Archives History News Preservation Web

Kojo Nnamdi Show Questions

Roy Rosenzweig and I had a terrific time on The Kojo Nnamdi Show today. If you missed the radio broadcast you can listen to it online on the WAMU website. There were a number of interesting calls from the audience, and we promised several callers that we would answer a couple of questions off the air; here they are.

Barbara from Potomac, MD asks, “I’m wondering whether new products that claim to help compress and organize data (I think one is called “C-Gate” [Kathy, an alert reader of his blog, has pointed out that Barbara probably means the giant disk drive company Seagate]) help out [to solve the problem of storing digital data for the long run]? The ads claim that you can store all sorts of data—from PowerPoint presentations and music to digital files—in a two-ounce standalone disk or other device.”

As we say in the book, we’re skeptical of using rare and/or proprietary formats to store digital materials for the long run. Despite the claims of many companies about new and novel storage devices, it’s unclear whether these specialized devices will be accessible in ten or a hundred years. We recommend sticking with common, popular formats and devices (at this point, probably standard hard drives and CD- or DVD-ROMs) if you want to have the best odds of preserving your materials for the long run. The National Institute of Standards and Technology (NIST) provides a good summary of how to store optical media such as CDs and DVDs for long periods of time.

Several callers asked where they could go if they have materials on old media, such as reel-to-reel or 8-track tapes, that they want to convert to a digital format.

You can easily find online some of the companies we mentioned that will (for a fee) transfer your own media files onto new devices. Google for the media you have (e.g., “8-track tape”) along with the words “conversion services” or “transfer services.” I probably overestimated the cost for these services; most conversions will cost less than $100 per tape. However, the older the media the more expensive it will be. I’ll continue to look into places in the Washington area that might provide these services for free, such as libraries and archives.

Categories
Archives History Preservation Web

Digital History on The Kojo Nnamdi Show

From the shameless plug dept.: Roy Rosenzweig and I will be discussing our book Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web this Tuesday, January 10, on The Kojo Nnamdi Show. The show is produced at Washington’s NPR station, WAMU. We’re on live from noon to 1 PM EST, and you’ll be able to ask us questions by phone (1-800-433-8850), via email (kojo@wamu.org), or through the web. The show will be replayed from 8-9 PM EST on Tuesday night, and syndicated via iTunes and other outlets as part of NPR’s terrific podcast series (look for The Kojo Nnamdi Show/Tech Tuesday). You’ll also be able to get the audio stream directly from the show’s website. I’ll probably answer some additional questions from the audience in this space.

Categories
Accessibility Blogs HTML Programming Software Standards Web

Creating a Blog from Scratch, Part 5: What is XHTML, and Why Should I Care?

In prior posts in this series (1, 2, 3, and 4), I described with some glee my rash abandonment of common blogging software in favor of writing my own. For my purposes there seemed to be some key disadvantages to these popular packages, including an overemphasis on the calendar (I just saw the definition of a blog at the South by Southwest Interactive Festival—”a page with dated entries”—which, to paraphrase Woody Allen, is like calling War and Peace “a book about Russia”), a sameness to their designs, and comments that are rarely helpful and often filled with spam. But one of the greatest advantages of recent blog software packages is that they generally write standards-compliant code. More specifically, blog software like WordPress automatically produces XHTML. Some of you might be asking, what is XHTML, and who cares? And why would I want to spend a great deal of effort ensuring that this blog complied strictly with this language?

The large digital library contingent that reads this blog could probably enumerate many reasons why XHTML compliance is important, but I had two reasons in mind when I started this blog. (Actually, I had a third, more secretive reason that I’ll mention first: Roy Rosenzweig and I argue in our book Digital History that XHTML will likely be critical for digital humanists to adhere to in the future—don’t want to be accused of being a hypocrite.) For those for whom web acronyms are Greek, XHTML is a sibling of XML, a more rigorously structured and flexible language than the HTML that underlies most of the web. XHTML is better prepared than HTML to be platform-independent; because it separates formatting from content, XHTML (like XML) can be reconfigured easily for very different environments (using, e.g., different style sheets). HTML, with formatting and content inextricably combined, for the most part assumes that you are using a computer screen and a web browser. Theoretically XHTML can be dynamically and instantaneously recast to work on many different devices (including a personal computer). This flexibility is becoming an increasingly important feature as people view websites on a variety of platforms (not just a normal computer screen, e.g., but cell phones or audio browsers for the blind). Indeed, according to the server logs for this blog, 1.6% of visitors are using a smart phone, PDA, or other means to read this blog, a number that will surely grow. In short, XHTML seems better prepared than regular HTML to withstand the technological changes of the coming years, and theoretically should be more easily preserved than older methods of displaying information on the web. For these and other reasons a 2001 report the Smithsonian commissioned recommended the institution move to XHTML from HTML.

Of course, with standards compliance comes extra work. (And extra cost. Just ask webmasters at government agencies trying to make their websites comply with Section 508, the mandatory accessibility rules for federal information resources.) Aside from a brief flirtation with the what-you-see-is-what-you-get, write-the-HTML-for-you program Dreamweaver in the late 1990s, I’ve been composing web pages using a text editor (the superb BBEdit) for over ten years, so my hands are used to typing certain codes in HTML, in the same way you get used to a QWERTY keyboard. XHTML is not that dissimilar from HTML, but it still has enough differences to make life difficult for those used to HTML. You have to remember to close every tag; some attributes related to formating are in strange new locations. One small example of the minor infractions I frequently trip up on writing XHTML: the oft-used break tag to add a line to a web page must “close itself” by adding a slash before the end bracket (not <br>, but <br />). But I figured doing this blog would give me a good incentive to start writing everything in strict XHTML.

Yeah, right. I clearly haven’t been paying enough attention to detail. The page you’re reading likely still has dozens of little coding errors that make it fail strict compliance with the World Wide Web Consortium’s XHTML standard. (If you would like a humbling experience that brings to mind receiving a pop quiz back from your third-grade teacher with lots of red ink on it, try the W3C’s XHTML Validator.) I haven’t had enough time to go back and correct all of those little missing slashes and quotation marks. WordPress users out there can now begin their snickering; their blog software does such mundane things for them, and many proudly (and annoyingly) display little “XHTML 1.0 compliant” badges on their sites. Go ahead, rub it in.

After I realized that it would take serious effort to bring my code up to code, so to speak, I sat back and did the only thing I could do: rationalize. I didn’t really need strict XHTML compliance because through some design slight-of-hand I had already been able to make this blog load well on a wide range of devices. I learned from other blog software that if you put the navigation on the right rather than the more common left you see on most websites, the body of each post shows up first on a PDA or smart phone. It also means that blind visitors don’t have to suffer through a long list of your other posts before getting to the article they want to read.

As far as XHTML is concerned, I’ll be brushing up on that this summer. Unless I move this blog to WordPress by then.

Part 6: One Year Later

Categories
Blogs Google History Maps Mashups Tagging

Hurricane Digital Memory Bank Featured on CNN

I was interviewed yesterday by CNN about a new project at the Center for History and New Media, the Hurricane Digital Memory Bank, which uses digital technology to record memories, photographs, and other media related to the Hurricanes Katrina, Rita, and Wilma. (CNN is going to feature the project sometime this week on its program The Situation Room.) The HDMB is a democratic historical project similar to our September 11 Digital Archive, which saved the recollections and digital files of tens of thousands of contributors from around the world; this time we’re trying to save thousands of perspectives on what occurred on the Gulf Coast in the fall of 2005. What amazes me is how the interest in online historical projects and collections has exploded recently. Several of the web projects I’ve co-directed over the last five years have engaged in collecting history online. But even a project with as prominent a topic as September 11 took a long time to be picked up by the mass media. This time CNN called us just a few weeks after we launched the website, and before we’ve done any real publicity. Here are three developments from the last two years I think account for this sharply increased interest.

Technologies enabling popular writing (blogs) and image sharing (e.g., Flickr) have moved into the mainstream, creating an unprecedented wave of self-documentation and historicizing. Blogs, of course, have given millions of people a taste for daily or weekly self-documentation unseen since the height of diary use in the late nineteenth century. And it used to be fairly complicated to set up an online gallery of one’s photos. Now you can do it with no technical know-how whatsoever, and it’s become much easier for others to find these photos (partly due to tagging/folksonomies). The result is that millions of photographs are being shared daily and the general public is getting used to the instantaneous documentation of events. Look at what happened in the hours after the London subway bombings— photographic documentation of the event that took place on photo-sharing sites within two days formerly would have taken months or even years for archivists to compile.

New web services are making combinations of these democratic efforts at documentation feasible and compelling. Our big innovation for the HDMB is to locate each contribution on an interactive map (using the Google Maps API), which allows one to compare the experiences and images from one place (e.g. an impoverished parish in New Orleans) with another (e.g., a wealthier suburb of Baton Rouge). (Can someone please come up with a better word for these combinations than the current “mashups”?) Through the savvy use of unique Technorati or Flickr tags, a scattered group of friends or colleagues can now automatically associate a group of documents or photographs to create an instant collection on an event or issue.

The mass media has almost completely reversed its formerly antagonistic posture toward new media. CNN now has at least two dedicated “Internet reporters” who look for new websites and scan blogs for news and commentary—once disparaged as the last refuge of unpublishable amateurs. In the last year the blogosphere has actually broken several stories (e.g., the Dan Rather document scandal), and many journalists have started their own blogs. The Washington Post has just hired its first full-time blogger. Technorati now tracks over 24 million blogs; even if 99% of those are discussing the latest on TomKat (the celebrity marriage) or Tomcat (the Linux server technology for Java), there are still a lot of new, interesting perspectives out there to be recorded for posterity.