Category: APIs

Creating a Blog from Scratch, Part 4: Searching for a Good Search

It often surprises those who have never looked at server logs (the detailed statistics about a website) that a tremendous percentage of site visitors come from searches. In the case of the Center for History and New Media, this is a staggering 400,000 unique visitors a month out of about one million. Furthermore, many of these visitors ignore a website’s navigation and go right to the site search box to complete their quest for information. While I’m not a big fan of consultants that tell webmasters to sacrifice virtually everything for usability, I do feel that searching has been undervalued by digital humanities projects, in part because so much effort goes into digitization, markup, interpretation, and other time-consuming tasks. But there’s another, technical reason too: it’s actually very hard to create an effective search—one, for instance, that finds phrases as well as single words, that is able to rank matches well, and that is easy to maintain through software and server upgrades. In this installment of “Creating a Blog from Scratch” (for those who missed them, here are parts 1, 2, and 3) I’ll take you behind the scenes to explain the pluses and minuses of the various options for adding a search feature to a blog, or any database-driven website for that matter.

There are basically four options for searching a website that is generated out of a database: 1) have the database do it for you, since it already has indexing and searching built in; 2) install another software package on your server that spiders your site, indices it, and powers your search; 3) use an application programming interface (API) from Google, Yahoo, or MSN to power the search, taking search results from this external source and shoehorning them into your website’s design; 4) outsourcing the search entirely by passing search queries to Google, Yahoo, or MSN’s website, with a modifier that says “only search my site for these words.”

Option #1 seems like the simplest. Just create an SQL statement (a line of code in database lingo) that sends the visitor’s query to the database software—in the case of this blog, the popular MySQL—and have it return a list of entries that match the query. Unfortunately, I’ve been using MySQL extensively for five years now and have found its ability to match such queries less than adequate. First of all, until the most recent version of the MySQL it would not handle phrase searching at all, so you would have to strip quotation marks out of queries and fool the user into believing your site could do something that it couldn’t (that is, do a search like Google could). Secondly, I have found its indexing and ranking schemes to be far behind what you expect from a major search engine. Maybe this has changed in version 5, but for many years it seemed as if MySQL was using search principles from the early 1990s, where the number of times a word appeared on the page signified how well the page matched the query (rather than the importance of the place of each instance of the word on the page, or even better, how important the document was in the constellation of pages that contained that word). MySQL will return a fraction from 0 to 1 for the relevance of a match, but it’s a crude measure. I’m still not convinced, even with the major upgrades in version 5, that MySQL’s searching is acceptable for demanding users.

Option #2 is to install specialized search packages such as the open source ht://Dig on your server, point it to your blog (or website) and let it spider the whole thing, just as Google or Yahoo does from the outside. These software packages can do a decent job indexing and swiftly finding documents that seem more relevant than the rankings in MySQL. But using them obviously requires installing and maintaining another complicated piece of software, and I’ve found that spiders have a way of wandering beyond the parameters you’ve set for them, or flaking out during server upgrades. (Over the last few days, for instance, I’ve had two spiders request hundreds of posts from this blog that don’t exist. Maybe they can see into the future.) Anecdotally, I also think that the search results are better from commercial services such as Google or Yahoo.

I’ve become increasingly enamored of Option #3, which is to use APIs, or direct server-to-server communications, with the indices maintained by Google, Yahoo, or Microsoft. The advantage of these APIs is that they provide you with very high quality search results and query handling (at least for Google and Yahoo; MSN is far behind). Ranking is done properly, with the most important documents (e.g., blog posts that many other bloggers link to or that you have referenced many times on your own site) coming up first if there are multiple hits in the search results. And these search giants have far more sophisticated ways of handling phrase searches (even long ones) and boolean searches than MySQL. The disadvantage of APIs is that for some reason the indices made available to software developers are only a fraction the size of the main indices for these search engines, and are only updated about once a month. So visitors may not find recent material, or some material that is ranked fairly low, through API searches. Another possibility for Option #3 is to use the API for a blog search engine, rather than a broad search engine. For instance, Technorati has a blog-specific search API. Since Technorati automatically receives a ping from my Atom feed every time I post (via FeedBurner), it’s possible that this (or another blog search engine) will ultimately provide a solid API-based search.

I’ve been experimenting with ways of getting new material into the main Google index swiftly (i.e., within a day or two rather than a month or two), and have come up with a good enough solution that I have chosen Option #4: outsourcing the search entirely to Google, by using their free (though unfortunately ad-supported) site-specific search. With little fanfare, this year Google released Google Sitemaps, which provides an easy way for those who maintain websites, especially database-driven ones, to specify where all of their web pages are using an XML schema. (Spiders often miss web pages generated out of a database because there are often so many of them, and some of these pages may not be linked to.) While not guaranteeing that everything in your sitemap will be crawled and indexed, Google does say that it makes it easier for them to crawl your site more effectively. (By the way, Google’s recent acquisition of 5 percent of AOL seems to have been, at least ostensibly, very much about providing AOL with better crawls, thus upping the visibility of their millions of pages without messing with Google’s ranking schemes.) And—here’s the big news if you’ve made it this far—I’ve found that having a sitemap gets new blog posts into the main Google index extremely fast. Indeed, usually within 24 hours of submitting a new post Google downloads my updated sitemap (created automatically by a PHP script I’ve written), sees the new URL for the post, and adds it to its index. This means that I can very effectively use the Google’s main search engine for this blog, although because I’m not using the API I can’t format the results page to match the design of my site exactly.

One final note, and I think an important one for those looking to increase the visibility of their blog posts (or any web page created from a database) in Google’s search results: have good URLs, i.e., ones with important keywords rather than meaningless numbers or letters. Database-driven sites often have such poor URLs featuring an ugly string of variables, which is a shame, since server technology (such as Apache’s mod_rewrite) allows webmasters to replace these variables with more memorable words. Moreover, Google, Yahoo, and other search engines clearly favor keywords in URLs (very apparent when you begin to work with Google’s Web API), assigning them a high value when determining the relevance of a web page to a query. Some blog software automatically creates good URLs (like Blogger, owned by Google), while many other software packages do not—typically emphasizing the date of a post in the URL or the page number in the blog. For my own blogging software, I designed a special field in the database just for URLs, so I can craft a particularly relevant and keyword-laden string. Mod_rewrite takes care of the rest, translating this string into an ID number that’s retrieved by the database to generate the page you’re reading.

For many reasons, including making it accessible to alternative platforms such as audio browsers and cell phones, I wanted to generate this page in strict XHTML, unlike my old website, which had poor coding practices left over from the 1990s. Unfortunately, as the next post in this series details, I failed terribly in the pursuit of this goal, and this floundering made me think twice about writing my own blogging software when existing packages like WordPress will generate XHTML for you, with no fuss.

Part 5: What is XHTML, and Why Should I Care?

Alexa Web Search Platform Debuts

I’m currently working on an article for D-Lib Magazine explaining in greater depth how some of my tools that use search engine APIs work (such as the Syllabus Finder and H-Bot). These APIs, such as the services from Google and Yahoo, allow somewhat more direct access to mammoth web databases than you can get through these companies’ more public web interfaces. I thought it would be helpful for the article to discuss some of the advantages and drawbacks of these services, and was just outlining one of my major disappointments with their programming interfaces—namely, that you can’t run sophisticated text analysis on their servers, but have to do post-processing on your own server once you get a set of results back—when it was announced that Alexa released its Web Search Platform. The AWSP allows you to do just what I’ve been wanting to do on an extremely large (4 billion web page) corpus: scan through it in the same way that employees at Yahoo and Google can do, using advanced algorithms and manipulating as large a set of results as you can handle, rather than mere dozens of relevant pages. Here’s what’s notable about AWSP for researchers and digital humanists.

  • Yahoo and Google hobble their APIs by only including a subset of their total web crawl. They seem leery of giving the entire 8 billion pages (in the case of the Google index) to developers. My calculation is that only about 1 in 5 pages in the main Google index makes it into their API index. AWSP provides access to the full crawl on their servers, plus the prior crawl and any crawl in progress. This means that AWSP probably provides the largest dataset researchers can presently access, about 3 times larger than Google or Yahoo (my rough guess from using their APIs is that those datasets are only about 1.5 billion pages, versus about 4 billion for AWSP). It seems ridiculous that this could make a difference (do I really need 250 terabytes of text rather than 75?), but when you’re searching for low-ranking documents like syllabi it could make a big difference. Moreover, with at least two versions of every webpage, it’s conceivable you could write a vertical search engine to compare differences across time on the web.
  • They seem to be using a similar setup to the Ning web application environment to allow nonprogrammers to quickly create a specialized search by cloning a similar search that someone else has already developed. No deep knowledge of a programming language needed (possibly…stay tuned).
  • You can download entire datasets, no matter how large, something that’s impossible on Yahoo and Google. So rather than doing my own crawl for 600,000 syllabi—which broke our relatively high-powered server—you can have AWSP do it for you and then grab the dataset.
  • You can also have AWSP host any search engine you create, which removes a lot of the hassle of setting up a search engine (database software, spider, scripting languages, etc.).
  • OK, now the big drawback. As economists say, there’s no such thing as a free lunch. In the case of AWSP, their business model differs from the Google and Yahoo APIs. Google and Yahoo are trying to give developers just enough so that they create new and interesting applications that rely on but don’t compete directly with Google and Yahoo. AWSP charges (unlike Google and Yahoo) for use, though the charges seem modest for a digital humanities application. While a serious new search engine that would data-mine the entire web might cost in the thousands of dollars, my back of the envelope calculation is that it would cost less than $100 (that is, paid to Alexa, aside from the programming time) to reproduce the Syllabus Finder, plus about $100 per year to provide it to users on their server.

I’ll report more details and thoughts as I test the service further.

Do APIs Have a Place in the Digital Humanities?

Since the 1960s, computer scientists have used application programming interfaces (APIs) to provide colleagues with robust, direct access to their databases and digital tools. Access via APIs is generally far more powerful than simple web-based access. APIs often include complex methods drawn from programming languages—precise ways of choosing materials to extract, methods to generate statistics, ways of searching, culling, and pulling together disparate data—that enable outside users to develop their own tools or information resources based on the work of others. In short, APIs hold great promise as a method for combining and manipulating various digital resources and tools in a free-form and potent way.

Unfortunately, even after four decades APIs remain much more common in the sciences and the commercial realm—for example, the APIs provided by search behemoths Google and Yahoo—than in the humanities. There are some obvious reasons for this disparity. By supplying an API, the owners of a resource or tool generally bear most of the cost (on their taxed servers, in technical support and staff time) while receiving little or no (immediate) benefit. Moreover, by essentially making an end-run around the common or “official” ways of accessing a tool or project (such as a web search form for a digital archive), an API may devalue the hard work and thoughtfulness put into the more public front end for a digital project. It is perhaps unsurprising that given these costs even Google and Yahoo, which have the financial strength and personnel to provide APIs for their search engines, continue to keep these programs hobbled—after all, programmers can use their APIs to create derivative search engines that compete directly with Google’s or Yahoo’s results pages, with none of the diverting (and profitable) text advertising.

So why should projects in the digital humanities provide APIs, especially given their often limited (or nonexistent) funding compared to a Google or Yahoo? The reason IBM conceived APIs in the first place, and still today the reason many computer scientists find APIs highly beneficial, is that unlike other forms of access they encourage the kind of energetic and creative grass-roots and third-party development that in the long run—after the initial costs borne by the API’s owner—maximize the value and utility of a digital resource or tool. Motivated by many different goals and employing many different methodologies, users of APIs often take digital resources or tools in directions completely unforeseen by their owners. APIs have provided fertile ground for thousands of developers to experiment with the tremendous indices and document caches maintained by Google and Yahoo. New resources based on these APIs appear weekly, some of them hinting at new methods for digital research, data visualization techniques, and novel ways to data-mine texts and synthesize knowledge.

Is it possible—and worthwhile—for digital humanities projects to provide such APIs for their resources and tools? Which resources or tools would be best suited for an API, and how will the creators of these projects sustain such an additional burden? And are there other forms of access or interoperability that have equal or greater benefits with fewer associated costs?