Friday, October 06, 2006 - Posts

Web Page Data Retrieval

I've been asked about the tool I used to gather the Terra Nova data in the previous post.

Basically, it uses the Internet Explorer object model to automate an instance of IE6.  I wrote a series of primitives to do things like navigate to a page; wait until page-loading has finished; enter data into a field; click a link; retrieve a value from the page and so on.  There were about 15 such primitive functions in all.  It was then fairly straightforward to write a scripting language to enable me (or other users) to extract data from sites without resorting to programming.  I also added a scheduler to allow scripts to run on a regular basis and update files overnight with data from various pharms, oil and energy sources.

The whole thing took a couple of days to write, and since it fetches data, it is called FIDO, meaning FIDO Is Data Oriented.  There is a very old tradition, sadly largely in abeyance, that when one writes a useful tool of which one is proud, it gets a self-referential or recursive acronym.  It contributed nicely to my end of year review last year.

The question I was particularly asked was how to specify what particular data to get from a page.  There are three tricks I use to do this, depending on the page:

  1. If this is always the only, first, third or nth instance of a given style or class on a page, I build an XML object containing the page HTML and parse accordingly.
  2. If the data always appears in the same place on a page (useful in EIA tabular pages) I use IE's ability to export to Excel and retrieve the relevant cell location.  This is great if it works, but useless if the page varies.
  3. If the page varies, and there might be an uncertain number of similarly formatted page elements, but the field always follows another field with a certain value (eg a title label) I use method one, but am able to retrieve all such elements.

Almost everything can be done with a mixture of these approaches.

The highlighting I discussed is done with a five-line piece of javascript which colours the relevant fields in bright blue.  I wrote it as the client-side component of a server-side search routine for our Broadvision servers.  That's Broadvision, the horrible application server with terrible documentation that is so opaque that being able to use it is at once very well-remunerated but barely worth it.