posted on Friday, October 06, 2006 5:49 AM by Endie

Web Page Data Retrieval

I've been asked about the tool I used to gather the Terra Nova data in the previous post.

Basically, it uses the Internet Explorer object model to automate an instance of IE6.  I wrote a series of primitives to do things like navigate to a page; wait until page-loading has finished; enter data into a field; click a link; retrieve a value from the page and so on.  There were about 15 such primitive functions in all.  It was then fairly straightforward to write a scripting language to enable me (or other users) to extract data from sites without resorting to programming.  I also added a scheduler to allow scripts to run on a regular basis and update files overnight with data from various pharms, oil and energy sources.

The whole thing took a couple of days to write, and since it fetches data, it is called FIDO, meaning FIDO Is Data Oriented.  There is a very old tradition, sadly largely in abeyance, that when one writes a useful tool of which one is proud, it gets a self-referential or recursive acronym.  It contributed nicely to my end of year review last year.

The question I was particularly asked was how to specify what particular data to get from a page.  There are three tricks I use to do this, depending on the page:

  1. If this is always the only, first, third or nth instance of a given style or class on a page, I build an XML object containing the page HTML and parse accordingly.
  2. If the data always appears in the same place on a page (useful in EIA tabular pages) I use IE's ability to export to Excel and retrieve the relevant cell location.  This is great if it works, but useless if the page varies.
  3. If the page varies, and there might be an uncertain number of similarly formatted page elements, but the field always follows another field with a certain value (eg a title label) I use method one, but am able to retrieve all such elements.

Almost everything can be done with a mixture of these approaches.

The highlighting I discussed is done with a five-line piece of javascript which colours the relevant fields in bright blue.  I wrote it as the client-side component of a server-side search routine for our Broadvision servers.  That's Broadvision, the horrible application server with terrible documentation that is so opaque that being able to use it is at once very well-remunerated but barely worth it.

Comments

# re: Web Page Data Retrieval

Monday, October 09, 2006 11:34 AM by Buck
I have not needed to do this in a while but used to use the Webl and Rebol languages (and, under pressure, Perl) which made it pretty easy to grab and process pages.

Sadly Webl was a Compaq project that died when taken over by HP (I still have the last Beta). Rebol, by Carl Sassenrath - Amiga OS designer, is still on the go and has expanded somewhat since I last visited the site. Still has free versions though (http://rebol.com)

# re: Web Page Data Retrieval

Tuesday, October 10, 2006 8:24 AM by Endie
Buck Godot?

And yes, you are right that there are tools and languages out there I *should* have used, that would have been a more efficient means of doing the job (although Perl has the disadvantage that I can't even remember or understand what my own code does from the one occasion I used it in anger). But this was an example of me wanting to try something and finding an excuse.

If I did it again, for instance, I would use it as an excuse to play with some Ruby.

I didn't know that Carl Sassenrath was still active in the field. The Amiga OS was a piece of art.

# re: Web Page Data Retrieval

Tuesday, October 10, 2006 1:28 PM by Buck Godot
Not a "My Language is better than..." just of interest. Ruby looks very good, Ruby on Rails appears to be squaring up to J2EE as well.