[uf-discuss] Scraping or parsing?

Wed Feb 28 04:44:15 PST 2007

* When my user agent encounters a HTML document it can use various
pattern-matching rules to extract embedded data, e.g. it sees the
string "vevent" in an attribute and uses the conventions for hCalendar
to pull out the event details.

This heuristic-based approach to document interpretation is commonly
known as screenscraping [1].

* When my user agent encounters a HTML document it can follow the HTML
specification, specifically the part on Meta Data Profiles [2]. If it
finds the a URI corresponding to the hCalendar profile, it can use the
conventions for hCalendar (which are encoded in a machine-readable
form in the XMDP document at the profile URI) to pull out the event
details.

This deterministic approach to document interpretation is commonly
know as parsing [3].

Being able to reliably parse documents means that there's a much
better chance of the publisher's intent being preserved. This point is
considerably more significant when the markup is likely to have some
machine-processing rather than being directly rendered to the user,
intelligent responses to mistakes are considerably more difficult for
computers. The preservation of the publishers intent, their authorised
statements, is particularly important in an environment where
republication is not uncommon and provenance tracking often desirable.
With scraping the chain of authority is broken at the first link.

What's more, data extraction is easier and more efficient if there's a
profile URI in place. Once the head of the doc has been read, the
agent has all the information it needs on how to process the body. No
speculative comparisons between the body content and a list of known
attribute strings.

Right now the best that can be offered for most microformats is
calculated guesswork. What's needed to get beyond this is the minting
of reasonably stable profile URIs and the XMDP documents placed at
those URIs.

XMDP profiles have already been drafted for many of the microformats
(e.g. there's one for hCalendar at [4]).

I really don't understand the lack of activity from the core devotees*
of microformats on this. A minimal piece of server admin - publish the
existing profiles at e.g. http://microformats.org/profiles/hcalendar
and it's done. Yet it must be approaching a year since it was accepted
that profiles should have URIs [5]. What does this say of the
microformats process? Ok, arguably microformats can solve 80/20 of the
embedded data problem without profile URIs. But ignoring the profile
part of the HTML spec makes a mockery of "based on existing
interoperable standards".

I really like Tantek's definition: "Microformats are the way to
publish and share information on the web with higher fidelity." [6].
Right now the conventions only offer a marginal improvement in
fidelity, because for the most part it's still just screenscraping.

Can someone please take this small step as soon as possible. I believe
it will make a huge difference in the long term.

Cheers,
Danny.

* I'd rather avoid the negative connotations of "cabal" ;-)

[1] http://en.wikipedia.org/wiki/Screen_scraping
[2] http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.3
[3] http://en.wikipedia.org/wiki/Parsing
[4] http://dannyayers.com/microformats/hcalendar-profile
[5] http://microformats.org/wiki/profile-uris
[6] http://microformats.org/wiki/what-are-microformats

-- 

http://dannyayers.com