[uf-discuss] generic microformat parsing heuristics?

Phil Dawes phil at phildawes.net
Mon Nov 7 05:42:25 PST 2005

Hi All,

I've recently been playing with microformats a bit and have added some 
basic hcard and hcalendar parsing to my structured data aggregator 
program JAM*VAT[1] (enough to parse Tantek's 
http://tantek.com/log/2005/10.html page). Unfortunately this is proving 
much more complicated than I originally thought, and was wondering if 
there is a bigger picture that I'm missing.

So my question is:
Is there a set of heuristics that can be employed to generically parse 
(all of the) microformats?
(or at least get reasonable results)

I ask this because JAM*VAT is able to employ some basic heuristics[2] to 
parse pretty much any data oriented XML format into a set of reasonable 
semantic statements (JAM*VAT uses a very simple scheme for representing 
semantic statements[3]). I'd like to be able to do something similar 
with semantic XHTML.

Many thanks,


[1] http://phildawes.net/jamvat/
[3] http://tagtriples.sf.net/

