[uf-dev] python microformats parser
Phil Dawes
phil at phildawes.net
Mon Nov 14 13:26:41 PST 2005
Hi Dev list,
I've put together a simple python microformat parser for use in python
projects (including my own structured-data-aggregator thingy JAM*VAT[1]).
Download from here:
http://phildawes.net/microformats/microformatparser.html
Unpack and type:
> python microformatparser.py http://tantek.com/log/2005/10.html
and it prints:
vevent
dtstart : 2005-11-11
dtend : 2005-11-14
summary : Hackers conference
vevent
dtstart : 2006-02-27
dtend : 2006-03-04
summary : W3C Technical Plenary
location : Sofitel, Mandelieu, France
[..etc..]
vcard
url : http://tantek.com/
logo : http://tantek.com/icon80px.jpg
fn : Tantek Çelik
My original aim was to produce a microformat parser that would have a
good go at parsing any semantic xhtml (given a starting point - e.g.
class="vcard").
I actually got quite far with this (the jamvat demo installation[1]
currently runs this parser). The main problem was deciding when a
property is a parent of other properties -
e.g. the structure:
<address class="vcard" id="hcard">
<a class="url fn" href="http://tantek.com/">
<img src="/icon80px.jpg" class="logo" alt="">
Tantek Çelik
</a>
</address>
..could easily be interpretted by a naive parser as being of structure:
vcard {
url : http://tantek.com
fn : {
logo: http://tantek.com/icon80px.jpg
}
}
instead of the more correct:
vcard {
url : http://tantek.com
fn : Tantek Çelik
logo: http://tantek.com/icon80px.jpg
}
Having bashed my head against this a bit, I decided to start from the
other direction: using a hardcoded schema, and then adding genericity
where possible. So this is a first stab at the latter approach - it's
driven by a simple datastructure which tells it which properties to look
out for, and also which ones can be 'parents' of other properties.
Here's the structure in v0.1:
-----------------
vcardprops = MicroformatSchema(['fn','family-name', 'given-name',
'additional-name', 'honorific-prefix', 'honorific-suffix', 'nickname',
'sort-string','url','email','type','tel','post-office-box',
'extended-address', 'street-address', 'locality', 'region',
'postal-code', 'country-name', 'label', 'latitude', 'longitude', 'tz',
'photo', 'logo', 'sound', 'bday','title', 'role','organization-name',
'organization-unit','category', 'note','class', 'key', 'mailer', 'uid',
'rev'],['n','email','adr','geo','org','tel'])
veventprops =
MicroformatSchema(["summary","url","dtstart","dtend","location"],[])
SCHEMAS= {'vcard':vcardprops,'vevent':veventprops}
-----------------
Hope this is of use to somebody!
Cheers,
Phil
[1] http://phildawes.net/jamvat/
More information about the microformats-dev
mailing list