[uf-dev] python microformats parser

Phil Dawes phil at phildawes.net
Mon Nov 14 13:26:41 PST 2005


Hi Dev list,

I've put together a simple python microformat parser for use in python 
projects (including my own structured-data-aggregator thingy JAM*VAT[1]).

Download from here:

http://phildawes.net/microformats/microformatparser.html

Unpack and type:

 > python microformatparser.py http://tantek.com/log/2005/10.html

and it prints:

vevent
     dtstart : 2005-11-11
     dtend : 2005-11-14
     summary : Hackers conference
vevent
     dtstart : 2006-02-27
     dtend : 2006-03-04
     summary : W3C Technical Plenary
     location : Sofitel, Mandelieu, France

[..etc..]

vcard
     url : http://tantek.com/
     logo : http://tantek.com/icon80px.jpg
     fn : Tantek Çelik

My original aim was to produce a microformat parser that would have a 
good go at parsing any semantic xhtml (given a starting point - e.g. 
class="vcard").

I actually got quite far with this (the jamvat demo installation[1] 
currently runs this parser). The main problem was deciding when a 
property is a parent of other properties -
e.g. the structure:

         <address class="vcard" id="hcard">
             <a class="url fn" href="http://tantek.com/">
                 <img src="/icon80px.jpg" class="logo" alt="">
                 Tantek Çelik
             </a>
         </address>

..could easily be interpretted by a naive parser as being of structure:

  vcard {
    url : http://tantek.com
    fn : {
        logo: http://tantek.com/icon80px.jpg
    }
}

instead of the more correct:

  vcard {
    url : http://tantek.com
    fn : Tantek Çelik
    logo: http://tantek.com/icon80px.jpg
}

Having bashed my head against this a bit, I decided to start from the 
other direction: using a hardcoded schema, and then adding genericity 
where possible. So this is a first stab at the latter approach - it's 
driven by a simple datastructure which tells it which properties to look 
out for, and also which ones can be 'parents' of other properties. 
Here's the structure in v0.1:

-----------------

vcardprops = MicroformatSchema(['fn','family-name', 'given-name', 
'additional-name', 'honorific-prefix', 'honorific-suffix', 'nickname', 
'sort-string','url','email','type','tel','post-office-box', 
'extended-address', 'street-address', 'locality', 'region', 
'postal-code', 'country-name', 'label', 'latitude', 'longitude', 'tz', 
'photo', 'logo', 'sound', 'bday','title', 'role','organization-name', 
'organization-unit','category', 'note','class', 'key', 'mailer', 'uid', 
'rev'],['n','email','adr','geo','org','tel'])

veventprops = 
MicroformatSchema(["summary","url","dtstart","dtend","location"],[])

SCHEMAS= {'vcard':vcardprops,'vevent':veventprops}

-----------------

Hope this is of use to somebody!

Cheers,

Phil

[1] http://phildawes.net/jamvat/



More information about the microformats-dev mailing list