[uf-discuss] generic microformat parsing heuristics?

Tue Nov 8 13:49:38 PST 2005

On Nov 7, 2005, at 12:32 PM, David House wrote:
> On 07/11/05, Mark Pilgrim <pilgrim at gmail.com> wrote:
>
>> No, nor should any effort be expended in such a pursuit.  c.f.
>> http://microformats.org/discuss/mail/microformats-discuss/2005- 
>> October/001175.html
>>  "We don't care about the general case."  This is just the general
>> case rearing its ugly head on the parsing side, instead of the
>> production side.
>
> I strongly disagree. I appreciate that not generalising is one of the
> fundamental principles of microformats, but I don't think it applies
> to building parsers.
>
> One of the main advantages of a generalised microformat parser is that
> it allows us to write less code when the next compound microformat
> comes out. First off we only had hCard and hCal. Now with things like
> hReview and hAtom coming in, don't you think it would be useful to
> have a decent base which we could just extend when the next
> microformat comes out?

We already have some of this:

* http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/MicroParserRuby
* http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/MicroParserPHP
* XPath

> The bare fact of the matter is that microformats (I'm talking solely
> about compound microformats throughout this email) have a lot in
> common with each other! Let's have a look at what a parser of any
> compound microformat would have to do:
>
> 1. Parse HTML.
> 2. Find elements with the base class.
> 3. Keep count of open elements so it can tell when we're out of the
> base-classed element.
> 4. Extract property-value pairs from some attributes (href on <a>s,
> title on <abbr> and so on) and from contents of elements.
> 5. Deal with other things like the type and value pseudo-properies,
> plural properties and so on.
> 6. Assemble the data in some kind of array, or straight to a specific
> output if it's only designed for one specific type of output (like
> X2V).
>
> Now, if we built a generic parser which allowed its individual
> extensions to execute some specialised code (like code to deal with
> implied N optimisation in hCard), the general parser could handle the
> above 6 steps. There's no reason why a robust generic microformat
> parser can't exist. It would be harder than writing one for just a
> single microformat, but not significantly.
>
> The reason why I'm writing this is because I'm in the middle of
> writing a generic compound microformat parser (PHP). It's nearing
> completion actually, just dealing with specifics of plural properties.
> It's designed to be robust and easy to extend to things like hAtom.
> The experience has shown me that parsing is a lot less 80/20 than
> you'd think: because microformat authors are against things like the
> edge case such things will never have to be parsed.
>
> All in all, a generic microformat parser can exist, has advantages
> over a specialised one,

We can answer that when we have one.

> and therefore should happen.

No one's saying don't build one- it just may not be useful and may in  
fact be a diversion.

-ryan
--
Ryan King
ryan at technorati.com