[uf-discuss] generic microformat parsing heuristics?

Mon Nov 7 12:32:00 PST 2005

On 07/11/05, Mark Pilgrim <pilgrim at gmail.com> wrote:
> No, nor should any effort be expended in such a pursuit.  c.f.
> http://microformats.org/discuss/mail/microformats-discuss/2005-October/001175.html
>  "We don't care about the general case."  This is just the general
> case rearing its ugly head on the parsing side, instead of the
> production side.

I strongly disagree. I appreciate that not generalising is one of the
fundamental principles of microformats, but I don't think it applies
to building parsers.

One of the main advantages of a generalised microformat parser is that
it allows us to write less code when the next compound microformat
comes out. First off we only had hCard and hCal. Now with things like
hReview and hAtom coming in, don't you think it would be useful to
have a decent base which we could just extend when the next
microformat comes out?

The bare fact of the matter is that microformats (I'm talking solely
about compound microformats throughout this email) have a lot in
common with each other! Let's have a look at what a parser of any
compound microformat would have to do:

1. Parse HTML.
2. Find elements with the base class.
3. Keep count of open elements so it can tell when we're out of the
base-classed element.
4. Extract property-value pairs from some attributes (href on <a>s,
title on <abbr> and so on) and from contents of elements.
5. Deal with other things like the type and value pseudo-properies,
plural properties and so on.
6. Assemble the data in some kind of array, or straight to a specific
output if it's only designed for one specific type of output (like
X2V).

Now, if we built a generic parser which allowed its individual
extensions to execute some specialised code (like code to deal with
implied N optimisation in hCard), the general parser could handle the
above 6 steps. There's no reason why a robust generic microformat
parser can't exist. It would be harder than writing one for just a
single microformat, but not significantly.

The reason why I'm writing this is because I'm in the middle of
writing a generic compound microformat parser (PHP). It's nearing
completion actually, just dealing with specifics of plural properties.
It's designed to be robust and easy to extend to things like hAtom.
The experience has shown me that parsing is a lot less 80/20 than
you'd think: because microformat authors are against things like the
edge case such things will never have to be parsed.

All in all, a generic microformat parser can exist, has advantages
over a specialised one, and therefore should happen.

--
-David House, dmhouse at gmail.com, http://xmouse.ithium.net