[uf-discuss] generic microformat parsing heuristics?

Charles Iliya Krempeaux supercanadian at gmail.com
Mon Nov 7 13:20:57 PST 2005


Hello,

On 11/7/05, Phil Dawes <phil at phildawes.net> wrote:
> Hi Tantek,
>
> Tantek Çelik wrote:
>
>  > Phil,
>  >
>  > Take a look at hCard parsing:
>  >
>  >  http://microformats.org/wiki/hcard-parsing
>  >
>  > Much of which is embodied there generalizes to other microformats.
>  >
>
> Excellent - many thanks.
>
> Out of interest, do you think that a generic microformats parser _can_
> be written?
> (e.g. something that could parse hcard, hcal et al out of xhtml without
> prior knowledge of their precise schemas?)

I'd say: no.

> I ask because we're starting to wonder about embedding our own internal
> microformats into webapps at work[1] (e.g. maybe for financial reference
> data), but we'd want to be able to use off-the-shelf generic tools to
> parse, aggregate and query the custom data.

To do what you want, you'd need a way of "marking" every Microformat
usage as being Microformat, so you could easily "lift out" all the
Microformat nuggets.  For example, if you required every Microformat
to use a class="microformat", then you could just lift out everything
with a class="microformat" in it.  (Note though, a class could
potentially have more than one thing listed in it.  For example
class="microformat vcard".)

But Microformats make no such requirement.  (Which makes them easier
to use.  But requires knowledge about each Microformat before
parsing.)  Which is why you need to know about each and every
Microformat (you want to deal with) ahead of time.

(I guess it is somewhat analygous to HTML vs XHTML.  To parse XHTML,
you can use any XML parser.  And XML parsers are easy because you can
tell... without knowing anything about the XML application... which
elements have "beginning" and "ending" tags; and which elements have
just a single stand alone tag.  But with HTML and SGML in general, you
do NOT know which elements have "beginning" and "ending" tags and
which elements only have a a single stand alone tag without knowing
about the format ahead of time.)


See ya

--
     Charles Iliya Krempeaux, B.Sc.

     charles @ reptile.ca
     supercanadian @ gmail.com

     developer weblog: http://ChangeLog.ca/
___________________________________________________________________________
 Never forget where you came from


More information about the microformats-discuss mailing list