[uf-new] Microformats parsing, in general (was: hAudio final draft)

Mon Jun 18 15:18:53 PDT 2007

(moving largely parsing discussion to microformats-dev, microformats-new
bcc'd)

On 6/18/07 2:27 PM, "Brian Suda" <brian.suda at gmail.com> wrote:

> On 6/18/07, Tantek Çelik <tantek at cs.stanford.edu> wrote:
>> This is likely to be precisely why we may need to solve this problem by
>> continuing the mfo discussion.
> 
> --- Part of the reason the MSO discussion died is because it didn´t
                             MFO

> actually solve anything.

No it helps abstract when to stop looking into a node for property values.
Full stop. Nothing more, nothing less.

>> If you look at the current known alternatives:
>> 
>> 1. require parsers to update whenever new nestable microformats are
>> introduced, and precisely define rules for handling known/common nesting
>> cases (to at a minimum avoid wasting time on straw-man arguments).
> 
> --- i do NOT like this alternative because it makes the assumption
> that you WANT the data to be two different things. For instance, if i
> have a URL as a child of hCard. Then the common parsing rules might
> say, when that hCard is a location of an hCalendar ignore the URL, but
> what happens when i WANT that URL to be part of the hCalendar - this
> leads to incorrect assumptions.

That case "when you want the URL (of the hCard) to be part of the hCalendar"
- I assert is *way* less than 20%.  If you think this is a real issue, let's
start with at least one concrete example you have seen where this is true.

> I would rather let the PUBLISHER be as
> explicit as they want or not, rather than parsers attempt to
> interpret their intents.

I agree with that methodological statement, yet perhaps we are coming to two
different conclusions.

>> 2. add a new class name to indicate a encapsulation scope (e.g. "mfo") when
>> embedding
>>  - = one new class name, only in cases where nesting occurs.
> 
> --- The problem with MSO is something like the following:
                       MFO

> - hCalendar
> -- location (MSO)
> --- hcard
> ---- URL

<snip> 

This is a false strawman example.  MFO is only for root microformat class
names, not for arbitrary properties.

e.g. class="vcard mfo",  NOT class="location mfo".

second example snipped for same false assumption.

> From what i remember MSO didn´t actually solve anything, it just
> created more problems. This is why IMHO it was never persued any
> further than just a thought.

No it wasn't pursued due to lack of time, and lower priority than other
pursuits.

>> 3. replicate/prefix property class names for each microformat e.g. audio-fn
>>  - = numerous new class names
>> 
>> It is pretty clear that #3 is the worst from a complexity (most new class
>> names) that would affect the most people (publishers) point of view.  So we
>> should seek to avoid #3 since that violates the principles the most.
> 
> --- each microformat can also defined its parsing rules. For instance,
> hAtom only looks for rel-tag NOT inside an hentry. there is no reason
> that a media format can´t define that an FN can ONLY be taken when it
> is NOT a child of an hCard, but then this limits the way people can
> publish.

These specific parsing rules are already part of the #1 option I mentioned.

>> #2 adds some incremental authoring complexity in some cases.
> 
> --- i am against MSO, it is un-needed, adds complexity and doesn´t
> actually solve much.

Based on the misspelling and false strawman examples, I think you may be
against something that is not being proposed.

>> #1 is something that we can probably still do today since both the number of
>> microformats is small (a good reason to keep the overall number small), and
>> the number of parser implementations is small and parser implementers are
>> both involved in the community and able to update their code quite quickly (
>> cc'ing microformats-dev accordingly).
>> 
>> 
>> Therefore it is reasonable IMHO to:
>> 
>> Pursue #1 in the short term until we have solved #2 in the medium term.
> 
> --- i think this can be fixed without either of these options. If we
> spend the time actually examining real data in the wild, i think we
> will find that many of these theoretical issues will either disappear,

This is a good approach of course.

> or we will have some exact examples that we can further explore and
> encode the rules in the format itself rather than trying to work with
> any of the above options...

Hence why I prefer pursuing #1 first as well.

> #1 doesn´t sit well with me because it causes an exponential code
> growth and potential to introduce more and more bugs.

Not necessarily.  I don't believe the assertion of required exponential code
growth.  I'm optimistic that patterns that emerge will solve a lot of this.

> Each format simply represents data, which can be divisible from each
> other. If there are hCards on the page, that is simply people data -
> no matter what it is nested in - i should be able to extract them
> independently of their scope.

Agreed.

> Introducing constraints i think makes
> things more complex, so i think this should be avoided.

In general yes we are trying to minimize complexity.

Sometimes it is difficult to avoid adding complexity *somewhere* and thus
the key point in this discussion is where to put necessary added complexity.

Thanks,

Tantek