[uf-discuss] Scraping or parsing?
Scott Reynen
scott at randomchaos.com
Tue Mar 6 06:31:01 PST 2007
On Mar 6, 2007, at 2:18 AM, Joe Andrieu wrote:
> Scott Reynen wrote:
>> On Mar 2, 2007, at 2:40 PM, Michael MD wrote:
>>
>>> I don't see how special cases where something has to be extracted
>>> in a different way are expressed in the profiles.
>>
>> Michael didn't see how that was expressed in profiles because it's
>> *not* expressed in the profiles. That doesn't mean profile URIs
>> aren't useful, just that they don't solve the problem of
>> communicating parsing instructions.
>
> Scott, I think different profile URIs do express "where something
> has to be extracted in a different way." Different profile URIs
> can mean different extraction rules.
>
> The rules are not actually in the profiles themselves, but the use
> of profile URIs does what Michael was asking about.
I think we're interpreting Michael's "different" to refer to two
different types of differences. I think Michael was looking for the
rules in the profiles themselves. Profile URIs do express that
something needs to extracted in a different way *from another
profile* (what I'm calling disambiguation), but they don't express
differences *from default microformat parsing rules* (what I'm
calling parsing instructions).
For the specific example Michael asked about, rel-tag, the value of
most microformat properties is extracted from the node content, but
rel="tag" is a special care where value should be extracted from the
last segment of the href attribute instead. That specific difference
is *not* machine-readable from a profile, and not just because there
is no profile for rel-tag. hCard's class="url" has a similarly non-
standard parsing rule (use the href attribute instead of node
content), and this difference is not machine-readable from the
profile either:
http://www.w3.org/2006/03/hcard
It's not even human-readable. There is nothing anywhere in the hCard
profile saying "url values should come from the href attribute,"
which is what I think Michael was looking for. The only place to
find that difference is in the referenced microformats wiki, where it
is only human-readable.
> As I
> understand it, the profile itself need not even be dereferenced by
> consuming applications. In that way, it is more of an identifier
> than a locator.
Right. An identifier is useful for disambiguation. A locator would
be necessary for parsing instructions.
> And in fact, profile URIs are the only mechanism we have for
> version control.
Right, but version control only requires disambiguation, not parsing
instructions.
> So if parsing rules change with a new version, the
> only way a consuming app would know to apply the new/old parsing is
> because of the profile URI.
Sure, but that won't tell a parser what the new parsing rules
actually are, only that they've changed.
> For context, Michael's original question in the archive is at
> http://microformats.org/discuss/mail/microformats-discuss/2007-
> March/008891.html
Here's the part I believe indicates a desire for parsing rules, not
just disambiguation:
> (eg for rel-tag it needs to split the url in the href attribute and
> get the
> last part)
But Michael can, of course, better clarify for himself exactly what
he was looking for and not finding.
Peace,
Scott
More information about the microformats-discuss
mailing list