[uf-discuss] Scraping or parsing?

Scott Reynen scott at randomchaos.com
Tue Mar 6 06:31:01 PST 2007


On Mar 6, 2007, at 2:18 AM, Joe Andrieu wrote:

> Scott Reynen wrote:
>> On Mar 2, 2007, at 2:40 PM, Michael MD wrote:
>>
>>> I don't see how special cases where something has to be extracted
>>> in a different way are expressed in the profiles.
>>
>> Michael didn't see how that was expressed in profiles because it's
>> *not* expressed in the profiles.  That doesn't mean profile URIs
>> aren't useful, just that they don't solve the problem of
>> communicating parsing instructions.
>
> Scott, I think different profile URIs do express "where something  
> has to be extracted in a different way."  Different profile URIs
> can mean different extraction rules.
>
> The rules are not actually in the profiles themselves, but the use  
> of profile URIs does what Michael was asking about.

I think we're interpreting Michael's "different" to refer to two  
different types of differences.  I think Michael was looking for the  
rules in the profiles themselves.  Profile URIs do express that  
something needs to extracted in a different way *from another  
profile* (what I'm calling disambiguation), but they don't express  
differences *from default microformat parsing rules* (what I'm  
calling parsing instructions).

For the specific example Michael asked about, rel-tag, the value of  
most microformat properties is extracted from the node content, but  
rel="tag" is a special care where value should be extracted from the  
last segment of the href attribute instead.  That specific difference  
is *not* machine-readable from a profile, and not just because there  
is no profile for rel-tag.  hCard's class="url" has a similarly non- 
standard parsing rule (use the href attribute instead of node  
content), and this difference is not machine-readable from the  
profile either:

http://www.w3.org/2006/03/hcard

It's not even human-readable.  There is nothing anywhere in the hCard  
profile saying "url values should come from the href attribute,"  
which is what I think Michael was looking for.  The only place to  
find that difference is in the referenced microformats wiki, where it  
is only human-readable.

> As I
> understand it, the profile itself need not even be dereferenced by  
> consuming applications. In that way, it is more of an identifier
> than a locator.

Right.  An identifier is useful for disambiguation.  A locator would  
be necessary for parsing instructions.

> And in fact, profile URIs are the only mechanism we have for  
> version control.

Right, but version control only requires disambiguation, not parsing  
instructions.

> So if parsing rules change with a new version, the
> only way a consuming app would know to apply the new/old parsing is  
> because of the profile URI.

Sure, but that won't tell a parser what the new parsing rules  
actually are, only that they've changed.

> For context, Michael's original question in the archive is at
> http://microformats.org/discuss/mail/microformats-discuss/2007- 
> March/008891.html

Here's the part I believe indicates a desire for parsing rules, not  
just disambiguation:

> (eg for rel-tag it needs to split the url in the href attribute and  
> get the
> last part)

But Michael can, of course, better clarify for himself exactly what  
he was looking for and not finding.

Peace,
Scott



More information about the microformats-discuss mailing list