[uf-discuss] Automated microformat parsing using XPath

NewsAgent 2000grad newsletter at 2000grad.com
Mon Aug 14 13:22:10 PDT 2006


Hello,

I am new to this list, but the problem I came across is somewhat related
to the one described here, when I was trying to write a more
format-knowledge-independent microformat application.
I was looking for a more "general approach" to find the value of a known
  microformatted element (well the class-name is known).

I was using rules like:
- If e.g. the microformat element 'description' is found within an H REF
then the URL shall be the value,
- if it is within the ABBR then it will be the title,
- if it is found in DIV or SPAN than the value is inside
- if in an IMG than it's the SRC-value
- ...

Is there such a general rule?

One exception would be if a sub-property (class="value")is between the
tags, than obviously this would be taken as the value for the
microformatted element.

Please correct me if I got something terribly wrong :-)

Henrich C. Pöhls


brian suda schrieb:
> There are several things to look out for... i'll answer a few, then
> suggest we move this to the mf-dev list if there are more specific
> questions.
> 
> 1) the portion of the XPATH: contains(@class, 'description')  will fail
> if there is 'descriptions' (plural) because this is only looking for the
> string CONTAINED in the @class, you will need to expand that to
> something like: contains(concat(' ', normalize-space(@class), ' '),'
> description ') This pads both sides with spaces and then searches for
> the term also padded with spaces.
> 
> 2) Depending on both the microformat property (URL, UID, etc) you will
> look in different places,
> if node() = 'a' and @class='url' then
>   // look on the @href
> end if
> 
> you will also need to consider data that is found on the ABBR attribute.
> If there is a microformat property and it is on an ABBR element, then
> values is extracted from the @title.
> 
> We have a repository of XSLT code, which has many working XPATHs already
> written, feel free to browse them at http://hg.microformats.org/
> 
> If you are already not part of the mf-dev list, an administrator will
> have to add you.
> 
> -brian
> 
> Matt Augustine wrote:
>> I have written simple parsers for hCard and hCal in javascript that use
>> XPath to parse the microformat properties from an arbitrary xhtml
>> document.  In general, for each known property I have code like this:
>>
>> node = document.evaluate("//*[contains(@class,
>> 'vevent')]//*[contains(@class, 'description')]", hCalXmlNode, null, 0
>> /*XPathResult.ANY_TYPE*/, null).iterateNext();
>>
>> if (node) {self.Description = node.textContent;}
>>
>> This works great in most cases, but I'm having trouble with the case
>> where the exact location of the data (which attribute, inner element
>> etc.) is unknown.  For example, UIDs might be represented as:
>>
>> <a rel="contact friend" class="url uid fn"
>> href="http://beta.plazes.com/plaze/cd21e1717f61ba9cf9df9006038da172/">fi
>> ahless</a>
>>
>> How would I parse the value without special casing to look in the href
>> attribute if the containing element is an <a>?  An XPath expression like
>> the one above would yield "fiahless" instead of
>> ="http://beta.plazes.com/plaze/cd21e1717f61ba9cf9df9006038da172/".
>>
>>
>> Matt Augustine
>> _______________________________________________
>> microformats-discuss mailing list
>> microformats-discuss at microformats.org
>> http://microformats.org/mailman/listinfo/microformats-discuss
>>
>>   
> 
> _______________________________________________
> microformats-discuss mailing list
> microformats-discuss at microformats.org
> http://microformats.org/mailman/listinfo/microformats-discuss



More information about the microformats-discuss mailing list