[uf-discuss] Automated microformat parsing using XPath
NewsAgent 2000grad
newsletter at 2000grad.com
Mon Aug 14 13:22:10 PDT 2006
Hello,
I am new to this list, but the problem I came across is somewhat related
to the one described here, when I was trying to write a more
format-knowledge-independent microformat application.
I was looking for a more "general approach" to find the value of a known
microformatted element (well the class-name is known).
I was using rules like:
- If e.g. the microformat element 'description' is found within an H REF
then the URL shall be the value,
- if it is within the ABBR then it will be the title,
- if it is found in DIV or SPAN than the value is inside
- if in an IMG than it's the SRC-value
- ...
Is there such a general rule?
One exception would be if a sub-property (class="value")is between the
tags, than obviously this would be taken as the value for the
microformatted element.
Please correct me if I got something terribly wrong :-)
Henrich C. Pöhls
brian suda schrieb:
> There are several things to look out for... i'll answer a few, then
> suggest we move this to the mf-dev list if there are more specific
> questions.
>
> 1) the portion of the XPATH: contains(@class, 'description') will fail
> if there is 'descriptions' (plural) because this is only looking for the
> string CONTAINED in the @class, you will need to expand that to
> something like: contains(concat(' ', normalize-space(@class), ' '),'
> description ') This pads both sides with spaces and then searches for
> the term also padded with spaces.
>
> 2) Depending on both the microformat property (URL, UID, etc) you will
> look in different places,
> if node() = 'a' and @class='url' then
> // look on the @href
> end if
>
> you will also need to consider data that is found on the ABBR attribute.
> If there is a microformat property and it is on an ABBR element, then
> values is extracted from the @title.
>
> We have a repository of XSLT code, which has many working XPATHs already
> written, feel free to browse them at http://hg.microformats.org/
>
> If you are already not part of the mf-dev list, an administrator will
> have to add you.
>
> -brian
>
> Matt Augustine wrote:
>> I have written simple parsers for hCard and hCal in javascript that use
>> XPath to parse the microformat properties from an arbitrary xhtml
>> document. In general, for each known property I have code like this:
>>
>> node = document.evaluate("//*[contains(@class,
>> 'vevent')]//*[contains(@class, 'description')]", hCalXmlNode, null, 0
>> /*XPathResult.ANY_TYPE*/, null).iterateNext();
>>
>> if (node) {self.Description = node.textContent;}
>>
>> This works great in most cases, but I'm having trouble with the case
>> where the exact location of the data (which attribute, inner element
>> etc.) is unknown. For example, UIDs might be represented as:
>>
>> <a rel="contact friend" class="url uid fn"
>> href="http://beta.plazes.com/plaze/cd21e1717f61ba9cf9df9006038da172/">fi
>> ahless</a>
>>
>> How would I parse the value without special casing to look in the href
>> attribute if the containing element is an <a>? An XPath expression like
>> the one above would yield "fiahless" instead of
>> ="http://beta.plazes.com/plaze/cd21e1717f61ba9cf9df9006038da172/".
>>
>>
>> Matt Augustine
>> _______________________________________________
>> microformats-discuss mailing list
>> microformats-discuss at microformats.org
>> http://microformats.org/mailman/listinfo/microformats-discuss
>>
>>
>
> _______________________________________________
> microformats-discuss mailing list
> microformats-discuss at microformats.org
> http://microformats.org/mailman/listinfo/microformats-discuss
More information about the microformats-discuss
mailing list