Difference between revisions of "parsing-microformats"

From Microformats Wiki
Jump to navigation Jump to search
Line 34: Line 34:
   )" > ...
   )" > ...
[http://balloon.hobix.com/xpath-generator xpath generator], to help you generate those long ugly xpath queries.
== Parsing rel/rev values ==
== Parsing rel/rev values ==

Revision as of 05:14, 8 July 2006

Microformat Parsing

Microformat parsing mechanisms that depend on documents having even minimal xml properties like well-formedness may fail when consuming non-well-formed content. Tidy or even better CyberNeko may be a useful work around. In particular Brian Suda's frequently cited X2V hCard and hCalendar discovery and transformation prototypes use XSLT, and "tidy" any non-well-formed input before processing it.

Most microformats tend to be agnostic about things like exact element type used.

Developers can use tools like XPATH that assume well-formedness on well-formed content (from the web or by using tidy). Mark Pilgrim's example universal feed parser suggests that it may be possible to sanitize user html to an extent that it is suitable for later processing as xml.

Parsing class values

When parsing class values care must be taken:

  1. Class attributes may contain multiple class names, e.g: class="foo vcard bar"
  2. Class attributes may contain class names which contain the class name used by a microformat, e.g: class="foovcardbar" class="foovcard", class="vcardbar".
  3. Multiple class names are seperated by one or more whitespace charchters.
  4. Class names are case sensitive.

See http://www.w3.org/TR/html401/struct/global.html#h-7.5.2.

JavaScript example

if (elemenent.className.match(/\bvcard\b/)) ...

XSLT example

<xsl:if test="contains(
   concat (
       ' ',
       concat(normalize-whitespace(@class),' ')
   ' vcard '
 )" > ...

xpath generator, to help you generate those long ugly xpath queries.

Parsing rel/rev values

Parsing rel and rev values is similar to parsing class values except for the following differences:

  1. rel and rev values should be seperated by one space.
  2. rel and rev values are case insensetive.

See http://www.w3.org/TR/html401/types.html#type-links.

See Also