parsing-microformats: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
m (→‎Parsing class values: Added two bogus examples)
(→‎Parsing class values: minor corrections to care must be taken list)
 
(28 intermediate revisions by 15 users not shown)
Line 1: Line 1:
= Microformat Parsing =
= Parsing Microformats =


Microformat parsing mechanisms that depend on documents having even minimal xml properties like well-formedness may fail when consuming non-well-formed content.  [http://tidy.sourceforge.net/ Tidy] or even better [http://people.apache.org/~andyc/neko/doc/html/ CyberNeko] may be a useful work around.
Microformat parsing mechanisms that depend on documents having even minimal xml properties like well-formedness may fail when consuming non-well-formed content.  [http://tidy.sourceforge.net/ Tidy] or even better [http://people.apache.org/~andyc/neko/doc/html/ CyberNeko] may be a useful work around.
In particular  [http://suda.co.uk/projects/X2V/ Brian Suda's frequently cited X2V hCard and hCalendar discovery and transformation prototypes] use XSLT, and "tidy" any non-well-formed input before processing it.
In particular  [http://suda.co.uk/projects/X2V/ X2V] uses XSLT, and [http://tidy.sourceforge.net/ tidy] to clean any non-well-formed input before processing it.
 
Most microformats tend to be agnostic about things like exact element type used.
 
Developers can use tools like XPATH that assume well-formedness on well-formed content (from the web or by using tidy).  Mark Pilgrim's example [http://sourceforge.net/projects/feedparser/ universal feed parser] suggests that it may be possible to sanitize user html to an extent that it is suitable for later processing as xml.


== Parsing class values ==
== Parsing class values ==


When parsing class values care must be taken:
When parsing class values care must be taken:
# Class attributes may contain multiple class names, e.g: <code>class="foo vcard bar"</code>
# Class attributes may contain multiple class names separated by whitespace, e.g: <code>class="foo vcard bar"</code>
# Class attributes may contain class names which contain the class name used by a microformat, e.g: <code>class="foo<strong>vcard</strong>bar"</code> <code>class="foo<strong>vcard</strong>"</code>, <code>class="<strong>vcard</strong>bar"</code>.
# Class attributes may contain class names which contain the class name used by a microformat, e.g: <code>class="foo<strong>vcard</strong>bar"</code> <code>class="foo<strong>vcard</strong>"</code>, <code>class="<strong>vcard</strong>bar"</code> - none of which are hCards.
# Multiple class names are seperated by one or more whitespace charchters.  
# Multiple class names can be separated by one <strong>or more</strong> whitespace characters.  
# Class names are case sensitive.
# Class names are case sensitive. microformats class names are always all lowercase (per [[naming-principles]] and [[naming-conventions]]).
   
   
See http://www.w3.org/TR/html401/struct/global.html#h-7.5.2.
See http://www.w3.org/TR/html401/struct/global.html#h-7.5.2.


=== JavaScript example ===
=== JavaScript example ===
The [http://www.robertnyman.com/2005/11/07/the-ultimate-getelementsbyclassname/ Ultimate getElementsByClassName] JavaScript function may be useful. Then you can do:
<code><pre>
var adrs = document.getElementsByClassName(document, "*", "adr");
</pre></code>
or even:


<code>
<code><pre>
if (<em>elemenent</em>.className.match(/\b<strong>vcard</strong>\b/)) ...
var cities = document.getElementsByClassName(document, "*", "locality");
</code>
</pre></code>


=== XSLT example ===
=== XSLT example ===
<code>
<code>
  &lt;xsl:if test="contains(
  &lt;xsl:if test="contains(
     concat (
     concat (' ', normalize-space(@class),' '),
        ' ',
        concat(normalize-whitespace(@class),' ')
    ),
     ' <strong>vcard</strong> '
     ' <strong>vcard</strong> '
  )" &gt; ...
    )" &gt; ...
</code>
</code>
JavaScript can also perform XSLT natively in browsers like Firefox. See [[firefox-extensions#XSL_Results|Firefox extensions]] for performing XSL transformations in Firefox without JavaScript.
=== XQuery example ===
Also using XPath...
<pre>&lt;div style="background-color:yellow;">
{
  for $a in doc()//div[@class='vcard']
  let $b := $a/div[@class='fn org' or @class='org fn']
  let $c := $a/div[@class='adr']
  return ($b, $c, &lt;br />)
}
&lt;/div>
</pre>
For example, this could be used against http://technorati.com/about/contact.html. See [[firefox-extensions#XqUSEme|Firefox extensions]] for getting XQuery in Firefox.
Note that the 'class' tests above should really use the more complicated XPath expression used within the XSLT example (in order to allow for other classes to be used on the element, variations in whitespace, etc.), but it is simplified above for demonstration purposes.
Simple XPath expressions can also be used, as these are considered to be valid XQueries.


== Parsing rel/rev values ==
== Parsing rel/rev values ==
Line 39: Line 60:
Parsing rel and rev values is similar to parsing class values except for the following differences:
Parsing rel and rev values is similar to parsing class values except for the following differences:


# rel and rev values are always seperated by one space.
# rel and rev values should be separated by one space.
# rel and rev values are case insensetive.
# rel and rev values are case insensitive.


See http://www.w3.org/TR/html401/types.html#type-links.
See http://www.w3.org/TR/html401/types.html#type-links.

Latest revision as of 06:46, 16 October 2009

Parsing Microformats

Microformat parsing mechanisms that depend on documents having even minimal xml properties like well-formedness may fail when consuming non-well-formed content. Tidy or even better CyberNeko may be a useful work around. In particular X2V uses XSLT, and tidy to clean any non-well-formed input before processing it.

Parsing class values

When parsing class values care must be taken:

  1. Class attributes may contain multiple class names separated by whitespace, e.g: class="foo vcard bar"
  2. Class attributes may contain class names which contain the class name used by a microformat, e.g: class="foovcardbar" class="foovcard", class="vcardbar" - none of which are hCards.
  3. Multiple class names can be separated by one or more whitespace characters.
  4. Class names are case sensitive. microformats class names are always all lowercase (per naming-principles and naming-conventions).

See http://www.w3.org/TR/html401/struct/global.html#h-7.5.2.

JavaScript example

The Ultimate getElementsByClassName JavaScript function may be useful. Then you can do:

var adrs = document.getElementsByClassName(document, "*", "adr");

or even:

var cities = document.getElementsByClassName(document, "*", "locality");

XSLT example

<xsl:if test="contains(
   concat (' ', normalize-space(@class),' '),
   ' vcard '
   )" > ...

JavaScript can also perform XSLT natively in browsers like Firefox. See Firefox extensions for performing XSL transformations in Firefox without JavaScript.

XQuery example

Also using XPath...

<div style="background-color:yellow;">
{
  for $a in doc()//div[@class='vcard']
  let $b := $a/div[@class='fn org' or @class='org fn']
  let $c := $a/div[@class='adr']
  return ($b, $c, <br />)
}
</div>

For example, this could be used against http://technorati.com/about/contact.html. See Firefox extensions for getting XQuery in Firefox.

Note that the 'class' tests above should really use the more complicated XPath expression used within the XSLT example (in order to allow for other classes to be used on the element, variations in whitespace, etc.), but it is simplified above for demonstration purposes.

Simple XPath expressions can also be used, as these are considered to be valid XQueries.

Parsing rel/rev values

Parsing rel and rev values is similar to parsing class values except for the following differences:

  1. rel and rev values should be separated by one space.
  2. rel and rev values are case insensitive.

See http://www.w3.org/TR/html401/types.html#type-links.

See Also