hcard-parsing: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
mNo edit summary
 
Line 5: Line 5:
== introduction ==
== introduction ==


When I first conceived of [[hcard|hCard]], it was clear to me how to unambiguously parse both for the existence of hCards in arbitrary (X)HTML (and anywhere that arbitrary (X)HTML can be embedded, e.g. RSS, Atom, "generic XML"), and their properties and values.
When I first conceived of [[hcard|hCard]], it was clear to me how to unambiguously parse both for the existence of hCards in arbitrary (X)HTML (and anywhere that arbitrary (X)HTML can be embedded, e.g. RSS, Atom, "generic XML"), and hCard  properties and values.


I worked directly with Brian Suda to capture these thoughts in an implementation, and Brian wrote X2V, an XSLT script that converts hCards to vCards, thus simultaneously demonstrating the parsability of hCards, and the immediate utility hCard content interoperating with widespread existing vCard applications.
I worked directly with Brian Suda to capture these thoughts in an implementation, and Brian wrote X2V, an XSLT script that converts hCards to vCards, thus simultaneously demonstrating the parsability of hCards, and the immediate utility hCard content interoperating with widespread existing vCard applications.
Line 14: Line 14:


Although this page is written specifically to explain how to parse [[hcard|hCard]], the concepts and algorithms contained therein, serve as an example for how other [[compound-microformat|compound microformats]] are to be parsed.
Although this page is written specifically to explain how to parse [[hcard|hCard]], the concepts and algorithms contained therein, serve as an example for how other [[compound-microformat|compound microformats]] are to be parsed.
== root class name ==
Each compound microformat starts with a root element with a relatively unique class name.  By that I mean a class name which isn't simply  a common word, and is unlikely to have been used outside the context of the microformat.  By choosing such a root class name the microformat avoids (for all practical purposes) colliding with existing class names that may exist within the (X)HTML context.  This is essential to enabling such compound microformats to be ''embedded'' inside current, existing content, as well as future content.
Fortunately this is not a new problem to solve.  The root object names chosen for vCard (RFC 2426) and iCalendar (RFC 2445) similarly had to avoid such collisions and did so by choosing names that were unlikely to have been introduced into a MIME object context.  The principle of ''reuse'' dictates that we should reuse the names for these root objects in those RFCs rather than invent our own.  Given the same semantics, a design should reuse the names, rather than inventing a second name for the same semantic (a common design mistake made in environments that require namespaces).
In the vCard specification, the names are case-insensitive due to the (lack of) requirements of their context.  (X)HTML class names are case sensitive per those specifications.  Thus we are required to pick a canonical case for the class name equivalents of vCard object and property names.  All lowercase is chosen to follow the precedent (i.e. ''reuse'' the pattern) set by XHTML, which similarly had to canonicalize the case of element and attribute names that it took from HTML4, which itself was case-insensitive due to its context (SGML).  Additionally, reasons for avoiding mixed-case (e.g. camel case) in the context of class names may be found in the essay [http://tantek.com/log/2002/12.html#L20021216t2238 A Touch of Class], specifically, the section titled [http://tantek.com/log/2002/12.html#atoc_csensitivity Class sensitivity].
Thus the root class name of an [[hcard|hCard]] is "vcard". 
== finding hCards ==
An (X)HTML document indicates that it may contain hCards by referencing the [[hcard-profile|hCard XMDP profile]].  See [http://gmpg.org/xmdp/description XMDP] for more details.
A parser finds hCards in an (X)HTML context by looking for elements with the root class name "vcard" just as the following CSS class selector does:
<pre>
.vcard
</pre>
For example, the following CSS style rule sets the background of all hCards to green:
<pre>
.vcard { background: green; }
</pre>
Note that the (X)HTML class attribute is a space separated set of class names.
Thus all of the following are valid hCard root elements:
* <code>&lt;div class="vcard"&gt; &lt;/div&gt;</code>
* <code>&lt;span class="attendee vcard"&gt; &lt;/span&gt;</code>
* <code>&lt;address class="vcard author"&gt; &lt;/address&gt;</code>
* <code>&lt;li class="reviewer vcard first"&gt; &lt;/li&gt;</code>
Once the root element of an hCard is found, that element and all its descendants are all that is needed to parse the hCard.
Thus it is possible for a well-formed hCard to be extracted from an overall non well-formed context, if the parser has the ability to find elements by class name within that non well-formed context.
== parsing hCard properties ==
The complete list of class names for hCard properties are documented in the [[hcard-profile|hCard profile]].
& & & & & & & & & & & & &
=== Work In Progress ===
I'm still in the process of writing this document.  Please avoid non-trivial edits.  Thanks, [http://tantek.com/log/ Tantek]
== References ==
=== Normative References ===
* vCard (RFC 2426)
* [http://w3.org/TR/XHTML1 XHTML 1.0 Recommendation]
* [http://w3.org/TR/html401 HTML 4.01 Recommendation]
* [http://gmpg.org/xmdp/ XMDP]
=== Informative References ===
* [http://w3.org/TR/REC-CSS1 CSS1 Recommendation]

Revision as of 18:37, 6 August 2005

hCard parsing

by Tantek Çelik

introduction

When I first conceived of hCard, it was clear to me how to unambiguously parse both for the existence of hCards in arbitrary (X)HTML (and anywhere that arbitrary (X)HTML can be embedded, e.g. RSS, Atom, "generic XML"), and hCard properties and values.

I worked directly with Brian Suda to capture these thoughts in an implementation, and Brian wrote X2V, an XSLT script that converts hCards to vCards, thus simultaneously demonstrating the parsability of hCards, and the immediate utility hCard content interoperating with widespread existing vCard applications.

I am now documenting those thoughts directly here so that additional implementations, rather than having to reverse engineer X2V, can be built directly from these elementary concepts.

scope

Although this page is written specifically to explain how to parse hCard, the concepts and algorithms contained therein, serve as an example for how other compound microformats are to be parsed.

root class name

Each compound microformat starts with a root element with a relatively unique class name. By that I mean a class name which isn't simply a common word, and is unlikely to have been used outside the context of the microformat. By choosing such a root class name the microformat avoids (for all practical purposes) colliding with existing class names that may exist within the (X)HTML context. This is essential to enabling such compound microformats to be embedded inside current, existing content, as well as future content.

Fortunately this is not a new problem to solve. The root object names chosen for vCard (RFC 2426) and iCalendar (RFC 2445) similarly had to avoid such collisions and did so by choosing names that were unlikely to have been introduced into a MIME object context. The principle of reuse dictates that we should reuse the names for these root objects in those RFCs rather than invent our own. Given the same semantics, a design should reuse the names, rather than inventing a second name for the same semantic (a common design mistake made in environments that require namespaces).

In the vCard specification, the names are case-insensitive due to the (lack of) requirements of their context. (X)HTML class names are case sensitive per those specifications. Thus we are required to pick a canonical case for the class name equivalents of vCard object and property names. All lowercase is chosen to follow the precedent (i.e. reuse the pattern) set by XHTML, which similarly had to canonicalize the case of element and attribute names that it took from HTML4, which itself was case-insensitive due to its context (SGML). Additionally, reasons for avoiding mixed-case (e.g. camel case) in the context of class names may be found in the essay A Touch of Class, specifically, the section titled Class sensitivity.

Thus the root class name of an hCard is "vcard".

finding hCards

An (X)HTML document indicates that it may contain hCards by referencing the hCard XMDP profile. See XMDP for more details.

A parser finds hCards in an (X)HTML context by looking for elements with the root class name "vcard" just as the following CSS class selector does:

 .vcard

For example, the following CSS style rule sets the background of all hCards to green:

 .vcard { background: green; }

Note that the (X)HTML class attribute is a space separated set of class names.

Thus all of the following are valid hCard root elements:

  • <div class="vcard"> </div>
  • <span class="attendee vcard"> </span>
  • <address class="vcard author"> </address>
  • <li class="reviewer vcard first"> </li>

Once the root element of an hCard is found, that element and all its descendants are all that is needed to parse the hCard.

Thus it is possible for a well-formed hCard to be extracted from an overall non well-formed context, if the parser has the ability to find elements by class name within that non well-formed context.

parsing hCard properties

The complete list of class names for hCard properties are documented in the hCard profile.

& & & & & & & & & & & & &

Work In Progress

I'm still in the process of writing this document. Please avoid non-trivial edits. Thanks, Tantek

References

Normative References

Informative References