hcard-parsing
hCard parsing
by Tantek Çelik
introduction
When I first conceived of hCard, it was clear to me how to unambiguously parse both for the existence of hCards in arbitrary (X)HTML (and anywhere that arbitrary (X)HTML can be embedded, e.g. RSS, Atom, "generic XML"), and hCard properties and values.
I worked directly with Brian Suda to capture these thoughts in an implementation, and Brian wrote X2V, an XSLT script that converts hCards to vCards, thus simultaneously demonstrating the parsability of hCards, and the immediate utility of hCard content interoperating with widespread existing vCard applications.
I am now documenting those thoughts directly here so that additional implementations, rather than having to reverse engineer X2V, can be built directly from these elementary concepts.
scope
Although this page is written specifically to explain how to parse hCard, the concepts and algorithms contained therein serve as an example for how other compound microformats are to be parsed.
URL handling
An hCard parser may begin with a URL to retrieve.
If the URL lacks a fragment identifier, then the parser should parse the entire retrieved resource for hCards.
If the URL has a fragment identifier, then the parser should parse only the node indicated by the fragment identifier and its descendants, looking for hCards, starting with the indicated node, which may itself be a single hCard.
root class name
Each compound microformat starts with a root element with a relatively unique class name. By that I mean a class name which isn't simply a common word, and is unlikely to have been used outside the context of the microformat. By choosing such a root class name the microformat avoids (for all practical purposes) colliding with existing class names that may exist within the (X)HTML context. This is essential to enabling such compound microformats to be embedded inside current, existing content, as well as future content.
Fortunately this is not a new problem to solve. The root object names chosen for vCard (RFC 2426) and iCalendar (RFC 2445) similarly had to avoid such collisions and did so by choosing names that were unlikely to have been introduced into a MIME object context. The principle of reuse dictates that we should reuse the names for these root objects in those RFCs rather than invent our own. Given the same semantics, a design should reuse the names, rather than inventing a second name for the same semantic (a common design mistake made in environments that require namespaces).
In the vCard specification, the names are case-insensitive due to the (lack of) requirements of their context. (X)HTML class names are case sensitive per those specifications. Thus we are required to pick a canonical case for the class name equivalents of vCard object and property names. All lowercase is chosen to follow the precedent (i.e. reuse the pattern) set by XHTML, which similarly had to canonicalize the case of element and attribute names that it took from HTML4, which itself was case-insensitive due to its context (SGML). Additionally, reasons for avoiding mixed-case (e.g. camel case) in the context of class names may be found in the essay A Touch of Class, specifically, the section titled Class sensitivity.
Thus the root class name of an hCard is "vcard".
finding hCards
An (X)HTML document indicates that it may contain hCards by referencing the hCard XMDP profile. See XMDP for more details.
A parser finds hCards in an (X)HTML context by looking for elements with the root class name "vcard" just as the following CSS class selector does:
.vcard
For example, the following CSS style rule sets the background of all hCards to green:
.vcard { background: green; }
Note that the (X)HTML class attribute is a space separated set of class names.
Thus all of the following are valid hCard root elements:
<div class="vcard"> </div>
<span class="attendee vcard"> </span>
<address class="vcard author"> </address>
<li class="reviewer vcard first"> </li>
Once the root element of an hCard is found, that element and all its descendants are all that is needed to parse the hCard.
Thus it is possible for a well-formed hCard to be extracted from an overall non well-formed context, if the parser has the ability to find elements by class name within that non well-formed context.
hCard properties
The complete list of class names for hCard properties are documented in the hCard profile.
forward compatible parsing
When parsing the contents of an hCard, any unrecongized class names must be ignored.
Similarly, unrecognized values for hCard properties must also be ignored.
finding hCard properties
To parse an hCard for an hCard property (e.g. "fn"), the parser simply looks for the first element with that class name inside the hCard.
This can also be expressed as the first element that matches this CSS selector:
.vcard .fn
Some properties, like "fn", should only appear once, and thus the parser stops looking for the property after it has found one occurrence. Additional occurrences are ignored.
Other properties, like "adr", "email", "url", "tel", etc., may (and often do) appear more than once, and thus the parser continues to look for each occurrence within the contents of the hCard.
parsing hCard properties and values
Once an element for a property is found, the contents of the element are used for the value.
There are several exceptions to accomodate semantic XHTML and more semantic equivalents.
For the "email" property in particular, when the element is:
<a href="mailto:...">
: use the value of the 'href' attribute, omitting the "mailto:" prefix in the attribute.
For properties of type URL or URI, parsers MUST handle relative URLs and normalize them to their respective absolute URLs. In addition, when the element for a property is:
<a href>
: use the value of the 'href' attribute.<img src>
: use the value of the 'src' attribute.<object data>
: use the value of the 'data' attribute.
For properties NOT of type URL or URI, when the element for a property is:
<img alt>
: use the value of the 'alt' attribute.
For all properties, when the element for a property is:
<abbr>
: use the value of the 'title' attribute.- For properties which take an ISO8601 datetime value, parsers *should* pad any necessary precision (e.g. seconds), and *should* normalize any datetimes with timezone offsets, (e.g.
20050814T2305-0700
) into UTC (20050815T060500Z
).
- For properties which take an ISO8601 datetime value, parsers *should* pad any necessary precision (e.g. seconds), and *should* normalize any datetimes with timezone offsets, (e.g.
hCard sub-properties
There are some hCard properties whose values themselves have structure (AKA structured type value) and are composed of multiple pieces, which we refer to as sub-properties.
For example, the "n" property consists of the sub-properties "family-name", "given-name", "additional-names", "honorific-prefixes", and "honorific-suffixes".
E.g. from section 3.1.2 of RFC 2426, modified to include Ph.D.
N:Public;John;Quinlan;Mr.;Esq.,Ph.D.
In hCard this "n" property would be marked up as
<span class="n"> <span class="honorific-prefixes">Mr.</span> <span class="given-name">John</span> <span class="additional-names">Quinlan</span> <span class="family-name">Public</span>, <span class="honorific-suffixes">Esq.,Ph.D.</span> </span>
Which would be rendered as:
Mr. John Quinlan Public, Esq., Ph.D.
ISSUE: Should we make plural sub-property names into singular versions and simply allow multiple instances? I.e. the singular honorific prefix would make more sense if it was classed as such, and the list implied by the value for honorific-suffixes could be made more explicit (and thus more easily machine parseable):
<span class="n"> <span class="honorific-prefix">Mr.</span> <span class="given-name">John</span> <span class="additional-names">Quinlan</span> <span class="family-name">Public</span>, <span class="honorific-suffix">Esq.</span>, <span class="honorific-suffix">Ph.D.</span> </span>
I am leaning towards adopting singular class name equivalents for plural property and sub-property names. -Tantek
hCard property parameters
Some hCard properties have one or more parameters, most often "type", with an enumerated set of values. We represent the specific value of the parameter as a class name on an element inside the element representing the property.
For example, the "adr" property has a type sub-property which takes the values: "dom", "intl", "post", "parcel", "home", "work", "pref".
Currently the way to encode the "type" of an "adr" property is with an nested element that has a class name value of the "type" (or types) of the "adr".
Example with the "tel" property with a value of type "work":
<span class="tel"><span class="work">+1.123.456.7890</span></span>
You could then style the 'tel' with CSS and also a work "tel" differently.
.tel {color:black;} .tel .work { color:red; }
ISSUE: Restricting the "type" sub-property values to being expressed in class names seems less than ideal. It's taking a piece of information which is very often visible in the content, and forcing it to be invisible.
Here is an example of an extensive bit of contact information on a web page:
http://www.patchlink.com/company/contact.html
Maiilng Address 3370 N. Hayden Road, #123-175 Scottsdale, AZ 85251-6632 Physical Address 8515 E Anderson Scottsdale, AZ 85255
Note that the type information for each "adr" is explicit in the content. This content could be marked up like this:
<div class="adr"> <abbr style="display:block" class="type" title="postal,parcel">Mailing Address</abbr> <div class="street-address">3370 N. Hayden Road, #123-175</div> <span class="locality">Scottsdale</span>, <span class="region">AZ</span> <span class="postal-code">85251-6632</span> </div> <div class="adr"> <abbr style="display:block" class="type" title="work,pref">Physical Address</abbr> <div class="street-address">8515 E Anderson</div> <span class="locality">Scottsdale</span>, <span class="region">AZ</span> <span class="postal-code">85255</span> </div>
I am strongly thinking of switching hCard to requiring that the "type" parameter be marked-up when content is available (like the above two examples), and ditching the type-value-as-another class name pattern. - Tantek
There are some potentical problems with the "type" parameter for TEL and EMAIL properties. Since there are no defined sub-properties (unlike ADR's post-code, locality, etc) the entire node-value of TEL is taken as the value. For example:
<span class="tel">+1.123.456.7890 <abbr class="type" title="work">(work)</abbr></span>
would be represented in vCard as:
TEL;TYPE=work:+123.456.7890 (work)
The additional "(work)" part is there because it is part of the node value of class="tel". There are two ways to avoid this:
- make the parsers smart enough to ignore decendants with class="type"
- introduce another sub-property class="value"
<span class="tel"><span class="value">+1.123.456.7890</span> <abbr class="type" title="work">(work)</abbr></span>
Then parsers would only need to look for class="value" and take the node value of that rather than class="tel".
& & & & & & & & & & & & & Work In Progress & & & & & & & & & & & & &
I'm still in the process of writing this document. Please avoid non-trivial edits. Thanks, Tantek
References
Normative References
- hCard
- vCard (RFC 2426)
- XHTML 1.0 Recommendation
- HTML 4.01 Recommendation
- XMDP