hCard parsing

Revision as of 23:02, 26 July 2010 by Tantek (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)

Jump to: navigation, search


by Tantek Çelik

Contents

introduction

When I first conceived of hCard, it was clear to me how to unambiguously parse both for the existence of hCards in arbitrary (X)HTML (and anywhere that arbitrary (X)HTML can be embedded, e.g. RSS, Atom, "generic XML"), and hCard properties and values.

I worked directly with Brian Suda to capture these thoughts in an implementation, and Brian wrote X2V, an XSLT script that converts hCards to vCards, thus simultaneously demonstrating the parsability of hCards, and the immediate utility of hCard content interoperating with widespread existing vCard applications.

I am now documenting those thoughts directly here so that additional implementations, rather than having to reverse engineer X2V, can be built directly from these elementary concepts.

scope

Although this page is written specifically to explain how to parse hCard, the concepts and algorithms contained therein serve as an example for how other compound microformats are to be parsed.

URL handling

An hCard parser may begin with a URL to retrieve.

If the URL lacks a fragment identifier, then the parser should parse the entire retrieved resource for hCards.

If the URL has a fragment identifier, then the parser should parse only the node indicated by the fragment identifier and its descendants, looking for hCards, starting with the indicated node, which may itself be a single hCard.

root class name

Each compound microformat starts with a root element with a relatively unique class name. By that I mean a class name which isn't simply a common word, and is unlikely to have been used outside the context of the microformat. By choosing such a root class name the microformat avoids (for all practical purposes) colliding with existing class names that may exist within the (X)HTML context. This is essential to enabling such compound microformats to be embedded inside current, existing content, as well as future content.

Fortunately this is not a new problem to solve. The root object names chosen for vCard (RFC 2426) and iCalendar (RFC 2445) similarly had to avoid such collisions and did so by choosing names that were unlikely to have been introduced into a MIME object context. The principle of reuse dictates that we should reuse the names for these root objects in those RFCs rather than invent our own. Given the same semantics, a design should reuse the names, rather than inventing a second name for the same semantic (a common design mistake made in environments that require namespaces).

In the vCard specification, the names are case-insensitive due to the (lack of) requirements of their context. (X)HTML class names are case sensitive per those specifications. Thus we are required to pick a canonical case for the class name equivalents of vCard object and property names. All lowercase is chosen to follow the precedent (i.e. reuse the pattern) set by XHTML, which similarly had to canonicalize the case of element and attribute names that it took from HTML4, which itself was case-insensitive due to its context (SGML). Additionally, reasons for avoiding mixed-case (e.g. camel case) in the context of class names may be found in the essay A Touch of Class, specifically, the section titled Class sensitivity.

Thus the root class name of an hCard is "vcard".

finding hCards

An (X)HTML document indicates that it may contain hCards by referencing the hCard XMDP profile. See XMDP for more details.

A parser finds hCards in an (X)HTML context by looking for elements with the root class name vcard just as the following CSS class selector does:

 .vcard

For example, the following CSS style statement sets the background of all [hcard|hCards] to green:

 .vcard { background: green; }

Note that the (X)HTML class attribute is a space separated set of class names.

Thus all of the following are valid hCard root elements:

Once the root element of an hCard is found, that element and all its descendants (except those inside nested hCards) are all that is needed to parse the hCard.

Thus it is possible for a well-formed hCard to be extracted from an overall non well-formed context, if the parser has the ability to find elements by class name within that non well-formed context.

nested hCards

When parsing an hCard, if a parser finds a nested hCard, it should treat that hCard as its own object, and treat properties of the nested hCard as only belonging to the nested hCard, not the containing hCard.

This is essential for example for handling use of the "agent" property to nest an hCard that is the agent of another hCard. See hCard examples from RFC2426 AGENT Example 2 for an actual example.

Similarly, parsers should treat nested hCalendar, hReview, hResume xFolk in the same way, properties inside them MUST only apply to the nested microformat, not to the containing microformat.

All references below to "inside the hCard", "within the contents of the hCard", and similar phrasing MUST be interpreted with taking this nesting rule into account.

hCard properties

The complete list of class names for hCard properties are documented in the hCard profile.

forward compatible parsing

When parsing the contents of an hCard, any unrecognized class names must be ignored.

Similarly, unrecognized values for hCard properties must also be ignored.

finding hCard properties

To parse an hCard for an hCard property (e.g. "fn"), the parser simply looks for the first element with that class name inside the hCard.

This can also be expressed as the first element that matches this CSS selector:

.vcard .fn /* note exception for nested hCards, see below */

Some properties, like "fn", should only appear once, and thus the parser stops looking for the property after it has found the first occurrence in document order. Additional occurrences are ignored.

Other properties, like "adr", "email", "url", "tel", etc., may (and often do) appear more than once, and thus the parser continues to look for each occurrence within the contents of the hCard.

not finding nested hCard properties

Per the nested hCards rule, properties inside a nested hCard MUST NOT apply to the current hCard being parsed. E.g. elements with class name "fn" that match this CSS selector:

.vcard .vcard .fn

MUST NOT affect the outer hCard.

parsing hCard properties and values

In general, once an element for a property is found, that element is used for the value.

In particular, once an element for a property is found:

  1. first, look for class value children and use them as described below
  2. otherwise, if there is a more semantic exception, use that as described below
  3. finally, always fallback to using the contents of the element for the value

class value handling

For all properties, if the element for a property has one or more children with a class name of "value", then concatenate the node values for all those child elements with class name of "value", in their document order, and use that concatenation as the value of the property. (also called value excerpting)

more semantic exceptions

There are several exceptions to accomodate semantic XHTML and more semantic equivalents.

email property

For the "email" property in particular, when the element is:

tel property

For the "tel" property in particular, when the element is:

properties of type URL or URI

For properties that may take type URL or URI parsers MUST handle relative URLs/URIs and normalize them to their respective absolute URLs/URIs, following the containing document's language's rules for resolving relative URLs/URIs (e.g. <base> for HTML, xml:base for XML).

properties of type URL or URI or UID

For properties that may take type URL, URI, or UID, when the element for that property is:

properties not of type URL or URI or UID

For properties with values NOT of type URL, URI, or UID, when the element for a property is:

all properties

For all properties, when the element for a property is:

white-space handling

hCard parsers should handle white-space parsing per XML white-space handling rules, with the following two additions:

  1. <pre> handling. Any content parsed as part of an hCard property that is inside a <pre> element must preserve all white-space per XML white-space preservation rules.
  2. <br /> handling. Any occurrence of a <br /> inside the element(s) for a value must be treated as a carriage-return (\n) in the respective location in the value.

hCard sub-properties

There are some hCard properties whose values themselves have structure (AKA structured type value) and are composed of multiple pieces, which we refer to as sub-properties.

For example, the "n" property consists of the sub-properties "family-name", "given-name", "additional-name", "honorific-prefix", and "honorific-suffix".

E.g. from section 3.1.2 of RFC 2426, modified to include Ph.D.

N:Public;John;Quinlan;Mr.;Esq.,Ph.D.

In hCard this "n" property would be marked up as

<span class="n">
 <span class="honorific-prefix">Mr.</span>
 <span class="given-name">John</span>
 <span class="additional-name">Quinlan</span>
 <span class="family-name">Public</span>,
 <span class="honorific-suffix">Esq.</span>,
 <span class="honorific-suffix">Ph.D.</span>
</span>

Which would be rendered as:

Mr. John Quinlan Public, Esq., Ph.D.

hCard property parameters

Some hCard properties have one or more parameters, most often "type", with an enumerated set of values. We represent the specific value of the parameter as a class name on an element inside the element representing the property.

For example, the "adr" property has a type parameter which takes the values: "dom", "intl", "post", "parcel", "home", "work", "pref".

The "type" parameter is treated like a sub-property.

To encode the "type" of an "adr" property, a nested element with class="type" is used to markup the value of the type parameter.

Example with the "tel" property with a value of type "work":

<span class="tel">
 <span class="type">work</span>: 
 <span class="value">+1.123.456.7890</span>
</span>

Value excerpting

Note the element with class="value" used in the above example.

Sometimes only part of an element which is the equivalent for a property should be used for the value of the property. This typically occurs when a property has a subtype, like TEL. For this purpose, the special class name "value" is introduced to excerpt out the subset of the element that is the value of the property.

Per the section in hCard on type with unspecified value, if the subtype is specified on a property, and there is no descendant of the property element with class name of "value", then the remainder (excluding the subtype) of the property element is considered the value.

Include Pattern and Table Headers

When processing elements from the include-pattern or table headers inclusion methods, such elements should be processed as if they were inline.

Proposed Additions

These are proposed additions to hCard parsing. Implementations MAY follow these conventions in order to gain implementation experience, and SHOULD report back on the results.

DEL element handling

When dealing with an HTML document that is hCard encoded, the parser SHOULD honor the <del> element.

There are two possibilities here (adopting both may be possible):

1. Skip any occurences of <del> elements and their contents entirely inside the contents of a property.

2. If the <del> element is used for a property itself, it could be useful as a way communication the of tombstoning / obsoleting of that particular property value, and thus while a parser that is converting to a vCard SHOULD simply do what is indicated in (1), applications which parsed hCard directly (rather than only supporting vCard) COULD treat such occurences of <del> elements as a way to remove obsolete information (with user confirmation of course) from a local contact information store.

Plain Text Formatting of Structural/Semantic HTML

There are several structural/semantic elements in HTML which have useful default styling which could be converted into ASCII (AKA Plain Text) equivalents as a low resolution way of communicating that structure. Note that <br /> and <pre> are already handled in the section above titled White-space Handling.

When parsing the hCard note property (or description in hCalendar and hReview), hierarchically convert the following HTML tags into their plain text styling equivalents.


More challenging elements


Use of CSS computed styles instead of HTML default styles

Rather than assuming the default presentation for these elements, and using that for the basis of plain text formatting, a parser could use the respective equivalent computed style properties and use those instead. However, requiring an hCard parser to also implement Cascading Style Sheets (e.g. CSS1) is out of scope. Some environments (i.e. a browser DOM) may already provide this information, and in that case, it may be easy for an hCard parser (e.g. a clientside javascript parser) to use computed style properties. E.g. instead of the elements above, the following computed styles could be used:

This is enough extra work that I'm not sure it is worth spending the time documenting more equivalents. The above are sufficient to illustrate the possibility.

Outstanding Issues

Issues 3

Might be worth considering defining the parsing in terms of the DOM, so that it applies to HTML and XHTML equally without ambiguity.

Resolved Issues

This section is informative.

The following issues have been explored and resolved inline in the text of hcard-parsing above.

Resolved as of 2005-09-16

ISSUE 1

Should we make plural sub-property names into singular versions and simply allow multiple instances? I.e. the singular honorific prefix would make more sense if it was classed as such, and the list implied by the value for honorific-suffixes could be made more explicit (and thus more easily machine parseable):

<span class="n">
 <span class="honorific-prefix">Mr.</span>
 <span class="given-name">John</span>
 <span class="additional-names">Quinlan</span>
 <span class="family-name">Public</span>,
 <span class="honorific-suffix">Esq.</span>,
 <span class="honorific-suffix">Ph.D.</span>
</span>

RESOLUTION: Adopt singular class name equivalents for plural property and sub-property names.


ISSUE 2

Restricting the "type" sub-property values to being expressed in class names seems less than ideal. It's taking a piece of information which is very often visible in the content, and forcing it to be invisible.

Here is an example of an extensive bit of contact information on a web page:

http://www.patchlink.com/company/contact.html

Mailing Address
3370 N. Hayden Road, #123-175
Scottsdale, AZ 85251-6632

Physical Address
8515 E Anderson
Scottsdale, AZ 85255

Note that the type information for each "adr" is explicit in the content. This content could be marked up like this:

<div class="adr">
<abbr style="display:block" class="type" title="postal,parcel">Mailing Address</abbr>
<div class="street-address">3370 N. Hayden Road, #123-175</div>
<span class="locality">Scottsdale</span>, <span class="region">AZ</span>
<span class="postal-code">85251-6632</span>
</div>
<div class="adr">
<abbr style="display:block" class="type" title="work,pref">Physical Address</abbr>
<div class="street-address">8515 E Anderson</div>
<span class="locality">Scottsdale</span>, <span class="region">AZ</span> 
<span class="postal-code">85255</span>
</div>

RESOLUTION: The "type" parameter MUST be marked-up when content is available (like the above two examples). We are ditching the type-value-as-another class name pattern.

In addition since there are some potentical problems with the "type" parameter for TEL and EMAIL properties. Since there are no defined sub-properties (unlike ADR's post-code, locality, etc) the entire node-value of TEL is taken as the value. For example:

<span class="tel">+1.123.456.7890 <abbr class="type" title="work">(work)</abbr></span>

would be represented in vCard as:

TEL;TYPE=work:+123.456.7890 (work)

We are introducing another sub-property class="value" to enable excerpting of a the value of an element of for a property.

<span class="tel"><span class="value">+1.123.456.7890</span> <abbr class="type" title="work">(work)</abbr></span>

Then parsers would first need to look for class="value" and take the node value of that if it exists rather than class="tel".

If one or more child elements with the class name of "value" are present inside the element for a property, then concatenate the node values of those child elements (in the order found) and use that as the value of the property. This would be before using the node value of the element for a property itself.

References

Normative References

Informative References

Related Pages

The hCard specification is a work in progress. As additional aspects are discussed, understood, and written, they will be added. These thoughts, issues, and questions are kept in separate pages.

hCard parsing was last modified: Wednesday, December 31st, 1969

Views