hcard-parsing-fr
pasage hCard
par Tantek Çelik
(traduction en cours Christophe Ducamp)
introduction
Quand j'ai d'abord conçu hCard, il était clair pour moi de savoir comment parser sans ambiguïté l'existence de hCards dans du (X)HTML arbitraire (et n'importe où pouvait être embarqué du (X)HTML arbitraire, par ex. RSS, Atom, "XML générique"), et les propriétés et valeurs de hCard.
J'ai travaillé directement avec Brian Suda pour capturer ces idées dans une implémentation, et Brian a écrit X2V, un script XSLT qui convertit les hCards en vCards, démontrant de ce fait simultanément, la parsabilité des hCards et l'utilité immédiate du contenu hCard interopérant avec les applications existantes massivement répandues de vCard.
Je vais maintenant documenter ces idées directement sur cette page, de façon à ce que des implémentations supplémentaires, plutôt que d'avoir à faire du reverse engineering sur X2V, puissent être construites à partir de ces concepts élémentaires.
étendue
Bien que cette page soit écrite spécifiquement pour expliquer comment parser hCard, les concepts et algorithmes contenus à l'intérieur servent comme exemple pour la façon dont d'autres microformats composés vont se faire parser.
Gestion URL
Un parseur hCard peut commencer par un URL à récupérer.
Si l'URL manque d'un identifiant fragment, alors le parseur devrait parser la ressource complète récupérée pour les hCards.
Si l'URL a un identifiant fragment, alors le parseur ne devrait paser que le noeud indiqué par l'identifiant fragment et ses descendants, chercher des hCards, démarrer avec le noeud indiqué, qui peut être lui-même un simple hCard.
nom classe racine
Chaque microformat composé commence par un élément racine ave un nom de classe relativement unique. Par cela, je veux dire un nom de classe qui n'est pas simplement un mot commun, et qui devra être peu probablement être utilisé à l'extérieur du contexte du microformat. En choisissant un tel nom de classe racine, le microformat évite (pour toutes les intentions pratiques) de rentrer en conflit avec des noms de classe existants qui peuvent exister dans le contexte (X)HTML. Ceci est essentiel pour permettre à de tels microformats composés d'être embarqués dans le contenu actuel, existant, tout comme le contenu futur.
Heureusement ce n'est pas un nouveau problème à résoudre. Les noms d'objts racine choisis pour la vCard (RFC 2426) et iCalendar (RFC 2445) ont eu de la même manière à éviter de telles collisions et l'ont fait en choisissant des noms qui devraient peu probablement être introduits à l'intérieur d'un objet contexte MIME. Le principe de réutilisation dicte le fait que nous devrions réutiliser les noms pour ces objets racine dans ces RFCs plutôt que d'inventer les nôtres. Compte tenu de la même sémantique, un design devrait réutiliser les noms, plutôt que d'inventer un second nom pour la même sémantique (une erreur commune de design produite dans des environnements qui requièrent des espaces-noms).
Dans la spécification vCard, les noms ne sont pas sensibles à la casse du aux (manque) d'exigences de leur contexte. Les noms de classes (X)HTML sont sensibles à la casse par ces spécifications. De ce fait, nous sommes obligés de choisir une casse canonique pour les noms de classe équivalents des noms d'objets, et des noms de propriéts de vCard. Toutes les bas de casse sont choisies pour suivre le réglage précédent (c'est à dire le modèle reuse) par le XHTML, ce qui devait de la même manière canoniser la casse de l'élément et les noms d'attributs que cela prenait du HTML4, qui lui-même était insensible à la casse du fait de son contexte (SGML). En outre, les raisons d'éviter le mélange de casse (par ex. la casse chatmot) dans le contexte de noms de classes peut être trouvé dans l'article A Touch of Class, spécifiquement, la section titrée Class sensitivity.
Par conséquent, le nom de classe racine d'un hCard est "vcard".
trouver des hCards
Un docuemnt (X)HTML indique qu'il peut contenir des hCards en référençant le hCard XMDP profile. Voir XMDP pour plus de détails.
Un parseur trouver des hCards dans un contexte (X)HTML en cherchant des éléments avec le nom de classe racine "vcard" tout simplement comme le fait le sélecteur de classe CSS suivant :
.vcard
Par exemple, la règle de style CSS suivante fixe l'arrière plan de toutes les hCards en vert:
.vcard { background: green; }
Notez que l'attribut de classe (X)HTML est un epace séparé par des noms de classe.
de ce fait tout ce qui suit sont des éléments racines valides hCard :
<div class="vcard"> </div>
<span class="attendee vcard"> </span>
<address class="vcard author"> </address>
<li class="reviewer vcard first"> </li>
Une fois l'élément racine d'un hCard trouvé, cet élément-là et tous ses descendants sont tout ce qui est exigé pour parser le hCard.
Par conséquent, il est possible pour un hCard bien formé d'être extrait à partir d'un contexte général non bien-formé, si le parseur a la capacité de trouver des éléments par nom de classe dans ce contexte non bien-formé.
propriétés hCard
The complete list of class names for hCard properties are documented in the hCard profile.
forward compatible parsing
When parsing the contents of an hCard, any unrecognized class names must be ignored.
Similarly, unrecognized values for hCard properties must also be ignored.
trouver des propriétés hCard
To parse an hCard for an hCard property (e.g. "fn"), the parser simply looks for the first element with that class name inside the hCard.
This can also be expressed as the first element that matches this CSS selector:
.vcard .fn
Some properties, like "fn", should only appear once, and thus the parser stops looking for the property after it has found one occurrence. Additional occurrences are ignored.
Other properties, like "adr", "email", "url", "tel", etc., may (and often do) appear more than once, and thus the parser continues to look for each occurrence within the contents of the hCard.
parser les propriétés et valeurs de hCard
Once an element for a property is found, the contents of the element are used for the value.
There are several exceptions to accomodate semantic XHTML and more semantic equivalents.
For the "email" property in particular, when the element is:
<a href="mailto:...">
OR<area href="mailto:...">
: parse the value of the 'href' attribute, omitting the "mailto:" prefix and any "?" query suffix (if present), in the attribute. For details on the "mailto:" URL scheme, see RFC 2368.
For properties that may take type URL or URI parsers MUST handle relative URLs/URIs and normalize them to their respective absolute URLs/URIs, following the containing document's language's rules for resolving relative URLs/URIs (e.g. <base> for HTML, xml:base for XML).
For properties that may take type URL, URI, or UID, when the element for that property is:
<a href>
OR<area href="mailto:...">
: use the value of the 'href' attribute.<img src>
: use the value of the 'src' attribute. If the 'src' is a "data:" URL, use the MIME type in that "data:" URL for the TYPE subproperty, otherwise if the the 'type' attribute is present, us that for the TYPE subproperty.<object data>
: use the value of the 'data' attribute for the value.If the 'data' is a "data:" URL, use the MIME type in that "data:" URL for the TYPE subproperty, otherwise if the the 'type' attribute is present, us that for the TYPE subproperty.
For properties with values NOT of type URL, URI, or UID, when the element for a property is:
<img alt>
OR<area alt>
: use the value of the 'alt' attribute.
For all properties, when the element for a property is:
<abbr>
: use the value of the 'title' attribute if present, otherwise the contents of the element.- For properties which take an ISO8601 datetime value, parsers *should* pad any necessary precision (e.g. seconds), and *should* normalize any datetimes with timezone offsets, (e.g.
20050814T2305-0700
) into UTC (20050815T060500Z
). Note that floating dates and times MUST NOT be made into UTC/Z absolute time zoned values.
- For properties which take an ISO8601 datetime value, parsers *should* pad any necessary precision (e.g. seconds), and *should* normalize any datetimes with timezone offsets, (e.g.
<br />
OR<hr />
: the value is the empty string. These two elements do not represent any semantics and thus it is probably an error (at least an abuse) for an author to use them with microformats class names. Nonetheless, if found, treat the value as empty.
For all properties, if the element for a property has one or more children with a class name of "value", then concatenate the node values for all those child elements with class name of "value" in their document order, and use that concatenation as the value of the property.
gestion espace blanc
hCard parsers should handle white-space parsing per XML white-space handling rules, with the following two additions:
<pre>
handling. Any content parsed as part of an hCard property that is inside a <pre> element must preserve all white-space per XML white-space preservation rules.<br />
handling. Any occurance of a<br />
inside the element(s) for a value must be treated as a carriage-return (\n) in the respective location in the value.
sous-propriétés hCard
There are some hCard properties whose values themselves have structure (AKA structured type value) and are composed of multiple pieces, which we refer to as sub-properties.
For example, the "n" property consists of the sub-properties "family-name", "given-name", "additional-name", "honorific-prefix", and "honorific-suffix".
E.g. from section 3.1.2 of RFC 2426, modified to include Ph.D.
N:Public;John;Quinlan;Mr.;Esq.,Ph.D.
In hCard this "n" property would be marked up as
<span class="n"> <span class="honorific-prefix">Mr.</span> <span class="given-name">John</span> <span class="additional-name">Quinlan</span> <span class="family-name">Public</span>, <span class="honorific-suffix">Esq.</span>, <span class="honorific-suffix">Ph.D.</span> </span>
Which would be rendered as:
Mr. John Quinlan Public, Esq., Ph.D.
paramètres proprités hCard
Some hCard properties have one or more parameters, most often "type", with an enumerated set of values. We represent the specific value of the parameter as a class name on an element inside the element representing the property.
For example, the "adr" property has a type parameter which takes the values: "dom", "intl", "post", "parcel", "home", "work", "pref".
The "type" parameter is treated like a sub-property.
To encode the "type" of an "adr" property, a nested element with class="type" is used to markup the value of the type parameter.
Example with the "tel" property with a value of type "work":
<span class="tel"> <span class="type">work</span>: <span class="value">+1.123.456.7890</span> </span>
extraction Value
Note the element with class="value" used in the above example.
Sometimes only part of an element which is the equivalent for a property should be used for the value of the property. This typically occurs when a property has a subtype, like TEL. For this purpose, the special class name "value" is introduced to excerpt out the subset of the element that is the value of the property.
Proposed Additions
These are proposed additions to hCard parsing. Implementations MAY follow these conventions in order to gain implementation experience, and SHOULD report back on the results.
gestion élément DEL
When dealing with an HTML document that is hCard encoded, the parser SHOULD honor the <del>
element.
There are two possibilities here (adopting both may be possible):
1. Skip any occurences of <del>
elements and their contents entirely inside the contents of a property.
2. If the <del>
element is used for a property itself, it could be useful as a way communication the of tombstoning / obsoleting of that particular property value, and thus while a parser that is converting to a vCard SHOULD simply do what is indicated in (1), applications which parsed hCard directly (rather than only supporting vCard) COULD treat such occurences of <del>
elements as a way to remove obsolete information (with user confirmation of course) from a local contact information store.
Plain Text Formatting of Structural/Semantic HTML
There are several structural/semantic elements in HTML which have useful default styling which could be converted into ASCII (AKA Plain Text) equivalents as a low resolution way of communicating that structure. Note that <br />
and <pre>
are already handled in the section above titled White-space Handling.
When parsing the DESCRIPTION property, hierarchically convert the following HTML tags into their plain text styling equivalents.
<div>
,</div>
,<dl>
,</dl>
,<dt>
,</li>
,</dd>
- Append a soft\n
to the output. By "soft\n
", we mean only do so if there isn't already a line break (in contrast to a "hard" (implied by default)\n
). Two things in particular order to ensure that<div> <div>
does not result in two\n
characters in a row:- only output the
\n
if something other than whitespace (including\n
) was outputted immediately previously. - omit any immediately subsequent whitespace characters.
- only output the
<li>
- Append a soft\n
and then * . (Note: Indenting the contents of the list item is not particularly practical, since that would require line-breaking, and that would depend on knowing the width of when the plain text is rendered. Wrapping to 70 characters may be a good assumption for plain text email, but is probably a very bad assumption for vCard output).</dt>
- Append:\n
<dd>
- Append a soft\n
and then (two space ASCII 32 characters).<h1>
,</h1>
,<h2>
,</h2>
,<h3>
,</h3>
,<h4>
,</h4>
,<h5>
,</h5>
,<h6>
,</h6>
- Append a soft\n
followed by a hard\n
. (Note: we may want to consider some conventions to indicate the heading level. Perhaps only the relative heading level inside the property matters, e.g. whatever level HTML heading is seen first is treated as a first level heading, then any subsequent HTML heading elements are treated relative to that original heading (this is because it is likely that the property is embedded somewhere deep inside an HTML document following higher heading levels). Any subsequent higher level headings should perhaps cause a warning, and then simply be treated as a first level heading. Given that, the straw proposal for heading syntax from Ian Hickson is one reasonable possibility, with the only issue being that for first and second level headings, how wide to make the line of '-'s or '='s, which is a similar problem to the line-breaking problem noted above when considering indenting the contents of list-items. Thus perhaps it might be sufficient to simply set a first level heading in ALL CAPS (same as the third level heading in Ian's proposed syntax), and let second and deeper level headings be simply implied by the "one line of text with two line breaks both before and after" convention. Rarely has there been more than one level of heading found within a DESCRIPTION property, and I've never seen more than two even if it is possible.)<p>
,</p>
- Append a soft\n
followed by a hard\n
. (Note: Typical books indent the start of a paragraph approximately three spaces "<q>
,</q>
- Append a double-quote '"' character.<sub>
- Append an open parenthesis "("</sub>
- Append a a close parenthesis ")".<sup>
- Append an open bracket "["</sup>
- Append a a close bracket "[".<sup>
are often used for footnotes which in plain text are often formatted as bracketed numbers.<table>
,</table>
,<tbodygt;
,</tbody>
,<thead>
,</thead>
,<tfoot>
,</tfoot>
,<tr>
,</tr>
,<caption>
,</caption>
- Append a soft\n
. Of course one could try to do a lot more with representing the structure of the table, but that is almost certainly more work than it is worth, nevermind the complexities introduced by COLSPAN, ROWSPAN etc. At least by approximating the table sections and rows the table may be more readable.</td>
,</th>
- Append a space and a tab character (ASCII 32, ASCII 9 respectively). It's not clear that what subsequent application would be able to make use of this visually, but at least the tabular nature of the structure is indicated and it makes it possible to copy and paste the table into something that handles tabular content like a spreadsheet and have the tabular structure reflected.
Plus d'éléments stimulants
<ol>
- it would be nice to number list items inside an ordered list rather than bullet them, but keeping track of list item numbers/counts is a non-trivial piece of state information for the parser to deal with, and thus we are omitting this behavior for now.
Usage de styles informatiques CSS au lieu de styles par défaut HTML
Rather than assuming the default presentation for these elements, and using that for the basis of plain text formatting, a parser could use the respective equivalent computed style properties and use those instead. However, requiring an hCard parser to also implement Cascading Style Sheets (e.g. CSS1) is out of scope. Some environments (i.e. a browser DOM) may already provide this information, and in that case, it may be easy for an hCard parser (e.g. a clientside javascript parser) to use computed style properties. E.g. instead of the elements above, the following computed styles could be used:
- display:block - Append a soft
\n
- text-indent (non-zero value, on an element with display:block or display:list-item) - Append a number of spaces equivalent to value of the text-ident property divided by the computed font-size property of the element.
- margin-top, margin-bottom (non-zero value, on an element with display:block or display:list-item) - Append a number of "\n" equivalent to the value divided by the computed font-size property of the element. Obviously this won't properly collapse vertical margins.
- display:list-item - Append a soft
\n
followed by " * " - etc.
This is enough extra work that I'm not sure it is worth spending the time documenting more equivalents. The above are sufficient to illustrate the possibility.
Problématiques étonnantes
Issues 3
Might be worth considering defining the parsing in terms of the DOM, so that it applies to HTML and XHTML equally without ambiguity.
Problématiques Résolues
This section is informative.
The following issues have been explored and resolved
Résolue le 16 septembre 2005
PROBLEMATIQUE 1
Should we make plural sub-property names into singular versions and simply allow multiple instances? I.e. the singular honorific prefix would make more sense if it was classed as such, and the list implied by the value for honorific-suffixes could be made more explicit (and thus more easily machine parseable):
<span class="n"> <span class="honorific-prefix">Mr.</span> <span class="given-name">John</span> <span class="additional-names">Quinlan</span> <span class="family-name">Public</span>, <span class="honorific-suffix">Esq.</span>, <span class="honorific-suffix">Ph.D.</span> </span>
RESOLUTION: Adopt singular class name equivalents for plural property and sub-property names.
PROBLEMATIQUE 2
Restricting the "type" sub-property values to being expressed in class names seems less than ideal. It's taking a piece of information which is very often visible in the content, and forcing it to be invisible.
Here is an example of an extensive bit of contact information on a web page:
http://www.patchlink.com/company/contact.html
Maiilng Address 3370 N. Hayden Road, #123-175 Scottsdale, AZ 85251-6632 Physical Address 8515 E Anderson Scottsdale, AZ 85255
Note that the type information for each "adr" is explicit in the content. This content could be marked up like this:
<div class="adr"> <abbr style="display:block" class="type" title="postal,parcel">Mailing Address</abbr> <div class="street-address">3370 N. Hayden Road, #123-175</div> <span class="locality">Scottsdale</span>, <span class="region">AZ</span> <span class="postal-code">85251-6632</span> </div> <div class="adr"> <abbr style="display:block" class="type" title="work,pref">Physical Address</abbr> <div class="street-address">8515 E Anderson</div> <span class="locality">Scottsdale</span>, <span class="region">AZ</span> <span class="postal-code">85255</span> </div>
RESOLUTION: The "type" parameter MUST be marked-up when content is available (like the above two examples). We are ditching the type-value-as-another class name pattern.
In addition since there are some potentical problems with the "type" parameter for TEL and EMAIL properties. Since there are no defined sub-properties (unlike ADR's post-code, locality, etc) the entire node-value of TEL is taken as the value. For example:
<span class="tel">+1.123.456.7890 <abbr class="type" title="work">(work)</abbr></span>
would be represented in vCard as:
TEL;TYPE=work:+123.456.7890 (work)
We are introducing another sub-property class="value" to enable excerpting of a the value of an element of for a property.
<span class="tel"><span class="value">+1.123.456.7890</span> <abbr class="type" title="work">(work)</abbr></span>
Then parsers would first need to look for class="value" and take the node value of that if it exists rather than class="tel".
If one or more child elements with the class name of "value" are present inside the element for a property, then concatenate the node values of those child elements (in the order found) and use that as the value of the property. This would be before using the node value of the element for a property itself.
Références
Références Normatives
- hCard
- vCard (RFC 2426)
- XHTML 1.0 Recommendation
- HTML 4.01 Recommendation
- XMDP