- 1 Brainstorming for hCard parsing
- 1.1 Additional Semantic HTML handling
- 1.2 multiple type parsing
- 1.3 fax and modem hyperlink parsing
- 1.4 Ambiguous name components
- 1.5 ADR with no children
- 1.6 See also
Brainstorming for hCard parsing
Add thoughts/proposals to improve/add to hCard parsing here in this section in hCard brainstorming, and be sure to include URLs to examples of hCards in the wild which could benefit from parsing rule changes.
Additional Semantic HTML handling
acronym element handling
- Explicitly treat
acronymthe same as
abbr, per semantics of the '
title' attribute on
acronymin particular, as defined in HTML4.01.
- Explicitly treat
acronymthe same as
span, and discourage use of
input element handling
One element I forgot at the time was the
input element, specifically,
<input type="text">. Another I forgot was the
<input type="text" value="...">: use the value of the 'value' attribute. If there is no 'value' attribute then treat the value as empty. Interactive user-agents MUST use the current value of the element.
- consider other input types also (e.g. checkbox, radio, hidden) and specify how to parse them as well.
<textarea>: use the text contents of the element. Interactive user-agents MUST use the current value of the element.
If you go to a site that needs your contact info for something, say an ecommerce site for checkout, and if the form fields are marked up with hCard semantics per the above, then perhaps we could consider having that mean "insert hCard here".
Interactive useragents (e.g. Operator Firefox plugin on Firefox) could detect such "insert hCard here" semantics in forms on pages, and let you "pre-fill" with *your* hCard info, and then all of a sudden we have a standard for forms auto-fill, rather than all the hacks that have gone into browsers since 1999 (starting with IE4.5/Mac which I'm pretty sure was the first to do forms auto-fill of an entire form with a single button press - not just auto-complete of each form field individually).
For more on this, see the 2007 August blog post hCard autofill? by David Baron, a Mozilla employee.
This way new sites could simply conform to the standard, rather than depend on hacks which parse label values etc. and imply things and get them wrong sometimes.
Internationalization advantages: hCard annotated form inputs would also be more international, thus avoiding the need for each browser to guess what is the "name" and "telephone" field in every language, so they can do forms auto-fill on any site regardless of language, not just English.
Tantek 16:24, 23 Jul 2007 (PDT)
One key summary:
The options discussed in a hypothetical hCard input system so far appear to be:
1) create a new root class other than vcard to indicate a form that's fillable with hCard data.
<form class="vcard-input" ...> <fieldset class="fn"> <input type="text" class="given-name" name="first_name" /> <input type="text" class="family-name" name="last_name" /> </fieldset> ... </form>
Benefits: Doesn't overcomplicate hCard with new parsing rules, doesn't require rewrite of existing parsers to ignore 'unparsable' data. Drawbacks: Requires completely new parsers to be written. Existing parsers would ignore data even if a valid hCard could be extracted.
2) extend hCard's parsing rules to cover form elements and relying on the FORM/INPUT semantics to indicate that stuff is inputtable.
<form ...> <div class="vcard"> <fieldset class="fn"> <input type="text" class="given-name" name="first_name" value="Rob" /> <input type="text" class="family-name" name="last_name" value="Manson" /> </fieldset> ... </div> <div class="vcard"> <fieldset class="fn"> <input type="text" class="given-name" name="first_name" value="Scott" /> <input type="text" class="family-name" name="last_name" value="Reynen" /> </fieldset> ... </div> </form>
Benefits: Small addition to existing format rather than new one. Semantics of an input form and the eventual display format are the same. Drawbacks: Existing parsers would/could parse forms as invalid hCards, would need re-writing.
Should this be extended beyond just hCard?
Key Issues/discussion points
- Extending parsing rules to extract value attributes from <input type="text|hidden"> fields
- Negative : this require adding a bit of special case to existing parsers to handle these elements - Positive : this could help to enable uf based auto form filling - Negative : this could help to enable uf based auto form filling (e.g. spam automation)
- Existing server side and client side scripts use non-hCard field names so class is the most seamless extension point
- Positive : this is in line with the current parsing model
- Many parsers (e.g. Operator Firefox plugin) parse the loaded html not the dynamic DOM
- Negative : parser doesn't pickup any updated form data after the page has loaded - e.g. even though textarea appears to parse ok - it's only ever the initially loaded value that can be exported
- Forms may contain more than one hCard so using <FORM class="vcard"> should not be required
- Positive : this minimises the changes to current parsing rules
- Empty values should be ignored when extracting hCards
- hCards with all empty values should be ignored when listing/extracting hCards
- Which form elements should be supported beyond input fields
multiple type parsing
- Multiple Type parsing / Type Optimization: The spec allows for, and the hCard authoring tips demonstrate the use of multiple type designations for a single value of tel. The syntax used in the authoring examples where each seems like it could become cumbersome. As these type designations are all single 'word' strings it may be possible to implement additional parsing rules to allow for multiple types inside the same HTML element. Handling delimiters may be an issue [space, comma, etc?], and some in-the-wild usage of multiple types would need to be located and examined before considering additional parsing rules along these lines [ ChrisCasciano 10:21, 16 Apr 2007 (PDT) ]
For the "tel" property in particular, when the element is:
<area href="fax:...">: parse the value of the 'href' attribute, omitting the "fax:" prefix and any "?" query suffix (if present), in the attribute. For details on the "fax:" URL scheme, see RFC 2806. In addition, treat this 'tel' property instance as having subproperty type "fax" in addition to any explicit subproperty type specified on the 'tel' property.
<area href="modem:...">: parse the value of the 'href' attribute, omitting the "modem:" prefix and any "?" query suffix (if present), in the attribute. For details on the "modem:" URL scheme, see RFC 2806. In addition, treat this 'tel' property instance as having subproperty type "modem" in addition to any explicit subproperty type specified on the 'tel' property.
Ambiguous name components
When automatically publishing hCards from pre-existing data, it's not necessarily possible to tell which words in a name map to which hCard properties. When the structure of a name is unknown, it is hard to ensure an automatically published hCard remains valid.
There's currently no easy answer to this.
One implementation suggestion is a 'best-guess' algorithm, something along the lines of:
- If the name is one word, attempt implied nickname optimization
- If the name is two words, attempt implied n optimization
- For three or more words
- Perform a lookup against known sub-name combinations (e.g. 'Sarah Jane', 'Vander Wal')
- Apply the grammar "given-name additional-name(s) family-name"
The principal behind this suggestion is that it's better to make a good guess and potentially miscategorize an ambiguous name component than to generate an invalid hCard.
ADR with no children
Parsers (Operator, Tails, Almost Universal Microformat Parser) currently expect
adr to have one or more sub-properties. It is not clear from the hCard spec that that's mandatory (though the vCard RFC requires it); nor is it always possible for an address field in a templated (or CMS) web site to be defined with such granularity.
Consider Wikipedia, whose templates often have a "locale" or "place" field, used, for example, on these articles about railway stations:
- Old Street
- "Place" ("locale" in the template) is a street
- "Place" is a local district
- "Place" is a city
Likewise, the Wikipedia template for organisations, in which a "headquarters" address (for a business, for example) may contain a full or partial postal address, or just a city/county or city/country pair:
implied single adr subproperty
I propose that, where
adr has content, but no explicit sub-properties, there should be a default sub-property to which that content is allocated, in order that it is captured by user agents, and can later be manually tweaked (in, say, an address book programme) by users if so desired. This would satisfy the vCard requirement for child-of-adr, and adhere to the general principle to "be strict in what you send but generous in what you receive".
- Note that there may be other reasons to consider this suggestion, such as "ease of authoring". Another way of looking at this suggestion is as a "adr/extended-address shorthand". Tantek 08:28, 26 Mar 2007 (PDT)
- there is also a LABEL property which is NOT structured data, but purely a text string to be used when labeling. LABEL purpose: To specify the formatted text corresponding to delivery address of the object the vCard represents. Brian 13:18, 30 Mar 2007 (UTC)
- On re-reading this, it seems that none of the adressess given in my examples meet the criteria of being "formatted text corresponding to delivery address". Andy Mabbett 03:35, 17 Apr 2007 (PDT)
Of the available sub-property options:
I suggest that "extended-address" is the most sensible sub-property to use, for this purpose. Andy Mabbett 03:57, 26 Mar 2007 (PDT)
implied adr subproperties
This may also be too difficult/complex to be dependable or interoperable, but it is worth at least documenting our considerations and analysis either way.
<td class="adr">Austin, USA</td>
We could first define a canonical ordering of how to parse for comma (and perhaps in some cases space) separated adr subproperties within an adr string e.g.:
Given a dictionary of country names and abbreviations, it may be feasible to parse for a country name at the end of the adr string, and then apply country/locale specific parsing rules to the rest of the adr string.
- from a theoretical dictionary of country names:
- US|USA|United States|United States of America|Etats-Unis d'Amerique
- parse the remainder of the adr string backwards as follows:
- preceding that, look for a 5 or 9 digit (with optional dash '-' separator between digits 5 and 6) postal-code, and if found use it for the 'postal-code'
- preceding that, look for the name of a US state (e.g. California or any of the other states or territories available from a canonical list) or 2 letter state abbreviation (e.g. CA), and if found use it for the 'region' subproperty
- preceding that, look for the name of a US city (e.g. San Francisco, Los Angeles or any other US city available from a canonical list) or common city abbreviation (e.g. SF, LA), and if found use it for the 'locality'
- preceding that, look for common extended address details, such as: #|apt|apartment|suite|ste followed by a word consisting of letters and numbers, and if found use it for the 'extended-address'
- preceding that, look for a common street name bracketed by the street number (an integer with optional fraction and/or letter), and an optional street type (av|ave|avenue|blvd|boulevard|cir|circle|pl|place|st|street), and if found use it for the 'street-address'
- preceding that, look for a common post office box, with the pobox literal string: pob|pobox|PO Box followed by a word consisting of numbers and letters, and if found use it for the 'post-office-box'
- ... other countries
The above heuristic (not quite well specified enough to be an algorithm, yet) would allow parsing of the IBM Employee Directory result documented above.
There are a lot of existing geocoder APIs that turn unstructured addresses into structured ones - we should examine these for patterns and best practices. eg Google's geocoder geopy calls multiple ones
adr without children FAQ
I think for now the simplest and most interoperable (and what I think implementations already do) is to make this an FAQ (because the spec already doesn't say to do anything with adr without any subproperty)
Q: What should a parser do with an "adr" property lacking any subproperties?
A: A parser SHOULD do nothing with such an "adr" property. A parser MAY provide the text content of such an "adr" property in the results of its parsing as a freeform value of the "adr" property. Note that the vCard standard does not allow for any such freeform value of its "adr" property (in vCard the "adr" property MUST be structured) and thus that MAY suggestion to parsers only applies in situations (such as APIs, JSON return values) where it is possible to return a freeform value for the "adr" property.
Tantek 09:20, 2 Aug 2007 (PDT)
- hCard cheatsheet - hCard properties
- hCard creator (feedback) - create your own hCard.
- hCard authoring - learn how to add hCard markup to your existing contact info.
- hCard examples - example usage of various classes within hCard.
- hCard examples in the wild - an on-going list of websites which use hCards.
- hCard supporting user profiles - sites with user profiles marked up with hCard - a very common example.
- hCard FAQ - if you have any questions about hCard, check here.
- hCard implementations - websites or tools which either generate or parse hCards.
- hCard parsing - normative details of how to parse hCards.
- hCards and pages - semantic distinctions between different hCards on a page, and how to identify each
- hcard-user-interface - techniques and issues surrounding user-interfaces to author, publish, and display hCards.
- hCard profile - the XMDP profile for hCard
- hCard singular properties - an explanation of the list of singular properties in hCard.
- hCard tests - a wiki page with actual embedded hCards to try parsing.
- hCard advocacy - encourage others to use hCard
- hCard "to do" - jobs to do
The hCard specification is a work in progress. As additional aspects are discussed, understood, and written, they will be added. These thoughts, issues, and questions are kept in separate pages.
- hCard brainstorming - brainstorms and other explorations relating to hCard.
- hCard feedback - general feedback (as opposed to specific issues).
- hCard issues - specific issues with the specification.
- vCard errata - corrections to the vCard specification, which underlies hCard.
- vCard suggestions - suggested improvements to the vCard specification.