Brainstorming for hCard parsing
Add thoughts/proposals to improve/add to hCard parsing here in this section in hCard brainstorming, and be sure to include URLs to examples of hCards in the wild which could benefit from parsing rule changes.
Additional Semantic HTML handling
acronym element handling
- Explicitly treat
acronymthe same as
abbr, per semantics of the '
title' attribute on
acronymin particular, as defined in HTML4.01.
- Explicitly treat
acronymthe same as
span, and discourage use of
input element handling
One element I forgot at the time was the
input element, specifically,
<input type="text">. Another I forgot was the
<input type="text" value="...">: use the value of the 'value' attribute. If there is no 'value' attribute then treat the value as empty. Interactive user-agents MUST use the current value of the element.
- consider other input types also (e.g. checkbox, radio, hidden) and specify how to parse them as well.
<textarea>: use the text contents of the element. Interactive user-agents MUST use the current value of the element.
If you go to a site that needs your contact info for something, say an ecommerce site for checkout, and if the form fields are marked up with hCard semantics per the above, then perhaps we could consider having that mean "insert hCard here".
Interactive useragents (e.g. operator on firefox) could detect such "insert hCard here" semantics in forms on pages, and let you "pre-fill" with *your* hCard info, and then all of a sudden we have a standard for forms auto-fill, rather than all the hacks that have gone into browsers since 1999 (starting with IE4.5/Mac, the first to do forms auto-fill of an entire form with a single button press - not just auto-complete of each form field individually).
This way new sites could simply conform to the standard, rather than depend on hacks which parse label values etc. and imply things and get them wrong sometimes.
i18n advantages: hCard annotated form inputs would also be more international, thus avoiding the need for each browser to guess what is the "name" and "telephone" field in every language, so they can do forms auto-fill on any site regardless of language, not just English.
Tantek 16:24, 23 Jul 2007 (PDT)
See hcard-input-examples for research on examples of contact info input forms.
By specifying a consistent way to markup contact info (person or venue/organization) input forms, we could enable both:
- hCard forms auto-fill
- hCard copy and paste (pasting in particular)
blog posts on hCard forms fill
For more on this, see the following blog posts:
- 2007 August blog post hCard autofill? by David Baron, a Mozilla employee.
- 2008-03-04 blog post Fill in those pesky forms with hCards by John Allsopp, author of the microformats book "Microformats: Empowering Your Markup for Web 2.0".
- http://shapeshed.com/examples/hcardme/ - open source and a page that demonstrates entering a URL and importing hCard contact info from that URL into the form on the page.
One key summary by Ciaran McNulty:
The options discussed in a hypothetical hCard input system from that post:
option new vcard input root class
1) create a new root class other than vcard to indicate a form that's fillable with hCard data.
<form class="vcard-input" ...> <fieldset class="fn"> <input type="text" class="given-name" name="first_name" /> <input type="text" class="family-name" name="last_name" /> </fieldset> ... </form>
- Doesn't overcomplicate hCard with new parsing rules,
- doesn't require rewrite of existing parsers to ignore 'unparsable' data.
- Requires completely new parsers to be written.
- Existing parsers would ignore data even if a valid hCard could be extracted.
- -1 I think it is preferable to try to make hCard work with existing classes for this user scenario rather than adding another scenario-specific class name. Adding scenario-specific class names also does not scale to other microformats in general (requiring additional class names for each microformat). Tantek 19:17, 8 June 2009 (UTC)
option add input elements to hCard parsing
2) extend hCard's parsing rules to cover form elements and relying on the FORM/INPUT semantics to indicate that stuff is inputtable.
<form ...> <div class="vcard"> <fieldset class="fn"> <input type="text" class="given-name" name="first_name" value="Rob" /> <input type="text" class="family-name" name="last_name" value="Manson" /> </fieldset> ... </div> <div class="vcard"> <fieldset class="fn"> <input type="text" class="given-name" name="first_name" value="Scott" /> <input type="text" class="family-name" name="last_name" value="Reynen" /> </fieldset> ... </div> </form>
- Small addition to existing format rather than new one.
- Semantics of an input form and the eventual display format are the same.
- Existing parsers would/could parse forms without values as invalid hCards.
See discussion points for more details and follow-up on benefits / drawbacks.
forms auto fill for all microformats
- http://microformats.org/discuss/mail/microformats-discuss/2005-September/001059.html Should this be extended beyond just hCard?
Many raised by RobManson.
- Extending parsing rules to extract value attributes from <input type="text|hidden"> fields
- -1 (unattributed, perhaps rhetorical) : this require adding a bit of special case to existing parsers to handle these elements
- +1 (unattributed, perhaps rhetorical) : this could help to enable microformat based auto form filling
- +1 The parsing rules for forms elements must be specified anyway, and thus it makes sense to see if they can be specified in such a way to at least enable forms autofill functionality. Tantek 19:17, 8 June 2009 (UTC)
- Existing server side and client side scripts use non-hCard field names so class is the most seamless extension point
- +1 (unattributed, perhaps rhetorical) : this is in line with the current parsing model
- +1 Tantek 19:17, 8 June 2009 (UTC)
- Some parsers (e.g. X2V) only parse the loaded html not the dynamic DOM (Operator parsers the page DOM).
- -1 (unattributed, perhaps rhetorical) : parser doesn't pickup any updated form data after the page has loaded, e.g. even though textarea appears to parse ok - it's only ever the initially loaded value that can be exported.
- +1 hcard-parsing should provide additional guidance on page load parsing vs dynamic DOM handling as necessary to handle both types of implementations. Tantek 19:17, 8 June 2009 (UTC)
- Forms may contain more than one hCard so using <form class="vcard"> should not be required.
- +1 (unattributed, perhaps rhetorical) : this minimizes the changes to current parsing rules
- +1 For example a <fieldset> could be used by an author instead, or even a div between the form and the inputs. Tantek 19:17, 8 June 2009 (UTC)
- Empty values should be ignored when extracting hCards
- +1 for vCards at least, perhaps into JSON as well. Tantek 19:17, 8 June 2009 (UTC)
- hCards with all empty values should be ignored when listing/extracting hCards
- +1 for vCards at least, perhaps into JSON as well. Tantek 19:17, 8 June 2009 (UTC)
Which form elements should be supported beyond input fields?
- title select that lists mr/mrs/ms/dr/etc.
honorific-prefixin particular, yes. Tantek 19:17, 8 June 2009 (UTC)
- checkboxes to choose which addresses to use
- +0 not sure how to make this work without a specific example to analyze. Tantek 19:17, 8 June 2009 (UTC)
- +1 (unattributed, perhaps rhetorical) this would simplify parsing and server side form processing as only single input fields for each value need to be used/validated
- +0 (unattributed, perhaps rhetorical) : either way any auto form filling will be more complex beyond simple <input type="text|hidden"> fields
- -1 (unattributed, perhaps rhetorical) hypothetical comment assuming more complexity beyond.
multiple type parsing
- Multiple Type parsing / Type Optimization: The spec allows for, and the hcard-authoring demonstrate the use of multiple type designations for a single value of tel. The syntax used in the authoring examples where each seems like it could become cumbersome. As these type designations are all single 'word' strings it may be possible to implement additional parsing rules to allow for multiple types inside the same HTML element. Handling delimiters may be an issue [space, comma, etc?], and some in-the-wild usage of multiple types would need to be located and examined before considering additional parsing rules along these lines [ ChrisCasciano 10:21, 16 Apr 2007 (PDT) ]
fax and modem hyperlink parsing
For the "tel" property in particular, when the element is:
<area href="fax:...">: parse the value of the 'href' attribute, omitting the "fax:" prefix and any "?" query suffix (if present), in the attribute. For details on the "fax:" URL scheme, see RFC 2806. In addition, treat this 'tel' property instance as having subproperty type "fax" in addition to any explicit subproperty type specified on the 'tel' property.
<area href="modem:...">: parse the value of the 'href' attribute, omitting the "modem:" prefix and any "?" query suffix (if present), in the attribute. For details on the "modem:" URL scheme, see RFC 2806. In addition, treat this 'tel' property instance as having subproperty type "modem" in addition to any explicit subproperty type specified on the 'tel' property.
Ambiguous name components
When automatically publishing hCards from pre-existing data, it's not necessarily possible to tell which words in a name map to which hCard properties. When the structure of a name is unknown, it is hard to ensure an automatically published hCard remains valid.
There's currently no easy answer to this.
One implementation suggestion is a 'best-guess' algorithm, something along the lines of:
- If the name is one word, attempt implied nickname optimization
- If the name is two words, attempt implied n optimization
- For three or more words
- Perform a lookup against known sub-name combinations (e.g. 'Sarah Jane', 'Vander Wal')
- Apply the grammar "given-name additional-name(s) family-name"
The principal behind this suggestion is that it's better to make a good guess and potentially miscategorize an ambiguous name component than to generate an invalid hCard.
ADR with no children
Parsers (Operator, Tails, Almost Universal Microformat Parser) currently expect
adr to have one or more sub-properties. It is not clear from the hCard spec that that's mandatory (though the vCard RFC requires it); nor is it always possible for an address field in a templated (or CMS) web site to be defined with such granularity.
Consider Wikipedia, whose templates often have a "locale" or "place" field, used, for example, on these articles about railway stations:
- Old Street
- "Place" ("locale" in the template) is a street
- "Place" is a local district
- "Place" is a city
Likewise, the Wikipedia template for organisations, in which a "headquarters" address (for a business, for example) may contain a full or partial postal address, or just a city/county or city/country pair:
implied single adr subproperty
I propose that, where
adr has content, but no explicit sub-properties, there should be a default sub-property to which that content is allocated, in order that it is captured by user agents, and can later be manually tweaked (in, say, an address book programme) by users if so desired. This would satisfy the vCard requirement for child-of-adr, and adhere to the general principle to "be strict in what you send but generous in what you receive".
- Note that there may be other reasons to consider this suggestion, such as "ease of authoring". Another way of looking at this suggestion is as a "adr/extended-address shorthand". Tantek 08:28, 26 Mar 2007 (PDT)
- there is also a LABEL property which is NOT structured data, but purely a text string to be used when labeling. LABEL purpose: To specify the formatted text corresponding to delivery address of the object the vCard represents. Brian 13:18, 30 Mar 2007 (UTC)
- On re-reading this, it seems that none of the adressess given in my examples meet the criteria of being "formatted text corresponding to delivery address". Andy Mabbett 03:35, 17 Apr 2007 (PDT)
Of the available sub-property options:
I suggest that "extended-address" is the most sensible sub-property to use, for this purpose. Andy Mabbett 03:57, 26 Mar 2007 (PDT)
implied adr subproperties
This may also be too difficult/complex to be dependable or interoperable, but it is worth at least documenting our considerations and analysis either way.
<td class="adr">Austin, USA</td>
We could first define a canonical ordering of how to parse for comma (and perhaps in some cases space) separated adr subproperties within an adr string e.g.:
Given a dictionary of country names and abbreviations, it may be feasible to parse for a country name at the end of the adr string, and then apply country/locale specific parsing rules to the rest of the adr string.
- from a theoretical dictionary of country names:
- US|USA|United States|United States of America|Etats-Unis d'Amerique
- parse the remainder of the adr string backwards as follows:
- preceding that, look for a 5 or 9 digit (with optional dash '-' separator between digits 5 and 6) postal-code, and if found use it for the 'postal-code'
- preceding that, look for the name of a US state (e.g. California or any of the other states or territories available from a canonical list) or 2 letter state abbreviation (e.g. CA), and if found use it for the 'region' subproperty
- preceding that, look for the name of a US city (e.g. San Francisco, Los Angeles or any other US city available from a canonical list) or common city abbreviation (e.g. SF, LA), and if found use it for the 'locality'
- preceding that, look for common extended address details, such as: #|apt|apartment|suite|ste followed by a word consisting of letters and numbers, and if found use it for the 'extended-address'
- preceding that, look for a common street name bracketed by the street number (an integer with optional fraction and/or letter), and an optional street type (av|ave|avenue|blvd|boulevard|cir|circle|pl|place|st|street), and if found use it for the 'street-address'
- preceding that, look for a common post office box, with the pobox literal string: pob|pobox|PO Box followed by a word consisting of numbers and letters, and if found use it for the 'post-office-box'
- ... other countries
The above heuristic (not quite well specified enough to be an algorithm, yet) would allow parsing of the IBM Employee Directory result documented above.
There are a lot of existing geocoder APIs that turn unstructured addresses into structured ones - we should examine these for patterns and best practices. eg Google's geocoder geopy calls multiple ones
adr without children FAQ
I think for now the simplest and most interoperable (and what I think implementations already do) is to make this an FAQ (because the spec already doesn't say to do anything with adr without any subproperty)
Q: What should a parser do with an "adr" property lacking any subproperties?
A: A parser SHOULD do nothing with such an "adr" property. A parser MAY provide the text content of such an "adr" property in the results of its parsing as a freeform value of the "adr" property. Note that the vCard standard does not allow for any such freeform value of its "adr" property (in vCard the "adr" property MUST be structured) and thus that MAY suggestion to parsers only applies in situations (such as APIs, JSON return values) where it is possible to return a freeform value for the "adr" property.
Tantek 09:20, 2 Aug 2007 (PDT)
- ignore (0) in tel property values
- e.g. British/UK telephone numbers often have a (0) in them to indicate how to dial #s locally.
- introduce 'tel' as a special type, so that it can trigger parsing for "tel:" URLs in
- as well as incorporate the special "ignore (0)" rule above.
Some nice to haves (parser related only in that they may require additional parsing related code)
- tel content to markup generator that generates the "tel:" URL
- could incorporate into the hCard Creator.
- tel validator
- hCard cheatsheet - hCard properties
- hCard creator (feedback) - create your own hCard.
- hCard authoring - learn how to add hCard markup to your existing contact info.
- hCard examples - example usage of various classes within hCard.
- hCard examples in the wild - an on-going list of websites which use hCards.
- hcard-supporting-user-profiles - sites with user profiles marked up with hCard - a very common example.
- hCard FAQ - if you have any questions about hCard, check here.
- hCard implementations - websites or tools which either generate or parse hCards.
- hCard parsing - normative details of how to parse hCards.
- hCards and pages - semantic distinctions between different hCards on a page, and how to identify each
- hcard-user-interface - techniques and issues surrounding user-interfaces to author, publish, and display hCards.
- hCard profile - the XMDP profile for hCard
- hCard singular properties - an explanation of the list of singular properties in hCard.
- hCard tests - a wiki page with actual embedded hCards to try parsing.
- hCard advocacy - encourage others to use hCard
- hCard "to do" - jobs to do
The hCard specification is a work in progress. As additional aspects are discussed, understood, and written, they will be added. These thoughts, issues, and questions are kept in separate pages.
- hCard brainstorming - brainstorms and other explorations relating to hCard.
- hcard-parsing-brainstorming - brainstorming specific to parsing of hCard
- geo brainstorming
- hCard feedback - general feedback (as opposed to specific issues).
- hCard issues - specific issues with the specification.
- vCard errata - corrections to the vCard specification, which underlies hCard.
- vCard suggestions - suggested improvements to the vCard specification.