hcard-parsing-brainstorming

(Difference between revisions)

Jump to: navigation, search

Revision as of 17:01, 17 January 2008

Contents

Brainstorming for hCard parsing

See separate hCard parsing page for current hCard parsing rules; and hcard-brainstorming for more general discussion.

Add thoughts/proposals to improve/add to hCard parsing here in this section in hCard brainstorming, and be sure to include URLs to examples of hCards in the wild which could benefit from parsing rule changes.

Additional Semantic HTML handling

acronym element handling

Choices:

input element handling

In hcard-parsing, I've defined special-case handling for several elements according to more semantic exceptions, e.g. textual properties on the img element use the 'alt' attribute.

One element I forgot at the time was the input element, specifically, <input type="text">. Another I forgot was the textarea element.

The simple suggestion is to add the following to hcard-parsing, specifically to the all properties sub-section:

Tantek

forms auto-fill

If you go to a site that needs your contact info for something, say an ecommerce site for checkout, and if the form fields are marked up with hCard semantics per the above, then perhaps we could consider having that mean "insert hCard here".

Interactive useragents (e.g. operator on firefox) could detect such "insert hCard here" semantics in forms on pages, and let you "pre-fill" with *your* hCard info, and then all of a sudden we have a standard for forms auto-fill, rather than all the hacks that have gone into browsers since 1999 (starting with IE4.5/Mac which I'm pretty sure was the first to do forms auto-fill of an entire form with a single button press - not just auto-complete of each form field individually).

Obviously this would make sense to build into *existing* forms auto-fill features in Firefox and IE, and any other browsers that support it.

For more on this, see the 2007 August blog post hCard autofill? by David Baron, a Mozilla employee.

This way new sites could simply conform to the standard, rather than depend on hacks which parse label values etc. and imply things and get them wrong sometimes.

i18n advantages: hCard annotated form inputs would also be more international, thus avoiding the need for each browser to guess what is the "name" and "telephone" field in every language, so they can do forms auto-fill on any site regardless of language, not just English.

Tantek 16:24, 23 Jul 2007 (PDT)

Background discussion:

Key threads:


Somewhat related:

One key summary:

The options discussed in a hypothetical hCard input system so far appear to be:

1) create a new root class other than vcard to indicate a form that's fillable with hCard data.

Proposed markup:

<form class="vcard-input" ...>
   <fieldset class="fn">
      <input type="text" class="given-name" name="first_name" />
      <input type="text" class="family-name" name="last_name" />
   </fieldset>
   ...
</form>
 Benefits:
     Doesn't overcomplicate hCard with new parsing rules,
     doesn't require rewrite of existing parsers to ignore 'unparsable' data.
 Drawbacks:
     Requires completely new parsers to be written.
     Existing parsers would ignore data even if a valid hCard could be extracted.

2) extend hCard's parsing rules to cover form elements and relying on the FORM/INPUT semantics to indicate that stuff is inputtable.

Proposed markup:

<form ...>
<div class="vcard">
   <fieldset class="fn">
      <input type="text" class="given-name" name="first_name" value="Rob" />
      <input type="text" class="family-name" name="last_name" value="Manson" />
   </fieldset>
   ...
</div>
<div class="vcard">
   <fieldset class="fn">
      <input type="text" class="given-name" name="first_name" value="Scott" />
      <input type="text" class="family-name" name="last_name" value="Reynen" />
   </fieldset>
   ...
</div>
</form>
 Benefits:
     Small addition to existing format rather than new one.
     Semantics of an input form and the eventual display format are the same.
 Drawbacks:
     Existing parsers would/could parse forms as invalid hCards, would need re-writing.

Broader question:

Should this be extended beyond just hCard?

Key Issues/discussion points

 - Negative : this require adding a bit of special case to existing parsers to handle these elements
 - Positive : this could help to enable uf based auto form filling
 - Negative : this could help to enable uf based auto form filling (e.g. spam automation)
 - Positive : this is in line with the current parsing model
 - Negative : parser doesn't pickup any updated form data after the page has loaded
 - e.g. even though textarea appears to parse ok - it's only ever the initially loaded value that can be exported
 - Positive : this minimises the changes to current parsing rules
 - Examples
   - title select that lists mr/mrs/ms/dr/etc.
   - checkboxes to choose which addresses to use
 - Option : simplify extension to only support input fields and recommend that select's, radio buttons and checkboxes update related hidden input fields with simple javascript (e.g. onChange/Click="this.form.elements[this.className].value = this.value")
   - Unworkable.  Cannot require clientside javascript.
 - Positive : this would simplify parsing and server side form processing as only single input fields for each value need to be used/validated
 - Negative : hCard forms then require javascript if they use form elements other than basic <input type="text|hidden">
 - Comment : either way any auto form filling will be more complex beyond simple <input type="text|hidden"> fields
   - hypothetical comment assuming more complexity beyond.

RobManson

multiple type parsing

fax and modem hyperlink parsing

For the "tel" property in particular, when the element is:

Ambiguous name components

When automatically publishing hCards from pre-existing data, it's not necessarily possible to tell which words in a name map to which hCard properties. When the structure of a name is unknown, it is hard to ensure an automatically published hCard remains valid.

There's currently no easy answer to this.

One implementation suggestion is a 'best-guess' algorithm, something along the lines of:

  1. If the name is one word, attempt implied nickname optimization
  2. If the name is two words, attempt implied n optimization
  3. For three or more words
    1. Perform a lookup against known sub-name combinations (e.g. 'Sarah Jane', 'Vander Wal')
    2. Apply the grammar "given-name additional-name(s) family-name"

The principal behind this suggestion is that it's better to make a good guess and potentially miscategorize an ambiguous name component than to generate an invalid hCard.

ADR with no children

Parsers (Operator, Tails, Almost Universal Microformat Parser) currently expect adr to have one or more sub-properties. It is not clear from the hCard spec that that's mandatory (though the vCard RFC requires it); nor is it always possible for an address field in a templated (or CMS) web site to be defined with such granularity.

Consider Wikipedia, whose templates often have a "locale" or "place" field, used, for example, on these articles about railway stations:

Likewise, the Wikipedia template for organisations, in which a "headquarters" address (for a business, for example) may contain a full or partial postal address, or just a city/county or city/country pair:

implied single adr subproperty

I propose that, where adr has content, but no explicit sub-properties, there should be a default sub-property to which that content is allocated, in order that it is captured by user agents, and can later be manually tweaked (in, say, an address book programme) by users if so desired. This would satisfy the vCard requirement for child-of-adr, and adhere to the general principle to "be strict in what you send but generous in what you receive".

Of the available sub-property options:

I suggest that "extended-address" is the most sensible sub-property to use, for this purpose. Andy Mabbett 03:57, 26 Mar 2007 (PDT)

implied adr subproperties

It may be possible for parsers to parse out adr subproperties from a contiguous adr string. This would be an optimization for both adr and hCard.

This may also be too difficult/complex to be dependable or interoperable, but it is worth at least documenting our considerations and analysis either way.

Examples:

IBM's Employee Directory search returns hCards with the "adr" property which contain the "locality" and "country-name" data but unfortunately without being marked up as such, e.g.:

<td class="adr">Austin, USA</td>

We could first define a canonical ordering of how to parse for comma (and perhaps in some cases space) separated adr subproperties within an adr string e.g.:

Given a dictionary of country names and abbreviations, it may be feasible to parse for a country name at the end of the adr string, and then apply country/locale specific parsing rules to the rest of the adr string.

E.g.

The above heuristic (not quite well specified enough to be an algorithm, yet) would allow parsing of the IBM Employee Directory result documented above.

There are a lot of existing geocoder APIs that turn unstructured addresses into structured ones - we should examine these for patterns and best practices. eg Google's geocoder geopy calls multiple ones

adr without children FAQ

I think for now the simplest and most interoperable (and what I think implementations already do) is to make this an FAQ (because the spec already doesn't say to do anything with adr without any subproperty)

Q: What should a parser do with an "adr" property lacking any subproperties?

A: A parser SHOULD do nothing with such an "adr" property. A parser MAY provide the text content of such an "adr" property in the results of its parsing as a freeform value of the "adr" property. Note that the vCard standard does not allow for any such freeform value of its "adr" property (in vCard the "adr" property MUST be structured) and thus that MAY suggestion to parsers only applies in situations (such as APIs, JSON return values) where it is possible to return a freeform value for the "adr" property.

Tantek 09:20, 2 Aug 2007 (PDT)


See also

The hCard specification is a work in progress. As additional aspects are discussed, understood, and written, they will be added. These thoughts, issues, and questions are kept in separate pages.

hcard-parsing-brainstorming was last modified: Wednesday, December 31st, 1969

Views