hcard-parsing-brainstorming: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
(fixed some heading levels, tel ignore (0) variant, tel as potentially a special type to trigger "tel:" URL parsing from a href)
(→‎tel parsing: nice to haves, generator, validator)
 
Line 293: Line 293:


== tel parsing ==
== tel parsing ==
* ignore (0) in tel property values
* '''ignore (0) in tel property values'''
** e.g. British/UK telephone numbers often have a (0) in them to indicate how to dial #s locally.
** e.g. British/UK telephone numbers often have a (0) in them to indicate how to dial #s locally.


* introduce 'tel' as a special type, so that it can trigger parsing for "tel:" URLs in <code>a href</code> elements.
* '''introduce 'tel' as a special type,''' so that it can trigger parsing for "tel:" URLs in <code>a href</code> elements.
** as well as incorporate the special "ignore (0)" rule above.
** as well as incorporate the special "ignore (0)" rule above.
Some nice to haves (parser related only in that they may require additional parsing related code)
* tel content to markup generator that generates the "tel:" URL
** could incorporate into the hCard Creator.
* tel validator


==See also==
==See also==
{{hcard-related-pages}}
{{hcard-related-pages}}

Latest revision as of 23:16, 24 July 2010

Brainstorming for hCard parsing

See separate hCard parsing page for current hCard parsing rules; and hcard-brainstorming for more general discussion.

Add thoughts/proposals to improve/add to hCard parsing here in this section in hCard brainstorming, and be sure to include URLs to examples of hCards in the wild which could benefit from parsing rule changes.

Additional Semantic HTML handling

acronym element handling

Choices:

  • Explicitly treat acronym the same as abbr, per semantics of the 'title' attribute on acronym in particular, as defined in HTML4.01.
    • +1 I prefer this option as it better follows the semantics of HTML4.01, and for those familiar with HTML4.01, follows the principle of least surprise. Tantek 19:17, 8 June 2009 (UTC)
  • Explicitly treat acronym the same as span, and discourage use of acronym.

input element handling

In hcard-parsing, I've defined special-case handling for several elements according to more semantic exceptions, e.g. textual properties on the img element use the 'alt' attribute.

One element I forgot at the time was the input element, specifically, <input type="text">. Another I forgot was the textarea element.

The simple suggestion is to add the following to hcard-parsing, specifically to the all properties sub-section:

  • <input type="text" value="...">: use the value of the 'value' attribute. If there is no 'value' attribute then treat the value as empty. Interactive user-agents MUST use the current value of the element.
    • consider other input types also (e.g. checkbox, radio, hidden) and specify how to parse them as well.
  • <textarea>: use the text contents of the element. Interactive user-agents MUST use the current value of the element.

Tantek

forms auto-fill

If you go to a site that needs your contact info for something, say an ecommerce site for checkout, and if the form fields are marked up with hCard semantics per the above, then perhaps we could consider having that mean "insert hCard here".

Interactive useragents (e.g. operator on firefox) could detect such "insert hCard here" semantics in forms on pages, and let you "pre-fill" with *your* hCard info, and then all of a sudden we have a standard for forms auto-fill, rather than all the hacks that have gone into browsers since 1999 (starting with IE4.5/Mac, the first to do forms auto-fill of an entire form with a single button press - not just auto-complete of each form field individually).

Obviously this would make sense to build into *existing* forms auto-fill features in Firefox and IE, and any other browsers that support it.

This way new sites could simply conform to the standard, rather than depend on hacks which parse label values etc. and imply things and get them wrong sometimes.

i18n advantages: hCard annotated form inputs would also be more international, thus avoiding the need for each browser to guess what is the "name" and "telephone" field in every language, so they can do forms auto-fill on any site regardless of language, not just English.

Tantek 16:24, 23 Jul 2007 (PDT)

input examples

See hcard-input-examples for research on examples of contact info input forms.

By specifying a consistent way to markup contact info (person or venue/organization) input forms, we could enable both:

  • hCard forms auto-fill
  • hCard copy and paste (pasting in particular)

blog posts on hCard forms fill

For more on this, see the following blog posts:

related implementations

background discussion

Key threads:


Somewhat related:

One key summary by Ciaran McNulty:

The options discussed in a hypothetical hCard input system from that post:

option new vcard input root class

1) create a new root class other than vcard to indicate a form that's fillable with hCard data.

Proposed markup:

<form class="vcard-input" ...>
   <fieldset class="fn">
      <input type="text" class="given-name" name="first_name" />
      <input type="text" class="family-name" name="last_name" />
   </fieldset>
   ...
</form>
  • Benefits:
    • Doesn't overcomplicate hCard with new parsing rules,
    • doesn't require rewrite of existing parsers to ignore 'unparsable' data.
  • Drawbacks:
    • Requires completely new parsers to be written.
    • Existing parsers would ignore data even if a valid hCard could be extracted.
  • -1 I think it is preferable to try to make hCard work with existing classes for this user scenario rather than adding another scenario-specific class name. Adding scenario-specific class names also does not scale to other microformats in general (requiring additional class names for each microformat). Tantek 19:17, 8 June 2009 (UTC)

option add input elements to hCard parsing

2) extend hCard's parsing rules to cover form elements and relying on the FORM/INPUT semantics to indicate that stuff is inputtable.

Proposed markup:

<form ...>
<div class="vcard">
   <fieldset class="fn">
      <input type="text" class="given-name" name="first_name" value="Rob" />
      <input type="text" class="family-name" name="last_name" value="Manson" />
   </fieldset>
   ...
</div>
<div class="vcard">
   <fieldset class="fn">
      <input type="text" class="given-name" name="first_name" value="Scott" />
      <input type="text" class="family-name" name="last_name" value="Reynen" />
   </fieldset>
   ...
</div>
</form>
  • Benefits:
    • Small addition to existing format rather than new one.
    • Semantics of an input form and the eventual display format are the same.
  • Drawbacks:
    • Existing parsers would/could parse forms without values as invalid hCards.

See discussion points for more details and follow-up on benefits / drawbacks.

forms auto fill for all microformats

Broader question:

discussion points

Many raised by RobManson.

  • Extending parsing rules to extract value attributes from <input type="text|hidden"> fields
    • -1 (unattributed, perhaps rhetorical) : this require adding a bit of special case to existing parsers to handle these elements
    • +1 (unattributed, perhaps rhetorical) : this could help to enable microformat based auto form filling
    • +1 The parsing rules for forms elements must be specified anyway, and thus it makes sense to see if they can be specified in such a way to at least enable forms autofill functionality. Tantek 19:17, 8 June 2009 (UTC)
  • Existing server side and client side scripts use non-hCard field names so class is the most seamless extension point
    • +1 (unattributed, perhaps rhetorical) : this is in line with the current parsing model
    • +1 Tantek 19:17, 8 June 2009 (UTC)
  • Some parsers (e.g. X2V) only parse the loaded html not the dynamic DOM (Operator parsers the page DOM).
    • -1 (unattributed, perhaps rhetorical) : parser doesn't pickup any updated form data after the page has loaded, e.g. even though textarea appears to parse ok - it's only ever the initially loaded value that can be exported.
    • +1 hcard-parsing should provide additional guidance on page load parsing vs dynamic DOM handling as necessary to handle both types of implementations. Tantek 19:17, 8 June 2009 (UTC)
  • Forms may contain more than one hCard so using <form class="vcard"> should not be required.
    • +1 (unattributed, perhaps rhetorical) : this minimizes the changes to current parsing rules
    • +1 For example a <fieldset> could be used by an author instead, or even a div between the form and the inputs. Tantek 19:17, 8 June 2009 (UTC)
  • Empty values should be ignored when extracting hCards
    • +1 for vCards at least, perhaps into JSON as well. Tantek 19:17, 8 June 2009 (UTC)
  • hCards with all empty values should be ignored when listing/extracting hCards
    • +1 for vCards at least, perhaps into JSON as well. Tantek 19:17, 8 June 2009 (UTC)


Which form elements should be supported beyond input fields?

  • title select that lists mr/mrs/ms/dr/etc.
    • +1 honorific-prefix in particular, yes. Tantek 19:17, 8 June 2009 (UTC)
  • checkboxes to choose which addresses to use
    • +0 not sure how to make this work without a specific example to analyze. Tantek 19:17, 8 June 2009 (UTC)
  • Option : simplify extension to only support input fields and recommend that select's, radio buttons and checkboxes update related hidden input fields with simple javascript (e.g. onChange/Click="this.form.elements[this.className].value = this.value")
    • -1 (unattributed, perhaps rhetorical) Unworkable. Cannot require clientside javascript.
    • +1 (unattributed, perhaps rhetorical) this would simplify parsing and server side form processing as only single input fields for each value need to be used/validated
    • -1 (unattributed, perhaps rhetorical) hCard forms then require javascript if they use form elements other than basic <input type="text|hidden">
    • +0 (unattributed, perhaps rhetorical)  : either way any auto form filling will be more complex beyond simple <input type="text|hidden"> fields
      • -1 (unattributed, perhaps rhetorical) hypothetical comment assuming more complexity beyond.
    • -1 requiring javascript is a non-starter. microformats must work as POSH. Tantek 19:17, 8 June 2009 (UTC)

multiple type parsing

  • Multiple Type parsing / Type Optimization: The spec allows for, and the hcard-authoring demonstrate the use of multiple type designations for a single value of tel. The syntax used in the authoring examples where each seems like it could become cumbersome. As these type designations are all single 'word' strings it may be possible to implement additional parsing rules to allow for multiple types inside the same HTML element. Handling delimiters may be an issue [space, comma, etc?], and some in-the-wild usage of multiple types would need to be located and examined before considering additional parsing rules along these lines [ ChrisCasciano 10:21, 16 Apr 2007 (PDT) ]

fax and modem hyperlink parsing

For the "tel" property in particular, when the element is:

  • <a href="fax:..."> OR <area href="fax:..."> : parse the value of the 'href' attribute, omitting the "fax:" prefix and any "?" query suffix (if present), in the attribute. For details on the "fax:" URL scheme, see RFC 2806. In addition, treat this 'tel' property instance as having subproperty type "fax" in addition to any explicit subproperty type specified on the 'tel' property.
  • <a href="modem:..."> OR <area href="modem:..."> : parse the value of the 'href' attribute, omitting the "modem:" prefix and any "?" query suffix (if present), in the attribute. For details on the "modem:" URL scheme, see RFC 2806. In addition, treat this 'tel' property instance as having subproperty type "modem" in addition to any explicit subproperty type specified on the 'tel' property.

Ambiguous name components

When automatically publishing hCards from pre-existing data, it's not necessarily possible to tell which words in a name map to which hCard properties. When the structure of a name is unknown, it is hard to ensure an automatically published hCard remains valid.

There's currently no easy answer to this.

One implementation suggestion is a 'best-guess' algorithm, something along the lines of:

  1. If the name is one word, attempt implied nickname optimization
  2. If the name is two words, attempt implied n optimization
  3. For three or more words
    1. Perform a lookup against known sub-name combinations (e.g. 'Sarah Jane', 'Vander Wal')
    2. Apply the grammar "given-name additional-name(s) family-name"

The principal behind this suggestion is that it's better to make a good guess and potentially miscategorize an ambiguous name component than to generate an invalid hCard.

ADR with no children

Parsers (Operator, Tails, Almost Universal Microformat Parser) currently expect adr to have one or more sub-properties. It is not clear from the hCard spec that that's mandatory (though the vCard RFC requires it); nor is it always possible for an address field in a templated (or CMS) web site to be defined with such granularity.

Consider Wikipedia, whose templates often have a "locale" or "place" field, used, for example, on these articles about railway stations:

Likewise, the Wikipedia template for organisations, in which a "headquarters" address (for a business, for example) may contain a full or partial postal address, or just a city/county or city/country pair:

implied single adr subproperty

I propose that, where adr has content, but no explicit sub-properties, there should be a default sub-property to which that content is allocated, in order that it is captured by user agents, and can later be manually tweaked (in, say, an address book programme) by users if so desired. This would satisfy the vCard requirement for child-of-adr, and adhere to the general principle to "be strict in what you send but generous in what you receive".

  • Note that there may be other reasons to consider this suggestion, such as "ease of authoring". Another way of looking at this suggestion is as a "adr/extended-address shorthand". Tantek 08:28, 26 Mar 2007 (PDT)
  • there is also a LABEL property which is NOT structured data, but purely a text string to be used when labeling. LABEL purpose: To specify the formatted text corresponding to delivery address of the object the vCard represents. Brian 13:18, 30 Mar 2007 (UTC)
    • On re-reading this, it seems that none of the adressess given in my examples meet the criteria of being "formatted text corresponding to delivery address". Andy Mabbett 03:35, 17 Apr 2007 (PDT)

Of the available sub-property options:

  • street-address
  • extended-address
  • region
  • locality

I suggest that "extended-address" is the most sensible sub-property to use, for this purpose. Andy Mabbett 03:57, 26 Mar 2007 (PDT)

implied adr subproperties

It may be possible for parsers to parse out adr subproperties from a contiguous adr string. This would be an optimization for both adr and hCard.

This may also be too difficult/complex to be dependable or interoperable, but it is worth at least documenting our considerations and analysis either way.

Examples:

IBM's Employee Directory search returns hCards with the "adr" property which contain the "locality" and "country-name" data but unfortunately without being marked up as such, e.g.:

<td class="adr">Austin, USA</td>

We could first define a canonical ordering of how to parse for comma (and perhaps in some cases space) separated adr subproperties within an adr string e.g.:

  • 'post-office-box'
  • 'street-address'
  • 'extended-address'
  • 'locality'
  • 'region'
  • 'postal-code'
  • 'country-name'

Given a dictionary of country names and abbreviations, it may be feasible to parse for a country name at the end of the adr string, and then apply country/locale specific parsing rules to the rest of the adr string.

E.g.

  • from a theoretical dictionary of country names:
    • US|USA|United States|United States of America|Etats-Unis d'Amerique
  • parse the remainder of the adr string backwards as follows:
    • preceding that, look for a 5 or 9 digit (with optional dash '-' separator between digits 5 and 6) postal-code, and if found use it for the 'postal-code'
    • preceding that, look for the name of a US state (e.g. California or any of the other states or territories available from a canonical list) or 2 letter state abbreviation (e.g. CA), and if found use it for the 'region' subproperty
    • preceding that, look for the name of a US city (e.g. San Francisco, Los Angeles or any other US city available from a canonical list) or common city abbreviation (e.g. SF, LA), and if found use it for the 'locality'
    • preceding that, look for common extended address details, such as: #|apt|apartment|suite|ste followed by a word consisting of letters and numbers, and if found use it for the 'extended-address'
    • preceding that, look for a common street name bracketed by the street number (an integer with optional fraction and/or letter), and an optional street type (av|ave|avenue|blvd|boulevard|cir|circle|pl|place|st|street), and if found use it for the 'street-address'
    • preceding that, look for a common post office box, with the pobox literal string: pob|pobox|PO Box followed by a word consisting of numbers and letters, and if found use it for the 'post-office-box'
  • ... other countries

The above heuristic (not quite well specified enough to be an algorithm, yet) would allow parsing of the IBM Employee Directory result documented above.

There are a lot of existing geocoder APIs that turn unstructured addresses into structured ones - we should examine these for patterns and best practices. eg Google's geocoder geopy calls multiple ones

adr without children FAQ

I think for now the simplest and most interoperable (and what I think implementations already do) is to make this an FAQ (because the spec already doesn't say to do anything with adr without any subproperty)

Q: What should a parser do with an "adr" property lacking any subproperties?

A: A parser SHOULD do nothing with such an "adr" property. A parser MAY provide the text content of such an "adr" property in the results of its parsing as a freeform value of the "adr" property. Note that the vCard standard does not allow for any such freeform value of its "adr" property (in vCard the "adr" property MUST be structured) and thus that MAY suggestion to parsers only applies in situations (such as APIs, JSON return values) where it is possible to return a freeform value for the "adr" property.

Tantek 09:20, 2 Aug 2007 (PDT)


tel parsing

  • ignore (0) in tel property values
    • e.g. British/UK telephone numbers often have a (0) in them to indicate how to dial #s locally.
  • introduce 'tel' as a special type, so that it can trigger parsing for "tel:" URLs in a href elements.
    • as well as incorporate the special "ignore (0)" rule above.

Some nice to haves (parser related only in that they may require additional parsing related code)

  • tel content to markup generator that generates the "tel:" URL
    • could incorporate into the hCard Creator.
  • tel validator

See also

The hCard specification is a work in progress. As additional aspects are discussed, understood, and written, they will be added. These thoughts, issues, and questions are kept in separate pages.