machine-data: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
(→‎The Combined Approach: Detail another approach.)
(→‎The Combined Approach: Slightly rewritten. Links to value-excerption and abbr-pattern, plus added detail on user agent treatment of empty links, note about abuse of semantics, and hack disclaimer)
Line 100: Line 100:
Of course, this does result in a dependency on CSS to make the data invisible to users, and will result in the machine data being displayed alongside the human form in any user agent without CSS support. That's a compromise that has to be resolved based on the requirements of individual sites.
Of course, this does result in a dependency on CSS to make the data invisible to users, and will result in the machine data being displayed alongside the human form in any user agent without CSS support. That's a compromise that has to be resolved based on the requirements of individual sites.


===The Combined Approach===
===As an empty <code>abbr</code> element===


Value excerpting can be combined with the ABBR pattern:
[[value-excerption-pattern|Value excerpting]] can be combined with the parsing rules of the [[abbr-design-pattern|<code>ABBR</code> pattern]] to embed data without exposing it to humans.


<code>&lt;span class="duration">3 minutes and 23 seconds &lt;abbr class="value" title="PT3M23S">&lt;/abbr>&lt;span></code>
<code>&lt;span class="duration">3 minutes and 23 seconds &lt;abbr class="value" title="PT3M23S">&lt;/abbr>&lt;/span></code>


And yet again, the optional CSS:
The optional CSS:


<code>.haudio .duration > .value { display: none; }</code>
<code>.haudio .duration > .value { display: none; }</code>


This lessens the dependance on CSS, as most non-CSS user agents will not render anything for the empty ABBR element.
The dependance on CSS is significantly reduced, as user agents will not render empty elements to the page (they take no physical space, and thus do not expose tool-tips), nor are empty elements exposed to assistive technology. Whilst this effectively hides the machine-data from exposure to humans, it still relies on an abuse of the semantics of the <code>abbr</code> element (an empty, zero-length string is not an abbreviation). It is best regarded as a parsing quirk, but if none of the other available options can work for you, this will parse.
 
''It is recommended that you comment uses of this technique in your code as a temporary hack, and check back for future techniques that provide this functionality more gracefully.''


== Proposed Methods ==
== Proposed Methods ==

Revision as of 12:03, 26 June 2008

Machine Data in Microformats

Microformats are designed to mark-up human consumable information, as commonly found in the wild. But, in a number of exceptional cases it has been necessary to specify precise data formats for particular properties. Formats for dates, times and locations are standardised in a way that doesn't always match the way information is visibly published. This is necessary to make the data understandable to parsers. Similarly, there are keywords in hCard that must be written in English (telephone ‘type’ in hCard, for example).

It is necessary for these data formats to be fixed to make the data parsable by machines; the cost for a parser to support every commonly published date-time format in the world (include approximations like ‘five minutes ago’) is too high, as is handling international translation (such as mobile telephones; US-English ‘cell’ published as British English ‘mobile’).

In some cases, the human version of the data can be semantically described as an abbreviated form of the machine data, and the machine data may also be human consumable. For example, the date-design-pattern uses HTML's abbr element to expand one human date representation into the ISO 8601 form date: ‘January 1st’ is an abbreviated form of ‘2008-01-01’. The latter is also legible to humans (and can be exposed to them through tool-tips and assistive screen readers).

In other cases, this machine data is not legible to humans. In hAudio, the duration property uses ISO 8601, resulting in machine data of PT3M23S; not understandable to humans, and therefore not a valid expansion of ‘three minutes and twenty-three seconds’.

Cases of Fixed Data Formats in Microformats

The following are all current uses of fixed format machine data required by the various microformats.

hCalendar

  • Uses ISO 8601 for dtstart, dtend, duration and rdate

hCard

  • Telephone type keywords: voice, home, msg, work, pref, fax, cell, video, pager, bbs, modem, car, isdn, pcs.
  • Address type keywords: INTL, POSTAL, PARCEL, WORK, dom, home, pref.
  • Email type keywords: INTERNET, x400, pref.
  • Uses ISO 8601 date for bday
  • ISO 8601 time zone for tz
  • Telephone numbers requires a numerical form, whilst phone numbers can be presented in alpha-numeric form: e.g. +1-555-FORMATS

hReview

  • Uses an ISO 8601 date-time for dtreviewed
  • Uses fixed-point integer values from 0-5 for rating (publishers may, for example, display a percentage rating)

hAtom

  • Uses ISO 8601 date-time for updated
  • Uses ISO 8601 date-time for published

hResume

  • Uses ISO 8601 date for individual experience items.
  • Uses ISO 8601 date for individual education items.

Geo

  • Requires latitude and longitude in decimal form (1.23232;-2.343535), but may be published in degrees: N 37° 24.491, W 122° 08.313
  • Locations are most often published just as place names (not abbreviated co-ordinates)

hAudio

  • Uses ISO 8601 for track duration, e.g. PT3M23S

Embedding Fixed Data Formats in Microformats

There are currently three supported methods of including these fixed data formats in a microformatted document.

As Visible Page Content

You may use the standard class-design-pattern to mark-up the data visibly in the page.

Ben was born on <span class="bday">1984-02-09</span>.

We're meeting up on Northumberland Avenue (<span class="geo">51.507033,-0.126343</span>).

As An Abbreviation

In some cases, the data formats specified make valid expansions of common human forms, such as dates in in an hCard birthday field:

Ben was born on <abbr class="bday" title="1984-02-09">9th February</abbr>

Note, however, that not all data formats are valid expansions. In HTML, the abbr element is working semantically at a text level, not a data level. Both the abbreviated form (the inner text) and the expanded form (the title) need to be consumable by humans.

This means that in hAudio, using an abbreviation for duration is incorrect:

<abbr class="duration" title="PT3M23S">3 minutes, 23 seconds</abbr>

Whilst the data ‘PT3M23S’ is an expanded form of ‘3 minutes, 23 seconds’, the text is not; ‘PT3M23S’ is nonsense to most human beings. abbr is an element that describes the text, not the data. HTML4 has no way to mark up arbitrary data.

As Supplementary Data using the value-excerption-pattern

The machine data form can be included alongside any human legible text, and hidden using another layer of the browser stack (namely, CSS). This behaviour is documented as the value-excerption-pattern, and derived from the value excerpting behaviour in hCard.

So, for example, when describing a location by name, but still wanting to include geo for the machine-readable location:

<span class="geo">Northumberland Avenue, London <span class="value">51.507033,-0.126343</span></span>

Then, optionally use CSS to hide the data you don't want displayed:

.geo > .value { display: none; }

The same pattern works for the hAudio duration example given above:

<span class="duration">3 minutes and 23 seconds <span class="value">PT3M23S</span></span>

And again, the optional CSS:

.haudio .duration > .value { display: none; }

Of course, this does result in a dependency on CSS to make the data invisible to users, and will result in the machine data being displayed alongside the human form in any user agent without CSS support. That's a compromise that has to be resolved based on the requirements of individual sites.

As an empty abbr element

Value excerpting can be combined with the parsing rules of the ABBR pattern to embed data without exposing it to humans.

<span class="duration">3 minutes and 23 seconds <abbr class="value" title="PT3M23S"></abbr></span>

The optional CSS:

.haudio .duration > .value { display: none; }

The dependance on CSS is significantly reduced, as user agents will not render empty elements to the page (they take no physical space, and thus do not expose tool-tips), nor are empty elements exposed to assistive technology. Whilst this effectively hides the machine-data from exposure to humans, it still relies on an abuse of the semantics of the abbr element (an empty, zero-length string is not an abbreviation). It is best regarded as a parsing quirk, but if none of the other available options can work for you, this will parse.

It is recommended that you comment uses of this technique in your code as a temporary hack, and check back for future techniques that provide this functionality more gracefully.

Proposed Methods

As Invisible Supplementary Data

The main content of this section — focussing just on using the value-excerption-pattern in a mark-up level hidden manner — has been moved into value-excerption-pattern-issues for easier tracking.

These are proposals and therefore should be noted that they are not endorsed or supported. You should not use any of these patterns when publishing a page, but may like to get involved to help develop these ideas.

Proposals for more publisher-friendly mark-up for machine-data

  • Extend the value-excerption-pattern to parse title attributes from empty elements with class value. Empty elements are ignored by browsers in the DOM, so won't render a tool-tip or be exposed to screen readers. Of course, it means putting an empty element into mark-up. Please post responses to this on the value-excerption-pattern-issues page.
  • Embed the data into the class attribute, where it will not be rendered or exposed.
  • Use a ‘ufusetitle’ class name to particular elements, as a processing instruction to parsers to read the title attribute rather than inner text. However, this adds a new concept of putting ‘parsing instructions’ into the class attribute, which is currently only used to extend the semantics of elements.
  • Break with requirements for valid HTML and adopt the RDFa content attribute as a means of embedding data (or another custom attribute). This results in invalid HTML.
    • Depends on what DTD you use. It would not be much work to create a DTD that added a "content" attribute to a few elements. Or simply validate against the RDFa DTD.
  • Use the title attribute on any HTML element for data embedding, and prefix the machine data with the string ‘data:’. Users exposed to the data string are given some context. This reduces the impact of the problem at the user side, and assistive technology is less likely to expose the title attribute of generic elements such as span as they are for abbr. This avoids adding empty elements to a page, but continues to expose the machine data to human users through tool-tips.
  • Repurpose the input element with a type=hidden attribute. Browsers hide this element from users completely. However, this would stretch the semantics of input; using an input, forms device for output.

A final suggestion is that on a case-by-case basis, all the above documented machine-data patterns could be reworked in a human accessible, publisher compatible, internationalizable and machine-parsable manner. Previously, such solutions have not been forthcoming.

Acknowledgements

  • James Craig and Bruce Lawson for the suggestion using an empty span as a means of embedding data without it being exposed in assistive technology
  • Jeremy Keith for assistance refining the invisible data extension to value-excerpting

Related Pages