machine-data: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
m (Replace <entry-title> with {{DISPLAYTITLE:}})
 
(18 intermediate revisions by 6 users not shown)
Line 1: Line 1:
<h1>Machine Data in Microformats</h1>
{{DISPLAYTITLE:Machine Data in Microformats}}


{{TOC-right}}
{{TOC-right}}
Line 17: Line 17:
===hCalendar===
===hCalendar===


* Uses ISO 8601 for <code>dtstart</code>, <code>dtend</code>, <code>duration</code> and <code>rdate</code>
* Uses ISO 8601 for <code>dtstart</code>, <code>dtend</code>, <code>duration</code>, <code>rdate</code> and <code>exdate</code>
* enumerated value for the <code>role</code> subproperty of the <code>attendee</code> property. Example documented in [[hcalendar-brainstorming#hCard_attendees|hCalendar brainstorming: hCard attendees]]


===hCard===
===hCard===
Line 29: Line 30:


===hReview===
===hReview===
* Uses an ISO 8601 date-time for <code>dtreviewed</code>
* Uses an ISO 8601 date-time for <code>dtreviewed</code>
* Uses fixed-point integer values from <var>0</var>-<var>5</var> for <code>rating</code> (publishers may, for example, display a percentage rating)


===hAtom===
===hAtom===
Line 51: Line 50:


* Uses ISO 8601 for track <code>duration</code>, e.g. <code>PT3M23S</code>
* Uses ISO 8601 for track <code>duration</code>, e.g. <code>PT3M23S</code>
== Misconceptions of Fixed Data Formats in Microformats==
There are also cases (at least one) of apparent fixed data formats in microformats which should not require the providing of a separate value.  It is useful to document these as a way to clear up apparent misconceptions.
===hReview===
* ''Uses fixed-point integer values from <var>0</var>-<var>5</var> for <code>rating</code> (publishers may, for example, display a percentage rating)''
There are several misconceptions here.
# The ''default'' rating values in [[hReview]] are [[hreview#In_General|from 1.0-5.0]] (not 0-5)
# hReview permits the author to state their own 'worst' to 'best' range for any given 'rating'.
Thus a publisher that wants to display a percentage rating can do so by simply specifying a 'worst' value for a rating of 0, and a 'best' value for a rating of 100.  Then the actual percentage rating can simply be marked up inline and no separate machine value is necessary.


==Embedding Fixed Data Formats in Microformats==
==Embedding Fixed Data Formats in Microformats==
Line 78: Line 89:
Whilst the ''data'' ‘PT3M23S’ is an expanded form of ‘3 minutes, 23 seconds’, the text is not; ‘PT3M23S’ is nonsense to most human beings. <code>abbr</code> is an element that describes the ''text'', not the data. HTML4 has no way to mark up arbitrary data.
Whilst the ''data'' ‘PT3M23S’ is an expanded form of ‘3 minutes, 23 seconds’, the text is not; ‘PT3M23S’ is nonsense to most human beings. <code>abbr</code> is an element that describes the ''text'', not the data. HTML4 has no way to mark up arbitrary data.


===As Supplementary Data using the value-excerption-pattern===
===Using the value-class-pattern===
 
The machine data form can be included alongside any human legible text, and hidden using another layer of the browser stack (namely, CSS). This behaviour is documented as the [[value-excerption-pattern]], and derived from the [[hcard#Value_excerpting|value excerpting]] behaviour in hCard.
 
So, for example, when describing a location by name, but still wanting to include [[geo]] for the machine-readable location:
 
<code>&lt;span class="geo">Northumberland Avenue, London &lt;span class="value">51.507033,-0.126343&lt;/span>&lt;/span></code>
 
Then, optionally use CSS to hide the data you don't want displayed:
 
<code>.geo > .value { display: none; }</code>
 
The same pattern works for the [[hAudio]] <code>duration</code> example given above:
 
<code>&lt;span class="duration">3 minutes and 23 seconds &lt;span class="value">PT3M23S&lt;/span>&lt;/span></code>
 
And again, the optional CSS:
 
<code>.haudio .duration > .value { display: none; }</code>
 
Of course, this does result in a dependency on CSS to make the data invisible to users, and will result in the machine data being displayed alongside the human form in any user agent without CSS support. That's a compromise that has to be resolved based on the requirements of individual sites.
 
== Proposed Methods ==
 
===As Invisible Supplementary Data===
 
''This section is a proposed extension to value-excerpting, is currently open to active discussion and is not currently supported in parsers. You '''must not''' implement this in pages at this time.''
 
Value excerpting is already implemented as a means of extracting data from within a microformat property. Where the element (''any'' element) with <code>class="value"</code> is also empty (containing no inner-text), parsers should instead read the value of the <code>title</code> attribute of that element.
 
So, the following code will read the inner-text of the element, as per current implementations:
 
<code>&lt;span class="dtstart">Tomorrow lunchtime &lt;span class="value">2008-05-17T12:00:00+0100&lt;/span>&lt;/span></code>
 
The data format, poorly legible to most humans remains visible in the page. To make the machine data invisible at an HTML level, the following can also be parsed:
 
<code>&lt;span class="dtstart">Tomorrow lunchtime &lt;span class="value" title="2008-05-17T12:00:00+0100">&lt;/span>&lt;/span></code>
 
The <code>span</code> with <code>class="value"</code> is empty, therefore the parser must read the value of the <code>title</code> attribute instead. Where the element is ''not empty'', the <code>title</code> attribute is ignored.
 
Empty elements are invisible within the page, and take up no physical space, so the <code>title</code> is not exposed as a tool-tip. Also, the entire element is ignored by assistive technology such as screen readers, therefore the data is not exposed to any user. ''This assistive technology claim is based on informed, expert advice, but is awaiting confirmation through testing. That testing is forthcoming''.
 
HTML has no pure way of including machine data inline. The use of value excerpting, which can be applied to any HTML element, is the least obtrusive way to embed data into HTML, without overloading existing element semantics and without browser compatibility issues.
 
==== Problems ====


* '''Violates the microformats [[principle]] of visible data.''' Numerous previous efforts (e.g. markup in comments etc.) have walked down that path of "dark data" and failed in practice. We must hold ourselves to higher standards than any XML/RDF solution.  It's part of what sets microformats apart from so many other failed efforts at data representation on the web. We must not go down the path of dark data. IMHO that principle is inviolable for [[microformats]]. [[User:Tantek|Tantek]]
See [[value-class-pattern]].
** The approach here is that we have ''exceptional'' situations where we are requiring data to be duplicated for machines. They are exceptions which have existed in microformats since hCard, and this is a pattern to handle those exceptions and '''only''' those exceptions in response to the problems people have publishing them. The specification for this could be written to make it a per-property opt-in device, only for those properties identified above. '''This is not a ‘generic data embedding’ device''' and in line with the cited principals, should not be allowed to become one. --[[User:BenWard|BenWard]] 05:17, 25 Jun 2008 (PDT)
** An alternative, I suppose, would be to recognise all of the above data format examples as being in violation of the microformats principal, since authors are hiding them in favour of their own content. Every instance of fixed data formats in microformats that force authors to break the invisible data principal would need to be eliminated in favour of accessible, i18n compatible replacements, including those in hCard which are 1:1 mappings from vCard. We _could_ undertake that, but previous discussions (people being advised to misuse ABBR for translation of the vCard telephone types, for example) have already suggested that supporting the visible publishing is too complex. --[[User:BenWard|BenWard]] 05:17, 25 Jun 2008 (PDT)
* '''Worsens the [[principles#related|DRY]] violation''' by separating the human visible version and machine readable version into separate elements.  Duplicate data itself is bad, but at least by keeping the duplicates local on the same element (as the existing abbr-pattern does), the risk of drift/divergence is reduced. The greater the distance in content of the duplicates, the greater the risk of drift/divergence, and thus the lower the quality of data. This has been illustrated by the divergence of invisible metadata in the head of a document versus the content in the body, and even more so across documents.
**The machine-data form is kept as a sibling of the human form, and in distance in code, is not much further away than the data stored on a single elements <code>title</code> attribute. Further, the specification for this could demand the value element be placed as the _first child_ of the parent property, forcing it to be published immediately after the property element. --[[User:BenWard|BenWard]] 05:17, 25 Jun 2008 (PDT)
* Some parsers (particularly those that run incoming HTML through [http://tidy.sf.net Tidy] to convert it into well-formed XML) may strip empty inline elements. A workaround may be to allow (or even require) hard white space (i.e. <code>&amp;nbsp;</code>) within the element with class='value".
**It is, however, trivial to patch and build Tidy not to do this (keeping empty elements where that element also has a class attribute). Parser writers need to feed back on whether using a custom build is impossible to their solution, but since Tidy can be made to work, the problem can likely be alleviated. Ben Ward has put up an experimental build of Tidy with patched element-dropping behaviour here: [http://ben-ward.co.uk/files/tidy-microformats.zip tidy-microformats.zip]
*** Tidy is not just used in parsers, but also by publishers, as part of CMSes, etc.


=== Other Proposals ===


* Use a ‘<code>ufusetitle</code>’ class name to particular elements, as a processing instruction to parsers to read the <code>title</code> attribute rather than inner text. However, this adds a new concept of putting ‘parsing instructions’ into the <code>class</code> attribute, which is currently only used to extend the <em>semantics</em> of elements.
* Break with requirements for valid HTML and adopt the RDFa <code>content</code> attribute as a means of embedding data (or another custom attribute). This results in invalid HTML.
** Depends on what DTD you use. It would not be much work to create a DTD that added a "content" attribute to a few elements. Or simply validate against the RDFa DTD.
* Use the title attribute on <em>any</em> HTML element for data embedding, and prefix the machine data with the string ‘<code>data:</code>’. Users exposed to the data string are given some context. This reduces the impact of the problem at the user side, and assistive technology is less likely to expose the title attribute of generic elements such as <code>span</code> as they are for <code>abbr</code>. This avoids adding empty elements to a page, but continues to expose the machine data to human users through tool-tips.
* Repurpose the <code>input</code> element with a <code>type=hidden</code> attribute. Browsers hide this element from users completely. However, this would stretch the semantics of <code>input</code>; using an <em>input</em>, forms device for <em>output</em>.


==Acknowledgements==


* James Craig and Bruce Lawson for the suggestion using an empty <code>span</code> as a means of embedding data without it being exposed in assistive technology
* [http://adactio.com Jeremy Keith] for assistance refining the invisible data extension to value-excerpting


==Related Pages==
==Related Pages==

Latest revision as of 16:28, 18 July 2020


Microformats are designed to mark-up human consumable information, as commonly found in the wild. But, in a number of exceptional cases it has been necessary to specify precise data formats for particular properties. Formats for dates, times and locations are standardised in a way that doesn't always match the way information is visibly published. This is necessary to make the data understandable to parsers. Similarly, there are keywords in hCard that must be written in English (telephone ‘type’ in hCard, for example).

It is necessary for these data formats to be fixed to make the data parsable by machines; the cost for a parser to support every commonly published date-time format in the world (include approximations like ‘five minutes ago’) is too high, as is handling international translation (such as mobile telephones; US-English ‘cell’ published as British English ‘mobile’).

In some cases, the human version of the data can be semantically described as an abbreviated form of the machine data, and the machine data may also be human consumable. For example, the date-design-pattern uses HTML's abbr element to expand one human date representation into the ISO 8601 form date: ‘January 1st’ is an abbreviated form of ‘2008-01-01’. The latter is also legible to humans (and can be exposed to them through tool-tips and assistive screen readers).

In other cases, this machine data is not legible to humans. In hAudio, the duration property uses ISO 8601, resulting in machine data of PT3M23S; not understandable to humans, and therefore not a valid expansion of ‘three minutes and twenty-three seconds’.

Cases of Fixed Data Formats in Microformats

The following are all current uses of fixed format machine data required by the various microformats.

hCalendar

hCard

  • Telephone type keywords: voice, home, msg, work, pref, fax, cell, video, pager, bbs, modem, car, isdn, pcs.
  • Address type keywords: INTL, POSTAL, PARCEL, WORK, dom, home, pref.
  • Email type keywords: INTERNET, x400, pref.
  • Uses ISO 8601 date for bday
  • ISO 8601 time zone for tz
  • Telephone numbers requires a numerical form, whilst phone numbers can be presented in alpha-numeric form: e.g. +1-555-FORMATS

hReview

  • Uses an ISO 8601 date-time for dtreviewed

hAtom

  • Uses ISO 8601 date-time for updated
  • Uses ISO 8601 date-time for published

hResume

  • Uses ISO 8601 date for individual experience items.
  • Uses ISO 8601 date for individual education items.

Geo

  • Requires latitude and longitude in decimal form (1.23232;-2.343535), but may be published in degrees: N 37° 24.491, W 122° 08.313
  • Locations are most often published just as place names (not abbreviated co-ordinates)

hAudio

  • Uses ISO 8601 for track duration, e.g. PT3M23S

Misconceptions of Fixed Data Formats in Microformats

There are also cases (at least one) of apparent fixed data formats in microformats which should not require the providing of a separate value. It is useful to document these as a way to clear up apparent misconceptions.

hReview

  • Uses fixed-point integer values from 0-5 for rating (publishers may, for example, display a percentage rating)

There are several misconceptions here.

  1. The default rating values in hReview are from 1.0-5.0 (not 0-5)
  2. hReview permits the author to state their own 'worst' to 'best' range for any given 'rating'.

Thus a publisher that wants to display a percentage rating can do so by simply specifying a 'worst' value for a rating of 0, and a 'best' value for a rating of 100. Then the actual percentage rating can simply be marked up inline and no separate machine value is necessary.

Embedding Fixed Data Formats in Microformats

There are currently three supported methods of including these fixed data formats in a microformatted document.

As Visible Page Content

You may use the standard class-design-pattern to mark-up the data visibly in the page.

Ben was born on <span class="bday">1984-02-09</span>.

We're meeting up on Northumberland Avenue (<span class="geo">51.507033,-0.126343</span>).

As An Abbreviation

In some cases, the data formats specified make valid expansions of common human forms, such as dates in in an hCard birthday field:

Ben was born on <abbr class="bday" title="1984-02-09">9th February</abbr>

Note, however, that not all data formats are valid expansions. In HTML, the abbr element is working semantically at a text level, not a data level. Both the abbreviated form (the inner text) and the expanded form (the title) need to be consumable by humans.

This means that in hAudio, using an abbreviation for duration is incorrect:

<abbr class="duration" title="PT3M23S">3 minutes, 23 seconds</abbr>

Whilst the data ‘PT3M23S’ is an expanded form of ‘3 minutes, 23 seconds’, the text is not; ‘PT3M23S’ is nonsense to most human beings. abbr is an element that describes the text, not the data. HTML4 has no way to mark up arbitrary data.

Using the value-class-pattern

See value-class-pattern.



Related Pages