microformats2: Difference between revisions

Revision as of 01:16, 11 April 2011

2004: In early February microformats were introduced as a concept at eTech, and in September hCard and hCalendar were proposed at FOO Camp.

2010:

34% of webdevs use microformats (2010 State of Web Development survey)
1.88 billion hCards (per Yahoo SearchMonkey)
36 million hCalendar events (ibid)

XFN -> Social Graph API -> Web as Social Network / Address Book

AUTHORS and PUBLISHING

How can we make it easier for authors to publish microformats?

Currently the simplest hCard:

<span class="vcard">
  <span class="fn">
    Chris Messina
  </span>
</span>

requires 2 elements (nested, with perhaps at least one being pre-existing), and 2 class names

Web authors/designers are used to the simplicity of most HTML tags, e.g. to mark up a heading:

<h1>Chris Messina</h1>

requires just 1 element.

How can we make microformats just as easy?

Proposal: allow root class name only.

This would enable:

<h1 class="vcard">Chris Messina</h1>

requiring only 1 class name for the simplest case.

Can we do even better?

One of the most common questions asked about hCard is:

Why does hCard use vcard as the root class name?

This slight inconsistency between the name of the format and the name of the root class name consistently causes confusion in a large percentage of newcomers to microformats.

Looks like a typo (just one letter difference)
Ambiguity in discussions, e.g. put "vcard" in your HTML - meaning, class name, or a link to a .vcf file?
Extra bit to remember when marking up a microformat
- in contrast to hReview, hListing, hRecipe, etc. which all have root class name same as name of microformat (lowercased).

Though in microformats we believe very strongly in the principle of reuse, we have to admit that in this case experience/evidence has shown that this may be a case where we re-used something too far beyond it's original meaning. Thus:

Proposal: use root class name "hcard" instead of "vcard" for future hCards.

This would result in:

<h1 class="hcard">Chris Messina</h1>

making the simple case even simpler:

Just 1 additional class name, named the same as the format you're adding. Think hCard, markup class="hcard".

It's very important for the simple case to be as simple as possible, to enable the maximum number of people to get started with minimum effort.

From there on, it's ok to require incremental effort for incremental return.

E.g. to add any addition information about a person, add explicit property names.

How does this simple root-only case work?

root class name reflects name of the microformat
every microformat must require at most 1 property (preferably 0)
- admit that requiring a field in an application just results in noise (the 90210 problem - apps which require zip code get lots of false 90210 entries), and specify that any application use cases which appear to "require" specific properties must instead define how to imply sensible defaults for them.
when only a root class name is specified, imply the entire text contents of the element as the value of the primary property of the microformat. e.g.
- "hcard" implies "fn"
- hcalendar event - "hevent" - implies "summary"
- "hreview" implies "summary"
- "hentry" implies "entry-summary" (perhaps collapse into "summary" - in practice they're not sufficiently semantically distinct to require separate property names)

Additional simplifications

What more can we simplify about microformats?

Numerous individuals have provided the feedback that whenever there is more than one level of hierarchy in a microformat, many (most?) developers get confused - in particular Kavi Goel of Google / Rich Snippets provided this feedback at a microformats dinner. Thus depending on multiple levels of hierarchy is likely resulting in a loss of authorability, perhaps even accuracy as confusion undoubtedly leads to more errors. Thus:

Proposal: simplify all microformats to flat sets of properties.

What this means:

all microformats are simply an object with a set of properties with values.
no more subproperties- drop the notion of subproperties.
use composition of multiple microformats for any further hierarchy, e.g. the "location" of an hCalendar even can be an hCard, or the "agent" of one hCard can be another hCard.

For example for hCard this would mean the following specific changes to keep relevant functionality:

drop "n", promote all "n" subproperties to full properties
- given-name, family-name, additional-name, honorific-prefix, honorific-suffix
treat "geo" as a nested microformat
treat "adr" as a nested microformat (what to do about adr's "type"?)
treat "org" as a flat string and drop "organization-name" and "organization-unit" (in practice rarely used, also not revealed or ignored in contact management user interfaces - e.g. Address Book)

Example: add a middle initial to the previous example Chris Messina's name, and markup each name component:

<h1 class="hcard">
 <span class="fn">
  <span class="given-name">Chris</span>
  <abbr class="additional-name">R.</abbr>
  <span class="family-name">Messina</span>
 </span>
</h1>

Note:

use of an explicit span with "fn" to markup his entire formatted name
use of the abbr element to explicitly indicate the semantic that "R." is merely an abbreviation for his additional-name.

COMMUNITY and TOOLS

(that) USE MICROFORMATS

parser / parsing
structured
getting the data out
json - 1:1 mapping

parsing microformats currently requires

a list of root class names of each microformat to be parsed
a list of properties for each specific microformats, along with knowledge of the type of each property in order to parse their data from potentially different portions of the HTML markup
some number of format-specific specific rules (markup/content optimizations)

This has meant that whenever a new microformat is drafted/specificied/adopted, parsers need to updated to handle it correctly, at a minimum to parse them when inside other microformats and avoid errantly implying properties from one to the other (containment, mfo problem).

I think there is a fairly simple solution to #1 and #2 from the above list, and we can make progress towards minimizing #3. In short:

Proposal: a set of naming conventions for microformat root class names and properties that make it obvious when:

a class name represents a microformat root class name
a class name represents a microformat property name
a class name represents a specific type of microformats property

In particular - derived from the real world examples of existing proven microformats (rather than any abstraction of what a schema should have)

"h-*" for root class names, e.g. "h-card", "h-event", "h-entry"
"p-*" for simple (text) properties, e.g. "p-fn", "p-summary"
"u-*" for URL properties, e.g. "u-url", "u-photo", "u-logo"
"d-*" for datetime properties, e.g. "d-start", "d-end", "d-bday" (initially I had proposed "dt-*" but Chris Messina suggested reducing it to "d-*" so that all prefixes were a single letter - made sense).
"n-*" for (one or more) numbers, e.g. "n-rating", "n-geo", leaving the semantics of more than one number up to specific format. e.g. for an "n-rating" inside an "h-review", the first number would presumably be the rating value, when only two numbers the second would be the "best" value (e.g. rated <span class="n-rating">3 out of 4</span>), when three numbers the second would be the "worst" and the third would be the "best" (e.g. <span class="n-rating">7.5 out of 1 to 10</span>). similarly "n-geo" would specify the first number to be the latitude and the second to be the longitude.

possibly also:

"e-*" for properties where the entire contained element hierarchy is the value, e.g. "e-content" (formerly "entry-content") for hAtom.
"i-*" for ID properties, e.g. "i-uid" (if this is the only one, then perhaps we just always re-use "uid" or collapse with "u-*" into "u-id".)
"t-*" for time duration, e.g. "t-duration" in hCalendar, hAudio, hRecipe (note also Google's hRecipe extensions "preptime", "cooktime", "totaltime")

and:

reserve all other single-letter-dash prefixes for future use. In practice we have seen very little (if any) use of single-letter-dash prefixing of class names by web developers/designers, and thus in practice we think this will have little if any impact/collisions. Certainly far fewer than existing generic microformat property class names like "title", "note", "summary".

Example: taking that simple heading hCard example forward:

<h1 class="h-card">Chris Messina</h1>

As part of microformats 2.0 we would immediately define root class names and property names for all existing microformats and drafts consistent with this naming convention, and require support thereof from all new implementations, as well as strongly encouraging existing implementations to adopt the simplified microformats 2.0 syntax and mechanism.

As a community we would continue to use the microformats process both for researching and determining the need for new microformats, and for naming new microformat property names for maximum re-use and interoperability of a shared vocabulary.

If it turns out we need a new property type in the future, we can use one of the remaining single-letter-prefixes to add it to microformats 2.0. This would require updating of parsers of course, but in practice the number of different types of properties has grown very slowly, and we know from other schema/programming languages that there's always some small limited number of scalar/atomic property types that you need, and using those you can create compound types/objects that represent richer / more complicated types of data.

ADVANTAGES

This has numerous advantages:

better maintainability - much more obvious to web authors/designers/publishers which class names are for/from microformats.
no chance of collision - for all practical purposes with existing class names and thus avoiding any need to add more complex CSS style rules to prevent unintended styling effects.
simple universal parsing - parsers can now do a simple stream-parse (or in-order DOM tree walk) and parse out all microformat objects, properties, and values, without having to know anything about any specific microformats.

More examples: here is that same heading example with name components:

<h1 class="h-card">
 <span class="p-fn">
  <span class="p-given-name">Chris</span>
  <abbr class="p-additional-name">R.</abbr>
  <span class="p-family-name">Messina</span>
 </span>
</h1>

with a hyperlink to Chris's URL:

<h1 class="h-card">
 <a class="p-fn u-url" href="http://factoryjoe.com/">
  <span class="p-given-name">Chris</span>
  <abbr class="p-additional-name">R.</abbr>
  <span class="p-family-name">Messina</span>
 </a>
</h1>

COMPATIBILITY

microformats 2.0 is backwards compatible in that in permits content authors to markup with both old and new class names for compatibility with old tools.

Here is a simple example:

<h1 class="h-card vcard">
 <span class="fn">Chris Messina</span>
</h1>

a microformats 2.0 parser would see the class name "h-card" and imply the one required property from the contents, while a microformats 1.0 parser would find the class name "vcard" and then look for the class name "fn". no data duplication is required. this is a very important continuing application of the DRY principle.

And the above hyperlinked example with both sets of class names:

<h1 class="h-card vcard">
 <a class="p-fn u-url n fn url" href="http://factoryjoe.com/">
  <span class="p-given-name given-name">Chris</span>
  <abbr class="p-additional-name additional-name">R.</abbr>
  <span class="p-family-name family-name">Messina</span>
 </a>
</h1>

VENDOR EXTENSIONS

(this section was only discussed verbally and not written up during discussions - capturing here as it is topical)

Proprietary extensions to formats have typically been shortlived experimental failures with one big recent exception.

Proprietary or experimental CSS3 property implementations have been very successful.

There has been much use of border radius properties and animations/transitions which use CSS properties with vendor-specific prefixes like:

-moz-border-radius
-webkit-border-radius

etc.

Note that these are merely string prefixes, not bound to any URL, and thus not namespaces in any practical sense of the word. This is quite an important distinction, as avoiding the need to bind to a URL has made them easier to support and use.

This use of vendor specific CSS properties has in recent years allowed the larger web design/development/implementor communities to experiment and iterate on new CSS features while the features were being developed and standardized.

The benefits have been two-fold:

designers have been able to make more attractive sites sooner (at least in some browsers)
features have been market / real-world tested before being fully standardized, thus resulting in better features

Implementers have used/introduced "x-" prefixes for IETF MIME/content-types for experimental content-types, MIME parameter extensions, and HTTP header extensions, per RFC 2045 Section 6.3, RFC 3798 section 3.3, and Wikipedia: HTTP header fields - non-standard headers (could use RFC reference instead) respectively, like:

application/x-latex (per Wikipedia Internet media type: Type x)
x-spam-score (in email headers)
X-Pingback (per Wikipedia:Pingback)

Some standard types started as experimental "x-" types, thus demonstrating this experiment first, standardize later approach has worked for at least some cases:

image/x-png (standardized as image/png, both per RFC2083)

There have been times when specific sites have wanted to extend microformats beyond what the set of properties in the microformat, and currently lack any experimental way to do so - to try and see if a feature (or even a whole format) is interesting in the real world before bothering to pursue researching and walking it through the microformats process. Thus:

Proposal:

'*-x-' + '-' + meaningful name for root and property class names
- where "*" indicates the single-character-prefix as defined above
- where "x" indicates a literal 'x' for an experimental extension OR
- OR "x" indicates a vendor prefix (more than one character, e.g. like CSS vendor extension abbreviations, or some stock symbols, avoiding first words/phrases/abbreviations of microformats properties like dt-)
- e.g.
- "h-bigco-one-ring" - a hypothetical "bigco" vendor-specific "onering" microformat root class name.
- "p-goog-preptime" - to represent Google's "preptime" property extension to hRecipe (aside: "duration" may be another property type to consider separate from "datetime" as it may be subject to different parsing rules.)
- "p-x-prep-time" - a possible experimental property name to be added to hRecipe upon consideration/documentation of real-world usage/uptake.

Background - this proposal is a composition of the following (at least somewhat) successful vendor extension syntaxes

CSS 2.1 4.1.2.1 Vendor-specific extensions
IETF MIME/content-type "x-*" extensions per RFC 2045 Section 6.3. [1]
IETF MIME experimental fields (e.g. x-spam-score)
HTTP header extensions (e.g. x-pingback)

FURTHER THOUGHTS REGARDING HUNGARIAN PREFIXING

Microformats 2.0 proposes using an explicit [a-z]- prefix on properties, to differentiate them from other uses of the class attribute, and identify them as microformat properties, such that they can be parsed generically.

The differentiation use case is supported by anecdotal evidence of sites (such as Facebook, Twitter, Yahoo) removing microformats or breaking objects in page edits. The addition of a prefix assists self-documentation of code.
The generic parsing use case is supported by Google Rich Snippets, Yahoo Search Monkey, and extensible plugins like Operator and the Firefox microformats parser. Although these extract microformats from the page, they are intermediate systems between the page content and the actual interpretation of the data. They need to parse all objects from a page, and then another developer or application will interpret some of them into something else.

The µf2 proposal goes further, though, into a small vocabulary of Hungarian prefixes of properties based on data type. This increases the level of understanding required to read microformats, and reduces the benefit of all microformat properties having a consistent identifying prefix.

Hungarian notation itself is controversial amongst programmers. Plenty find it uglifies their code, can be a cause of confusion (especially when very-short prefixes are used, or esoteric types, or where the declared set of types differs from the available types in other programming languages.) Others support its benefits to type identification.

Critically, however, there is no clear indication that either of the above use cases requires types to be strongly identified.

For identifying µf in pages, a differentiator is required from regular classnames. There is no evidence of further requirement to differentiate between properties beyond their name (and existing criticisms of Hungarian notation suggest it can harm understandability.)
For generic parsing, there is no requirement that datatypes be established at extraction time. Data types will instead be applied by the developers of apps and widgets that build on the generic parsers.
A counter argument may be that special properties in microformats—such as URLs, or images—need to be identified because in microformats it is common to parse an attribute (href, or src) rather than inner text of an element for these properties. However, in the context of extracting and then interpreting HTML in other contexts this is insufficient: For example, though an image only exists as a single property in vcard, in HTML it is both a URL to a resource and and text string (alt) representing an accessible fallback. A ‘generic extracter’ of microformats from a page must capture all of this information from HTML, so that the interpreting application can choose which data type is most relevant to its context. Likewise, an application interpreting a URL may also consider using the original inner text as an inferred label. Both pieces of data are useful, and a generic parser should not discard elemental semantics at the extraction level.

Given this, hungarian prefixes are of no benefit to parsers (and may in fact harm applications down the chain if parsing is prematurely strict.) It would be sufficient then not to concern embedding data types in property names, and instead settle on one single property prefix to differentiate all properties consistently. This would reduce the prefixes to just 3:

h would indicates a root class name. An ‘object in HTML’.
p would indicates a property within an object.
x would indicates an experimental extension to an object.

--BenWard 01:16, 11 April 2011 (UTC)

USERS

Need more tools and interfaces that:

publish
copy/paste
right-click on a microformat
share
search results

discussed some existing like: H2VX converts hCard to vCard, hCalendar to iCalendar

how would we re-implement Live Clipboard today, making it easier for publishers and developers?

@@ Line 278: / Line 278: @@
 * IETF MIME experimental fields (e.g. x-spam-score)
 * HTTP header extensions (e.g. x-pingback)
+==== FURTHER THOUGHTS REGARDING HUNGARIAN PREFIXING ====
+Microformats 2.0 proposes using an explicit <code>[a-z]-</code> prefix on properties, to differentiate them from other uses of the class attribute, and identify them as microformat properties, such that they can be parsed generically.
+* The differentiation use case is supported by anecdotal evidence of sites (such as Facebook, Twitter, Yahoo) removing microformats or breaking objects in page edits. The addition of a prefix assists self-documentation of code.
+* The generic parsing use case is supported by Google Rich Snippets, Yahoo Search Monkey, and extensible plugins like Operator and the Firefox microformats parser. Although these extract microformats from the page, they are intermediate systems between the page content and the actual interpretation of the data. They need to parse all objects from a page, and then another developer or application will interpret some of them into something else.
+The µf2 proposal goes further, though, into a small vocabulary of [http://en.wikipedia.org/wiki/Hungarian_notation Hungarian] prefixes of properties based on data type. This increases the level of understanding required to read microformats, and reduces the benefit of all microformat properties having a consistent identifying prefix.
+Hungarian notation itself is controversial amongst programmers. Plenty find it uglifies their code, can be a cause of confusion (especially when very-short prefixes are used, or esoteric types, or where the declared set of types differs from the available types in other programming languages.) Others support its benefits to type identification.
+Critically, however, there is no clear indication that either of the above use cases requires types to be strongly identified.
+* For identifying µf in pages, a differentiator is required from regular classnames. There is no evidence of further requirement to differentiate between properties beyond their name (and existing criticisms of Hungarian notation suggest it can harm understandability.)
+* For generic parsing, there is no requirement that datatypes be established at extraction time. Data types will instead be applied by the developers of apps and widgets that build on the generic parsers.
+* A counter argument may be that special properties in microformats—such as URLs, or images—need to be identified because in microformats it is common to parse an attribute (href, or src) rather than inner text of an element for these properties. However, in the context of extracting and then interpreting HTML in other contexts this is insufficient: For example, though an image only exists as a single property in vcard, in HTML it is both a URL to a resource ''and'' and text string (alt) representing an accessible fallback. A ‘generic extracter’ of microformats from a page must capture all of this information from HTML, so that the interpreting application can choose which data type is most relevant to its context. Likewise, an application interpreting a URL may also consider using the original inner text as an inferred label. Both pieces of data are useful, and a generic parser should not discard elemental semantics at the extraction level.
+Given this, hungarian prefixes are of no benefit to parsers (and may in fact harm applications down the chain if parsing is prematurely strict.) It would be sufficient then not to concern embedding data types in property names, and instead settle on one single property prefix to differentiate all properties consistently. This would reduce the prefixes to just 3:
+* <code>h</code> would indicates a root class name. An ‘object in HTML’.
+* <code>p</code> would indicates a property within an object.
+* <code>x</code> would indicates an experimental extension to an object.
+--[[User:BenWard|BenWard]] 01:16, 11 April 2011 (UTC)
 === USERS ===