microformats2-parsing-brainstorming: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
(follow-up re: pubdate (avoid extra attribute use/dependency in general, can use as backward compat/legacy if use in wild is documented), and vcp - for now use 'value', 'value-title' as-is.)
(No difference)

Revision as of 01:12, 16 October 2012

Author: Ben Ward

Microformats 2 proposes a new, all encompassing syntax modification of prefixes that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.

Parsing Microformats 2.0 Syntax: Extraction vs. Interpretation

A microformats ‘1.0’ parser performs the following function:

  • Given a piece of HTML content, discover a known microformat, extract it, apply various extraction patterns based upon the HTML mark-up used (e.g. include pattern, abbr patterns, date-time patterns, value-title pattern), apply various content optimisations where applicable, and return the result in an object native to the programming language.

This is performing two types of function: Extraction of data from an HTML document or fragment, and interpretation and optimisation of that content to match the rules set out by a vocabulary specification.

It is only possible to write a generic parser that covers the first half of this task: Extraction, and application of global rules based on HTML elements and patterns common to all formats.

The purpose of a generic parser (as supported by use cases such as search engines, and other crawlers) is:

To provide a way for tools to extract rich data from a page for native storage, such that the data may be interpreted later by applications. This allows microformats to be crawled, and indexed, and removes the need to include complex HTML parsing within every implementation of microformat data.

Microformats will continue to define various vocabulary-specific optimisations. as part of the design to be optimised for authors. For example: The fn pattern in hcard, or the lat;long pattern in geo, as well as default values for properties, such as the maximum rating in an hreview.

  • Actually, no, as it is defined currently, microformats 2 drops vocabulary-specific optimizations. Such optimizations have often been too inapplicable, error prone or i18n-unsafe (e.g. fn to given-name + family-name fails for both numerous cases where middlenames/initials are used, and in general in numerous Asian languages where given/family name order is the reverse of Western English conventions, or languages with multiple family-names, e.g. Spanish - see hcard-issues-resolved for more). This is a deliberate cutting of a "feature" from microformats 1, it is a deliberate model simplification design decision. Tantek 12:43, 4 October 2011 (UTC)

Microformats 2.0 should refer only to extraction of microformats. Vocabularies should in turn document their appropriate optimisations, which will need to be applied by implementations, or a companion to an extractor, which I'll refer to here as an ‘interpreter’.

  • Vocabularies will no longer have optimizations, this is again deliberately, as they've been shown to be more error prone than helpful. Thus there should be no need for any vocabulary-specific 'interpreters'. However, due to design quirks in various legacy/interchange formats, export conversions algorithms to those legacy/interchange formats will require some additional legacy-format-specific rules (e.g. odd "required" rules in Atom or vCard will require specific synthesis rules, limitations in said formats will require filtering of some values, e.g. vcard3 BDAY disallows vague birthdays like year-month and --month-day - subsequently allowed in vcard4). Tantek 12:43, 4 October 2011 (UTC)

A microformats 2.0 ‘extractor’, in combination with the functionality of a domain and format-aware ‘interpreter’ (either another shared component, or part of the implementation itself) would be equivalent to a microformats 1.0 ‘parser.’

  • A microformats 2.0 parser is both generic (no knowledge of specific vocabularies), and lacks any/all such vocabulary-specific rules as compared to a microformats 1.0 parser with the exception of a 1) a limited list of well-established/interoperable backward compat root class names (of current microformats that are or can be soon shown to be specifications/standards per the process), 2) flat sets of backward compat property names (some with prefix/name specific conversion) for each of those backward compat root class names. This is a deliberate design decision that makes microformats 2 more "micro", and yes this means that even with such backward compat support, this simple form of backward compat may mean that some existing microformats 1 content breaks. We'll assess those and iterate on a documented case-by-case basis rather than attempt to maintain theoretical 100% backward compatibility (since many current microformats format-specific-features are either unused, or may have caused more problems than solutions). Tantek 12:43, 4 October 2011 (UTC)

N.B. I'll rewrite some of these as microformats-2-parsing-faq to help better clarify. The reasoning that led to most of these design decisions is documented in the microformats 2: About This Brainstorm section and following sections. I'll recheck those sections to see if/where reasoning for some of the above noted design decisions may have been missed, and back-fill accordingly. This is necessary because microformats 2 is a evolutionary result of simultaneously addressing both numerous generic issues as well as various common format-specific problems in microformats 1 syntax and vocabularies. The very number of changes may make it more challenging (from a microformats 1 perspective) to see why any particular design change has been made. Tantek 12:43, 4 October 2011 (UTC)

Parsing Literal Values

It is proposed for microformats 2.0 that all microformats be parseable from just their root element, e.g. <p class="h-card">Ben Ward</p> would create an hCard with the following properties after parsing:

{ 
  'type': ['h-card'],
  'properties': {
     'name': ['Ben Ward']
  }
}

This is a four-fold change from the current hCard:

  1. type is generically identifiable as a microformat root, even in parsed form. The use of the 'h-' prefix persists into the type of the object. This is deliberately so, as a result of re-using the JSON data model of microdata which itself is re-using a common JSON convention, such that microformatted data is clearly distinguishable (as opposed to any other random schema that may be using a similar data model).
  2. root-class-only support. Per microformats-2-implied-properties, the name property is implied by the entirety of the root class name element.
  3. 'name' instead of 'fn'. As also documented in microformats-2-implied-properties, the continuous challenges/problems and need to repeatedly re-explain 'fn' over the years combined with the real-world market response of nearly every other party doing a person vocabulary renaming 'fn' to 'name', microformats 2 makes this change as well.
  4. There is no automatic parse-time inferring of 'given-name': ['Ben'] and 'family-name': ['Ward']. Any such inferring *might* be made by a vCard converter, but is left up to that specific application (not all applications) built on that vocabulary, though even in that case it may not be necessary, as an empty "N:;;;" vCard property is sufficient to satisfy the N property requirement of vCard, and also causes no problems when imported into various vcard-implementations.

It is required of the extractor to understand that when a microformats object specifies no explicit child properties, that it must treat h-card as having a p-name. But, the parser is generic, so it also treats h-review, h-entry, h-recipe, h-geo as having a ‘p-name’.

As a result, specific vocabularies are evolved to drop their specific form of name (e.g. fn, summary, entry-title) and simplified to use a common 'name' property instead.

Note: while the overwhelming majority of real world publishing/consuming uses of microformats do so with proper nouns which have names (and thus this parser-level incorporation of an implied 'name'), there are some formats that do not have a 'name' semantic. For example, geo, adr, and possibly if/when developed, units of measure, length, cost. The current thinking is that the benefits to the far greater proper-noun use-case of microformats outweigh the technical inelegance of having an extra/ignored 'name' property on formats that lack such a semantic.

Some formats also may appear in theory to better imply some other property, e.g. a review might be thought to imply its content, not its name, and an Atom entry its content, not its title, but in practice (actual publishing patterns) this is not the case. Typically, brief unstructured reviews (or mentions thereof) provide a summary (often hyperlinked to an expanded structured form) of that review, not its content, and similarly, brief unstructured posts (e.g. RSS items) have historically most often been link blog items which include the title of an item and a link. Short status updates as well established by Twitter are newer and would seem to imply purely content with no title, at least semantically, however, even Twitter populates the RSS title and ATOM entry title of their feeds with the content. It's not clear what went into that decision, however, that's likely irrelevant, as the outcome turns out to be emergent consistency among publishing behaviors.

To avoid overloading or undermining the semantics of a vocabulary, I propose that we handle this at the extractor level in a simpler fashion: Define a new property for literal data, that an extractor will provide if no other information was available. All interpreters may then be instructed that in the event that an object has no properties, it can attempt to interpret the literal value from the page instead.

  • This was one of the design iterations I went through which led me to the current implied 'name' design. Another iteration was the ability for a vocabulary to specify a single required property which was implied if there were no properties provided. However, the combination of the fact that in most cases such single required properties were quite name-like, and that a vocabulary-specific rule like that would then bind parsers to specific vocabularies (even so slightly) led me to collapse them into implying a 'name'. It's not perfect, but it's the best alternative so far that balances practical convenience of publishing/consuming, avoids vocabulary-specific knowledge in the parser, and technical (in)elegance. Tantek 13:48, 4 October 2011 (UTC)

In existing microformats, the closest existing example we have for this is the label property in hCard, which is used to represent the literal address label for a place. It is a corresponding piece of fn, org and adr in combination, but has no structure in and of itself. Possibly, every microformat could have a label form where structured data is unavailable.

However in practice, the hCard label property is both little understood and little used. It's not even clear that it ought to be kept for microformats 2 (no known consumers, very few (if any?) real-world non-test publishers). This disuse is likely a good indicator that we should avoid basing anything on its design.

Alternatively, value is used throughout microformats to target a generic value (e.g. in combination with price in hListing.) It has been proposed that when parsing properties that are also themselves microformats, we create native objects of the form:

   {
       'value': '1900 12th Street, San Francisco, CA 94'
     , 'type': ['adr']
     , 'properties': {
           'street-address': '1900 12th Street'
         , 'etc': 'etc'
       }
   }

We could apply this same pattern to the root level:

   { 
       type: [hcard]
     , properties: {}
     , value: 'Ben Ward'
   }

In this case, an interpreter or implementation is responsible for using value in place of fn, or restructuring the object. It would be the responsibility of each vocabulary to define its root property. The parsing layer of microformats 2.0 would not impose semantics or naming onto that.

For another example, geo would end up like this:

   {
       type: [geo]
     , properties: {}
     , value: '1.3232;-0.543'
   }
  • This is an alternative I've been considering as well: Tantek 13:48, 4 October 2011 (UTC)
    • 'value' is more generic than 'name' (applies to more vocabularies) with the trade-off that it naturally has less (weaker) semantics.
      • +1 I think that having naturally weaker semantics would be appropriate for this parsing functionality. —BenWard 07:24, 5 October 2011 (UTC)
    • The interesting thing that this analysis has revealed is that there appear to be two distinct clusters of microformats, the much more commonly used/understood/useful proper-noun microformats which markup things with names (people, events, reviews, recipes), and the less used compound-data microformats which are often used inside other microformats and just have some sort of semi-structured value (adr, geo, measure, and perhaps even things like tel). Perhaps this is implying the possibility and some degree of utility for two microformats root class name prefixes, 'h-' for existing proper-noun microformats, and something else ('m-' for microformat/molecule?, 's-' for structured-value?, 'v-' for value (though historically "v-"/"v." has meant "vendor-specific")?) for unnamed structured data microformats.
      • This more and more feels like a good idea, and I'm leaning toward "s-" for struct / structure / structured value. "s-" works just like "h-" except that it doesn't imply any properties at parse time. We can try it and see what happens. There's also no harm if publishers just use "h-" structures, they just (possibly) get a few extra properties if they happen to omit properties.
    • Parallels the same JSON when a property has both a string value and is a structure itself.
      • Changed my mind on this. The parallel is not quite there. 'name'/'url'/'photo' are only implied if there are NO properties, where as the JSON string value + structure convention *always* provides a 'value'. Tantek 22:39, 4 October 2011 (UTC)
      • And due to this difference in behavior ('value' is there when nested properties are present, whereas 'name' is only implied when there are no properties specified), I think it's correct to keep them separate, i.e. stick with implied 'name'. Tantek 14:56, 5 October 2011 (UTC)
    • However, I'm still currently leaning towards the practical convenience of providing a 'name' for the vast majority of microformats uses, rather than diluting this feature for the sake of avoiding implying inapplicable semantics to the few plain structured data microformats, and even then, only when no properties are explicitly specified! I'd rather introduce a new root prefix for those than lose the simplicity and utility of implied 'name'. Tantek 13:48, 4 October 2011 (UTC)

Parsing properties from rel attributes

--BenWard 07:24, 5 October 2011 (UTC):

  • Currently, hAtom parses `bookmark` as a permalink
  • Various microformats parse `rel=tag` as tags
  • The current proposal for parsing does not allow parsing properties from rel attributes.

Microformats parsers could instead extract all link relationships from rel attributes within an microformat object, parsing them as if a u- prefixed property.

  • Minor nit: Rather than same as a u- prefixed property, I think such "rel" properties should be parsed purely from the href attribute on <a> and <area> elements and nothing more. I would strongly disagree to extending rel to apply to other elements with URLs like img src, object data, or to apply to elements in general like div. That's the path that RDFa has taken and caused much confusion as a result. Tantek 07:39, 5 October 2011 (UTC)
    • Agree: That seems like a perfectly reasonable restriction. --BenWard 08:29, 5 October 2011 (UTC)

This results in:

  • Continuing use of the rel attribute in HTML, thereby building on HTML semantics rather than bypassing them or ignoring them in favour of something less meaningful.
  • Parsing hAtom objects contain a property named bookmark, in place of permalink.
  • All microformats that use rel-tag would contain a property named… tag. Perfect.

Since rel attributes are not overloaded for other functionality like class is, and other uses of rel within content are low (and non-semantic uses are nil, to the best of my knowledge) the risk of property pollution would be extremely low.

Note, with regard to this last point, that a generic microformats parser will parse false-positive properties, and will parse objects in combined chunks, rather than individually by format. Extracted objects will often not represent a vocabulary without further processing.

  • This sounds like it might be workable. Let's try it and see how well authors "get it". - Tantek
  • Possible issue: do we have any collisions between class property names and rel names? (I don't think so offhand, but useful to ask the question). - Tantek
    • None that I can think of in microformats. There is the case of Google's rel=author and p-author in hAtom. However, the next point, about mfo scoping, would cover it in most situations (rel-author on a hyperlink within an hcard wouldn't be applied to the hentry.) The one situation in a parse tree where it's ambiguous would be this:
<a href="p-author h-card" 
   rel="author" 
   href="http://benward.me">
   Ben Ward
</a>
    • I can think of two quite reasonable solutions:
      • 1. Declare that class properties take precedence over rel properties of the same name, discarding rel values if a class is also found, or
      • 2. Since all properties are now multi-value anyway, the hAtom object could be parsed as:
 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': ['Ben Ward'], /* from the p-author     */
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        'http://benward.me'      /* from the rel="author" */
     ],
     
   }
 }
    • BenWard 08:29, 5 October 2011 (UTC)
      • Option 2 makes sense and is consistent with the rest of the multi-value parsing/handling. - Tantek 14:56, 5 October 2011 (UTC)
      • What about without the 'p-author'?
<a href="h-card" 
   rel="author" 
   href="http://benward.me">
   Ben Ward
</a>

Should that be parsed as:

 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        'http://benward.me'      /* from the rel="author" */
     ],
     
   }
 }

Or

 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': 'http://benward.me' /* from the rel="author" */
          'type': ['h-card'],          /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        
     ],
     
   }
 }
      • And if the former, then we're presumably saying that the value parsed due to the presence of a rel is always its own value, and does not combine with any other structures. I am fine with this, but I wanted to make sure we are ok with that explicitly. Tantek 14:56, 5 October 2011 (UTC)
        • +1 I think that since the rel attribute is specifically concerned with the relation to an href attribute, it should not be combined with other structures that are rightly declared uses classes.
          • The more I've thought about this and how consuming applications may want to treat rel semantics, the more it seems correct to keep rel semantics distinct from class semantics. Class semantics are quite general/flexible, whereas rel is quite specific, naming something else in terms of a relationship from the current page/microformat's perspective. I think we should consider putting rel values in their own 'rel' collection, separate from the 'properties' collection. E.g. the original rel-author p-author h-card markup example would be parsed into this:
 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': ['Ben Ward'], /* from the p-author     */
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        }
     ],
     
   }
   'rel': {
     'author': ['http://benward.me'] /* from the rel="author" */
   }
 }
          • and if a post had multiple authors:
 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': ['Ben Ward'], /* from p-author     */
          'type': ['h-card'],    /* from h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        {
          'value': ['Tantek Çelik'], /* from 2nd p-author     */
          'type': ['h-card'],        /* from 2nd h-card ...   */
          'properties': { 
            'name': ['Tantek Çelik'], 
            'url': ['http://tantek.com']
        },
     ],
     
   }
   'rel': {
     'author': [
       'http://benward.me',      /* from rel="author" */
       'http://tantek.com'       /* from 2nd rel="author" */
     ]
   }
 }
          • This preserves the semantic distinction between rel and properties in general, and leaves it up to a higher-level application to implement any logic around showing "more info" about a rel-author, e.g. by correlating the rel-author URL with the 'url' of an hCard it found in the same entry. However, note that even in the earlier JSON data model, the rel-author value just shows up as another property value, and any higher level application would still have to do some correlation logic. At least with this JSON data model, applications that may be looking for a rel value in particular, or a property value in particular can do so without having one unintentionally pollute the other. Tantek 17:33, 6 October 2011 (UTC)


  • Presumably we'd apply all the same property scoping rules to rel scoping as well. E.g. a rel hyperlink inside a microformat won't be seen by any containing microformat. - Tantek
    • Correct, it should be parsed in the same scope as all other class properties in the object.

Other Interpretation/Parsing Notes

Collection of other unresolved parsing issues in a generic model:

This is good material for documenting as microformats-2-issues, microformats-2-faq, and perhaps some of the more technical details in microformats-2-parsing-faq.

  • The include pattern references other elements from elsewhere in a document. A generic parser needs to track IDs and fill them in after walking the DOM. (also, itemref if adopted.)
  • Will itemref always map to an item property name?
    • No, itemref maps to one or more elements by ids, and their children. Those referenced elements may have property class names themselves, or they may contain elements that do. Tantek
  • hAtom implies author from an hCard in a page that uses an address element. This requires format knowledge, but a generic parser does not currently track the element type of a property node. Should it?
    • It should not. element-specific handling (e.g. using "alt" from img, and "title" from abbr) is completely done at parse time. The JSON data model does not reflect which element type or attribute the value came from. Additionally, hAtom is an example where we created far too many vocabulary-specific rules, in practice they're not necessary, and only complicate the microformat for both publisher understanding and parser implementation. Tantek
  • hAtom defines that the highest level heading within an entry implies entry-title. This particular optimisation might be better off dead.
    • Agreed, this is gone in microformats 2. Tantek
  • hAtom defines that permalinks be parsed from rel attributes, not class
    • In practice this has been one of the more problematic/error prone aspects of hAtom implementations, and it's also inconsistent with other microformats (although hReview tried to use both rel permalinks and "url"). The dependence upon rel-bookmark for permalinks is dropped in h-atom in preference to re-using "u-url" and "u-uid". Tantek
  • XFN is entirely built on rel (although, has various other differences from structural microformats, as do vote-links, so perhaps are excluded from this discussion and will always be handled by dedicated parsers/queries regardless?)
    • The best (easiest and most reliable) use of 'rel' microformats in practice is when they are orthogonal to 'class' microformats. This is true both with XFN and some newer rel values like rel-author. In addition, it was very clear at the recent schema.org workshop's syntax session that RDFa's decision to apparently arbitrarily mix use of 'rel' and 'property' attributes for specifying different types of properties (it wasn't clear to people in the room when you use which for what) has caused a high degree of confusion among publishers and thus high error-rates. Thus if anything we should learn from both the mistakes of RDFa and our own experiences with even very deliberate/specific mixing of rel microformats in class microformats, and keep them defined as separate orthogonal building blocks that work together, but don't depend on each other. Tantek
      • Relatedly to this: rel-tag in hAtom. --BenWard 06:50, 5 October 2011 (UTC)
        • Yes, and two related things here. First, despite my (and others') objection and (past) interoperable post/entry-specific treatment by Technorati and Ice Rocket, Hixie has redefined rel-tag in HTML5 to mean applying to the whole page, not a single post. Second, I've explicitly added 'p-category' to the draft 'h-atom' vocabulary in microformats-2. Tantek 07:12, 5 October 2011 (UTC)
  • HTML's time element includes an optional pubdate attribute. Simply: We should parse this as dt-published. --BenWard 06:12, 10 October 2011 (UTC)
    • *If* there is even some reasonable data on actual use of the "pubdate" attribute (I don't think there is, frankly, especially with the removal of the algorithm to produce Atom from HTML5), then we could consider parsing "pubdate" as backwards compatible option for "dt-published". As a general rule, however, it is bad (demonstrably/experienced) design to depend on additional attributes (c.f. RDFa confusion over "property" vs. "rel"), especially for an instance where no additional attribute is necessary. I would leave this out for now until there is non-trivial (more than just test pages or folks who've written HTML5 books, ahem) use in the wild. When there is such use in the wild, it should be documented on a wiki page. We don't want to encourage more complex (additional attribute) publishing as a result of supporting it. Tantek 12:12, 10 October 2011 (UTC)
  • value-class-pattern: In microformats-2, since there are no sub-properties, there will presumably no-longer be a 'value' property in any parsed model. Properties such as 'tel > type' in hCard are, as I recall, deprecated due to underuse anyway, so 'tel > value' becomes redundant. (There's also potentially some clarification around 'price > value' in hListing, whereby value was used in a pattern. So, what does this mean for value class parsing, with regard to value-title patterns and date separation patterns. Are we looking for a 'p-value' and 'p-value-title' classname, but treating them specially (excluding them from regular property parsing.) Or, are we giving them a special prefix (v-text, v-title? That seems confusing, but could be a concept.) I'm fine with p- for both, and just having the parser ignore them since they're special, but need clarification and naming confirmation. --BenWard 09:35, 10 October 2011 (UTC)
    • A few things:
      • 1. Yes, no more subproperties. 'tel' becomes just 'p-tel'. If there is demand for a structured 'tel' value, then we can use that demand (and research into publishing in practice) to brainstorm and create an 'h-tel' structured telephone number (with perhaps fields like 'type', 'extension', some indication of it being local dialing (an extra 0 in some countries) or international dialing, etc.) Or, we address the different 'tel' types as their own flat properties (again as justified by research), e.g. perhaps 'p-tel-fax', or 'p-tel-mobile'. Something for hcard-2-brainstorming.
      • 2. For prices, e.g. hListing, either we're going to need to encode how to parse monetary amounts including monetary symbols, or consider creating an 'h-price' structured price. Not sure what the right answer is here, again, will need to be informed by analysis of documented actual price publication practices.
      • 3. We should avoid introducing a new prefix 'v-' just for value-class-pattern. As we've noted elsewhere, each new prefix adds complexity and should be avoided without substantial advantage.
      • 4. Using 'p-value-title' is strange, as it would be an exception to 'p-' parsing, since it would get the value from the 'title' attribute whereas 'p-' properties don't normally do that (exception: abbr).
      • 5. Using 'p-value' is also strange, as it wouldn't generate a 'value' property in the JSON data model.
      • 6. Class name 'value-title' is already sufficiently prefixed - we've found or even heard of no collisions in practice.
      • 7. Class name 'value' can, by its simpler naming nature, be expected to potentially collide with other web designer class name usage though we have no documentation/mention thereof. We could consider a renaming, or providing of alternative, such as 'value-string', or 'value-content', etc. However, let's keep that as a backup plan to use only if/when evidence is presented that we need to.
    • Conclusions: for now, in microformats-2, keep using 'value' and 'value-title' as defined in the value-class-pattern, and add the additional (obvious) interpretation that value class pattern: date and time parsing applies to all 'dt-' properties. - Tantek 12:12, 10 October 2011 (UTC)

see also