microformats2-parsing-issues: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
(not a stub for a while)
Line 3: Line 3:
== issues ==
== issues ==
Open issues in various states of partial resolution from none to nearly resolved.
Open issues in various states of partial resolution from none to nearly resolved.
=== whitespace collapsing revisited ===
2015-05-27: (raised by [[User:Kevin Marks|Kevin Marks]] per Glenn Jones)
Revising the microformats tests to conform the the "don't collapse whitespace" rule below reveals some non-intuitive cases.
preserving whitespace in addresses is somewhat defensible, but in an implied name it is often unhelpful, as it preserves non-user visible space there for authoring reasons.
For example:
[https://github.com/microformats/tests/commit/a325e0e9bc2089507e69b1883f7065a3316e07c2#diff-d577012c1438978a571c4049179607f0 this test] shows how extraneous whitespace ends up in the <code>name</code>
<code><pre>
<div class="h-review-aggregate">
    <div class="p-item h-event">
        <h3 class="p-name">Fullfrontal</h3>
        <p class="p-description">A one day JavaScript Conference held in Brighton</p>
        <p><time class="dt-start" datetime="2012-11-09">9th November 2012</time></p>   
    </div>
   
    <p class="p-rating">
        <span class="p-average value">9.9</span> out of
        <span class="p-best">10</span>
        based on <span class="p-count">62</span> reviews
    </p>
</div>
</pre></code>
give a parsed result of:
<code><pre>
{
    "items": [{
        "type": ["h-review-aggregate"],
        "properties": {
            "item": [{
                "value": "Fullfrontal\nA one day JavaScript Conference held in Brighton\n9th November 2012",
                "type": ["h-event"],
                "properties": {
                    "name": ["Fullfrontal"],
                    "description": ["A one day JavaScript Conference held in Brighton"],
                    "start": ["2012-11-09"]
                }
            }],
            "rating": ["9.9"],
            "average": ["9.9"],
            "best": ["10"],
            "count": ["62"],
            "name": ["Fullfrontal\nA one day JavaScript Conference held in Brighton\n9th November 2012\n\n\n9.9 out of \n        10 \n        based on 62 reviews"]
        }
    }],
    "rels": {}
}
</pre></code>
The <code>value</code> is a reasonable textual representation of the event, but the implied <code>name</code> is full of spurious whitespace that any consumer would have to strip.
[https://github.com/microformats/tests/commit/4c9690b53b0a2f40440abac8e609c51ac7dd6d56 h-review] has similar issues
Glenn also raised label in h-adr [http://testrunner-47055.onmodulus.net/test/microformats-v2/h-adr/geo/ example] but in this case vcard LABEL is supposed to preserve newlines so this is less clear.
Options:
1. keep as is and every parser client has to post process for common cases.
2. make implied <code>name</code> and <code>value</code> normalise whitespace.
3. Just make implied <code>name</code>  normalise whitespace.
4. Put \n in textual forms if there is a <code>&lt;p&gt;</code> tag in the original.


=== uf2 children inside a classic microformats root class name ===
=== uf2 children inside a classic microformats root class name ===

Revision as of 19:23, 27 May 2015

This page is for documenting issues with the microformats2-parsing specification.

issues

Open issues in various states of partial resolution from none to nearly resolved.

whitespace collapsing revisited

2015-05-27: (raised by Kevin Marks per Glenn Jones)

Revising the microformats tests to conform the the "don't collapse whitespace" rule below reveals some non-intuitive cases. preserving whitespace in addresses is somewhat defensible, but in an implied name it is often unhelpful, as it preserves non-user visible space there for authoring reasons.

For example: this test shows how extraneous whitespace ends up in the name

<div class="h-review-aggregate">
    <div class="p-item h-event">
        <h3 class="p-name">Fullfrontal</h3>
        <p class="p-description">A one day JavaScript Conference held in Brighton</p>
        <p><time class="dt-start" datetime="2012-11-09">9th November 2012</time></p>    
    </div> 
    
    <p class="p-rating">
        <span class="p-average value">9.9</span> out of 
        <span class="p-best">10</span> 
        based on <span class="p-count">62</span> reviews
    </p>
</div>

give a parsed result of:

{
    "items": [{
        "type": ["h-review-aggregate"],
        "properties": {
            "item": [{
                "value": "Fullfrontal\nA one day JavaScript Conference held in Brighton\n9th November 2012",
                "type": ["h-event"],
                "properties": {
                    "name": ["Fullfrontal"],
                    "description": ["A one day JavaScript Conference held in Brighton"],
                    "start": ["2012-11-09"]
                }
            }],
            "rating": ["9.9"],
            "average": ["9.9"],
            "best": ["10"],
            "count": ["62"],
            "name": ["Fullfrontal\nA one day JavaScript Conference held in Brighton\n9th November 2012\n\n\n9.9 out of \n        10 \n        based on 62 reviews"]
        }
    }],
    "rels": {}
}

The value is a reasonable textual representation of the event, but the implied name is full of spurious whitespace that any consumer would have to strip.

h-review has similar issues

Glenn also raised label in h-adr example but in this case vcard LABEL is supposed to preserve newlines so this is less clear.

Options:

1. keep as is and every parser client has to post process for common cases.

2. make implied name and value normalise whitespace.

3. Just make implied name normalise whitespace.

4. Put \n in textual forms if there is a <p> tag in the original.


uf2 children inside a classic microformats root class name

2015-020: (raised by kylewm) What should microformats2 children inside a classic microformats root class name do?

Options:

1. Nothing. Any unattached uf2 children inside a classic microformats root are ignored. Problems:

  • However then there's a possible surprise if/when the author upgrades the classic microformats root to uf2, then all of a sudden all the new uf2 children show-up.
  • Another downside: author adds uf2 markup, can't figure out why nothing is happening (because somewhere up the tree in code they didn't touch is classic microformats that are hiding these unattached uf2 children.

2. Show up in the children collection of the classic microformats root

  • Feels most predictable. When you add uf2 root class names anywhere, they will show up in the JSON output hierarchy.
  • When you convert ancestor class microformats root class names to uf2 root class names, no surprise in terms of which microformats show up. Same children collection.
  • +1 Thus I'm leaning towards this one, despite the fact that classic microformats never had a concept of generic unattached children. Tantek 04:55, 21 January 2015 (UTC)
  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.

3. Show up as peers to the classic microformats root. Issue(s)

  • Has ths surprise aspect of if/when you convert the classic root class name to a uf2 root class name, the former peers become unattached children.

any h- root class name overrides and stops backcompat root

2015-020: The presence of any h-* root class name overrides and stop any backcompat parsing of classic microformats root class names. Tantek 04:55, 21 January 2015 (UTC)

Thoughts?

  • Tom & Kyle - implementable with the same backcompat root flag as needed for restricting backcompat root class name to only seeing backcompat property class names
  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.

backcompat classic microformats should only see backcompat properties

2015-020: When parsing a microformats vocabulary that indicates a backcompat root class name (and thus an absence of the microformats2 equivalent on the same element), parsers must only look for the backcompat properties that are specified explicitly for that backcompat root class. Tantek 04:04, 21 January 2015 (UTC)

Reasoning: such behavior was never expected by authors, and crossing a classic microformats root class name with microformats2 property names were never explicitly expected nor specified to work.

Thoughts?

  • Tom & Kyle - implementable with the same backcompat root flag as needed for
  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.

microformats2 root class names should only see microformats2 properties

2015-020: When parsing a microformats2 root class name, only explicit microformats2 properties should be parsed. Any backcompat property names must be ignored. Tantek 04:04, 21 January 2015 (UTC)

Reasoning: such microformats2 authors should be expected to do all their microformats markup with microformats2 class names - this is a deliberate expectation so that their microformats aren't polluted with other (classic microformats) coincidentally named generic class names.

Thoughts?

  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.

implied properties on backcompat parsing unlikely to be intended

Since classic microformats had no notion of implied properties, when implied property parsing occurs on backward compat classic microformats root class names, it is unlikely that any implied property (p-name u-url u-photo) was ever intended by the author of the classic microformat. Tantek 02:43, 30 December 2014 (UTC) Examples:

Proposed resolution:

  • Be explicit in implied property parsing that it must only be done for explicit 'h-*' root class name microformats, not for any (back)compat parsing of microformats. Please comment on this proposal with "** comment" on new lines below. Tantek 02:43, 30 December 2014 (UTC)
    • +1 This makes a lot of sense to me. We should strive to parse mf1 as it was intended by the author, and I think you're right that implied rules are unlikely to be what was intended Kylewm 03:22, 30 December 2014 (UTC)
    • RESOLVED at 2015-01-20 meetup. Tantek 04:09, 21 January 2015 (UTC)

implied properties when an explicit class is provided

Should "u-url" still be implied if another explicit class is already provided, as below. This is a contrived example, but it is taken from Bridgy's unit tests.

<article class="h-entry">
  <a class="u-like-of" href="http://orig.domain/baz">liked this</a>
</article>

In this case, http://orig.domain/baz is almost certainly not the u-url, so IMO it would be better to leave it out —Kylewm 15:10, 7 October 2014 (UTC)

Proposed resolution:

  • Changed my mind. Simpler to do nothing. Example provided is artificially constructed, does not reflect likely real world confusion of if we make implied properties more complicated. Tantek 06:26, 21 January 2015 (UTC)
  • ++ Consensus on do nothing for this case. At 2015-01-20

link elements and u- parsing

  • Raised by tantek on 2014-07-08 on irc: should the parsing specification for handling u- properties be modified to include the link element? The potential downside is that invisible-metadata-is-considered-harmful, however all known real world examples of link are semi-visible data (not fully hidden).

There are potential cases for wanting to use link as an alternative to a (and area), such as a whole page where the root html element is an h-card and the properties are included across the page: some in visible data in the body while others are in the head as link elements. Example:

One specific use-case is the semi-visible link rel="shortcut icon" href="..." - which is visible sometimes in browser UI, and also when a user chooses "Add to Home Screen" on a mobile device. Such page level icons may be used as a u-photo or u-logo of the containing h-* object on the html element.

  • http://adactio.com/about/myself/ on 2014-190
    • could use <html class=h-card> - page is all about Jeremy Keith the person
    • icon / logo is only on <link> tag which could use class=u-logo:
      <link rel="shortcut icon apple-touch-icon" type="image/png" href="/icon.png" />

Another specific use-case is a post permalink page, e.g. with <html class=h-entry>

Another use-case is publishing links to PGP/GPG keys linked from the head which is currently handled by <link rel=pgpkey> which is already supported in existing microformats2 rel parsing of link rel elements. Thus there is a (admittedly weak) argument for consistently parsing both <link rel> and <link class="u-*">.

E.g. inside that aforementioned real world <html class=h-entry> post permalink page example,

  • why should <link rel="in-reply-to"> work
  • but not <link class="u-in-reply-to"> ?

The slightly stronger argument for consistency of link handling is that it simplifies the publisher (and parser) model:

  • <a> and <area> work for both rel and class
  • why does <link> only work for rel ?
  • it would be simpler if all three tags just worked (in the same way) for both rel and class

Should the parsing spec be modified to handle these cases?Tom Morris 09:25, 9 July 2014 (UTC)

    • I'm generally in favour. It'd be good to see what other parser developers think. —Tom Morris 10:16, 9 July 2014 (UTC)
    • adding this to the parsers won't be an issue. The question is should the door be opened to hidden mf data? Up on further reflection, there seems to be no need to distinguish between rel=property and class=u-property on link elements. So I am in favour for consistency. Kartik 18:30, 2014-07-09 (EST)
    • RESOLVED at 2015-01-20 meetup. Make link consistent with a.

e- parsing iframe srcdoc

  • Proposal: addition of a new e-* parsing rule for iframe elements with srcdoc attributes. E.G.
<div class="h-entry">
 <iframe class="e-content" srcdoc="<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp;amp; doubly quoted ampersands</p>" />
</div>
{
 "items": [{
  "type": ["h-entry"],
  "properties": {
   "content": ["<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp; doubly quoted ampersands</p>"]
  }
 }]
}

This would allow, for example, HTML comments to be sandboxed inside iframes but still parsable as microformats.

I believe the correct processing would be to leave " entities as they are but to unescape any doubly-escaped ampersands.

    • Is there any use case for that? —Tom Morris 12:32, 14 September 2013 (UTC)
    • +1 we need documentation of use case and existing sites publishing iframe srcdoc like this - Tantek 00:47, 15 September 2013 (UTC)
    • Rejected by consensus at 2015-01-20 meetup due to lack of real world uses cases / existing sites. Tantek 06:26, 21 January 2015 (UTC)

How to interpret mf2 properties on select

How should select elements with properties be treated any differently?

Awaiting real world examples / stronger use-cases, until then no special treatment of select elements with properties:

  • Are there any real world examples of select elements with microformats properties?
  • What would the use-case be for putting a microformats property class name on a select element?
  • Nothing special. By consensus at 2015-01-20 meetup due to lack of real world uses cases / existing sites. Tantek 06:26, 21 January 2015 (UTC)

How to interpret mf2 root name on form

See what to do about root class names on <form> elements in particular:

Awaiting real world examples / stronger use-cases, until then, no special treatment of root class names on <form> elements:

  • Are there any real world examples of a <form> element with a microformats root class name?
  • hcard-input is one possible use-case, is anyone attempting to use forms for hCard input, e.g. with scripts to help make it work?
  • Are there other use-cases for putting a microformats root class name on a <form> element?
  • As of 2015-01-20 - no consensus - need more input as to when/why this is useful to do anything special.


Parsing Literal Values

Issue raised by: Ben Ward

It is proposed for microformats2 that all microformats be parsable from just their root element, e.g. <p class="h-card">Ben Ward</p> would create an hCard with the following properties after parsing:

{ 
  'type': ['h-card'],
  'properties': {
     'name': ['Ben Ward']
  }
}

This is a four-fold change from the current hCard:

  1. type is generically identifiable as a microformat root, even in parsed form. The use of the 'h-' prefix persists into the type of the object. This is deliberately so, as a result of re-using the JSON data model of microdata which itself is re-using a common JSON convention, such that microformatted data is clearly distinguishable (as opposed to any other random schema that may be using a similar data model).
  2. root-class-only support. Per microformats-2-implied-properties, the name property is implied by the entirety of the root class name element.
  3. 'name' instead of 'fn'. As also documented in microformats-2-implied-properties, the continuous challenges/problems and need to repeatedly re-explain 'fn' over the years combined with the real-world market response of nearly every other party doing a person vocabulary renaming 'fn' to 'name', microformats 2 makes this change as well.
  4. There is no automatic parse-time inferring of 'given-name': ['Ben'] and 'family-name': ['Ward']. Any such inferring *might* be made by a vCard converter, but is left up to that specific application (not all applications) built on that vocabulary, though even in that case it may not be necessary, as an empty "N:;;;" vCard property is sufficient to satisfy the N property requirement of vCard, and also causes no problems when imported into various vcard-implementations.

It is required of the extractor to understand that when a microformats object specifies no explicit child properties, that it must treat h-card as having a p-name. But, the parser is generic, so it also treats h-review, h-entry, h-recipe, h-geo as having a ‘p-name’.

As a result, specific vocabularies are evolved to drop their specific form of name (e.g. fn, summary, entry-title) and simplified to use a common 'name' property instead.

Note: while the overwhelming majority of real world publishing/consuming uses of microformats do so with proper nouns which have names (and thus this parser-level incorporation of an implied 'name'), there are some formats that do not have a 'name' semantic. For example, geo, adr, and possibly if/when developed, units of measure, length, cost. The current thinking is that the benefits to the far greater proper-noun use-case of microformats outweigh the technical inelegance of having an extra/ignored 'name' property on formats that lack such a semantic.

Some formats also may appear in theory to better imply some other property, e.g. a review might be thought to imply its content, not its name, and an Atom entry its content, not its title, but in practice (actual publishing patterns) this is not the case. Typically, brief unstructured reviews (or mentions thereof) provide a summary (often hyperlinked to an expanded structured form) of that review, not its content, and similarly, brief unstructured posts (e.g. RSS items) have historically most often been link blog items which include the title of an item and a link. Short status updates as well established by Twitter are newer and would seem to imply purely content with no title, at least semantically, however, even Twitter populates the RSS title and ATOM entry title of their feeds with the content. It's not clear what went into that decision, however, that's likely irrelevant, as the outcome turns out to be emergent consistency among publishing behaviors.

To avoid overloading or undermining the semantics of a vocabulary, I propose that we handle this at the extractor level in a simpler fashion: Define a new property for literal data, that an extractor will provide if no other information was available. All interpreters may then be instructed that in the event that an object has no properties, it can attempt to interpret the literal value from the page instead.

  • This was one of the design iterations I went through which led me to the current implied 'name' design. Another iteration was the ability for a vocabulary to specify a single required property which was implied if there were no properties provided. However, the combination of the fact that in most cases such single required properties were quite name-like, and that a vocabulary-specific rule like that would then bind parsers to specific vocabularies (even so slightly) led me to collapse them into implying a 'name'. It's not perfect, but it's the best alternative so far that balances practical convenience of publishing/consuming, avoids vocabulary-specific knowledge in the parser, and technical (in)elegance. Tantek 13:48, 4 October 2011 (UTC)

In existing microformats, the closest existing example we have for this is the label property in hCard, which is used to represent the literal address label for a place. It is a corresponding piece of fn, org and adr in combination, but has no structure in and of itself. Possibly, every microformat could have a label form where structured data is unavailable.

However in practice, the hCard label property is both little understood and little used. It's not even clear that it ought to be kept for microformats 2 (no known consumers, very few (if any?) real-world non-test publishers). This disuse is likely a good indicator that we should avoid basing anything on its design.

Alternatively, value is used throughout microformats to target a generic value (e.g. in combination with price in hListing.) It has been proposed that when parsing properties that are also themselves microformats, we create native objects of the form:

   {
       'value': '1900 12th Street, San Francisco, CA 94'
     , 'type': ['h-adr']
     , 'properties': {
           'street-address': '1900 12th Street'
         , 'etc': 'etc'
       }
   }

We could apply this same pattern to the root level:

   { 
       type: [h-card]
     , properties: {}
     , value: 'Ben Ward'
   }

In this case, an interpreter or implementation is responsible for using value in place of fn, or restructuring the object. It would be the responsibility of each vocabulary to define its root property. The parsing layer of microformats 2.0 would not impose semantics or naming onto that.

For another example, h-geo would end up like this:

   {
       type: [h-geo]
     , properties: {}
     , value: '1.3232;-0.543'
   }
  • This is an alternative I've been considering as well: Tantek 13:48, 4 October 2011 (UTC)
    • 'value' is more generic than 'name' (applies to more vocabularies) with the trade-off that it naturally has less (weaker) semantics.
      • +1 I think that having naturally weaker semantics would be appropriate for this parsing functionality. —BenWard 07:24, 5 October 2011 (UTC)
    • The interesting thing that this analysis has revealed is that there appear to be two distinct clusters of microformats, the much more commonly used/understood/useful proper-noun microformats which markup things with names (people, events, reviews, recipes), and the less used compound-data microformats which are often used inside other microformats and just have some sort of semi-structured value (adr, geo, measure, and perhaps even things like tel). Perhaps this is implying the possibility and some degree of utility for two microformats root class name prefixes, 'h-' for existing proper-noun microformats, and something else ('m-' for microformat/molecule?, 's-' for structured-value?, 'v-' for value (though historically "v-"/"v." has meant "vendor-specific")?) for unnamed structured data microformats.
      • This more and more feels like a good idea, and I'm leaning toward "s-" for struct / structure / structured value. "s-" works just like "h-" except that it doesn't imply any properties at parse time. We can try it and see what happens. There's also no harm if publishers just use "h-" structures, they just (possibly) get a few extra properties if they happen to omit properties.
    • Parallels the same JSON when a property has both a string value and is a structure itself.
      • Changed my mind on this. The parallel is not quite there. 'name'/'url'/'photo' are only implied if there are NO properties, where as the JSON string value + structure convention *always* provides a 'value'. Tantek 22:39, 4 October 2011 (UTC)
      • And due to this difference in behavior ('value' is there when nested properties are present, whereas 'name' is only implied when there are no properties specified), I think it's correct to keep them separate, i.e. stick with implied 'name'. Tantek 14:56, 5 October 2011 (UTC)
    • However, I'm still currently leaning towards the practical convenience of providing a 'name' for the vast majority of microformats uses, rather than diluting this feature for the sake of avoiding implying inapplicable semantics to the few plain structured data microformats, and even then, only when no properties are explicitly specified! I'd rather introduce a new root prefix for those than lose the simplicity and utility of implied 'name'. Tantek 13:48, 4 October 2011 (UTC)

resolved

When to collapse whitespace in properties

The spec doesn’t explicitly require whitespace to be collapsed or not. The official mf2 test suite requires it to be collapsed.

Reasons why whitespace shouldn’t be collapsed:

  • Plaintext property representations of syntax-highlighted code, poetry and song lyrics require whitespace to be present
  • Whether or not whitespace is an important part of the content being parsed is determined by css white-space and CANNOT be inferred from HTML markup alone

Resolution 2013-11-12: Agreed, whitespace should not be collapsed (other than normal HTML5 parsing rules). The spec now refers to "textContent" rather than "innertext" to make this explicit.

How to interpret mf2 classnames on form inputs

E.G. how to parse:

<input class="u-url" value="https://brennannovak.com/notes/338" />

Examples in the wild: https://brennannovak.com/notes/338

See proposal:

Resolution 2013-11-12: Per that proposal, p- u- dt- properties on input[value] elements now use the value attribute.

mixture of microformats2 and classic microformats classnames on different elements

Some sites in the wild have mistakenly combined classic mf and mf2 markup in ways which misrepresent the content if parsed in BC mode.

Typically this is caused by putting classic and mf2 classnames for the same vocabulary on different elements, e.g:

<body class="hentry">
 <article class="h-entry">
  <h1 class="p-name"></h1>
 </article>
</body>

Sites where this has been observed:

Discussion:

  • As far as I can tell, the problems in all of these examples were caused by mf2 markup being injected by a wordpress plugin, but classic mf classnames being present further up the DOM in the themes. When parsed in compatibility mode, the classic mf classnames are transformed into mf2 classnames, making the original mf2 classnames look like children of empty items.
  • Turns out this isn’t theme-specific, WordPress injects hentry via PHP [1]. The bug with the wordpress mf2 plugin is resolved as of 2013-10-22 --bw 13:38, 22 October 2013 (UTC)

e- and p- escaping levels

  • The fact that the parsed value of any element with .e-* is at a different level of escaping to the parsed values of p-*, dt-* etc. without any indication of how the property was parsed in the output is a security problem. For example:
input output
   <p class="h-card">
 <span class="p-name">&lt;tag&gt;</span>
</p>
   {
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "<tag>"
                ]
            }
        }
    ]
}
   <p class="h-card">
 <span class="e-name">&lt;tag&gt;</span>
</p>
   {
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "&lt;tag&gt;"
                ]
            }
        }
    ]
}
  • As a parser developer, the most straightforward way I can think of solving this is to add an option (enabled by default) which encodes HTML special characters on all non e-* properties, so the developer knows that all property values are going to be at the same level of escaping. --bw 20:00, 15 June 2013 (UTC)
    • Your suggestion of auto-HTML-encoding p-*/u-*/dt-* property values is the most sensible I think. I would NOT make it an option, as it makes sense write consistent microformats2 consumers. - Tantek 07:18, 5 July 2013 (UTC)
    • Can you think of any existing apps/consumers of microformats2 via the parser that would break? What would indieweb comments parsers do? - Tantek 07:18, 5 July 2013 (UTC)
      • The only breakage which might occur would be over-encoding of non e-* properties, but I’ll release this update as v0.2.0 and warn people about the changes. The worst thing which could happen is that some comments look a bit weird, as opposed to the current worst possible scenario of easy XSS attacks --bw 12:55, 5 July 2013 (UTC)
      • We should also decide exactly which characters get encoded — just angle brackets, or quotes/ampersands as well? --bw 12:55, 5 July 2013 (UTC)
      • I am not sure about this, it seems more like a helper function rather than a core feature of the parser. Personally I would like to store data as text and encode only when I am going to use and I known the format it is going to be use in. --Glenn Jones 9:54, 14 July 2013 (UTC)
      • After the discussion on the indiewebcamp IRC with Barnaby Walters I now understand the XSS issue that this change is trying to address. A rogue author could include HTML with scripts to execute a XSS attack. These could be masked by switch prefixes i.e. p-* to e-* on a well use property. As the consumer does not see the prefix in the JSON output they have no idea if a property will content HTML or text. I will update my two parsers and the test suite --Glenn Jones 8:02, 17 July 2013 (UTC)
    • So what about an author setting a property to e-* when it would normal be p-*, dt-* or u-* i.e.
<div class="h-card"><p class="e-name"><script> alert('xss test') </script></p></div>
  • Resolved by changes to the parsing spec: all properties are plaintext (non-HTML escaped), e-* properties result in a dictionary with value = plaintext version, html = raw HTML version


br hr empty string

  • The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for Glenn Jones
    • Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write <span class="p-foo"></span> which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - Tantek 15:29, 10 February 2013 (UTC)

datetime examples without T delimiter

  • The examples in the wiki microformats-2 pages such h-entry and h-entry had datetime without the 'T' delimiter between date and time. ie
<time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>

I have updated the pages. As far as I known this is a new pattern for dates. Was it a mistake in the examples or is it a new datetime pattern.

    • The HTML5 "time" element, and "datetime" attribute allow for space " " as a separator between date and time as well as "T", thus we allow it for microformats as well. The " " separator is preferred as the date and time are more readable when separated by a space. The examples noted in those specs deliberately use this. - Tantek 18:48, 15 July 2013 (UTC)

rel-alternate absent optional attributes

  • What should rel-alternate parsing do when one of the optional attributes specified (hreflang or media or both) is not there? The options seem to be:
    1. leave the corresponding key out of the alternate JSON object
      • This one. Leave the corresponding key out.
    2. include the corresponding key in the alternate JSON object, but set the value to the JSON null object
    3. include the corresponding key in the alternate JSON object, but set the value to a blank string
    4. something I haven't thought of

I haven't checked the existing implementations, but Barnaby said he's not sure what the appropriate way to deal with it is either. —Tom Morris 15:41, 9 August 2013 (UTC)

rel-alternate and type attribute

  • Should rel-alternate parsing also pick up the type attribute? It’s fairly widely used, e.g. for ATOM feeds.
    • Numerous existing sites/pages have various rel-alternate uses with a type attribute for feeds/APIs so that's good enough to add this for help with discovery in general. Rel parsing updated. - Tantek 00:47, 15 September 2013 (UTC)

Extraction vs Interpretation

Issue raised by: Ben Ward

A microformats ‘1.0’ parser performs the following function:

  • Given a piece of HTML content, discover a known microformat, extract it, apply various extraction patterns based upon the HTML mark-up used (e.g. include pattern, abbr patterns, date-time patterns, value-title pattern), apply various content optimisations where applicable, and return the result in an object native to the programming language.

This is performing two types of function: Extraction of data from an HTML document or fragment, and interpretation and optimisation of that content to match the rules set out by a vocabulary specification.

It is only possible to write a generic parser that covers the first half of this task: Extraction, and application of global rules based on HTML elements and patterns common to all formats.

The purpose of a generic parser (as supported by use cases such as search engines, and other crawlers) is:

To provide a way for tools to extract rich data from a page for native storage, such that the data may be interpreted later by applications. This allows microformats to be crawled, and indexed, and removes the need to include complex HTML parsing within every implementation of microformat data.

Microformats will continue to define various vocabulary-specific optimisations. as part of the design to be optimised for authors. For example: The fn pattern in hcard, or the lat;long pattern in geo, as well as default values for properties, such as the maximum rating in an hreview.

  • Actually, no, as it is defined currently, microformats 2 drops vocabulary-specific optimizations. Such optimizations have often been too inapplicable, error prone or i18n-unsafe (e.g. fn to given-name + family-name fails for both numerous cases where middlenames/initials are used, and in general in numerous Asian languages where given/family name order is the reverse of Western English conventions, or languages with multiple family-names, e.g. Spanish - see hcard-issues-resolved for more). This is a deliberate cutting of a "feature" from microformats 1, it is a deliberate model simplification design decision. Tantek 12:43, 4 October 2011 (UTC)

Extraction resolution

Proposed resolution:

Microformats2 should refer only to extraction of microformats. Vocabularies should in turn document their appropriate optimisations, which will need to be applied by implementations, or a companion to an extractor, which I'll refer to here as an ‘interpreter’.

  • Vocabularies will no longer have optimizations, this is again deliberately, as they've been shown to be more error prone than helpful. Thus there should be no need for any vocabulary-specific 'interpreters'. However, due to design quirks in various legacy/interchange formats, export conversions algorithms to those legacy/interchange formats will require some additional legacy-format-specific rules (e.g. odd "required" rules in Atom or vCard will require specific synthesis rules, limitations in said formats will require filtering of some values, e.g. vcard3 BDAY disallows vague birthdays like year-month and --month-day - subsequently allowed in vcard4). Tantek 12:43, 4 October 2011 (UTC)

A microformats2 ‘extractor’, in combination with the functionality of a domain and format-aware ‘interpreter’ (either another shared component, or part of the implementation itself) would be equivalent to a microformats 1.0 ‘parser.’

  • A microformats2 parser is both generic (no knowledge of specific vocabularies), and lacks any/all such vocabulary-specific rules as compared to a microformats 1.0 parser with the exception of a 1) a limited list of well-established/interoperable backward compat root class names (of current microformats that are or can be soon shown to be specifications/standards per the process), 2) flat sets of backward compat property names (some with prefix/name specific conversion) for each of those backward compat root class names. This is a deliberate design decision that makes microformats 2 more "micro", and yes this means that even with such backward compat support, this simple form of backward compat may mean that some existing microformats 1 content breaks. We'll assess those and iterate on a documented case-by-case basis rather than attempt to maintain theoretical 100% backward compatibility (since many current microformats format-specific-features are either unused, or may have caused more problems than solutions). Tantek 12:43, 4 October 2011 (UTC)

N.B. I'll rewrite some of these as microformats2-parsing-faq to help better clarify. The reasoning that led to most of these design decisions is documented in the microformats 2: About This Brainstorm section and following sections. I'll recheck those sections to see if/where reasoning for some of the above noted design decisions may have been missed, and back-fill accordingly. This is necessary because microformats2 is a evolutionary result of simultaneously addressing both numerous generic issues as well as various common format-specific problems in microformats1 syntax and vocabularies. The very number of changes may make it more challenging (from a microformats1 perspective) to see why any particular design change has been made. Tantek 12:43, 4 October 2011 (UTC)

This issue can be moved from resolved to closed once the above-mentioned write-ups have occurred.


Parsing properties from rel attributes

tl;dr resolution: As of 2013, microformats2-parsing handles parsing all link and a href rel values at document scope level, and producing canonical JSON accordingly. - Tantek

Issue raised by BenWard 07:24, 5 October 2011 (UTC):

  • Currently, hAtom parses `bookmark` as a permalink
  • Various microformats parse `rel=tag` as tags
  • The current proposal for parsing does not allow parsing properties from rel attributes.

Microformats parsers could instead extract all link relationships from rel attributes within an microformat object, parsing them as if a u- prefixed property.

  • Minor nit: Rather than same as a u- prefixed property, I think such "rel" properties should be parsed purely from the href attribute on <a> and <area> elements and nothing more. I would strongly disagree to extending rel to apply to other elements with URLs like img src, object data, or to apply to elements in general like div. That's the path that RDFa has taken and caused much confusion as a result. Tantek 07:39, 5 October 2011 (UTC)
    • Agree: That seems like a perfectly reasonable restriction. --BenWard 08:29, 5 October 2011 (UTC)

This results in:

  • Continuing use of the rel attribute in HTML, thereby building on HTML semantics rather than bypassing them or ignoring them in favour of something less meaningful.
  • Parsing hAtom objects contain a property named bookmark, in place of permalink.
  • All microformats that use rel-tag would contain a property named… tag. Perfect.

Since rel attributes are not overloaded for other functionality like class is, and other uses of rel within content are low (and non-semantic uses are nil, to the best of my knowledge) the risk of property pollution would be extremely low.

Note, with regard to this last point, that a generic microformats parser will parse false-positive properties, and will parse objects in combined chunks, rather than individually by format. Extracted objects will often not represent a vocabulary without further processing.

  • This sounds like it might be workable. Let's try it and see how well authors "get it". - Tantek
  • Possible issue: do we have any collisions between class property names and rel names? (I don't think so offhand, but useful to ask the question). - Tantek
    • None that I can think of in microformats. There is the case of Google's rel=author and p-author in hAtom. However, the next point, about mfo scoping, would cover it in most situations (rel-author on a hyperlink within an hcard wouldn't be applied to the hentry.) The one situation in a parse tree where it's ambiguous would be this:
<a href="p-author h-card" 
   rel="author" 
   href="http://benward.me">
   Ben Ward
</a>
    • I can think of two quite reasonable solutions:
      • 1. Declare that class properties take precedence over rel properties of the same name, discarding rel values if a class is also found, or
      • 2. Since all properties are now multi-value anyway, the hAtom object could be parsed as:
 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': ['Ben Ward'], /* from the p-author     */
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        'http://benward.me'      /* from the rel="author" */
     ],
     
   }
 }
    • BenWard 08:29, 5 October 2011 (UTC)
      • Option 2 makes sense and is consistent with the rest of the multi-value parsing/handling. - Tantek 14:56, 5 October 2011 (UTC)
      • What about without the 'p-author'?
<a href="h-card" 
   rel="author" 
   href="http://benward.me">
   Ben Ward
</a>

Should that be parsed as:

 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        'http://benward.me'      /* from the rel="author" */
     ],
     
   }
 }

Or

 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': 'http://benward.me' /* from the rel="author" */
          'type': ['h-card'],          /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        
     ],
     
   }
 }
      • And if the former, then we're presumably saying that the value parsed due to the presence of a rel is always its own value, and does not combine with any other structures. I am fine with this, but I wanted to make sure we are ok with that explicitly. Tantek 14:56, 5 October 2011 (UTC)
        • +1 I think that since the rel attribute is specifically concerned with the relation to an href attribute, it should not be combined with other structures that are rightly declared uses classes.
          • The more I've thought about this and how consuming applications may want to treat rel semantics, the more it seems correct to keep rel semantics distinct from class semantics. Class semantics are quite general/flexible, whereas rel is quite specific, naming something else in terms of a relationship from the current page/microformat's perspective. I think we should consider putting rel values in their own 'rel' collection, separate from the 'properties' collection. E.g. the original rel-author p-author h-card markup example would be parsed into this:
 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': ['Ben Ward'], /* from the p-author     */
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        }
     ],
     
   }
   'rel': {
     'author': ['http://benward.me'] /* from the rel="author" */
   }
 }
          • and if a post had multiple authors:
 {
   'type': ['h-entry'],
   'properties': {
     
     'author': [
        {
          'value': ['Ben Ward'], /* from p-author     */
          'type': ['h-card'],    /* from h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        {
          'value': ['Tantek Çelik'], /* from 2nd p-author     */
          'type': ['h-card'],        /* from 2nd h-card ...   */
          'properties': { 
            'name': ['Tantek Çelik'], 
            'url': ['http://tantek.com']
        },
     ],
     
   }
   'rel': {
     'author': [
       'http://benward.me',      /* from rel="author" */
       'http://tantek.com'       /* from 2nd rel="author" */
     ]
   }
 }
          • This preserves the semantic distinction between rel and properties in general, and leaves it up to a higher-level application to implement any logic around showing "more info" about a rel-author, e.g. by correlating the rel-author URL with the 'url' of an hCard it found in the same entry. However, note that even in the earlier JSON data model, the rel-author value just shows up as another property value, and any higher level application would still have to do some correlation logic. At least with this JSON data model, applications that may be looking for a rel value in particular, or a property value in particular can do so without having one unintentionally pollute the other. Tantek 17:33, 6 October 2011 (UTC)


  • Presumably we'd apply all the same property scoping rules to rel scoping as well. E.g. a rel hyperlink inside a microformat won't be seen by any containing microformat. - Tantek
    • Correct, it should be parsed in the same scope as all other class properties in the object.
      • Update: all rel microformats are now parsed at page-scope. Per-microformat scoping of rel has been found to be too confusing in practice (and against the general semantic of rel expressed in the HTML/HTML5 specs) Tantek 01:00, 10 July 2014 (UTC)


This issue can be moved from resolved to closed once we've verified that all the above-mentioned and implied needs to write things up have occurred.

see also