microformats2 parsing

(Difference between revisions)

Jump to: navigation, search
(Encoding properties)
(New datetime pattern issue)
Line 250: Line 250:
* The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for  [[User:GlennJones|Glenn Jones]]
* The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for  [[User:GlennJones|Glenn Jones]]
** Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write <code>&lt;span class="p-foo">&lt;/span></code> which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - [[User:Tantek|Tantek]] 15:29, 10 February 2013 (UTC)
** Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write <code>&lt;span class="p-foo">&lt;/span></code> which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - [[User:Tantek|Tantek]] 15:29, 10 February 2013 (UTC)
 +
</div>
 +
 +
<div class="discussion">
 +
* The examples in the wiki microformats-2 pages such h-entry and h-entry had datetime without the 'T' delimiter between date and time. ie
 +
<source lang="html4strict">
 +
<time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>
 +
</source>
 +
I have updated the pages. As far as I known this is a new pattern for dates. Was it a mistake in the examples or is it a new datetime pattern.
</div>
</div>

Revision as of 10:25, 14 July 2013


One of the goals of microformats2 is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary. This page briefly documents the microformats2 parsing algorithm for doing so.

Contents

implementations

Main article: microformats2#Implementations

There are open source microformats2 parsers available for Javascript, node.js, PHP, and Ruby.

algorithm

parse a document for microformats

To parse a document for microformats:

{
  "items": [],
  "rels": {}
}

Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).

parse an element for class microformats

To parse an element for class microformats:

parse an element for properties

parsing a p- property

To parse an element for a p-x property value:

parsing a u- property

To parse an element for a u-x property value:

parsing a dt- property

To parse an element for a dt-x property value:

parsing an e- property

To parse an element for a e-x property value:

parsing for implied properties

To imply properties: (where h-x is the root microformat element being parsed)

parse a hyperlink element for rel microformats

To parse a hyperlink element for rel microformats: (where * is the hyperlink element)

rel parse examples

Here are some examples to show how parsed rels may be reflected into the JSON (empty items key).

E.g. parsing this markup:

<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/b">author b</a>
<a rel="in-reply-to" href="http://example.com/1">post 1</a>
<a rel="in-reply-to" href="http://example.com/2">post 2</a>
<a rel="alternate home"
   href="http://example.com/fr"
   media="handheld"
   hreflang="fr">French mobile homepage</a>

Would generate this JSON:

{
  "items": [],
  "rels": { 
    "author": [ "http://example.com/a", "http://example.com/b" ],
    "in-reply-to": [ "http://example.com/1", "http://example.com/2" ] 
  },
  "alternates": [{
     "url": "http://example.com/fr", 
     "rel": "home", 
     "media": "handheld", 
     "hreflang": "fr" 
  }]
}

Another parse output example can be found here:

what do the CSS selector expressions mean

Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.

questions

See the FAQ:

issues

Proposal: addition of a new e-* parsing rule for iframe elements with srcdoc attributes. E.G.

<div class="h-entry">
 <iframe class="e-content" srcdoc="<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp;amp; doubly quoted ampersands</p>" />
</div>
{
 "items": [{
  "type": ["h-entry"],
  "properties": {
   "content": ["<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp; doubly quoted ampersands</p>"]
  }
 }]
}

This would allow, for example, HTML comments to be sandboxed inside iframes but still parsable as microformats.

I believe the correct processing would be to leave " entities as they are but to unescape any doubly-escaped ampersands.

Should rel-alternate parsing also pick up the type attribute? It’s fairly widely used, e.g. for ATOM feeds.

The fact that the parsed value of any element with .e-* is at a different level of escaping to the parsed values of p-*, dt-* etc. without any indication of how the property was parsed in the output is a security problem. For example:

input output
<p class="h-card">
 <span class="p-name">&lt;tag&gt;</span>
</p>
{
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "<tag>"
                ]
            }
        }
    ]
}
<p class="h-card">
 <span class="e-name">&lt;tag&gt;</span>
</p>
{
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "&lt;tag&gt;"
                ]
            }
        }
    ]
}
  • As a parser developer, the most straightforward way I can think of solving this is to add an option (enabled by default) which encodes HTML special characters on all non e-* properties, so the developer knows that all property values are going to be at the same level of escaping. --bw 20:00, 15 June 2013 (UTC)
    • Your suggestion of auto-HTML-encoding p-*/u-*/dt-* property values is the most sensible I think. I would NOT make it an option, as it makes sense write consistent microformats2 consumers. - Tantek 07:18, 5 July 2013 (UTC)
    • Can you think of any existing apps/consumers of microformats2 via the parser that would break? What would indieweb comments parsers do? - Tantek 07:18, 5 July 2013 (UTC)
      • The only breakage which might occur would be over-encoding of non e-* properties, but I’ll release this update as v0.2.0 and warn people about the changes. The worst thing which could happen is that some comments look a bit weird, as opposed to the current worst possible scenario of easy XSS attacks --bw 12:55, 5 July 2013 (UTC)
      • We should also decide exactly which characters get encoded — just angle brackets, or quotes/ampersands as well? --bw 12:55, 5 July 2013 (UTC)
      • I am not sure about this, it seems more like a helper function rather than a core feature of the parser. Personally I would like to store data as text and encode only when I am going to use and I known the format it is going to be use in. --Glenn Jones 9:54, 14 July 2013 (UTC)
  • The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for Glenn Jones
    • Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write <span class="p-foo"></span> which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - Tantek 15:29, 10 February 2013 (UTC)
  • The examples in the wiki microformats-2 pages such h-entry and h-entry had datetime without the 'T' delimiter between date and time. ie
<time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>

I have updated the pages. As far as I known this is a new pattern for dates. Was it a mistake in the examples or is it a new datetime pattern.

see also

microformats2 parsing was last modified: Wednesday, December 31st, 1969

Views