microformats2-parsing: Difference between revisions
GlennJones (talk | contribs) (but u-* which should be HTTP encoded) |
GlennJones (talk | contribs) (Updated my point) |
||
Line 245: | Line 245: | ||
*** We should also decide exactly which characters get encoded — just angle brackets, or quotes/ampersands as well? --[[User:Barnabywalters|bw]] 12:55, 5 July 2013 (UTC) | *** We should also decide exactly which characters get encoded — just angle brackets, or quotes/ampersands as well? --[[User:Barnabywalters|bw]] 12:55, 5 July 2013 (UTC) | ||
*** I am not sure about this, it seems more like a helper function rather than a core feature of the parser. Personally I would like to store data as text and encode only when I am going to use and I known the format it is going to be use in. --[[User:GlennJones|Glenn Jones]] 9:54, 14 July 2013 (UTC) | *** I am not sure about this, it seems more like a helper function rather than a core feature of the parser. Personally I would like to store data as text and encode only when I am going to use and I known the format it is going to be use in. --[[User:GlennJones|Glenn Jones]] 9:54, 14 July 2013 (UTC) | ||
***After the discussion on the indiewebcamp IRC with Barnaby Walters I now understand the XSS issue that this change is trying to address. A rogue author could include HTML with scripts to execute a XSS attack. These could be masked by switch prefixes i.e. p-* to e-* on a well use property. As the consumer does not see the prefix in the JSON output they have no idea if a property will content HTML or text. | ***After the discussion on the indiewebcamp IRC with Barnaby Walters I now understand the XSS issue that this change is trying to address. A rogue author could include HTML with scripts to execute a XSS attack. These could be masked by switch prefixes i.e. p-* to e-* on a well use property. As the consumer does not see the prefix in the JSON output they have no idea if a property will content HTML or text. I will update my two parsers and the test suite --[[User:GlennJones|Glenn Jones]] 8:02, 17 July 2013 (UTC) | ||
</div> | </div> | ||
Revision as of 07:21, 17 July 2013
<entry-title>microformats2 parsing</entry-title>
One of the goals of microformats2 is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary. This page briefly documents the microformats2 parsing algorithm for doing so.
implementations
There are open source microformats2 parsers available for Javascript, node.js, PHP, and Ruby.
algorithm
parse a document for microformats
To parse a document for microformats:
- start with an empty JSON "items" array and "rels" hash:
{
"items": [],
"rels": {}
}
- parse the root element for class microformats, adding to the JSON items array accordingly
- parse all hyperlink (
<link> <a>
) elements for rel microformats, adding to the JSON rels hash accordingly - return the resulting JSON
Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).
parse an element for class microformats
To parse an element for class microformats:
- parse element class for root class name(s) "h-x" (and backcompat)
- if not found, parse child elements for microformats (depth first, doc order)
- else if found, start parsing a new microformat
- parse child elements (document order) by:
- parse a child element for properties (p-,u-,dt-,e-)
- add properties found to current microformat
- parse a child element for microformats (recurse)
- if that child element itself has a microformat and is a property element, add it into the array of values for that property
- else add found elements that are microformats to the "children" array
- parse a child element for properties (p-,u-,dt-,e-)
- imply properties for the found microformat (see below)
- parse child elements (document order) by:
parse an element for properties
parsing a p- property
To parse an element for a p-x property value:
- parse the element for the value-class-pattern, if a value is found then return it.
- if abbr.p-x[title], then return the title attribute
- else if data.p-x[value], then return the value attribute
- else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
- else return the innertext of the element, replacing any nested
<img>
elements with theiralt
attribute if present, or otherwise theirsrc
attribute if present.
parsing a u- property
To parse an element for a u-x property value:
- parse the element for the value-class-pattern, if a value is found then return it.
- if a.u-x[href] or area.u-x[href], then get the href attribute
- else if img.u-x[src], then get the src attribute
- else if object.u-x[data], then get the data attribute
- if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first
<base>
element if any). - else if abbr.u-x[title], then return the title attribute
- else if data.u-x[value], then return the value attribute
- else return the innertext of the element.
parsing a dt- property
To parse an element for a dt-x property value:
- parse the element for the value-class-pattern including the date and time parsing rules, if a value is found then return it.
- if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
- else if abbr.dt-x[title], then return the title attribute
- else if data.dt-x[value], then return the value attribute
- else return the innertext of the element.
parsing an e- property
To parse an element for a e-x property value:
- return the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm.
parsing for implied properties
To imply properties: (where h-x is the root microformat element being parsed)
- if no explicit "name" property,
- then imply by:
- if img.h-x then use its alt attribute for name
- else if abbr.h-x[title] then use its title attribute for name
- else if .h-x>img:only-child then use that img alt for name
- else if .h-x>abbr:only-child[title] then use that abbr title for name
- else if .h-x>:only-child>img:only-child use that img alt for name
- else if .h-x>:only-child>abbr:only-child[title] use that abbr title for name
- else use the innertext of the .h-x for name
- drop leading & trailing white-space from name, including nbsp
- if no explicit "photo" property,
- then imply by:
- if img.h-x[src] then use src for photo
- else if object.h-x[data] then use data for photo
- else if .h-x>img[src]:only-of-type then use that img src for photo
- else if .h-x>object[data]:only-of-type then use that object data for photo
- else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo
- else if .h-x>:only-child>object[data]:only-of-type then use that object data for photo
- if no explicit "url" property,
- then imply by:
- if a.h-x[href] then use href for url
- else if .h-x>a[href]:only-of-type then use that a[href] for url
parse a hyperlink element for rel microformats
To parse a hyperlink element for rel microformats: (where * is the hyperlink element)
- if the "rel" attribute of the element is empty then exit
- set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first
<base>
element if any). - treat the "rel" attribute of the element as a space separate set of rel values
- if the set of rel values does NOT have "alternate" then
- for each rel value (rel-value)
- if there is no key rel-value in the rels hash then create it with an empty array as its value
- add url to the array of the key rel-value in the rels hash
- end for
- for each rel value (rel-value)
- else
- if there is no top level "alternates" key in the JSON, then create it with an empty array as its value
- add a new hash to the array with keys:
- "url": url
- "rel": the set of rel values appended with spaces, except "alternate"
- "media": the value of the "media" attribute
- "hreflang": the value of the "hreflang" attribute
- end if
rel parse examples
Here are some examples to show how parsed rels may be reflected into the JSON (empty items key).
E.g. parsing this markup:
<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/b">author b</a>
<a rel="in-reply-to" href="http://example.com/1">post 1</a>
<a rel="in-reply-to" href="http://example.com/2">post 2</a>
<a rel="alternate home"
href="http://example.com/fr"
media="handheld"
hreflang="fr">French mobile homepage</a>
Would generate this JSON:
{
"items": [],
"rels": {
"author": [ "http://example.com/a", "http://example.com/b" ],
"in-reply-to": [ "http://example.com/1", "http://example.com/2" ]
},
"alternates": [{
"url": "http://example.com/fr",
"rel": "home",
"media": "handheld",
"hreflang": "fr"
}]
}
Another parse output example can be found here:
what do the CSS selector expressions mean
Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.
questions
See the FAQ:
issues
Proposal: addition of a new e-* parsing rule for iframe elements with srcdoc attributes. E.G.
<div class="h-entry">
<iframe class="e-content" srcdoc="<p>A paragraph of HTML with "quoted quotes" &amp; doubly quoted ampersands</p>" />
</div>
{
"items": [{
"type": ["h-entry"],
"properties": {
"content": ["<p>A paragraph of HTML with "quoted quotes" & doubly quoted ampersands</p>"]
}
}]
}
This would allow, for example, HTML comments to be sandboxed inside iframes but still parsable as microformats.
I believe the correct processing would be to leave " entities as they are but to unescape any doubly-escaped ampersands.
Should rel-alternate parsing also pick up the type
attribute? It’s fairly widely used, e.g. for ATOM feeds.
The fact that the parsed value of any element with .e-* is at a different level of escaping to the parsed values of p-*, dt-* etc. without any indication of how the property was parsed in the output is a security problem. For example:
input | output |
---|---|
<p class="h-card">
<span class="p-name"><tag></span>
</p>
|
{
"items": [
{
"type": [
"h-card"
],
"properties": {
"name": [
"<tag>"
]
}
}
]
}
|
<p class="h-card">
<span class="e-name"><tag></span>
</p>
|
{
"items": [
{
"type": [
"h-card"
],
"properties": {
"name": [
"<tag>"
]
}
}
]
}
|
- As a parser developer, the most straightforward way I can think of solving this is to add an option (enabled by default) which encodes HTML special characters on all non e-* properties, so the developer knows that all property values are going to be at the same level of escaping. --bw 20:00, 15 June 2013 (UTC)
- Your suggestion of auto-HTML-encoding p-*/u-*/dt-* property values is the most sensible I think. I would NOT make it an option, as it makes sense write consistent microformats2 consumers. - Tantek 07:18, 5 July 2013 (UTC)
- Can you think of any existing apps/consumers of microformats2 via the parser that would break? What would indieweb comments parsers do? - Tantek 07:18, 5 July 2013 (UTC)
- The only breakage which might occur would be over-encoding of non e-* properties, but I’ll release this update as v0.2.0 and warn people about the changes. The worst thing which could happen is that some comments look a bit weird, as opposed to the current worst possible scenario of easy XSS attacks --bw 12:55, 5 July 2013 (UTC)
- We should also decide exactly which characters get encoded — just angle brackets, or quotes/ampersands as well? --bw 12:55, 5 July 2013 (UTC)
- I am not sure about this, it seems more like a helper function rather than a core feature of the parser. Personally I would like to store data as text and encode only when I am going to use and I known the format it is going to be use in. --Glenn Jones 9:54, 14 July 2013 (UTC)
- After the discussion on the indiewebcamp IRC with Barnaby Walters I now understand the XSS issue that this change is trying to address. A rogue author could include HTML with scripts to execute a XSS attack. These could be masked by switch prefixes i.e. p-* to e-* on a well use property. As the consumer does not see the prefix in the JSON output they have no idea if a property will content HTML or text. I will update my two parsers and the test suite --Glenn Jones 8:02, 17 July 2013 (UTC)
- The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for Glenn Jones
- Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write
<span class="p-foo"></span>
which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - Tantek 15:29, 10 February 2013 (UTC)
- Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write
- The examples in the wiki microformats-2 pages such h-entry and h-entry had datetime without the 'T' delimiter between date and time. ie
<time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>
I have updated the pages. As far as I known this is a new pattern for dates. Was it a mistake in the examples or is it a new datetime pattern.
- The HTML5 "time" element, and "datetime" attribute allow for space " " as a separator between date and time as well as "T", thus we allow it for microformats as well. The " " separator is preferred as the date and time are more readable when separated by a space. The examples noted in those specs deliberately use this. - Tantek 18:48, 15 July 2013 (UTC)
see also
- microformats2
- microformats2-parsing-faq
- microformats2-parsing-brainstorming - for background, thinking, exploring possibilities
- microformats2-parsing-rdf
- microformats2-implied-properties