microformats2-parsing: Difference between revisions

Revision as of 02:15, 11 May 2013

<entry-title>microformats2 parsing</entry-title>

One of the goals of microformats2 is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary. This page briefly documents the microformats2 parsing algorithm for doing so.

implementations

Main article: microformats2#Implementations

There are open source microformats2 parsers available for Javascript, node.js, PHP, and Ruby.

algorithm

parse a document for microformats

To parse a document for microformats:

start with an empty JSON "items" array and "rels" hash:

{
  "items": [],
  "rels": {}
}

parse the root element for class microformats, adding to the JSON items array accordingly
parse all hyperlink (<link> <a>) elements for rel microformats, adding to the JSON rels hash accordingly
return the resulting JSON

Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).

parse an element for class microformats

To parse an element for class microformats:

parse element class for root class name(s) "h-x" (and backcompat)
- if not found, parse child elements for microformats (depth first, doc order)
- else if found, start parsing a new microformat
  - parse child elements (document order) by:
    - parse a child element for properties (p-,u-,dt-,e-)
      - add properties found to current microformat
    - parse a child element for microformats (recurse)
      - if that child element itself has a microformat and is a property element, add it into the array of values for that property
      - else add found elements that are microformats to the "children" array
  - imply properties for the found microformat (see below)

parse an element for properties

parsing a p- property

To parse an element for a p-x property value:

parse the element for the value-class-pattern, if a value is found then return it.
if abbr.p-x[title], then return the title attribute
else if data.p-x[value], then return the value attribute
else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
else return the innertext of the element, replacing any nested <img> elements with their alt attribute if present, or otherwise their src attribute if present.

parsing a u- property

To parse an element for a u-x property value:

parse the element for the value-class-pattern, if a value is found then return it.
if a.u-x[href] or area.u-x[href], then get the href attribute
else if img.u-x[src], then get the src attribute
else if object.u-x[data], then get the data attribute
if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
else if abbr.u-x[title], then return the title attribute
else if data.u-x[value], then return the value attribute
else return the innertext of the element.

parsing a dt- property

To parse an element for a dt-x property value:

parse the element for the value-class-pattern including the date and time parsing rules, if a value is found then return it.
if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
else if abbr.dt-x[title], then return the title attribute
else if data.dt-x[value], then return the value attribute
else return the innertext of the element.

parsing an e- property

To parse an element for a e-x property value:

return the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm.

parsing for implied properties

To imply properties: (where h-x is the root microformat element being parsed)

if no explicit "name" property,
then imply by:
- if img.h-x then use its alt attribute for name
- else if abbr.h-x[title] then use its title attribute for name
- else if .h-x>img:only-child then use that img alt for name
- else if .h-x>abbr:only-child[title] then use that abbr title for name
- else if .h-x>:only-child>img:only-child use that img alt for name
- else if .h-x>:only-child>abbr:only-child[title] use that abbr title for name
- else use the innertext of the .h-x for name
- drop leading & trailing white-space from name, including nbsp
if no explicit "photo" property,
then imply by:
- if img.h-x[src] then use src for photo
- else if object.h-x[data] then use data for photo
- else if .h-x>img[src]:only-of-type then use that img src for photo
- else if .h-x>object[data]:only-of-type then use that object data for photo
- else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo
- else if .h-x>:only-child>object[data]:only-of-type then use that object data for photo
if no explicit "url" property,
then imply by:
- if a.h-x[href] then use href for url
- else if .h-x>a[href]:only-of-type then use that a[href] for url

parse a hyperlink element for rel microformats

To parse a hyperlink element for rel microformats: (where * is the hyperlink element)

if the "rel" attribute of the element is empty then exit
set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
treat the "rel" attribute of the element as a space separate set of rel values
if the set of rel values does NOT have "alternate" then
- for each rel value (rel-value)
  - if there is no key rel-value in the rels hash then create it with an empty array as its value
  - add url to the array of the key rel-value in the rels hash
- end for
else
- if there is no top level "alternates" key in the JSON, then create it with an empty array as its value
- add a new hash to the array with keys:
  - "url": url
  - "rel": the set of rel values appended with spaces, except "alternate"
  - "media": the value of the "media" attribute
  - "hreflang": the value of the "hreflang" attribute
end if

what do the CSS selector expressions mean

Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.

questions

See the FAQ:

microformats2-parsing-faq

issues

The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for Glenn Jones
- Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write <span class="p-foo"></span> which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - Tantek 15:29, 10 February 2013 (UTC)

@@ Line 20: / Line 20: @@
 * parse all hyperlink (<code>&lt;link> &lt;a></code>) elements for rel microformats, adding to the JSON rels hash accordingly
 * return the resulting JSON
+Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).
 === parse an element for class microformats ===
@@ Line 90: / Line 91: @@
 ** if a.h-x[href] then use href for url
 ** else if .h-x>a[href]:only-of-type then use that a[href] for url
+=== parse a hyperlink element for rel microformats ===
+To parse a hyperlink element for rel microformats: (where * is the hyperlink element)
+* if the "rel" attribute of the element is empty then exit
+* set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element if any).
+* treat the "rel" attribute of the element as a space separate set of rel values
+* if the set of rel values does NOT have "alternate" then
+** for each rel value (rel-value)
+*** if there is no key rel-value in the rels hash then create it with an empty array as its value
+*** add url to the array of the key rel-value in the rels hash
+** end for
+* else
+** if there is no top level "alternates" key in the JSON, then create it with an empty array as its value
+** add a new hash to the array with keys:
+*** "url": url
+*** "rel": the set of rel values appended with spaces, except "alternate"
+*** "media": the value of the "media" attribute
+*** "hreflang": the value of the "hreflang" attribute
+* end if
 == what do the CSS selector expressions mean ==

microformats2-parsing: Difference between revisions

Revision as of 02:15, 11 May 2013

Contents

implementations

algorithm

parse a document for microformats

parse an element for class microformats

parse an element for properties

parsing a p- property

parsing a u- property

parsing a dt- property

parsing an e- property

parsing for implied properties

parse a hyperlink element for rel microformats

what do the CSS selector expressions mean

questions

issues

see also

Navigation menu

microformats2-parsing: Difference between revisions

Revision as of 02:15, 11 May 2013

implementations

algorithm

parse a document for microformats

parse an element for class microformats

parse an element for properties

parsing a p- property

parsing a u- property

parsing a dt- property

parsing an e- property

parsing for implied properties

parse a hyperlink element for rel microformats

what do the CSS selector expressions mean

questions

issues

see also

Navigation menu

Search