Difference between revisions of "microformats2-parsing"

From Microformats Wiki
microformats2-parsing
Jump to navigation Jump to search
(explicit reference to HTML parsing rules at start, add note HTML parsing rules section and template example, and jgarber test case in the wild)
(→‎parsing a p- property: added resolution of relative URLs detail to p- parsing textContent)
Line 47: Line 47:
 
* else if data.p-x[value] or input.p-x[value], then return the value attribute
 
* else if data.p-x[value] or input.p-x[value], then return the value attribute
 
* else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
 
* else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
* else return the textContent of the element, replacing any nested <code>&lt;img></code> elements with their <code>alt</code> attribute if present, or otherwise their <code>src</code> attribute if present.
+
* else return the textContent of the element, replacing any nested <code>&lt;img></code> elements with their <code>alt</code> attribute if present, or otherwise their <code>src</code> attribute if present, resolving any relative URLs.
  
 
==== parsing a u- property ====
 
==== parsing a u- property ====

Revision as of 11:39, 17 July 2014

<entry-title>microformats2 parsing specification</entry-title>

Tantek Çelik (Editor)


microformats2 is a simple, open format for marking up data in HTML. The microformats2 parsing specification describes how to implement a microformats2 parser.

One of the goals of microformats2 is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary. This specification documents the microformats2 parsing algorithm for doing so.

Per CC0, to the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work. In addition, as of 2021-04-13, the editors have made this specification available under the Open Web Foundation Agreement Version 1.0.

algorithm

parse a document for microformats

To parse a document for microformats, follow the HTML parsing rules and do the following:

  • start with an empty JSON "items" array and "rels" hash:
{
 "items": [],
 "rels": {}
}
  • parse the root element for class microformats, adding to the JSON items array accordingly
  • parse all hyperlink (<link> <a>) elements for rel microformats, adding to the JSON rels hash accordingly
  • return the resulting JSON

Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).

parse an element for class microformats

To parse an element for class microformats:

  • parse element class for root class name(s) "h-*" and backcompat root classes
    • if not found, parse child elements for microformats (depth first, doc order)
    • else if found, start parsing a new microformat
      • parse child elements (document order) by:
        • parse a child element for properties (p-*,u-*,dt-*,e-*)
          • add properties found to current microformat
        • parse a child element for microformats (recurse)
          • if that child element itself has a microformat ("h-*" or backcompat roots) and is a property element, add it into the array of values for that property as a { } structure with:
            • value: "string value of the property element",
            • type: [array of microformat type(s) on the child element],
            • properties: { } - to be filled in when that child element itself is parsed for microformats properties as part of the recursion
          • else add found elements that are microformats to the "children" array
      • imply properties for the found microformat (see below)

parse an element for properties

parsing a p- property

To parse an element for a p-x property value:

  • parse the element for the Value Class Pattern, if a value is found then return it.
  • if abbr.p-x[title], then return the title attribute
  • else if data.p-x[value] or input.p-x[value], then return the value attribute
  • else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
  • else return the textContent of the element, replacing any nested <img> elements with their alt attribute if present, or otherwise their src attribute if present, resolving any relative URLs.

parsing a u- property

To parse an element for a u-x property value:

  • if a.u-x[href] or area.u-x[href], then get the href attribute
  • else if img.u-x[src], then get the src attribute
  • else if object.u-x[data], then get the data attribute
  • if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
  • else parse the element for the Value Class Pattern, if a value is found then return it.
  • else if abbr.u-x[title], then return the title attribute
  • else if data.u-x[value] or input.u-x[value], then return the value attribute
  • else return the textContent of the element.

parsing a dt- property

To parse an element for a dt-x property value:

  • parse the element for the Value Class Pattern including the date and time parsing rules, if a value is found then return it.
  • if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
  • else if abbr.dt-x[title], then return the title attribute
  • else if data.dt-x[value] or input.dt-x[value], then return the value attribute
  • else return the textContent of the element.

parsing an e- property

To parse an element for a e-x property value:

  • return a dictionary with two keys:
    • html: the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm.
    • value: the textContent of the element, replacing any nested <img> elements with their alt attribute if present, or otherwise their src attribute if present.

parsing for implied properties

To imply properties: (where h-x is the root microformat element being parsed)

  • if no explicit "name" property,
  • then imply by:
    • if img.h-x then use its alt attribute for name
    • else if abbr.h-x[title] then use its title attribute for name
    • else if .h-x>img:only-child then use that img alt for name
    • else if .h-x>abbr:only-child[title] then use that abbr title for name
    • else if .h-x>:only-child>img:only-child use that img alt for name
    • else if .h-x>:only-child>abbr:only-child[title] use that abbr title for name
    • else use the textContent of the .h-x for name
    • drop leading & trailing white-space from name, including nbsp
  • if no explicit "photo" property,
  • then imply by:
    • if img.h-x[src] then use src for photo
    • else if object.h-x[data] then use data for photo
    • else if .h-x>img[src]:only-of-type:not[.h-*] then use that img src for photo
    • else if .h-x>object[data]:only-of-type:not[.h-*] then use that object data for photo
    • else if .h-x>:only-child>img[src]:only-of-type:not[.h-*] then use that img src for photo
    • else if .h-x>:only-child>object[data]:only-of-type:not[.h-*] then use that object data for photo
    • if there is a gotten photo value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
  • if no explicit "url" property,
  • then imply by:
    • if a.h-x[href] then use href for url
    • else if .h-x>a[href]:only-of-type:not[.h-*] then use that a[href] for url
    • if there is a gotten url value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).

parse a hyperlink element for rel microformats

To parse a hyperlink element for rel microformats: (where * is the hyperlink element)

  • if the "rel" attribute of the element is empty then exit
  • set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
  • treat the "rel" attribute of the element as a space separate set of rel values
  • if the set of rel values does NOT have "alternate" then
    • for each rel value (rel-value)
      • if there is no key rel-value in the rels hash then create it with an empty array as its value
      • add url to the array of the key rel-value in the rels hash
    • end for
  • else
    • if there's no top-level "alternates" array, then create it as an empty array.
    • add a new hash to the top-level "alternates" array with keys for each of these attributes when present:
      • "url": url
      • "rel": the set of rel values appended with spaces, except "alternate"
      • "media": the value of the "media" attribute
      • "hreflang": the value of the "hreflang" attribute
      • "type": the value of the "type" attribute
  • end if

rel parse examples

Here are some examples to show how parsed rels may be reflected into the JSON (empty items key).

E.g. parsing this markup:

<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/b">author b</a>
<a rel="in-reply-to" href="http://example.com/1">post 1</a>
<a rel="in-reply-to" href="http://example.com/2">post 2</a>
<a rel="alternate home"
   href="http://example.com/fr"
   media="handheld"
   hreflang="fr">French mobile homepage</a>

Would generate this JSON:

{
  "items": [],
  "rels": { 
    "author": [ "http://example.com/a", "http://example.com/b" ],
    "in-reply-to": [ "http://example.com/1", "http://example.com/2" ] 
  },
  "alternates": [{
     "url": "http://example.com/fr", 
     "rel": "home", 
     "media": "handheld", 
     "hreflang": "fr" 
  }]
}

Another parse output example can be found here:

what do the CSS selector expressions mean

This section is non-normative.

Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.

Exception:

  • :not[.h-*] is not a valid CSS selector but is used here to mean:
    • does not have any class names that start with "h-"

note HTML parsing rules

This section is non-normative.

microformats2 parsers are expected to follow HTML parsing rules, which includes for example:

questions

See the FAQ:

issues

See the issues page:

implementations

Main article: microformats2#Implementations

There are open source microformats2 parsers available for Javascript, node.js, PHP, Ruby and Python.

test suite

See:

Ports to/for other languages encouraged.

see also