microformats2-parsing: Difference between revisions
Jump to navigation
Jump to search
(s/h-*/h-x to make it easier to copy/paste CSS selector expressions into tools which expand them into prose like Selectoracle, add prefix property parsing rules) |
|||
Line 17: | Line 17: | ||
**** parse an element for microformats (recurse) | **** parse an element for microformats (recurse) | ||
*** imply properties (see below) | *** imply properties (see below) | ||
* parse element class for properties (p-, | * parse element class for properties (p-,u-,dt-,e-) | ||
* add properties found (with any nested microformats) to current microformat | * add properties found (with any nested microformats) to current microformat | ||
Revision as of 23:39, 16 October 2012
<entry-title>microformats2 parsing</entry-title>
One of the goals of microformats2 is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary.
parsing algorithm
parse a document for microformats
To parse a document for microformats:
- start with an empty JSON items array
- parse the root element for microformats
parse an element for microformats
To parse an element for microformats:
- parse element class for root class name(s) "h-x" (and backcompat)
- if found, start parsing a new microformat
- parse contained elements for properties (depth first, doc order)
- parse an element for microformats (recurse)
- imply properties (see below)
- parse contained elements for properties (depth first, doc order)
- if found, start parsing a new microformat
- parse element class for properties (p-,u-,dt-,e-)
- add properties found (with any nested microformats) to current microformat
parsing a p- property
To parse an element for a p-x property value:
- parse the element for the value-class-pattern, if a value is found then return it.
- if abbr.p-x[title], then return the title attribute
- else if data.p-x[value], then return the value attribute
- else if br.p-x or hr.p-x, then return "" (empty string)
- else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
- else return the innertext of the element.
parsing a u- property
To parse an element for a u-x property value:
- parse the element for the value-class-pattern, if a value is found then return it.
- if a.u-x[href] or area.u-x[href], then get the href attribute
- else if img.u-x[src], then get the src attribute
- else if object.u-x[data], then get the data attribute
- if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs.
- else if abbr.u-x[title], then return the title attribute
- else if data.u-x[value], then return the value attribute
- else return the innertext of the element.
parsing a dt- property
To parse an element for a dt-x property value:
- parse the element for the value-class-pattern including the date and time parsing rules, if a value is found then return it.
- if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
- else if abbr.dt-x[title], then return the title attribute
- else if data.dt-x[value], then return the value attribute
- else return the innertext of the element.
parsing an e- property
To parse an element for a e-x property value:
- return the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm.
parsing for implied properties
To imply properties: (where h-x is the root microformat element being parsed)
- if no explicit "name" property,
- then imply by:
- if img.h-x then use its alt attribute for name
- else if .h-x>img:only-node then use that img alt for name
- else if .h-x>:only-node>img:only-node use that img alt for name
- else use the innertext of the .h-x for name
- drop leading & trailing white-space from name, including nbsp
- if no explicit "photo" property,
- then imply by:
- if img.h-x[src] then use src for photo
- else if .h-x>img[src]:only-of-type then use that img src for photo
- else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo
- if no explicit "url" property,
- then imply by:
- if a.h-x[href] then use href for url
- else if .h-x>a[href]:only-of-type then use that a[href] for url
what do the CSS selector expressions mean
Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.
see also
- microformats2
- microformats2-implied-properties
- microformats2-parsing-brainstorming - for background, thinking, exploring possibilities