microformats2-parsing: Difference between revisions
Jump to navigation
Jump to search
(parse a document vs parse an element) |
(s/h-*/h-x to make it easier to copy/paste CSS selector expressions into tools which expand them into prose like Selectoracle, add prefix property parsing rules) |
||
Line 12: | Line 12: | ||
=== parse an element for microformats === | === parse an element for microformats === | ||
To parse an element for microformats: | To parse an element for microformats: | ||
* parse element class for root class name(s) "h- | * parse element class for root class name(s) "h-x" (and backcompat) | ||
** if found, start parsing a new microformat | ** if found, start parsing a new microformat | ||
*** parse contained elements for properties (depth first, doc order) | *** parse contained elements for properties (depth first, doc order) | ||
Line 19: | Line 19: | ||
* parse element class for properties (p-,dt-,u-,e-) | * parse element class for properties (p-,dt-,u-,e-) | ||
* add properties found (with any nested microformats) to current microformat | * add properties found (with any nested microformats) to current microformat | ||
==== parsing a p- property ==== | |||
To parse an element for a p-x property value: | |||
* parse the element for the [[value-class-pattern]], if a value is found then return it. | |||
* if abbr.p-x[title], then return the title attribute | |||
* else if data.p-x[value], then return the value attribute | |||
* else if br.p-x or hr.p-x, then return "" (empty string) | |||
* else if img.p-x[alt] or area.p-x[alt], then return the alt attribute | |||
* else return the innertext of the element. | |||
==== parsing a u- property ==== | |||
To parse an element for a u-x property value: | |||
* parse the element for the [[value-class-pattern]], if a value is found then return it. | |||
* if a.u-x[href] or area.u-x[href], then get the href attribute | |||
* else if img.u-x[src], then get the src attribute | |||
* else if object.u-x[data], then get the data attribute | |||
* if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs. | |||
* else if abbr.u-x[title], then return the title attribute | |||
* else if data.u-x[value], then return the value attribute | |||
* else return the innertext of the element. | |||
==== parsing a dt- property ==== | |||
To parse an element for a dt-x property value: | |||
* parse the element for the [[value-class-pattern]] including the date and time parsing rules, if a value is found then return it. | |||
* if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute | |||
* else if abbr.dt-x[title], then return the title attribute | |||
* else if data.dt-x[value], then return the value attribute | |||
* else return the innertext of the element. | |||
==== parsing an e- property ==== | |||
To parse an element for a e-x property value: | |||
* return the innerHTML of the element by using the [http://www.whatwg.org/specs/web-apps/current-work/multipage/the-end.html#serializing-html-fragments HTML spec: Serializing HTML Fragments algorithm]. | |||
=== parsing for implied properties === | === parsing for implied properties === | ||
To imply properties: (where h- | To imply properties: (where h-x is the root microformat element being parsed) | ||
* if no explicit "name" property, | * if no explicit "name" property, | ||
* then imply by: | * then imply by: | ||
** if img.h- | ** if img.h-x then use its alt attribute for name | ||
** else if .h- | ** else if .h-x>img:only-node then use that img alt for name | ||
** else if .h- | ** else if .h-x>:only-node>img:only-node use that img alt for name | ||
** else use the innertext of the .h- | ** else use the innertext of the .h-x for name | ||
** drop leading & trailing white-space from name, including nbsp | ** drop leading & trailing white-space from name, including nbsp | ||
* if no explicit "photo" property, | * if no explicit "photo" property, | ||
* then imply by: | * then imply by: | ||
** if img.h- | ** if img.h-x[src] then use src for photo | ||
** else if .h- | ** else if .h-x>img[src]:only-of-type then use that img src for photo | ||
** else if .h- | ** else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo | ||
* if no explicit "url" property, | * if no explicit "url" property, | ||
* then imply by: | * then imply by: | ||
** if a.h- | ** if a.h-x[href] then use href for url | ||
** else if .h- | ** else if .h-x>a[href]:only-of-type then use that a[href] for url | ||
== what do the CSS selector expressions mean == | |||
Use [http://gallery.theopalgroup.com/selectoracle/ SelectORacle] to expand any of the above CSS selector expressions into longform English prose. | |||
== see also == | == see also == |
Revision as of 23:39, 16 October 2012
<entry-title>microformats2 parsing</entry-title>
One of the goals of microformats2 is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary.
parsing algorithm
parse a document for microformats
To parse a document for microformats:
- start with an empty JSON items array
- parse the root element for microformats
parse an element for microformats
To parse an element for microformats:
- parse element class for root class name(s) "h-x" (and backcompat)
- if found, start parsing a new microformat
- parse contained elements for properties (depth first, doc order)
- parse an element for microformats (recurse)
- imply properties (see below)
- parse contained elements for properties (depth first, doc order)
- if found, start parsing a new microformat
- parse element class for properties (p-,dt-,u-,e-)
- add properties found (with any nested microformats) to current microformat
parsing a p- property
To parse an element for a p-x property value:
- parse the element for the value-class-pattern, if a value is found then return it.
- if abbr.p-x[title], then return the title attribute
- else if data.p-x[value], then return the value attribute
- else if br.p-x or hr.p-x, then return "" (empty string)
- else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
- else return the innertext of the element.
parsing a u- property
To parse an element for a u-x property value:
- parse the element for the value-class-pattern, if a value is found then return it.
- if a.u-x[href] or area.u-x[href], then get the href attribute
- else if img.u-x[src], then get the src attribute
- else if object.u-x[data], then get the data attribute
- if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs.
- else if abbr.u-x[title], then return the title attribute
- else if data.u-x[value], then return the value attribute
- else return the innertext of the element.
parsing a dt- property
To parse an element for a dt-x property value:
- parse the element for the value-class-pattern including the date and time parsing rules, if a value is found then return it.
- if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
- else if abbr.dt-x[title], then return the title attribute
- else if data.dt-x[value], then return the value attribute
- else return the innertext of the element.
parsing an e- property
To parse an element for a e-x property value:
- return the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm.
parsing for implied properties
To imply properties: (where h-x is the root microformat element being parsed)
- if no explicit "name" property,
- then imply by:
- if img.h-x then use its alt attribute for name
- else if .h-x>img:only-node then use that img alt for name
- else if .h-x>:only-node>img:only-node use that img alt for name
- else use the innertext of the .h-x for name
- drop leading & trailing white-space from name, including nbsp
- if no explicit "photo" property,
- then imply by:
- if img.h-x[src] then use src for photo
- else if .h-x>img[src]:only-of-type then use that img src for photo
- else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo
- if no explicit "url" property,
- then imply by:
- if a.h-x[href] then use href for url
- else if .h-x>a[href]:only-of-type then use that a[href] for url
what do the CSS selector expressions mean
Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.
see also
- microformats2
- microformats2-implied-properties
- microformats2-parsing-brainstorming - for background, thinking, exploring possibilities