microformats2-parsing: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
(parse a document vs parse an element)
(s/h-*/h-x to make it easier to copy/paste CSS selector expressions into tools which expand them into prose like Selectoracle, add prefix property parsing rules)
Line 12: Line 12:
=== parse an element for microformats ===
=== parse an element for microformats ===
To parse an element for microformats:
To parse an element for microformats:
* parse element class for root class name(s) "h-*" (and backcompat)
* parse element class for root class name(s) "h-x" (and backcompat)
** if found, start parsing a new microformat
** if found, start parsing a new microformat
*** parse contained elements for properties (depth first, doc order)
*** parse contained elements for properties (depth first, doc order)
Line 19: Line 19:
* parse element class for properties (p-,dt-,u-,e-)
* parse element class for properties (p-,dt-,u-,e-)
* add properties found (with any nested microformats) to current microformat
* add properties found (with any nested microformats) to current microformat
==== parsing a p- property ====
To parse an element for a p-x property value:
* parse the element for the [[value-class-pattern]], if a value is found then return it.
* if abbr.p-x[title], then return the title attribute
* else if data.p-x[value], then return the value attribute
* else if br.p-x or hr.p-x, then return "" (empty string)
* else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
* else return the innertext of the element.
==== parsing a u- property ====
To parse an element for a u-x property value:
* parse the element for the [[value-class-pattern]], if a value is found then return it.
* if a.u-x[href] or area.u-x[href], then get the href attribute
* else if img.u-x[src], then get the src attribute
* else if object.u-x[data], then get the data attribute
* if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs.
* else if abbr.u-x[title], then return the title attribute
* else if data.u-x[value], then return the value attribute
* else return the innertext of the element.
==== parsing a dt- property ====
To parse an element for a dt-x property value:
* parse the element for the [[value-class-pattern]] including the date and time parsing rules, if a value is found then return it.
* if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
* else if abbr.dt-x[title], then return the title attribute
* else if data.dt-x[value], then return the value attribute
* else return the innertext of the element.
==== parsing an e- property ====
To parse an element for a e-x property value:
* return the innerHTML of the element by using the [http://www.whatwg.org/specs/web-apps/current-work/multipage/the-end.html#serializing-html-fragments HTML spec: Serializing HTML Fragments algorithm].


=== parsing for implied properties ===
=== parsing for implied properties ===
To imply properties: (where h-* is the root microformat element being parsed)
To imply properties: (where h-x is the root microformat element being parsed)
* if no explicit "name" property,  
* if no explicit "name" property,  
* then imply by:
* then imply by:
** if img.h-* then use its alt attribute for name
** if img.h-x then use its alt attribute for name
** else if .h-*>img:only-node then use that img alt for name
** else if .h-x>img:only-node then use that img alt for name
** else if .h-*>:only-node>img:only-node use that img alt for name
** else if .h-x>:only-node>img:only-node use that img alt for name
** else use the innertext of the .h-* for name
** else use the innertext of the .h-x for name
** drop leading & trailing white-space from name, including nbsp
** drop leading & trailing white-space from name, including nbsp
* if no explicit "photo" property,  
* if no explicit "photo" property,  
* then imply by:
* then imply by:
** if img.h-*[src] then use src for photo
** if img.h-x[src] then use src for photo
** else if .h-*>img[src]:only-of-type then use that img src for photo
** else if .h-x>img[src]:only-of-type then use that img src for photo
** else if .h-*>:only-child>img[src]:only-of-type then use that img src for photo
** else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo
* if no explicit "url" property,
* if no explicit "url" property,
* then imply by:
* then imply by:
** if a.h-*[href] then use href for url
** if a.h-x[href] then use href for url
** else if .h-*>a[href]:only-of-type then use that a[href] for url
** else if .h-x>a[href]:only-of-type then use that a[href] for url
 
== what do the CSS selector expressions mean ==
Use [http://gallery.theopalgroup.com/selectoracle/ SelectORacle] to expand any of the above CSS selector expressions into longform English prose.


== see also ==
== see also ==

Revision as of 23:39, 16 October 2012

<entry-title>microformats2 parsing</entry-title>

One of the goals of microformats2 is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary.

parsing algorithm

parse a document for microformats

To parse a document for microformats:

  • start with an empty JSON items array
  • parse the root element for microformats

parse an element for microformats

To parse an element for microformats:

  • parse element class for root class name(s) "h-x" (and backcompat)
    • if found, start parsing a new microformat
      • parse contained elements for properties (depth first, doc order)
        • parse an element for microformats (recurse)
      • imply properties (see below)
  • parse element class for properties (p-,dt-,u-,e-)
  • add properties found (with any nested microformats) to current microformat

parsing a p- property

To parse an element for a p-x property value:

  • parse the element for the value-class-pattern, if a value is found then return it.
  • if abbr.p-x[title], then return the title attribute
  • else if data.p-x[value], then return the value attribute
  • else if br.p-x or hr.p-x, then return "" (empty string)
  • else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
  • else return the innertext of the element.

parsing a u- property

To parse an element for a u-x property value:

  • parse the element for the value-class-pattern, if a value is found then return it.
  • if a.u-x[href] or area.u-x[href], then get the href attribute
  • else if img.u-x[src], then get the src attribute
  • else if object.u-x[data], then get the data attribute
  • if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs.
  • else if abbr.u-x[title], then return the title attribute
  • else if data.u-x[value], then return the value attribute
  • else return the innertext of the element.

parsing a dt- property

To parse an element for a dt-x property value:

  • parse the element for the value-class-pattern including the date and time parsing rules, if a value is found then return it.
  • if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
  • else if abbr.dt-x[title], then return the title attribute
  • else if data.dt-x[value], then return the value attribute
  • else return the innertext of the element.

parsing an e- property

To parse an element for a e-x property value:

parsing for implied properties

To imply properties: (where h-x is the root microformat element being parsed)

  • if no explicit "name" property,
  • then imply by:
    • if img.h-x then use its alt attribute for name
    • else if .h-x>img:only-node then use that img alt for name
    • else if .h-x>:only-node>img:only-node use that img alt for name
    • else use the innertext of the .h-x for name
    • drop leading & trailing white-space from name, including nbsp
  • if no explicit "photo" property,
  • then imply by:
    • if img.h-x[src] then use src for photo
    • else if .h-x>img[src]:only-of-type then use that img src for photo
    • else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo
  • if no explicit "url" property,
  • then imply by:
    • if a.h-x[href] then use href for url
    • else if .h-x>a[href]:only-of-type then use that a[href] for url

what do the CSS selector expressions mean

Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.

see also