Parsing Algorithm (Brainstorming)
Introduction
This document is intended to replace parsing once it reaches consensus, as it is somewhat more detailed.
This page explains how to parse the properties of a compound microformat once we have located the root element, which we shall call root. It deals with simple properties which have no sub-properties (such as fn in hCard), and with properties which may include an embedded microformat (such as agent in hCard) — 95% of properties do fall into these categories. Most other properties can be parsed by performing a "pivot": treating the property element as a microformat in its own right and finding sub-properties using the techniques on this page.
Contributors
Primary Author: Toby Inkster (TobyInk)
Licence
The author of this page releases the algorithms described into the public domain.
Additionally, a working implementation of these algorithms (in Perl) is available as part of Swignition. The implementation is not released into the public domain, but is licensed under the GNU General Public License (version 3).
General Algorithm
This algorithm is described in procedural terms, but the same ideas could be used in a functional or declarative programming language. A DOM-like tree of the HTML is preferable, for quick, easy access to parent/child nodes. A getElementsByClassName-like function is needed — if your programming environment does not provide one, then you will need to write one yourself. (Bear in mind that the class attribute is a whitespace delimited list, and classes are case sensitive.)
- Make a copy of the DOM tree to operate on. All future references to DOM traversal and manipulation refer to this clone.
- Implement the include pattern by removing any nodes with
class="include"and replacing them with the node which they point to. (Unless the node pointed to is an ancestor node of the node withclass="include"!!) - Parse all properties that may contain nested microformats. For example within hCard, the agent property may contain a nested hCard. For each property
propwhich may contain a nested microformat:- Create an empty array to store the value(s) of
propin. Call thisA. - Find all elements with
class="prop"that are descended fromroot. Call this listP. - Create an empty list
E. - For each element
pinP:- Let
okbe true. - Let
aby the parent node ofp. - While
ais not the root element of this microformat:- If
ahas a class attribute which indicates that it is a nested microformat (e.g.class="vevent") then setokto false - Set
ato the parent ofa.
- If
- If
okis true, then pushpontoE.
- Let
- OUTER: For each element
einE:- INNER: For each microformat root class name that might be present (e.g. within an hCalendar event's
locationproperty you might reasonably expect to find an hCard, and adr or a geo!):- Create an empty list
F. - Search for any elements within
ewhich have the microformat root class name. Bear in mind that perhaps this element might bee— this is allowed. For each such element,e2:- Parse
e2as an embedded microformat and place the result inFif it is valid. - Run the DESTROY_ELEMENT function on
e2
- Parse
- If
Fis not empty:- Move the first item in
FtoA - Jump out of the INNER loop.
- Move the first item in
- Create an empty list
- If
Ais not empty:- Change the
classattribute one, removingpropfrom the class list. - If property
propis singular, jump out of the OUTER loop.
- Change the
- INNER: For each microformat root class name that might be present (e.g. within an hCalendar event's
- If
propis a singular property, letA[0]be its value. If plural, letAbe the list of its values.
- Create an empty array to store the value(s) of
- Run the DESTROYER function on
root.- Depending on what microformat you're parsing, you will want to pass a further parameter to DESTROYER indicating a list of microformats to "save" from destruction. This allows for the simple string/url/datetime/etc properties to be included within nested microformats. So far, the only case I've found where this is beneficial is when parsing hCard, it is useful to save adr and geo from destruction. This enables the
fnproperty to be set on address components. example
- Depending on what microformat you're parsing, you will want to pass a further parameter to DESTROYER indicating a list of microformats to "save" from destruction. This allows for the simple string/url/datetime/etc properties to be included within nested microformats. So far, the only case I've found where this is beneficial is when parsing hCard, it is useful to save adr and geo from destruction. This enables the
- Parse all properties. There are three different categories of property — singular, plural and concatenated. Most properties are either singular or plural, but a handful are concatenated, such as
entry-summaryin hAtom. For each propertypropwithinroot:- Create an empty array to store the value(s) of
propin. Call thisA. - Find all elements with
class="prop"that are descended fromroot. - For each element
e, run this:- Find the value of
e, using the techniques in the section below. - If the value of
eis not NULL, add it toA - If the
propis a singular property andAis not empty, jump out of this foreach loop.
- Find the value of
- If
propis a singular property, then its value isA[0]. - If
propis a plural property, then its values areA. - If
propis a concatenated property, then its values are formed by concatenating the values ofAtogether.
- Create an empty array to store the value(s) of
Finding Values
There are at least five different types of property that can be parsed, each of which requires different techniques:
- HTML properties, such as
entry-contentin hAtom - URI properties, such as
urlin hCard - ID properties, such as
uidin hCard - Datetime properties, such as
dtstartin hCalendar - Plain text properties, such as
titlein hCard
Arguments can be made for duration properties and numeric properties to also have variations in the algorithm, but for now, we'll just treat them as plain text properties.
Generally speaking, the mechanism for going from an element e to a value is to use the first "hit" from the following:
- Look at relevant attributes (e.g.
datetimeif we're looking for a date, orhrefif we're looking for a URI). - Look for any descendants with
class="value". - Look at the node contents.
This is described in more detail below, for each type of property.
HTML Properties
These are the easiest to parse. Given an element e, just use the HTML representation of its DOM node. Some DOM implementations make this available as .outerHTML.
URI Properties
Certain HTML elements are capable of linking to other resources. The most obvious is <a> though there are many others. The following list of linking elements is derived from Perl's HTML::Tagset module:
{
'a' => ['href'],
'applet' => ['codebase', 'archive', 'code'],
'area' => ['href'],
# 'base' => ['href'],
'bgsound' => ['src'],
'blockquote' => ['cite'],
# 'body' => ['background'],
'del' => ['cite'],
'embed' => ['src', 'pluginspage'],
'form' => ['action'],
'frame' => ['src', 'longdesc'],
'iframe' => ['src', 'longdesc'],
# 'ilayer' => ['background'],
'img' => ['src', 'lowsrc', 'longdesc', 'usemap'],
'input' => ['src', 'usemap'],
'ins' => ['cite'],
'isindex' => ['action'],
# 'head' => ['profile'],
'layer' => ['src'], # 'background'
'link' => ['href'],
'object' => ['data', 'classid', 'codebase', 'archive', 'usemap'],
'q' => ['cite'],
'script' => ['src', 'for'],
# 'table' => ['background'],
# 'td' => ['background'],
# 'th' => ['background'],
# 'tr' => ['background'],
'xmp' => ['href'],
}
Note that some are commented out as they might be too counter-intuitive to implement!
If we're parsing an element e and looking for a URI, here is the algorithm we use:
- Set variable
uto NULL. - Search
efor any descendent elements withclass="value". Call this listV. - Add the element
eitself to the listV, at the front of the list. - OUTER: for each element
vfrom listV:- If
vis a linking element from the above list- INNER: for each attribute
aassociated that the tag name ofvin the above list- If
ais set- Set
uto the contents ofa - Jump out of the OUTER loop.
- Set
- If
- INNER: for each attribute
- If
- If
uis not null, and is a relative URI, convert it to an absolute URI.
The URI has hopefully been found in u. If no URI has been found, then fall back to plain text parsing.
UID Properties
UID properties are parsed similarly to URL properties, but with a slightly modified algorithm, allowing for UIDs to be specified in the id attribute. The following example has a UID of "http://example.com/page#foo".
<base href="http://example.com/page" /> <div class="uid" id="foo">...</div>
The modified algorithm used is:
- Set variable
uto NULL. - Search
efor any descendent elements withclass="value". Call this listV. - Add the element
eitself to the listV, at the front of the list. - OUTER: for each element
vfrom listV:- If
vis a linking element from the above list- INNER: for each attribute
aassociated that the tag name ofvin the above list- If
ais set- Set
uto the contents ofa - Jump out of the OUTER loop.
- Set
- If
- INNER: for each attribute
- If
vhas anidattribute set- Set
uto the contents ofid, with the character "#" prepended - Jump out of the OUTER loop.
- Set
- If
- If
uis not null, and is a relative URI, convert it to an absolute URI.
Again, if no u has been found by the algorithm, then fall back to parsing it as a plain text property.
Datetime Properties
Parsing property prop, if class="prop" is found on element e.
- If element
ehas an attributedatetime, then the content of that attribute is the value and the rest of these steps should be skipped. - Create a list
D, which is empty. - Create a list
Vof elements withclass="value". - For each element
vinV:- If
vhas an attributedatetime, then add the content of that attribute toD - Otherwise, run the STRINGIFY function on
vand add the result toD
- If
- If
Dis empty, then run the STRINGIFY function oneand let the result be the value, and skip the rest of these steps. - If
Dcontains only one item, and it looks like an ISO date or ISO datetime, then let that be the value, and skip the rest of these steps. - If
Dcontains two items, and the first looks like an ISO date, and the second like a time, concatenate them, joining with an upper case 'T', let that be the value, and skip the rest of these steps. - If
Dcontains three items, and the first looks like an ISO date, the second like a time, and the last like a timezone (may need normalisation), concatenate them, joining the first two with an upper case 'T' and the last one with no intervening character, let that be the value, and skip the rest of these steps. - Concatenate all the items in
Dand let that be the value.
The final value should be interpreted as liberally as possible with regards to punctuation as an ISO date or ISO datetime.
Normalizing Timezones
Where S is a sign (+ or -) and the letters a, b, c, d are numerals, then:
- Sa → S0a00
- Sab → Sab00
- Sabc → S0abc
- Sa: → S0a00
- Sab: → Sab00
- Sa:b → S0ab0
- S:ab → S00ab
- Sa:bc → S0abc
- Sab:c → Sabc0
- Sab:cd → Sabcd
Plain text Properties
To obtain the value of the property, run STRINGIFY on the property node.
Stringification
The STRINGIFY function performs a text serialisation of an HTML node, with a few adjustments to implement the ABBR pattern. It uses a helper function, _STRINGIFY.
STRINGIFY
- First parameter: element to stringify,
e. - Second parameter: whether to perform value excerpting - default yes.
- Third parameter: whether to perform abbr pattern - default yes.
- If
eis an<abbr>or<acronym>element, and has atitleattribute, then return that attribute. - If you want to implement any proposed alternatives to the ABBR pattern, then here is the place to do so.
- If value excerpting is enabled:
- Create an empty list
S - Search for any descendant elements of
ewithclass="value". Put these into a listV. - For each element
vinV- Recursion: call STRINGIFY on
v, disabling value excerpting but enabling the ABBR pattern. Add the result toS.
- Recursion: call STRINGIFY on
- Concatenate the items in
Sto form a string. If this string is not empty, then return the string.
- Create an empty list
- Run _STRINGIFY on
e, trim excess white space from the result and return it.
_STRINGIFY
This is a somewhat simplified version of the real algorithm that I use. You probably want to refine it by adding better whitespace handling rules (e.g. line breaks after block elements, asterisks for list items, etc).
_STRINGIFY is called with one parameter, the element e to be stringified.
- If
eis text node (not an element), then return it. - If
eis an<img>tag, return thealttext. - If
eis an<input>tag, return the text of thevalueattribute. - If
eis an<br>tag, return a linebreak character. - If
eis an<del>tag, return a zero-length string. - Otherwise, create an empty list
S. - For each direct child node
cofe:- Run _STRINGIFY on
cand add the result toS.
- Run _STRINGIFY on
- Concatenate the items in list
Sand return them.
Destruction
In some cases it is necessary to render an element and its children opaque to later microformat processing. The two functions DESTROYER and DESTROY_ELEMENT deal with this.
DESTROYER
This function aims to find any microformats within an element and make them opaque to later parsing.
- First parameter: element to operate on,
e. - Second parameter: list of microformats to "save" from destruction,
S.
- Create a list
Mof all known compound microformat root classes. This list should include any microformats you know about, even if your parser does not include support for them. A sample list is included below. - For each item in
S: remove fromM. - For each descendent element
dofe(not includingeas a descendent of itself):- If the class list of
dincludes one or more of the class names inM, then:- Run the DESTROY_ELEMENT function on
d. - Modify the
classattribute ofd, removing any classes which appear inM.
- Run the DESTROY_ELEMENT function on
- If the class list of
Compound microformat root classes
- mfo
- vcard
- adr
- geo
- hreview
- xfolkentry
- hresume
- biota
- vcalendar
- vevent
- vtodo
- valarm
- vfreebusy
- hfeed
- hentry
- hslice
- haudio
- hmeasure
- hangle
- hmoney
- hlisting
- figure
- hproduct
- hmedia
Note that XOXO and XMDP are excluded from this list, as they are not compound microformats.
DESTROY_ELEMENT
The aim of this function is to make the innards of a particular element opaque to microformat parsing. The algorithm for "destroying" element e is as follows:
- Search for all elements which descend from
e. For each such elementd:- Set the
classattribute to the empty string. - Set the
relattribute to the empty string. - Set the
revattribute to the empty string.
- Set the
Hierarchy
Generally speaking, when looking for a property p and an element with class p is found, we look for the value of p in the following places and use the first value that has been found:
- Appropriate attributes on
p - Appropriate attributes on any descendants of
pwithclass="value"(using first in case of URLs/UIDs, concatenating otherwise) - Contents of all descendants of
pwithclass="value", concatenated - Contents of
p
Example
Looking for a URL, these should all be parsed as p="http://example.com/use-this".
<a class="p" href="http://example.com/use-this">...</a>
<span class="p">...<a class="value" href="http://example.com/use-this">...</a>...</span>
<span class="p">...<b class="value">http://example.com/use-this</b>...</span>
<span class="p">http://example.com/use-this</span>
Examples of Parsable HTML
Nested Microformats Examples
<div class="vcard">
<span class="fn">Jack Bauer</span>
<div class="agent">
Agent:
<div class="vcard">
<span class="fn">Chloe O'Brian</span>
</div>
</div>
</div>
<div class="vcard">
<span class="fn">Jack Bauer</span>
<div class="agent vcard">
Agent: <span class="fn">Chloe O'Brian</span>
</div>
</div>
OK, this one's getting complicated, but still works:
<div class="vcard">
<span class="fn">Queen Elizabeth II</span>'s
<span class="agent vcard">
<span class="role">representative in Australia</span> is
<span class="title">Governor General</span>
<a class="fn url" href="http://www.gg.gov.au/">Michael Jeffery</a>.
<span class="agent vcard">
You can contact him through his <span class="role">secretary</span>,
<span class="fn">Malcolm Hazell</span>.
</span>
</span>
</div>
Here's an "fn" inside an address:
<div class="vcard">
<p class="adr">
<span class="org">
<strong class="fn organization-unit extended-address">Chemistry Library</strong>,
<span class="organization-name extended-address">Fictional Institute of Science</span>,
</span>
<span class="street-address">123 Example Street</span>,
<span class="locality">Testville</span>.
</p>
</div>
Plain Text Examples
All the following examples parse as "Foobar baz".
<span class="note">Foobar baz</span>
<img class="note" alt="Foobar baz" src="foobar-baz.jpeg">
<span class="note"> <b class="value">Foo</b> quux <b class="value">bar baz</b> xyzzy. </span>
<abbr class="note" title="Foobar baz">FBB</abbr>
<div class="note"> <span class="value">Foobar</span> <abbr class="value" title=" baz">FBB</abbr> </div>
<span class="note">Foo<b>bar</b> baz</span>
URL Examples
The following are all parsed as the same URL.
<a class="url" href="http://example.com/page">Page</a>
<p class="url"> <a href="http://example.com/not-this">Not this</a> <a class="value" href="http://example.com/page">Page</a> </p>
<p class="url"> <a class="value" href="http://example.com/page">Page</a> <a class="value" href="http://example.com/not-this">Not this</a> </p>
<p class="url"> <span class="value">http://example.com/</span> Not this <span class="value">page</span> </p>
<img class="url" alt="foo" src="http://example.com/page"
longdesc="http://example.com/not-this">
<!-- (invalid, but works) --> <img class="url" longdesc="http://example.com/page">
<!-- (strange, but true) --> <p class="url"> <img src="http://example.com/not-this" alt="http://example.com/page"> </p>
<!-- (but, by contrast) --> <p class="url"> <img src="http://example.com/page" alt="http://example.com/not-this" class="value"> </p>
UID Examples
<a class="uid" href="http://example.com/page">Page</a>
<span class="uid" id="this">Not this</span>
<a class="uid" href="http://example.com/page" id="not-this">Page</a>
<div class="uid"> <span class="value" id="this">Not this</span> <a href="http://example.com/not-this" class="value">Not this</a> </div>
Datetime Examples
<time class="dtstart" datetime="2008-07-21">Monday</time>
<p class="dtstart"> <time class="value" datetime="2008-07-21">Monday</time> at <time class="value" datetime="21:30">9:30pm</time>. </p>
<p class="dtstart"> <time class="value" datetime="2008-07-21">Monday</time> at <time class="value" datetime="21:30">9:30pm</time> <time class="value" datetime="+1">(UK)</time>. </p>
<p class="dtstart"> <abbr class="value" title="2008-07-21">Monday</time> at <abbr class="value" title="T21:30">9:30pm</time> <abbr class="value" title="+0100">(UK)</time>. </p>
Combination Examples
<div class="vcard">
<h1 class="fn">Toby Inkster</h1>
<ins class="rev url" datetime="2008-07-21T21:30:00+0100">
I launched my <a href="http://example.com/" class="value">new
website</a> today with the help of
<span class="vcard">
<span class="role">graphic designer</span>
<span class="fn">Joe Bloggs</span>.
</span>
</ins>
</div>
Parsed as:
First hCard:
- fn = "Toby Inkster"
- rev = 2008-07-21T21:30:00+0100
- url = http://example.com/
Second hCard:
- fn = "Joe Bloggs"
- role = "graphic designer"