parsing-brainstorming

From Microformats Wiki
parsing-brainstorming /
Revision as of 01:17, 25 July 2008 by TobyInk (talk | contribs) (→‎General Algorithm: add link)
Jump to navigation Jump to search

Introduction

This is an attempt to get some of my thoughts on parsing, from practical experience implementing Cognition, out of my head and onto the wiki. Hopefully it will replace Parsing once it reaches consensus, as this document is somewhat more detailed. It deals with how to parse the properties of a compound microformat once we have located the root element, which we shall call root. It deals with simple properties which have no sub-properties (such as fn in hCard), and with properties which may include an embedded microformat (such as agent in hCard) — 95% of properties do fall into these categories. (Many of the others can be parsed by treating the property element as root and then finding sub-properties using the techniques on this page, but some do occasionally require special handling.) TobyInk

Note: as a courtesy, I'd like to ask people not to edit this page for the next few days, until I have gotten the initial version stable. Thanks. TobyInk 01:14, 21 Jul 2008 (PDT)

Contributors

Primary Author: Toby Inkster (TobyInk)

Licence

The author of this page releases the algorithms described into the public domain.

Additionally, a working implementation of these algorithms (in Perl) is available as part of Cognition. The implementation is not released into the public domain, but is licensed under the GNU General Public License (version 3).

General Algorithm

  1. Make a copy of the DOM tree to operate on. All future references to DOM traversal and manipulation refer to this clone.
  2. Implement the include pattern by removing any nodes with class="include" and replacing them with the node which they point to. (Unless the node pointed to is a parent node of the node with class="include"!!)
  3. Parse all properties that may contain nested microformats. For example within hCard, the agent property may contain a nested hCard. For each property prop which may contain a nested microformat:
    1. Create an empty array to store the value(s) of prop in. Call this A.
    2. OUTER: Find all elements with class="prop" that are descended from root. For each of those elements e:
      1. INNER: For each microformat root class name that might be present (e.g. within an hCalendar event's location property you might reasonably expect to find an hCard, and adr or a geo!):
        1. Create an empty list F.
        2. Search for any elements within e which have the microformat root class name. Bear in mind that perhaps this element might be e — this is allowed. For each such element, e2:
          1. Parse e2 as an embedded microformat and place the result in F if it is valid.
          2. Run the DESTROY_ELEMENT function on e2
        3. If F is not empty:
          1. Move the first item in F to A
          2. Jump out of the INNER loop.
      2. If A is not empty:
        1. Change the class attribute on e, removing prop from the class list.
        2. If property prop is singular, jump out of the OUTER loop.
    3. If prop is a singular property, let A[0] be its value. If plural, let A be the list of its values.
  4. Run the DESTROYER function on root.
    • Depending on what microformat you're parsing, you will want to pass a further parameter to DESTROYER indicating a list of microformats to "save" from destruction. This allows for the simple string/url/datetime/etc properties to be included within nested microformats. So far, the only case I've found where this is beneficial is when parsing hCard, it is useful to save adr and geo from destruction. This enables the fn property to be set on address components, e.g.:
      <div class="vcard">
 <p class="adr">
   <span class="org">
     <strong class="fn organization-unit extended-address">Chemistry Library</strong>,
     <span class="organization-name extended-address">Fictional Institute of Science</span>,
   </span>
   <span class="street-address">123 Example Street</span>,
   <span class="locality">Testville</span>.
 </p>

</div>

  1. Parse all properties. There are three different categories of property — singular, plural and concatenated. Most properties are either singular or plural, but a handful are concatenated, such as entry-summary in hAtom 0.1. For each property prop within root:
    1. Create an empty array to store the value(s) of prop in. Call this A.
    2. Find all elements with class="prop" that are descended from root.
    3. For each element e, run this:
      1. <a href="#Finding_Values">Find the value of e</a>, using the techniques in the section below.
      2. If the value of e is not NULL, add it to A
      3. If the prop is a singular property and A is not empty, jump out of this foreach loop.
    4. If prop is a singular property, then its value is A[0].
    5. If prop is a plural property, then its values are A.
    6. If prop is a concatenated property, then its values are formed by concatenating the values of A together.

Finding Values

There are at least five different types of property that can be parsed, each of which requires different techniques:

  • HTML properties, such as entry-content in hAtom
  • URI properties, such as url in hCard 1.0
  • ID properties, such as uid in hCard
  • Datetime properties, such as dtstart in hCalendar 1.0
  • Plain text properties, such as title in hCard

Arguments can be made for duration properties and numeric properties to also have variations in the algorithm, but for now, we'll just treat them as plain text properties.

Generally speaking, the mechanism for going from an element e to a value is to use the first "hit" from the following:

  1. Look at relevant attributes (e.g. datetime if we're looking for a date, or href if we're looking for a URI).
  2. Look for any descendants with class="value".
  3. Look at the node contents.

This is described in more detail below, for each type of property.

HTML Properties

These are the easiest to parse. Given an element e, just use the HTML representation of its DOM node. Some DOM implementations make this available as .outerHTML.

URI Properties

Certain HTML elements are capable of linking to other resources. The most obvious is <a> though there are many others. The following list of linking elements is derived from Perl's HTML::Tagset module:

{
	'a'       => ['href'],
	'applet'  => ['codebase', 'archive', 'code'],
	'area'    => ['href'],
#	'base'    => ['href'],
	'bgsound' => ['src'],
	'blockquote' => ['cite'],
#	'body'    => ['background'],
	'del'     => ['cite'],
	'embed'   => ['src', 'pluginspage'],
	'form'    => ['action'],
	'frame'   => ['src', 'longdesc'],
	'iframe'  => ['src', 'longdesc'],
#	'ilayer'  => ['background'],
	'img'     => ['src', 'lowsrc', 'longdesc', 'usemap'],
	'input'   => ['src', 'usemap'],
	'ins'     => ['cite'],
	'isindex' => ['action'],
#	'head'    => ['profile'],
	'layer'   => ['src'], # 'background'
	'link'    => ['href'],
	'object'  => ['data', 'classid', 'codebase', 'archive', 'usemap'],
	'q'       => ['cite'],
	'script'  => ['src', 'for'],
#	'table'   => ['background'],
#	'td'      => ['background'],
#	'th'      => ['background'],
#	'tr'      => ['background'],
	'xmp'     => ['href'],
}

Note that some are commented out as they might be too counter-intuitive to implement!

If we're parsing an element e and looking for a URI, here is the algorithm we use:

  1. Set variable u to NULL.
  2. Search e for any descendent elements with class="value". Call this list V.
  3. Add the element e itself to the list V, at the front of the list.
  4. OUTER: for each element v from list V:
    1. If v is a linking element from the above list
      1. INNER: for each attribute a associated that the tag name of v in the above list
        1. If a is set
          1. Set u to the contents of a
          2. Jump out of the OUTER loop.
  5. If u is not null, and is a relative URI, convert it to an absolute URI.

The URI has hopefully been found in u. If no URI has been found, then fall back to plain text parsing.

UID Properties

UID properties are parsed similarly to URL properties, but with a slightly modified algorithm, allowing for UIDs to be specified in the id attribute. The following example has a UID of "http://example.com/page#foo".

<base href="http://example.com/page" />
<div class="uid" id="foo">...</div>

The modified algorithm used is:

  1. Set variable u to NULL.
  2. Search e for any descendent elements with class="value". Call this list V.
  3. Add the element e itself to the list V, at the front of the list.
  4. OUTER: for each element v from list V:
    1. If v is a linking element from the above list
      1. INNER: for each attribute a associated that the tag name of v in the above list
        1. If a is set
          1. Set u to the contents of a
          2. Jump out of the OUTER loop.
    2. If v has an id attribute set
      1. Set u to the contents of id, with the character "#" prepended
      2. Jump out of the OUTER loop.
  5. If u is not null, and is a relative URI, convert it to an absolute URI.

Again, if no u has been found by the algorithm, then fall back to parsing it as a plain text property.

Datetime Properties

Parsing property prop, if class="prop" is found on element e.

  1. If element e has an attribute datetime, then the content of that attribute is the value and the rest of these steps should be skipped.
  2. Create a list D, which is empty.
  3. Create a list V of elements with class="value".
  4. For each element v in V:
    1. If v has an attribute datetime, then add the content of that attribute to D
    2. Otherwise, run the STRINGIFY function on v and add the result to D
  5. If D is empty, then run the STRINGIFY function on e and let the result be the value, and skip the rest of these steps.
  6. If D contains only one item, and it looks like an ISO date or ISO datetime, then let that be the value, and skip the rest of these steps.
  7. If D contains two items, and the first looks like an ISO date, and the second like a time, concatenate them, joining with an upper case 'T', let that be the value, and skip the rest of these steps.
  8. If D contains three items, and the first looks like an ISO date, the second like a time, and the last like a timezone (may need normalisation), concatenate them, joining the first two with an upper case 'T' and the last one with no intervening character, let that be the value, and skip the rest of these steps.
  9. Concatenate all the items in D and let that be the value.

The final value should be interpreted as liberally as possible with regards to punctuation as an ISO date or ISO datetime.

Normalizing Timezones

Where S is a sign (+ or -) and the letters a, b, c, d are numerals, then:

  • Sa → S0a00
  • Sab → Sab00
  • Sabc → S0abc
  • Sa: → S0a00
  • Sab: → Sab00
  • Sa:b → S0ab0
  • S:ab → S00ab
  • Sa:bc → S0abc
  • Sab:c → Sabc0
  • Sab:cd → Sabcd

Plain text Properties

To obtain the value of the property, run STRINGIFY on the property node.

Stringification

The STRINGIFY function performs a text serialisation of an HTML node, with a few adjustments to implement the ABBR pattern. It uses a helper function, _STRINGIFY.

STRINGIFY

  • First parameter: element to stringify, e.
  • Second parameter: whether to perform value excerpting - default yes.
  • Third parameter: whether to perform abbr pattern - default yes.
  1. If e is an <abbr> or <acronym> element, and has a title attribute, then return that attribute.
  2. If you want to implement any proposed alternatives to the ABBR pattern, then here is the place to do so.
  3. If value excerpting is enabled:
    1. Create an empty list S
    2. Search for any descendant elements of e with class="value". Put these into a list V.
    3. For each element v in V
      1. Recursion: call STRINGIFY on v, disabling value excerpting but enabling the ABBR pattern. Add the result to S.
    4. Concatenate the items in S to form a string. If this string is not empty, then return the string.
  4. Run _STRINGIFY on e, trim excess white space from the result and return it.

_STRINGIFY

This is a somewhat simplified version of the real algorithm that I use. You probably want to refine it by adding better whitespace handling rules (e.g. line breaks after block elements, asterisks for list items, etc).

_STRINGIFY is called with one parameter, the element e to be stringified.

  1. If e is text node (not an element), then return it.
  2. If e is an <img> tag, return the alt text.
  3. If e is an <input> tag, return the text of the value attribute.
  4. If e is an <br> tag, return a linebreak character.
  5. If e is an <del> tag, return a zero-length string.
  6. Otherwise, create an empty list S.
  7. For each direct child node c of e:
    1. Run _STRINGIFY on c and add the result to S.
  8. Concatenate the items in list S and return them.

Destruction

In some cases it is necessary to render an element and its children opaque to later microformat processing. The two functions DESTROYER and DESTROY_ELEMENT deal with this.

DESTROYER

This function aims to find any microformats within an element and make them opaque to later parsing.

  • First parameter: element to operate on, e.
  • Second parameter: list of microformats to "save" from destruction, S.
  1. Create a list M of all known compound microformat root classes. This list should include any microformats you know about, even if your parser does not include support for them. A sample list is included below.
  2. For each item in S: remove from M.
  3. For each descendent element d of e (not including e as a descendent of itself):
    1. If the class list of d includes one or more of the class names in M, then:
      1. Run the DESTROY_ELEMENT function on d.
      2. Modify the class attribute of d, removing any classes which appear in M.

Compound microformat root classes

  • mfo
  • vcard
  • adr
  • geo
  • vcalendar
  • vevent
  • vtodo
  • valarm
  • vfreebusy
  • hfeed
  • hentry
  • hslice
  • hreview
  • hresume
  • xfolkentry
  • biota
  • haudio
  • hmeasure
  • hangle
  • hmoney
  • hlisting

Note that XOXO and XMDP are excluded from this list, as they are not compound microformats.

DESTROY_ELEMENT

The aim of this function is to make the innards of a particular element opaque to microformat parsing. The algorithm for "destroying" element e is as follows:

  1. Search for all elements which descend from e. For each such element d:
    1. Set the class attribute to the empty string.
    2. Set the rel attribute to the empty string.
    3. Set the rev attribute to the empty string.

Examples of Parsable HTML

URL Examples

The following are all parsed as the same URL.

<a class="url" href="http://example.com/page">Page</a>
<p class="url">
  <a href="http://example.com/not-this">Not this</a>
  <a class="value" href="http://example.com/page">Page</a>
</p>
<p class="url">
  <a class="value" href="http://example.com/page">Page</a>
  <a class="value" href="http://example.com/not-this">Not this</a>
</p>
<p class="url">
  <span class="value">http://example.com/</span>
  Not this
  <span class="value">page</span>
</p>
<img class="url" alt="foo" src="http://example.com/page"
    longdesc="http://example.com/not-this">
<!-- (invalid, but works) -->
<img class="url" longdesc="http://example.com/page">
<!-- (strange, but true) -->
<p class="url">
  <img src="http://example.com/not-this"
  alt="http://example.com/page">
</p>

Datetime Examples

<time class="dtstart" datetime="2008-07-21">Monday</time>
<p class="dtstart">
  <time class="value" datetime="2008-07-21">Monday</time> at
  <time class="value" datetime="21:30">9:30pm</time>.
</p>
<p class="dtstart">
  <time class="value" datetime="2008-07-21">Monday</time> at
  <time class="value" datetime="21:30">9:30pm</time>
  <time class="value" datetime="+1">(UK)</time>.
</p>
<p class="dtstart">
  <abbr class="value" title="2008-07-21">Monday</time> at
  <abbr class="value" title="T21:30">9:30pm</time>
  <abbr class="value" title="+0100">(UK)</time>.
</p>

Combination Examples

<ins class="rev url" datetime="2008-07-21T21:30:00+0100">
  I launched my <a href="http://example.com/" class="value">new
  website</a> today.
</ins>

Parsed as: