parsing-brainstorming

From Microformats Wiki
Revision as of 08:14, 21 July 2008 by TobyInk (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is an attempt to get some of my thoughts on parsing, from practical experience implementing Cognition, out of my head and onto the wiki. Hopefully it will replace parsing once it reaches consensus, as this document is somewhat more detailed. It deals with how to parse the properties of a compound microformat once we have located the root element, which we shall call root. It only deals with simple properties which have no sub-properties, but 90% of properties do fall into this category. (And many of the others can be parsed by treating the property element as root and then finding sub-properties using the techniques on this page.) TobyInk

Note: as a courtesy, I'd like to ask people not to edit this page for the next few days, until I have gotten the initial version stable. Thanks. TobyInk 01:14, 21 Jul 2008 (PDT)

General Algorithm

  1. Make a copy of the DOM tree and operate on it.
  2. Implement the include pattern by removing any nodes with class="include" and replacing them with the node which they point to.
  3. Parse each property using the DOM clone.

There are three different categories of property — singular, plural and concatenated. Most properties are either singular or plural, but a handful are concatenated, such as entry-summary in hAtom. The general algorithm for parsing a property prop within root is:

  1. Create an empty array to store the value(s) of prop in. Call this A.
  2. Find all elements with class="prop" that are descended from root, taking mfo into account.
  3. For each element e, run this:
    1. Find the value of e, using the techniques in the section below.
    2. If the value of e is not NULL, add it to A
    3. If the prop is a singular property and A is not empty, jump out of this foreach loop.
  4. If prop is a singular property, then its value is A[0].
  5. If prop is a plural property, then its values are A.
  6. If prop is a concatenated property, then its values are formed by concatenating the values of A together using joiner as a joining character. (The string joiner will be specified later.)

Finding Values

There are at least five different types of property that can be parsed, each of which requires different techniques:

  • HTML properties, such as entry-content in hAtom
  • URI properties, such as url in hCard
  • ID properties, such as uid in hCard
  • Datetime properties, such as dtstart in hCalendar
  • Plain text properties, such as title in hCard

Arguments can be made for duration properties and numeric properties to also have variations in the algorithm, but for now, we'll just treat them as plain text properties.

HTML Properties

These are the easiest to parse. Given an element e, just use the HTML representation of its DOM node. Some DOM implementations make this available as .outerHTML.

URI Properties

Certain HTML elements are capable of linking to other resources. The most obvious is <a> though there are many others. The following list of linking elements is derived from Perl's HTML::Tagset module:

{
	'a'       => ['href'],
	'applet'  => ['codebase', 'archive', 'code'],
	'area'    => ['href'],
	'base'    => ['href'],
	'bgsound' => ['src'],
	'blockquote' => ['cite'],
#	'body'    => ['background'],
	'del'     => ['cite'],
	'embed'   => ['src', 'pluginspage'],
	'form'    => ['action'],
	'frame'   => ['src', 'longdesc'],
	'iframe'  => ['src', 'longdesc'],
#	'ilayer'  => ['background'],
	'img'     => ['src', 'lowsrc', 'longdesc', 'usemap'],
	'input'   => ['src', 'usemap'],
	'ins'     => ['cite'],
	'isindex' => ['action'],
	'head'    => ['profile'],
	'layer'   => ['src'], # 'background'
	'link'    => ['href'],
	'object'  => ['data', 'classid', 'codebase', 'archive', 'usemap'],
	'q'       => ['cite'],
	'script'  => ['src', 'for'],
#	'table'   => ['background'],
#	'td'      => ['background'],
#	'th'      => ['background'],
#	'tr'      => ['background'],
	'xmp'     => ['href'],
}

Note that some are commented out as they might be too counter-intuitive to implement!

If we're parsing an element e and looking for a URI, here is the algorithm we use:

  1. Set variable u to NULL.
  2. Search e for any descendent elements with class="value". Call this list V.
  3. Add the element e itself to the list V, at the front of the list.
  4. OUTER: for each element v from list V:
    1. If v is a linking element from the above list
      1. INNER: for each attribute a associated that the tag name of v in the above list
        1. If a is set
          1. Set u to the contents of a
          2. Jump out of the OUTER loop.
  5. If u is not null, and is a relative URI, convert it to an absolute URI.

The URI has hopefully been found in u. If no URI has been found, then fall back to plain text parsing.

UID Properties

UID properties are parsed similarly to URL properties, but with a slightly modified algorithm, allowing for UIDs to be specified in the id attribute. The following example has a UID of "http://example.com/page#foo".

<base href="http://example.com/page" />
<div class="uid" id="foo">...</div>

The modified algorithm used is:

  1. Set variable u to NULL.
  2. Search e for any descendent elements with class="value". Call this list V.
  3. Add the element e itself to the list V, at the front of the list.
  4. OUTER: for each element v from list V:
    1. If v is a linking element from the above list
      1. INNER: for each attribute a associated that the tag name of v in the above list
        1. If a is set
          1. Set u to the contents of a
          2. Jump out of the OUTER loop.
    2. If v has an id attribute set
      1. Set u to the contents of id, with the character "#" prepended
      2. Jump out of the OUTER loop.
  5. If u is not null, and is a relative URI, convert it to an absolute URI.

Again, if no u has been found by the algorithm, then fall back to parsing it as a plain text property.

Datetime Properties

(Hopefully I'll write this later today.)

Plain text Properties

(Hopefully I'll write this later today.)

Stringification

(Hopefully I'll write this later today.)