parsing-brainstorming: Difference between revisions
(→Plain text Properties: To obtain the value of the property, run STRINGIFY on the property node.) |
(→Stringification: Add this algorithm too.) |
||
| Line 155: | Line 155: | ||
= Stringification = | = Stringification = | ||
( | The STRINGIFY function performs a text serialisation of an HTML node, with a few adjustments to implement the [[abbr-pattern|ABBR pattern]]. It uses a helper function, _STRINGIFY. | ||
== STRINGIFY == | |||
First parameter: element to stringify, <code>e</code>. | |||
Second parameter: whether to perform value excerpting - default yes. | |||
Third parameter: whether to perform abbr pattern - default yes. | |||
# If <code>e</code> is an <code><abbr></code> or <code><acronym></code> element, and has a <code>title</code> attribute, then return that attribute. | |||
# If value excerpting is enabled: | |||
## Create an empty list <code>S</code> | |||
## Search for any descendant elements of <code>e</code> with <code>class="value"</code>. Put these into a list <code>V</code>. | |||
## For each element <code>v</code> in <code>V</code> | |||
### Recursion: call STRINGIFY on <code>v</code>, disabling value excerpting but enabling the abbr pattern. Add the result to <code>S</code>. | |||
## Concatenate the items in <code>S</code> to form a string. If this string is not empty, then return the string. | |||
# Run _STRINGIFY on <code>e</code>, trim excess white space from the result and return it. | |||
== _STRINGIFY == | |||
This is a somewhat simplified version of the real algorithm that I use. You probably want to refine it by adding better whitespace handling rules (e.g. line breaks after block elements, asterisks for list items, etc). | |||
_STRINGIFY is called with one parameter, the element <code>e</code> to be stringified. | |||
# If <code>e</code> is text node (not an element), then return it. | |||
# If <code>e</code> is an <code><img></code> tag, return the <code>alt</code> text. | |||
# If <code>e</code> is an <code><input></code> tag, return the text of the <code>value</code> attribute. | |||
# If <code>e</code> is an <code><br></code> tag, return a linebreak character. | |||
# If <code>e</code> is an <code><del></code> tag, return a zero-length string. | |||
# Otherwise, create an empty list <code>S</code>. | |||
# For each direct child node <code>c</code> of <code>e</code>: | |||
## Run _STRINGIFY on <code>c</code> and add the result to <code>S</code>. | |||
# Concatenate the items in list <code>S</code> and return them. | |||
Revision as of 13:29, 21 July 2008
This is an attempt to get some of my thoughts on parsing, from practical experience implementing Cognition, out of my head and onto the wiki. Hopefully it will replace parsing once it reaches consensus, as this document is somewhat more detailed. It deals with how to parse the properties of a compound microformat once we have located the root element, which we shall call root. It only deals with simple properties which have no sub-properties, but 90% of properties do fall into this category. (And many of the others can be parsed by treating the property element as root and then finding sub-properties using the techniques on this page.) TobyInk
Note: as a courtesy, I'd like to ask people not to edit this page for the next few days, until I have gotten the initial version stable. Thanks. TobyInk 01:14, 21 Jul 2008 (PDT)
General Algorithm
- Make a copy of the DOM tree and operate on it.
- Implement the include pattern by removing any nodes with
class="include"and replacing them with the node which they point to. - Parse each property using the DOM clone.
There are three different categories of property — singular, plural and concatenated. Most properties are either singular or plural, but a handful are concatenated, such as entry-summary in hAtom. The general algorithm for parsing a property prop within root is:
- Create an empty array to store the value(s) of
propin. Call thisA. - Find all elements with
class="prop"that are descended fromroot, taking mfo into account. - For each element
e, run this:- Find the value of
e, using the techniques in the section below. - If the value of
eis not NULL, add it toA - If the
propis a singular property andAis not empty, jump out of this foreach loop.
- Find the value of
- If
propis a singular property, then its value isA[0]. - If
propis a plural property, then its values areA. - If
propis a concatenated property, then its values are formed by concatenating the values ofAtogether usingjoineras a joining character. (The stringjoinerwill be specified later.)
Finding Values
There are at least five different types of property that can be parsed, each of which requires different techniques:
- HTML properties, such as
entry-contentin hAtom - URI properties, such as
urlin hCard - ID properties, such as
uidin hCard - Datetime properties, such as
dtstartin hCalendar - Plain text properties, such as
titlein hCard
Arguments can be made for duration properties and numeric properties to also have variations in the algorithm, but for now, we'll just treat them as plain text properties.
HTML Properties
These are the easiest to parse. Given an element e, just use the HTML representation of its DOM node. Some DOM implementations make this available as .outerHTML.
URI Properties
Certain HTML elements are capable of linking to other resources. The most obvious is <a> though there are many others. The following list of linking elements is derived from Perl's HTML::Tagset module:
{
'a' => ['href'],
'applet' => ['codebase', 'archive', 'code'],
'area' => ['href'],
'base' => ['href'],
'bgsound' => ['src'],
'blockquote' => ['cite'],
# 'body' => ['background'],
'del' => ['cite'],
'embed' => ['src', 'pluginspage'],
'form' => ['action'],
'frame' => ['src', 'longdesc'],
'iframe' => ['src', 'longdesc'],
# 'ilayer' => ['background'],
'img' => ['src', 'lowsrc', 'longdesc', 'usemap'],
'input' => ['src', 'usemap'],
'ins' => ['cite'],
'isindex' => ['action'],
'head' => ['profile'],
'layer' => ['src'], # 'background'
'link' => ['href'],
'object' => ['data', 'classid', 'codebase', 'archive', 'usemap'],
'q' => ['cite'],
'script' => ['src', 'for'],
# 'table' => ['background'],
# 'td' => ['background'],
# 'th' => ['background'],
# 'tr' => ['background'],
'xmp' => ['href'],
}
Note that some are commented out as they might be too counter-intuitive to implement!
If we're parsing an element e and looking for a URI, here is the algorithm we use:
- Set variable
uto NULL. - Search
efor any descendent elements withclass="value". Call this listV. - Add the element
eitself to the listV, at the front of the list. - OUTER: for each element
vfrom listV:- If
vis a linking element from the above list- INNER: for each attribute
aassociated that the tag name ofvin the above list- If
ais set- Set
uto the contents ofa - Jump out of the OUTER loop.
- Set
- If
- INNER: for each attribute
- If
- If
uis not null, and is a relative URI, convert it to an absolute URI.
The URI has hopefully been found in u. If no URI has been found, then fall back to plain text parsing.
UID Properties
UID properties are parsed similarly to URL properties, but with a slightly modified algorithm, allowing for UIDs to be specified in the id attribute. The following example has a UID of "http://example.com/page#foo".
<base href="http://example.com/page" /> <div class="uid" id="foo">...</div>
The modified algorithm used is:
- Set variable
uto NULL. - Search
efor any descendent elements withclass="value". Call this listV. - Add the element
eitself to the listV, at the front of the list. - OUTER: for each element
vfrom listV:- If
vis a linking element from the above list- INNER: for each attribute
aassociated that the tag name ofvin the above list- If
ais set- Set
uto the contents ofa - Jump out of the OUTER loop.
- Set
- If
- INNER: for each attribute
- If
vhas anidattribute set- Set
uto the contents ofid, with the character "#" prepended - Jump out of the OUTER loop.
- Set
- If
- If
uis not null, and is a relative URI, convert it to an absolute URI.
Again, if no u has been found by the algorithm, then fall back to parsing it as a plain text property.
Datetime Properties
Parsing property prop, if class="prop" is found on element e.
- If element
ehas an attributedatetime, then the content of that attribute is the value and the rest of these steps should be skipped. - Create a list
D, which is empty. - Create a list
Vof elements withclass="value". - For each element
vinV:- If
vhas an attributedatetime, then add the content of that attribute toD - Otherwise, run the STRINGIFY function on
vand add the result toD
- If
- If
Dis empty, then run the STRINGIFY function oneand let the result be the value, and skip the rest of these steps. - If
Dcontains only one item, and it looks like an ISO date or ISO datetime, then let that be the value, and skip the rest of these steps. - If
Dcontains two items, and the first looks like an ISO date, and the second like a time, concatenate them, joining with an upper case 'T', let that be the value, and skip the rest of these steps. - If
Dcontains three items, and the first looks like an ISO date, the second like a time, and the last like a timezone (may need normalisation), concatenate them, joining the first two with an upper case 'T' and the last one with no intervening character, let that be the value, and skip the rest of these steps. - Concatenate all the items in
Dand let that be the value.
The final value should be interpreted as liberally as possible with regards to punctuation as an ISO date or ISO datetime.
Normalizing Timezones
Where S is a sign (+ or -) and the letters a, b, c, d are numerals, then:
- Sa → S0a00
- Sab → Sab00
- Sabc → S0abc
- Sa: → S0a00
- Sab: → Sab00
- Sa:b → S0ab0
- S:ab → S00ab
- Sa:bc → S0abc
- Sab:c → Sabc0
- Sab:cd → Sabcd
Plain text Properties
To obtain the value of the property, run STRINGIFY on the property node.
Stringification
The STRINGIFY function performs a text serialisation of an HTML node, with a few adjustments to implement the ABBR pattern. It uses a helper function, _STRINGIFY.
STRINGIFY
First parameter: element to stringify, e.
Second parameter: whether to perform value excerpting - default yes.
Third parameter: whether to perform abbr pattern - default yes.
- If
eis an<abbr>or<acronym>element, and has atitleattribute, then return that attribute. - If value excerpting is enabled:
- Create an empty list
S - Search for any descendant elements of
ewithclass="value". Put these into a listV. - For each element
vinV- Recursion: call STRINGIFY on
v, disabling value excerpting but enabling the abbr pattern. Add the result toS.
- Recursion: call STRINGIFY on
- Concatenate the items in
Sto form a string. If this string is not empty, then return the string.
- Create an empty list
- Run _STRINGIFY on
e, trim excess white space from the result and return it.
_STRINGIFY
This is a somewhat simplified version of the real algorithm that I use. You probably want to refine it by adding better whitespace handling rules (e.g. line breaks after block elements, asterisks for list items, etc).
_STRINGIFY is called with one parameter, the element e to be stringified.
- If
eis text node (not an element), then return it. - If
eis an<img>tag, return thealttext. - If
eis an<input>tag, return the text of thevalueattribute. - If
eis an<br>tag, return a linebreak character. - If
eis an<del>tag, return a zero-length string. - Otherwise, create an empty list
S. - For each direct child node
cofe:- Run _STRINGIFY on
cand add the result toS.
- Run _STRINGIFY on
- Concatenate the items in list
Sand return them.