# Parsing

This is a braindump, this page will need cleaning-up, take everything with a grain of salt at the moment.

For now, start with reading hCard parsing as that has more detail and has been more throughly reviewed and implemented.

- I've documented my own thoughts on parsing which flesh out some of the ideas and go into more detail on algorithms and stuff. TobyInk 06:33, 21 Jul 2008 (PDT)

## Contents

## By Element

This is a matrix of element and type. This should be describe under what circumstances each value and where that value comes from. The list of elements has been taken from http://www.w3.org/TR/html4/index/elements.html with some Microformats in HTML5 elements added, in particular those with special parsing needs.

See Semantic HTML for a definitive list of elements.

### data types

(this probably needs a better name) There are two types in microformats, protocol types and strings. Strings could be integers, such as ratings, strings, such as a note, or datetimes, such as dtstart. Protocol types are UIDs, URLs, email addresses, (sometimes Telephones and faxes)

If there is a comma list, then this is in order of availability. For instance, the ABBR element is @title,node-value. IF the @title is present then it is used, if not the stack is popped and node-value is looked at, if there is no node-value, then the value is NULL.

protocol | string | |

A | @href,node-value | node-value |

ABBR | @title,node-value | @title,node-value |

ACRONYM | @title,node-value | @title,node-value |

ADDRESS | node-value | node-value |

APPLET | ??? | ???(node-value) |

AREA | @href,node-value | node-value |

B | node-value | node-value |

BASE (valid?) | @href | |

BASEFONT (valid?) | ||

BDO (valid?) | ||

BIG | node-value | node-value |

BLOCKQUOTE | @cite?,node-value | node-value |

BODY | node-value | node-value |

BR (valid?) | ||

BUTTON | @value? | @value? |

CAPTION | node-value | node-value |

CENTER | node-value | node-value |

CITE | node-value | node-value |

CODE | node-value | node-value |

COL | node-value | node-value |

COLGROUP | node-value | node-value |

DATA | @value,node-value | @value,node-value |

DD | node-value | node-value |

DEL | @cite,node-value | node-value |

DFN | node-value | node-value |

DIR | node-value | node-value |

DIV | node-value | node-value |

DL | node-value | node-value |

DT | node-value | node-value |

EM | node-value | node-value |

FIELDSET | node-value | node-value |

FONT | node-value | node-value |

FORM | @action?,node-value | node-value |

FRAME | @src?,node-value | node-value |

FRAMESET | node-value | node-value |

H1 | node-value | node-value |

H2 | node-value | node-value |

H3 | node-value | node-value |

H4 | node-value | node-value |

H5 | node-value | node-value |

H6 | node-value | node-value |

HEAD (valid?) | node-value | node-value |

HR (valid?) | node-value | node-value |

HTML (valid?) | node-value | node-value |

I | node-value | node-value |

IFRAME | @src? | node-value |

IMG | @src | @alt |

INPUT | @value? | @value? |

INS | @cite,node-value | node-value |

ISINDEX (valid?) | ||

KBD | node-value | node-value |

LABEL | node-value | node-value |

LEGEND | node-value | node-value |

LI | node-value | node-value |

LINK (valid?) | ||

MAP | node-value | node-value |

MENU (valid?) | ||

META (valid?) | ||

NOFRAMES | node-value | node-value |

NOSCRIPT | node-value | node-value |

OBJECT | @data,node-value | node-value |

OL | node-value | node-value |

OPTGROUP (valid?) | node-value | node-value |

OPTION | node-value | node-value |

P | node-value | node-value |

PARAM (?) | node-value | node-value |

PRE | node-value | node-value |

Q | node-value | node-value |

S | node-value | node-value |

SAMP | node-value | node-value |

SCRIPT | node-value | node-value |

SELECT (valid?) | node-value | node-value |

SMALL | node-value | node-value |

SPAN | node-value | node-value |

STRIKE | node-value | node-value |

STRONG | node-value | node-value |

STYLE (valid?) | node-value | node-value |

SUB | node-value | node-value |

SUP | node-value | node-value |

TABLE(valid?) | node-value | node-value |

TBODY | node-value | node-value |

TD | node-value | node-value |

TEXTAREA | node-value | node-value |

TFOOT | node-value | node-value |

TH | node-value | node-value |

THEAD | node-value | node-value |

TIME | @datetime,node-value | @datetime,node-value |

TITLE | node-value | node-value |

TR | node-value | node-value |

TT | node-value | node-value |

U | node-value | node-value |

UL | node-value | node-value |

VAR | node-value | node-value |

## New Elements

When a new Semantic HTML element is introduced, follow these steps to update microformats to handle the new element.

- add the element to SemanticHTML MediaWiki extension, which enables creating wiki live markup examples
- update parsing rules accordingly (on Parsing and hCard parsing wiki pages)
- create/iterate actual live markup examples on wiki with real world content examples
- implement experimental new parsing support, test on wiki examples. optionally deploy for broader testing (e.g. dev.)
- if results are as expected/predicted, create test case from example markup with results as expected. if not then re-assess how parsing should work and go to 2.
- add parsing support to additional implementations
- have individual implementations test/deploy broadly as they see fit