microformats2-parsing-brainstorming
Author: Ben Ward
Microformats 2 proposes a new, all encompassing syntax modification of prefixes that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.
Parsing Microformats 2.0 Syntax: Extraction vs. Interpretation
A microformats ‘1.0’ parser performs the following function:
- Given a piece of HTML content, discover a known microformat, extract it, apply various extraction patterns based upon the HTML mark-up used (e.g. include pattern,
abbr
patterns, date-time patterns, value-title pattern), apply various content optimisations where applicable, and return the result in an object native to the programming language.
This is performing two types of function: Extraction of data from an HTML document or fragment, and interpretation and optimisation of that content to match the rules set out by a vocabulary specification.
It is only possible to write a generic parser that covers the first half of this task: Extraction, and application of global rules based on HTML elements and patterns common to all formats.
The purpose of a generic parser (as supported by use cases such as search engines, and other crawlers) is:
To provide a way for tools to extract rich data from a page for native storage, such that the data may be interpreted later by applications. This allows microformats to be crawled, and indexed, and removes the need to include complex HTML parsing within every implementation of microformat data.
Microformats will continue to define various vocabulary-specific optimisations. as part of the design to be optimised for authors. For example: The fn
pattern in hcard, or the lat;long
pattern in geo, as well as default values for properties, such as the maximum rating in an hreview.
Microformats 2.0 should refer only to extraction of microformats. Vocabularies should in turn document their appropriate optimisations, which will need to be applied by implementations, or a companion to an extractor, which I'll refer to here as an ‘interpreter’.
A microformats 2.0 ‘extractor’, in combination with the functionality of a domain and format-aware ‘interpreter’ (either another shared component, or part of the implementation itself) would be equivalent to a microformats 1.0 ‘parser.’
Parsing Literal Values
It is proposed for microformats 2.0 that all microformats be parseable from just their root element, e.g.
Ben Ward
would create an hCard with the following properties after extraction and interpretation:
{ type: [hcard] , properties: { 'fn': ['Ben Ward'] , 'given-name': ['Ben'] , 'family-name': ['Ward'] } }
It is required of the extractor to understand that when a microformats object specifies no explicit child properties, that it must treat h-card
as p-fn
. But, the parser is generic, so it would also treat h-review
, h-entry
, h-recipe
, h-geo
as ‘p-fn
’. As a parser, this is somewhat acceptable, but is unacceptable to the vocabularies, many of which do not define fn
and that even if all various of fn
, summary
and entry-title
be consolidated to fn
(which it is also proposed be renamed to name</code), it cannot with certainly be claimed that every one of these formats provide anything semantically valid in that case. What, for example, is the ‘name’ of a geo? It has none. Likewise, a review described with just a single field would more likely map to the review's content, not its name, and an Atom entry its content, not its title.
To avoid overloading or undermining the semantics of a vocabulary, I propose that we handle this at the extractor level in a simpler fashion: Define a new property for literal data, that an extractor will provide if no other information was available. All interpreters may then be instructed that in the event that an object has no properties, it can attempt to interpret the literal value from the page instead.
This enables a parser to operate generically without incorrectly applying semantics, and allows applications an interpreters to work with parsed data is a simple and defined way.
In existing microformats, the closest existing example we have for this is the
label
property in hCard, which is used to represent the literal address label for a place. It is a corresponding piece of
fn
, org
and adr
in combination, but has no structure in and of itself.
Possibly, every microformat could have a
label
form where structured data is unavailable.
Alternatively,
value
is used throughout microformats to target a generic value (e.g. in combination with price
in hListing.) It has been proposed that when parsing properties that are also themselves microformats, we create native objects of the form:
{
'value': '1900 12th Street, San Francisco, CA 94'
, 'type': ['adr']
, 'properties': {
'street-address': '1900 12th Street'
, 'etc': 'etc'
}
}
We could apply this same pattern to the root level:
{
type: [hcard]
, properties: {}
, value: 'Ben Ward'
}
In this case, an interpreter or implementation is responsible for using value
in place of fn
, or restructuring the object. It would be the responsibility of each vocabulary to define its root property. The parsing layer of microformats 2.0 would not impose semantics or naming onto that.
For another example, geo would end up like this:
{
type: [geo]
, properties: {}
, value: '1.3232;-0.543'
}
Other Interpretation/Parsing Notes
Collection of other unresolved parsing issues in a generic model:
- The include pattern references other elements from elsewhere in a document. A generic parser needs to track IDs and fill them in after walking the DOM. (also,
itemref
if adopted.)
- Will
itemref
always map to an item
property name?
- hAtom implies
author
from an hcard
in a page that uses an address
element. This requires format knowledge, but a generic parser does not currently track the element type of a property node. Should it?
- hAtom defines that the highest level heading within an entry implies
entry-title
. This particular optimisation might be better off dead.
- hAtom defines that permalinks be parsed from
rel
attributes, not class
- xfn is entirely built on
rel
(although, has various other differents from structural microformats, as do vote-links, so perhaps are excluded from this discussion and will always be handled by dedicated parsers/queries regardless?)