[uf-new] Extensible and open-ended microformats?

Fri Jul 15 05:34:42 PDT 2011

Hi everyone,

Semantic Web is cumbersome and machine-oriented, microformats declare they are "Designed for humans first and machines second" plus supporting backward compatibility in lightweight way. Is it enough? Though everyone understands that microformats are not infinitely extensible and open-ended, why we could not try to imagine what will be changed, if so.

The main obstacle for microformats to be open-ended is "format" word. No doubt, formats are necessary, they order information to be recognizable by other software (and humans). But usually format implies usage of fixed (by specification) structure, which collapse data if some additional or missing rule is used outside of its scope. It is not appropriate behavior, if we set "humans first". Natural language is a format too, but it has quite different nature: it consists of a restricted set of rules, plus infinite set of identifiers, which have fine-grained and open-ended compatibility. Of course, we cannot use natural language, because its identifiers and rules are ambiguous. Is a compromise possible? Why not? In fact, hypertext is such a compromise, microformats are too but with certain shortcomings.

For example, let's take hCard and hCalendar as examples:

<span class="tel">
    <span class="type">home</span>:
    <span class="value">+1.415.555.1212</span>
</span>

The microformats.org site was launched
 on 2005-06-20 
 at the Supernova Conference 
 in San Francisco, CA, USA.

What we see here?

1. You cannot add more "formatted" details, unless it is provided by format. No "state" for "CA".
2. You should remember all abbreviations, which a format has. It is not problem for really microformat, but it will if a set of used tags is big enough, or you will use a lot of microformats.
3. Is overlapping of several microformats possible? For example, to mention a phone inside hCard record. Theoretically, yes, practically, it is fine, when both formats have no coincided tags. Is it possible to sew some microformat record through other microformat record? For example, if we would have hCardPlus format it might include both "Supernova Conference" as an event and "San Francisco", "CA", and "USA" as location elements (they might be even in different sentences or paragraphs).

Is there a remedy? Idea is simple: (a) using identifiers to make natural language less ambiguous and make formats fine-grained and open-ended, (b) using restricted set of simple rules to relate elements with each other. This means we may use identifiers as "elemental microformats" and a set of rules as ordering (nesting, sequencing) of tags and attributes (with one exception: new set of rules should allow overlapping of elements, which generally considered as "unformat" feature, though it is used widely by "humans first" languages). "Simple rule" implies "humans first" (not "experts first"), that is, they prefer "is" and "has" rules over "class", "property", etc.

For example:

The microformats.org site was launched
 on 2005-06-20 
 at the Supernova Conference
 in San Francisco, CA, USA.
 Please call us: +1.415.555.1212

The example is more lengthy, but what is the difference?

1. "Summary" may be inherent not only hCalendar, but relate to any text and express the content in abstract and more concise form. Therefore, it uses "is" relation.
2. "Start" relates to the entire record and declared as "date". Though it is more complex than microformat example, but it is more flexible, because:
 a. "dtstart" is abbreviation, which is recognizable only if the entire format is recognizable.
 b. "s-of" is used because the date may occur outside of this text at all, for example: "The microformats.org was launched at ... . ... <many paragraphs> ... We are blessed with microformats since 20 Jun, 2005".
 c. "date" may be inherent not only to hCalendar. 
 d. Explicit formatting helps to avoid ambiguities like 2005-06-06 (of course, you can check format specification, but we stick to "humans first", remember? explicit formatting saves human time at user-side).
3. Supernova 2005 conference has globally unique composite identifier, but it may consists of two simple identifiers for "Supernova conference" and "2005 year", which are associated into composite one)
4. San Francisco has globally unique identifier too, but it omits definitions for "location", "CA", and "USA", because they are derivatives of "San Francisco".
5. Phone markup is not lengthy, comparing with the original hCard record, because type is a part of the identifier, and value is expressed with one tag. The former hCard record has cross-reference to the former hCalendar record.

New definition has drawbacks: fixed format is replaced with the format, which is collected in-place by explicit declaring of relations. But who needs fixed formats? In fact, machines. Humans prefer explicit explanations. And no validation, please. A human may spell Jun 20, 2005 in hundreds ways (without counting different languages), which is not human fault. The fault is absence of means for converging these ways to the same value. At the same time, nothing prevents us from composing dynamic microformats for reusing them throughout page, site, or the entire Web.

Of course, you may ask evident question: What's if someone uses "phone", whereas someone else uses "telephone"? This is the point which is still not treated appropriately by both Semantic Web and microformats. Answer is: non-URI decentralized routed identification. 

Why non-URI? Theoretically, because URI refers to information resources, whereas identifiers should refer to things, which are, in own turn, referred by URIs. Practically, identifying may be supported not by single site but by multiple sites or a cloud. Moreover, it requires routing, which may target different destinations. Also, what to do with identifiers which would refer to dead sites? 

Why decentralized? Because meaning of identifiers may differ depending on who uses them and with what purpose. Additionally, there is no sense to store in one place too specific identifiers for a single hotel room (but which the specific hotel really needs) or an event, which concerns a small town. 

Why routed? Because an identifier may be routed to different places, where information may be found, depending on your preferences, your company preferences like security, etc. The same concerns derivatives. Fixed format forces you to put San Francisco into "location" tag (though it might be named also as "geographical position", "place", "city", "metropolitan area", etc). First, it is tautology. Second, to put them all as "keywords" is waste of time, because you don't know all synonyms and all related words and phrases. Third, it is not flexible: if a new buzzword is coined, you might not update all old records, concerned by it. Routing cares about it: the identification of "San Francisco" word as a reference to "San Francisco" city is enough to infer all its synonyms and related words (but with explicit definition of how they are exactly related).

More flexible and open-ended ways has own cost. They are less efficient to process, they are unpredictable. But it is how we think and how we speak. Do humans need flexibility or machines? Do humans need open-endness or machines? I guess humans, right? 

If you are interested, you can find more on this:

http://on-meaning.blogspot.com/2011/06/great-blunders-of-modern-it-and-their.html

Thanks,
Yuriy.

PS: This letter includes examples in microformat-like way (though not quite), whereas my proposal assumes optional using of non-URI identifiers and new tag-attributes.