[uf-discuss] Human and machine readable data format

Jason Karns jason.karns at gmail.com
Sat Jul 12 10:23:20 PDT 2008


> The premise that publishers will pick any old format is merely an
> assertion with no evidence. Please show us an example somewhere else
> where this has happened, or perhaps a better argument than merely
> insisting on the "obvious" truth of it.
>
> The way I see it, if they publish in the wrong format, then the
> parsers won't pick up the date. This is what happens with microformats
> already. I don't know about anyone else, but when I publish a
> microformat, I test whether parsers can read it correctly. I do the
> same thing with any html. If a publisher can't take the time to test,
> and publish in the correct format then they take the consequences.
> it's exactly the same with any other technology. Why should
> microformats be any different? Why do you think making a microformat
> resemble natural language drastically changes this set of rules?
>
The problem is as simple as testing in a parser to verify that the
format is correct.  NLP is too difficult to easily solved in every
parser.  The outcome will be that different parsers will handle
different levels of NLP, parsing only subsets of accepted 'native
language formats'. This is similar to the way many parsers are now.
(Many parsers handle different portions of the specs. Few handle the
entire spec. Case in point: the include pattern.)  Even assuming the
very extreme case that all parsers handle the same string formats, no
parser will ever handle every possible language permutation.

The only solution that will result in practical parser use will
*require* some amount of data duplication.  Just as you stated:
1. metadata and information hiding is out of the question
2. putting ISO 8601 style dates ("machine dates") in any place where a
human can see it or have it read to them  is "the problem" that we are
trying to solve, so we can't do that.
3. The date cannot resemble anything a human might want to read.

One of the above rules must be broken. #2 is the problem as you said.
#3 will result in a 'spec' that will never be fully implemented in all
parsers and will thus never be practical for publishing. #1 therefore
must be broken.  I don't understand why this is even an argument at
this point. The abbr-pattern was already accepted though it violates
this principle. The only reason it is rejected now is because of the
semantics of the @title attribute. Thus any solution that violates
principle #1 in the same way as the abbr-pattern should also be
acceptable so long as it does not suffer the same accessibility issue.

Any sort of class="data-*" solution seems to be an acceptable
compromise (and a compromise is what is required). It keeps the data
machine-readable without making parsing impractical. It keeps the
machine data out of human-readable context (@title). And it keeps the
duplicate data near the human-readable version for maintenance.
(Though I take exception with the duplicate-data principle as most
publishers use automated tools that easily duplicate data without
causing stale-issues.)

~Jason


More information about the microformats-discuss mailing list