[uf-discuss] re: HTML5 support

Wed Jul 21 02:27:44 PDT 2010

On Tue, 20 Jul 2010 21:55:38 +0200, Angelo Gladding <angelo at gladding.name>  
wrote:

> On Tue, Jul 20, 2010 at 3:25 AM, Philip Jägenstedt <philipj at opera.com>  
> wrote:
>> On Tue, 20 Jul 2010 06:05:06 +0200, Angelo Gladding  
>> <angelo at gladding.name>
>> wrote:
>>
>>> Can an enlightened soul describe in which ways microdata is actually
>>> superior to profiled poshformats?
>>
>> Microdata should be compared to the class attributes and the various
>> patterns that microformats use, not any specific vocabulary.
>
> Of course. Let me clarify. A `microformat` is a poshformat that has
> undergone a relatively laborious process of research and brainstorming
> to capture real world user requirements to make a minimal vocabulary
> that can capture ~80% of current usage patterns. Microdata is a set of
> rules governing a syntax. Hence my comparison of microdata to
> poshformats, which are essentially microformats sans the due
> diligence.

Right, designing vocabularies is hard and requires due diligence. That's  
true no matter what the syntax is.

>> The main benefit is that parsing becomes well-defined
>
> Ain't that the truth.
>
>> and simple.
>
> Or is it? I wonder how different the two sets of supporting algorithms
> might look face to face once fully documented and implemented.
>
> The Microformats wiki makes the following comparison to microdata:
>
> 1. `itemprop` - is a more specific version of class, for field names.
> 2. `subject` - allows semantically linking within the page.
> Conceptually similar to the include-pattern.
> 3. `itemref` - allows including properties elsewhere on the page that
> are not descendants of itemscope. Takes space-separated ids (for
> example itemref="address phone" would include the elements with
> id="address" and id="phone"). Conceptually similar to the
> include-pattern.
> 4. `content` - on the meta element can be used to include invisible
> data that is not part of the content. As current browsers move meta
> inside <head>, make sure to include via `itemref`. Conceptually
> similar to the 'value-title' feature of the value-class-pattern.
> 5. `itemscope` - identifies blocks to be marked as structured data.
> Conceptually similar to the mfo brainstorming.
> 6. `itemtype` - to specify the type for an item (for example:
> itemtype="http://microformats.org/profile/hcard").

What wiki page is this from? subject has been replaced by itemid. I can't  
understand what the similary with the include-pattern could possibly be,  
though.

> Distilled down:
>
> 1. @class
> 2/3. include-pattern/table-header-pattern
> 4. value-class-pattern
> 5. "mfo"
> 6. rel-profile
>
> Sounds to me like the same sort of desire for absolute normativity
> that [non-HTML5] XHTML once attempted to burden the entirety of
> humanity with. Ironically, HTML5 has deprecated such a style in favor
> of a seemingly more flexible Microformat-esque syntax.

Putting XHTML2 aside, one of the main achievements of HTML5 is having  
formalized how to parse all the sloppy, broken HTML out there (a.k.a. "tag  
soup"). While the syntax is flexible to authors, there's no flexibility  
whatsoever for an implementor how to parse it. The result will always be  
the same. In my view, microdata is to microformats what the HTML5 parser  
is to HTML4. It makes it possible to parse, without ever guessing, all the  
microdata items on a page. While it's really easy to write a microformat  
parser in JavaScript, you're not going to see that built into a browser,  
where each vocabulary needs a new parser. Microdata also hasn't been  
implemented by any browser yet, but I'm pretty sure it's going to happen  
if it takes off.

> <span itemscope itemtype="http://microformats.org/profile/hcard">

> Considering your affiliation with Opera, what might I ask are your
> feelings about Operator?

I've heard of it before, it looks like a custom Opera distribution? It has  
nothing to do with microformats or microdata as far as I can tell.

>> which really isn't really practical with microformats when the
>> data is hidden in class attributes together with everything else.
>
> As I alluded to above I see this as a complete non-issue yet you are
> most certainly not the first to bring it up. What am I missing?

If a browser is going to support some kind of embedded data vocabularies  
(like events or contacts), the code for parsing it isn't going to be  
written in JavaScript using the DOM, it's going to be in C++ or C  
operating on the internal datastructures of the browser. To support a  
specific microformat vocabulary, one would have to look through all the  
classes on all elements to find the "root" element, then speculatively  
search its children for the other structures of the microformat. Given  
that the all of the constructs used in microformats are also used for  
completely different things, so most of the data you inspect isn't  
actually going to be what you're looking for. Since one has to do this for  
all documents parsed (and not "on demand" like when finding a particular  
class using document.getElementsByClassName) my guess is that it's going  
to be slow. What's worse, you'll have to write more or this complicated,  
slow code for each vocabulary you want to support.

If the data is put in new attributes like itemprop, the code for parsing  
it will be simpler and you won't have to write it again for every  
vocabulary support, you can just reuse your getItems(x) implementation to  
find all items of type x and go from there.

Now, this is all theoretical since no browser has implemented this yet (I  
tried a bit on my free time, but had too little). If you don't care about  
browsers, then of course it doesn't matter. If microformats work for you  
then keep using them. I'm just saying that there's a better way forward.

>>> Might a "humans first, machines second" CJKV internationalization of
>>> `n` optimization be to analyze the contents of the `fn`'s @lang and
>>> inner text and use either or both to better determine name order?
>>
>> The main problem with this is that due to lazy copy-pasting, lang="en"  
>> is
>> often used even when the language isn't English. Also, in the case of  
>> e.g.
>> Facebook, lang="en" would be correct for the page itself, but people's  
>> names
>> aren't in English anyway.
>
> Check out http://ja-jp.facebook.com/people/gong-ye-zhong/100000456401743
>
> <html lang=ja>...<div class=vcard>...<a class=fn ... >宮野衆</a>...</div>
>
> 宮野 can log in today and, without any cooperation from Facebook, append
> a U+200B (zero-width space [1]) to his first name (regardless of the
> input taking the form of one or two boxes), and immediately reap the
> benefits of such an `n` optimization without negatively affecting UI,
> sort order, etc.
>
> [1] http://en.wikipedia.org/wiki/Zero-width_space

I don't speak Japanese, but I think 宮野 is the family name and 衆 is the  
given name. By not doing anything the 'n' optimization will incorrectly  
guess that the family name is 宮野衆 and given name unknown. By inserting  
a zero-width space, it will instead incorrectly guess that 宮野 is the  
given name and 衆 is the family name. Either way it's incorrect.

>> The only way to get it right is to ask the user both for the full name,
>> given name and family name, something I haven't ever seen.
>
> If you haven't seen it, then it isn't even a single way to get it
> right -- another
> byproduct of Microformats philosophy I believe. However, if optimizations
>  can yield 80%+ positive results when viewed in aggregate I personally  
> give
>  a little bit of magic a big thumbs up.

I guess we're not going by the population of the earth then, since China,  
Japan, Vietnam and South Korea account for 23.36% of it.  
(http://en.wikipedia.org/wiki/List_of_countries_by_population)

>> The most practical solution is to not guess at all, and I don't know
>> of any negative effects of this.
>
> I just see a tiny hint of dehumanization. ;)

Seriously though, what are the negative effects? I'm betting that the  
number of people that make good use of having the given name and family  
name separately in their address book aren't many enough to justify  
screwing it up for the population of East Asia.

-- 
Philip Jägenstedt
Core Developer
Opera Software