[uf-discuss] 2 billion hCards! gathering material for a "microformats.org turns 5" blog post

Thu Jul 8 01:25:03 PDT 2010

Toby,

On Wed, Jul 7, 2010 at 4:28 PM, Toby Inkster <tai at g5n.co.uk> wrote:
> On Wed, 7 Jul 2010 08:24:52 -0700
> Tantek Çelik <tantek at cs.stanford.edu> wrote:
>

<snip academic discussion of fb: being a URL scheme or not>

>> > The page at http://wordpress.org/ does actually contain 3 triples i
>> > evaluated as RDFa 1.0, though they're each the result of RDFa
>> > grandfathering in certain HTML 4/XHTML 1 semantics.
>>
>> No, it might contain 3 RDF triples - but they're not RDF*a*.
>
> It contains three attributes which are described by the XHTML+RDFaspec,
> and which, when processed according to the RDFa spec, each produce an
> triple.
>
>> Just because a page can be parsed/converted into another format does
>> not mean it "contains" that format.
>
> The page at http://wordpress.org/ doesn't need to be converted to RDFa.
> It is RDFa. (It doesn't use an RDFa DTD, though many seem to believe
> that judging an XML document's type by its DTD is a layering violation.)
>
> It would need to be converted if you wanted RDF/XML, Turtle or JSON.
> But it doesn't need to be converted to RDFa; it is RDFa.

These assertions of "is RDFa" on grandfathered formats/syntaxes are
deceptive because it's essentially claiming implied credit/branding
for something that had nothing to do with RDFa.

E.g. if some future version of XHTML+RDFa spec describes how to
process microformats (given the trend the RDFa specs to grandfather in
more and more syntax - it's reasonable to predict that this happen),
then you can make the same claim there, that all use of microformats
are RDFa, which then dilutes the phrase "is RDFa" to the point of
meaninglessness.

Such a conflation of reclassifying previously non-RDFa markup as RDFa
is, as I said, clouding a definition at best, and deceptive/dishonest
at worst.

It still just conversion of a *previous* syntax, defined *outside* and
*predating* RDFa.

Another analogy: you could make a new spec called BrandXSemantics
(BXS) that defined processing of all syntaxes like meta tags,
microformats, RDFa, microdata etc. that claimed that all such syntaxes
were BXS, but such a claim is of little utility and would merely serve
to artificially inflate claims about BXS being more popular that
microformats or RDFa or microdata - this is essentially what this kind
of "grandfathering" in RDFa is doing.

Claiming "It is RDFa" is also deceptive from the point of view of
developer behavior, which is illustrated by your next point.

>> Saying so is deceptively mis-using the word "contains" at best, and
>> playing semantic games at worst.
>>
>> Just because a page has hAtom does not mean it "contains" Atom.
>
> No, it "contains" hAtom and can possibly be converted to Atom (atom:id
> concerns notwithstanding).
>
> The page at http://wordpress.org/ contains RDFa and can be converted to
> RDF/XML.
>
>> The question of comparison is deliberately chosen to illuminate what
>> are developers actually coding? What syntax? Not what can you "infer",
>> "parse as", or "convert to".
>
> In the case of http://wordpress.org/, they have coded RDFa. Thanks to
> the fact that RDFa grandfathered in some semantics from earlier
> versions of (X)HTML, they may not have been *knowingly* doing so.

Claiming some code is RDFa that clearly was not *knowingly*
written/intended as such points out the key flaw - if you're talking
about what are developers adopting, then their intent, and what they
are explicitly choosing to do is what matters. Thus comparisons like
Google's Rich Snippets adoption table make sense to contrast developer
adoption of different format approaches.

>> > The question "how many pages contain RDFa?" is only meaningful if
>> > certain qualifications are added... Does broken RDFa count?
>>
>> broken RDFa counts, but only to demonstrate the difficulty of coding
>> RDFa, not instances of RDF(a). one of the reasons that Google found
>> so little RDFa is may be because much of it was broken. this is one of
>> the common problems with namespaces in data.
>
> Do twitter's 100 million plus broken hCards demonstrate the difficulty
> of coding microformats?

If there are problems with Twitter's hCards, please document the
specific problems on the respective issues page that way we can better
verify the problem report(s), investigate possible causes, and suggest
fixes to Twitter as well.

I've added a placeholder section for this:

http://microformats.org/wiki/hcard-supporting-user-profiles-issues#Twitter

> I imagine that the reason Google found so little RDFa is because they
> were only counting RDFa that used their own RDFa vocabulary, and
> neglecting to count *all* RDFa. Without more information on their
> testing process I can't verify that though.

My understanding of RDF(a) advocates is that one of the design
principles of RDF(a) is its infinite extensibility and philosophy of
encouraging everyone to make up their own vocabulary (which is often
contrasted with microformats opposite design principle of deliberate
re-use of shared vocabularies for better interoperability and
communication).

Google using their own RDFa vocabulary is a direct consequence of this
principle/philosophy of RDF(a)/namespaces etc., and thus if there's a
problem with that approach, it merely calls into question that
principle/philosophy of RDF(a)/namespaces.

> This would be analogous to Wikipedia surveying usage levels of rel-tag
> by searching for rel-tag links to http://en.wikipedia.org/wiki/* only.

It's not analogous because rel-tag doesn't explicitly state nor
encourage sites to only use their own rel-tags, whereas RDF(a) does
encourage making up and using your own vocabularies.

>>> Do grandfathered rel/rev values count? &c.
>>
>> rel/rev syntax and values work without RDFa - they're not RDFa,
>> despite RDFa's attempt to subsume them (and even errantly claim/imply
>> credit in the spec, e.g. rel-license).
>
> I don't think the RDFa spec claims credit for anything in particular.
> It reuses a lot of (X)HTML attributes and rel/rev values, but is rather
> silent on their origins.

Right - it's that "silent on their origins" which is sloppy at best
and plagiaristic (implying first invention/credit by absence of
citation of prior art) at worst.

I'll follow-up with a more detailed description of where/when RDFa
claims/implies credit for work that predates RDFa. E.g. the
introduction of rel='license' in an example following a section that
states "examples to illustrate how Alice can use RDFa" [1] is one such
errant/deceptive implication that rel="license" is RDFa, that fails to
provide citations to the invention/introduction of rel="license" [2]
which IMHO borders on plagiarism, writing something implying
claiming/taking credit for something that was invented by another
beforehand, and omitting the reference to prior art.

[1] http://www.w3.org/TR/2008/NOTE-xhtml-rdfa-primer-20081014/#id84491

[2] http://microformats.org/wiki/history
2004-02-11 http://tantek.com/presentations/2004etech/realworldsemanticspres.html

The counter-argument is that perhaps it is/was a case of simultaneous
invention, which I would prefer to give more weight to, except that
the microformats introduction of rel-license was explicitly
discussed/mentioned afterwards on the Creative Commons mailing list[3]
where many related subsequent RDF discussions were had:

[3] http://lists.ibiblio.org/pipermail/cc-metadata/2004-February/000290.html

>> 1. theoretical strawman[1]
>> 2. google.com/robots.txt prevents this from counting in any "search"
>
> I think you're neglecting the serious point that page counts on the Web
> are not especially significant - it's easy to generate many millions of
> pages from a single template.

If it's a "serious point" - please provide data to substantiate that
criticism rather than merely asserting that Yahoo Search Monkey
returns numbers that "are not especially significant" - I think the
Yahoo Search Monkey developers deserve more benefit of the doubt.

> There are probably much more interesting measures than page counts. To
> evaluate the health of a format, it's just as important -- perhaps more
> important -- to look at how many active consumers there are.

By all means, propose alternative concrete "more interesting measures"
and how you would measure them.

Until then, the concrete Yahoo Search Monkey measures are the most
interesting measures of web-wide microformats adoption to date.

Sarven,

On Wed, Jul 7, 2010 at 3:53 PM, Sarven Capadisli <info at csarven.ca> wrote:
>
> I'm not sure about exact numbers, but a StatusNet instance (e.g.,
> http://identi.ca/ ), has hCards for all users and groups. It includes
> representative hCards.
>
> Updated wiki.

Thanks much Sarven!

Do you know *when* Identica added hCard support? (I'd really prefer to
keep this blog post to recognizing specific deployments in the past
year)

Also, do you know how many Identica/status.net profiles there are today?

Please feel free to add answers to those directly to Identica's entry
on the hCard supporting user profiles page:

http://microformats.org/wiki/hcard-supporting-user-profiles

Thanks,

Tantek

-- 
http://tantek.com/