[uf-discuss] 2 billion hCards! gathering material for a "microformats.org turns 5" blog post

Wed Jul 7 02:25:38 PDT 2010

Jeremy,

> Well, this isn't huge in terms of numbers but it's something that makes my day to day work a whole lot smoother:
>
> 37 Signals have added hCards to Basecamp:
> http://answers.37signals.com/basecamp/556-any-chance-of-adding-hcards

This is great news! In the few times I've used Basecamp I remember
being quite frustrated by the lack of hCard support and simple person
info portability.  Great to see that 37 Signals has added hCards.

Peter,

On Tue, Jul 6, 2010 at 1:27 AM, Peter Mika <pmika at yahoo-inc.com> wrote:
> Hi Ed,
>
> The comparison to the number of people online is misleading, because the
> microformat stats quoted (both the Google and Yahoo figures) include
> duplicate counting. One of my illustrative examples is news.stanford.edu,
> where the microformat annotation is in the template, and thus every single
> page has exactly the same microformat markup, i.e. the address of Stanford
> University.

On the other hand, there are also numerous pages with multiple hCards
per page.  Directory listings, friends lists, about pages for
companies listing their executives etc.

The wiki has many such examples already:

http://microformats.org/wiki/hcard-examples-in-wild

There are certainly:
* multiple pages with the same hCard.
* pages with multiple hCards.

This was my experience with the microformats indexer we built at
Technorati back in the day.

It's hard to know how these average out.

You have to write a bunch more code (e.g. really good deduping etc.)
to figure it out.

Lacking that we should cite *pages* with hCards rather than total
hCards for the Search Monkey stat to be more accurate.

> The second point to make is that RDFa usage is underreported by [1]. Compare
>
> searchmonkey:com.yahoo.page.rdf.rdfa
>
> with
>
> searchmonkey:com.yahoo.page.uf.hcard
>
> These indicate that there are 2.7B pages with RDFa

I think this may be an errant number based on the way that Search
Monkey normalizes things internally to RDFa (because of an unfortunate
premature architectural decision that they then became stuck with - as
it was related to me by Paul Tarjan).

OR (and this deserves a little analysis)

Those pages don't actually all (if any?) contain RDFa.

Look at the first page of results.

E.g. Wordpress.org results don't have any RDFa.

View source and the only thing even remotely resembling you see is:

<meta property="fb:page_id" content="...">

- which is simply use of an invalid "property" attribute (in XHTML
1.0). The qname "fb:" is not defined anywhere.

This is not RDFa, this is simply a <meta> tag using a new (invalid)
syntax. That is, using "property" instead of the standard HTML 4.01
"name" attribute:

<meta name="fb:page_id" content="...">

Similarly with CNN.com, download.cnet.com, online.wsj.com.

OTOH, www.vistaprint.ca, digg.com, www.joomlart.com, www.webmd.com
don't even have "property" attributes. Who knows why they're listed in
that result page. No evidence of any RDFa on those pages.

www.metacafe.com does appear to define an "og" qname and use it in a
"property" attribute.

And that's it for the first page of results for that query
"searchmonkey:com.yahoo.page.rdf.rdfa" -

Only 1 out of 10 of at least the first page of results actually had
any RDFa - and that one was invisible <meta> data at that.

It does not appear that that query actually returns pages with rdfa,
for the most part not in any valid sense, nor in any sense of the
intent of marking up existing visible content with additional
attributes.

Perhaps a challenge could be posed - how many results of that query do
you have to look through before you find a legitimate "marking up
visible data" instance of RDFa?

In 4 pages of results (40) I only found 2 - and both were on the
Creative Commons site - not a big surprise given that Ben Adida is
both co-chair of RDFa WG and works for Creative Commons. But no
others.

It seems that RDFa usage is grossly exaggerated (by at least a factor
of 20) by the Yahoo Search Monkey
"searchmonkey:com.yahoo.page.rdf.rdfa" query.

> compared to 2B pages with
> hCard. There are many caveats to these numbers, but they are more or less on
> equal footing.

They're not even close (at least an order of magnitude difference), as
the above debunking of the RDFa results demonstrates.

Ed,

> Ed Summers wrote:
>>
>> On Sat, Jul 3, 2010 at 10:18 PM, Tantek Çelik <tantek at cs.stanford.edu>
>> wrote:
>>
>>>
>>> Some additional recent news:
>>> * microformats has 94% marketshare compared to alternatives (e.g.
>>> RDFa) according to Google (announced at the Semantic Technology
>>> conference)
>>>  -
>>> http://www.readwriteweb.com/archives/google_semantic_web_push_rich_snippets_usage_grow.php
>>>  - http://www.readwriteweb.com/images/richsnippets_june10b.jpg
>>>
>>
>> Was it clear if Google's stats were comparing all microformat usage
>> with usage of only their particular rich snippet vocabulary [1]? I'd
>> be surprised if it was *all* RDFa vocabulary use, since that would
>> mean that Google are indexing all RDFa on the web. John Breslin asked
>> a similar question in the comments on that RWW post [2].

This is an excellent question.

In particular the context (and numbers) of that slide appear to be
rich snippet specific - both for microformats and RDFa.

That is, comparing particular microformats for rich snippets, and
particular RDFa for rich snippets - 94% of the instances of markup for
rich snippets they found were done with microformats.

Good catch Ed, that's an important detail to call out.

Thanks everyone for the corrections and additions.  I've updated the
wiki accordingly:

http://microformats.org/wiki/microformats-turns-5

Please let me know if I've missed anything else - I'm going to go
ahead and write this up tomorrow morning.

Thanks,

Tantek

-- 
http://tantek.com/