[uf-discuss] Canonical hCards (was: Search on CSS element)

Wed Jan 24 03:18:41 PST 2007

On 1/24/07, David Janes <davidjanes at blogmatrix.com> wrote:
> Do you (Tantek + all) agree with the following "architecture", or it
> least think it's worth pursuing further:
>
> (a) hCards without additional markup; "url" is used to lookup a URL
> (b) at the URL we can either find:
> (b.i) the authorative hCard; OR
> (b.ii) a pointer to authorative URL with the authorative hCard
> (c) it's easy to find the authorative hCard on the authorative URL
>
> I'm sure we have the technology to to (b.ii), I just don't know if
> anyone has done it. Anyone?

I believe that http://rubhub.com/main/ acts in a similar manner. When
it finds XFN values it continues to crawl those URLs.

In a similar vein, an hCard spider could find hCards in a page with a
URL. They could then follow that URL to the person's page. Then
inspect for hCards. If none are found, it could simply follow all
rel-me links. Since rel-me is published by the author of the page,
[it is a safe asssumption?] that the subsequent requested pages are
also controled by the author. Then hCards could be looked for on those
pages as well. The problem arrises when multiple hCards are
encountered on a page - which is the authorative hCard? This issue is
not a problem with the spider, but with the mechanism to say "THIS
hCard is the one you want" (you suggested an anchor link #vcard), but
using some hueristics, it might be possible to match the URL of the
ORIGINAL hCard that started this spidering, and any hCards found in
the rel-me crawl. If the URLs match, then you could (with some degree
of certainly) collapse the values into a more robust hCard.

For example, if i leave a comment on XYZ blog. It cites me as the
author and uses and hCard.
<p class="vcard">posted by: <a class="fn url"
href="http://suda.co.uk/">Brian Suda</a></p>

So the spider will find my URL in XYZ blog and begin to spider
suda.co.uk for any rel-me links. It finds some onthe homepage to my
contact page and to my publications page. On the publications page it
finds several hCards for various events, organizations, etc. Each of
those is compared to the original from XYZ blog. The FN's don't match,
the URLs don't match - so with a high degree of probablity they are
NOT me. Then the spider visits my contact page. It finds another hCard
(we'll say 2). It compares the first one and the FN's and URLs don't
match. It compares the second one, and the URLs match, the FNs match -
with a high degree of certainly they are the same person and the data
can be merged.

By doing this, we introduce NO new technology or mark-up to find
authoritative data. We are using already existing microformats (XFN),
the data is visible using @rel instead of <link> it allows for the
market to compete and build better spiders.

We also have the lesser used UID to uniquely identify hCards. As we
talked about long ago, UID and URL should be the same thing, if we can
futher develop that idea then comparing and collapsing on URL/UID will
give us an even higher degree of certainty.

> The reason I'm asking is I'm more willing to plug work into this, but
> not if you've already decided that this approach will not work.

--- i think it will work just fine, but i also thing we have all the
tools needed right now.

People 'say' they want this feature, but i don't think they have
explored the possible solutions available currently. If we are going
to try to tell folks to add something like <link rel="id.hcard"..../>
why not just rel-me? that gets us several things at once and solves
this same issue for hCards as hCalendar as hResumes ...

The downside is that i would say for things like X2V or other
hCard->vCard services they should NOT spider links. This would be more
for a social network app that caches the data rather than generating
vCards on the fly.

I think this happily solves the 80/20 percent of all use-cases right
now. I'm also sure smarter people than me can take this to the next
step, implement something like this, and do an even better job of
getting quality data out.

-brian

-- 
brian suda
http://suda.co.uk