[uf-discuss] Bases

Mon Dec 5 05:47:39 PST 2005

Charles Iliya Krempeaux wrote:

> On 12/5/05, Chris Messina <chris.messina at gmail.com> wrote:
>> On 12/4/05, Scott Reynen <scott at randomchaos.com> wrote:
>>
>>> Personally, I suspect there's just not enough microformatted
>>> content out there yet to make it worth Google's cycles parsing
>>> it." [2].  But I thought it better to try and prove myself wrong  
>>> with
>>> some code than to just speculate about it.
>>
>> Um, why are we waiting for Google? I mean, besides technorati, aren't
>> microformats kind of the next frontier for "smart" search engines?
>>
>> The "web as distributed database" sounds pretty damn appealing to me.
>
> If you want to search all of it, and want to do it in a reasonable
> amount of time, indexing helps.

Right, that's the first problem I ran into.  If you want to crawl the  
whole web, you have to index the whole web.  And there's not enough  
microformatted data out there to be worth indexing the whole web to  
get at it.  Even restricting the crawler to one node away from a  
found microformat, only 293 out of 5163 (5%) URLs currently contain  
microformats.  Crawling the entire web, that percentage quickly  
approaches zero.  Google Base, on the other hand, gets valuable  
structured data out of 100% of submissions.  Advantage Google.

David Janes -- BlogMatrix wrote:

> we have developed a crawler that collects FOAF profiles from the  
> Web and
> uploads them into Google Base.

I've been thinking about doing the same with the Microformat Base  
data, but I don't really care to deal with the potential copyright  
issues:

> If you want to be removed from Google Base, please send us a mail  
> and we will remove you.

If anyone else wants to be responsible for that, I'd be glad to make  
an Atom feed of hCards, which you could convert to a Google Base  
upload format.

Peace,
Scott