[uf-discuss] Bases
Scott Reynen
scott at randomchaos.com
Mon Dec 5 05:47:39 PST 2005
Charles Iliya Krempeaux wrote:
> On 12/5/05, Chris Messina <chris.messina at gmail.com> wrote:
>> On 12/4/05, Scott Reynen <scott at randomchaos.com> wrote:
>>
>>> Personally, I suspect there's just not enough microformatted
>>> content out there yet to make it worth Google's cycles parsing
>>> it." [2]. But I thought it better to try and prove myself wrong
>>> with
>>> some code than to just speculate about it.
>>
>> Um, why are we waiting for Google? I mean, besides technorati, aren't
>> microformats kind of the next frontier for "smart" search engines?
>>
>> The "web as distributed database" sounds pretty damn appealing to me.
>
> If you want to search all of it, and want to do it in a reasonable
> amount of time, indexing helps.
Right, that's the first problem I ran into. If you want to crawl the
whole web, you have to index the whole web. And there's not enough
microformatted data out there to be worth indexing the whole web to
get at it. Even restricting the crawler to one node away from a
found microformat, only 293 out of 5163 (5%) URLs currently contain
microformats. Crawling the entire web, that percentage quickly
approaches zero. Google Base, on the other hand, gets valuable
structured data out of 100% of submissions. Advantage Google.
David Janes -- BlogMatrix wrote:
> we have developed a crawler that collects FOAF profiles from the
> Web and
> uploads them into Google Base.
I've been thinking about doing the same with the Microformat Base
data, but I don't really care to deal with the potential copyright
issues:
> If you want to be removed from Google Base, please send us a mail
> and we will remove you.
If anyone else wants to be responsible for that, I'd be glad to make
an Atom feed of hCards, which you could convert to a Google Base
upload format.
Peace,
Scott
More information about the microformats-discuss
mailing list