[uf-discuss] Storing Microformats

Ryan King ryan at technorati.com
Sun Sep 23 18:27:14 PDT 2007


On Sep 17, 2007, at 12:44 PM, Paul Kinlan wrote:
> I have created a C#/.Net Stream-based Microformat parser
> (http://www.codeplex.com/microformat) and I am trying to create some
> reference applications to show it off.
>
> I am in the process of creating an "Operator" like plugin for IE (It
> currently parses and displays the microformats that have been found on
> a page).
>
> One of the other ideas that I am toying with is a Microformat spider,
> that crawls the web looking for microformats, storing them and then
> allowing them to be searched.   My question is: How are people storing
> the data present in microformats so that they can be searched and
> maintained and consumed in applications etc?

In short, I use mysql tables, one for each microformat and one for  
each elemental type that can be many-to-many (images, photos, tags,  
etc) which then have polymorphic many-to-many relationships with the  
tables for the formats themselves.

We also build search indexes, currently using Ferret [http:// 
ferret.davebalmain.com/trac/], but hopefully soon switching our  
standard Lucene infrastructure at Technorati.

We cache all objects in memcache with indefinite timeouts (all cache  
clearing is done proactively). This includes all related items in one  
cache entry.

When it comes down to it, it's all a matter of scale. When we were  
indexing 10^5 and 10^6 items, we would actually parse some of the  
markup on the fly when someone did a search. Sounds crazy but it  
worked alright for awhile (I blame Tantek). Now we parse it all out  
into a relatively normalized model. We're at 10^8 or so items now. If  
we hit another order of magnitude we'll have to rethink things and  
probably take some stuff (like BLOBs) out of the relational database  
and put them somewhere else.

-ryan


More information about the microformats-discuss mailing list