[uf-discuss] Storing Microformats
ryan at technorati.com
Sun Sep 23 18:27:14 PDT 2007
On Sep 17, 2007, at 12:44 PM, Paul Kinlan wrote:
> I have created a C#/.Net Stream-based Microformat parser
> (http://www.codeplex.com/microformat) and I am trying to create some
> reference applications to show it off.
> I am in the process of creating an "Operator" like plugin for IE (It
> currently parses and displays the microformats that have been found on
> a page).
> One of the other ideas that I am toying with is a Microformat spider,
> that crawls the web looking for microformats, storing them and then
> allowing them to be searched. My question is: How are people storing
> the data present in microformats so that they can be searched and
> maintained and consumed in applications etc?
In short, I use mysql tables, one for each microformat and one for
each elemental type that can be many-to-many (images, photos, tags,
etc) which then have polymorphic many-to-many relationships with the
tables for the formats themselves.
We also build search indexes, currently using Ferret [http://
ferret.davebalmain.com/trac/], but hopefully soon switching our
standard Lucene infrastructure at Technorati.
We cache all objects in memcache with indefinite timeouts (all cache
clearing is done proactively). This includes all related items in one
When it comes down to it, it's all a matter of scale. When we were
indexing 10^5 and 10^6 items, we would actually parse some of the
markup on the fly when someone did a search. Sounds crazy but it
worked alright for awhile (I blame Tantek). Now we parse it all out
into a relatively normalized model. We're at 10^8 or so items now. If
we hit another order of magnitude we'll have to rethink things and
probably take some stuff (like BLOBs) out of the relational database
and put them somewhere else.
More information about the microformats-discuss