[uf-discuss] ufXtract - new microformats parser
glenn.jones at madgex.com
Sun Nov 25 10:09:42 PST 2007
I have being work hard on a new microformats parser (ufXtract) to help
explore the real world issues of creating portable social networks.
Although I have previously designed a number spiders that can find the
most common hCard and XFN structures, this is my first full blown
parser. It has been built from the ground up to take configuration
objects which allow the parsing of different microformats or POSH
patterns. It was important that I could parse more general patterns such
as the joint hCard-XFN being promoted for use with friend's lists.
After some further testing I am going to start to produce a number of
portable social network demo's and posts. This should also provide
others with experimental API's. By sharing this early work I hope in
some way to add to the important technical and architectural discussions
that are taking place.
I have already added hCard-XFN, rel="me", rel="next" and hAtom to the
parser. These are the four cornerstone microformats/patterns required to
gather profile and content from other social networks. Although for
technical/speed reasons ufXtract is currently only parsing the hEntry
sub-element of hAtom.
The component also contains extendable output options, so far, I have
built a simple text format for debugging, JSON and XML for building
services. For the more technically minded ufXtract is a .net component
written in c#. It uses a combination of DOM structures and xPaths. It
can typically parse a page in 50-200ms.
At the moment, I am building a test suite to fine tune the components'
compliancy. It still has some small issues with most of the compound
microformats, which I am trying to address.
If you have any comments or want to point out any issues, please give me
as much feedback as possible.
More information about the microformats-discuss