[uf-discuss] ufXtract's portable social network parser

Mon Dec 3 14:34:20 PST 2007

ufXtract's portable social network parser is a combination of the
ufXtract microformats parser and a spider which follows rel="me" links.
It has been designed to extract profiles and friends lists from social
networks and other sites which have microformats support. The parser
returns two main collections of data, all the rel="me" links and any
hCard-XFN patterns.

The parser API
http://lab.backnetwork.com/ufXtract-psn/ 

A demo using JavaScript and JSON 
http://lab.backnetwork.com/ufXtract-psn/demo01.htm

The Parser
You can set the parser to single single or multiple domains. Currently,
there are limits to the number of pages which will be parsed (20). Each
collection item is given an additional source-url attribute to identify
its origin 

There is support for both XML and JSON output, for both client and
server-side development. 

The parser also uses a version of the representative hCard concept,
which tries to identify the hCard representing the profile owner. The
implementation is a little more complex than described on the
microformats wiki as it extends over multiple pages and domains. This
means you may find multiple representative hCards from one call to the
API, but there should only ever be one per a URL. 

The Demo 
I believe there are a number of different ways that this functionality
could be designed into web sites. So I have provided a simple interface
design to demonstrate one possibility. It's a bit of a homage to the
getsatisfaction.com registration page with a few extra twists. I would
like to thank my co-worker James Wragg who created the JavaScript for
the demo. 

Of the sites listed on the demo last.fm and ma.gnolia.com return the
best results. The other sites have differing levels of portable social
network support. It also works well against blogs such as adactio.com or
tantek.com that are marked-up with rel="me" . It's worth trying out the
two depth search levels. 

Pages not parsing 
You may find on some sites like twitter.com only certain pages are
parsed. These sites often have good microformats support, but parts of
their functionally are locked behind logon's. The parser does not
support authenticated sessions as this would mean asking the user to
pass me their log-in details which is a really bad idea.  If I can lay
my hands on a good Open-ID and/or OAuth C# libraries, I will try and
implement some different types of authentication.

Research
This is all research work still under development, I placed it on the
web for others to experiment with. I hope you enjoy playing with it. 

Glenn