[uf-discuss] ANN: Curiosity - an XPath based Microformats scraper

Piercarlo Slavazza piercarlos at gmail.com
Thu Mar 23 05:40:42 PST 2006

Hi all,

I would like to announce the first release of Curiosity, a .NET xpath 
based screen scraper and push platform, which can be straightaway used 
for scraping Microformats data.

***  http://www.go-curiosity.com  ***

Curiosity can extract data from ANY web page, because it uses Tidy 
(http://tidy.sourceforge.net/) for converting the page in xhtml.

For every page defined in its configuration, Curiosity maintains an 
history of the data extracted in the past: hence, it can easily identify 
new, modified and deleted items.

Moreover, Curiosity can be instructed with (xpath based) crawling rules, 
form based authentications (even on https) and proxy settings.

The extracted data can be handled by an extensible architecture of 
"providers": Curiosity is equipped with providers for sending the data 
by email, and for creating and uploading RSS feeds by means of ftp.
If you want to build your own provider, you just have to implement a 
simple interface using .NET languages (the c# sources of a sample custom 
provider for pushing data in a MS Jet db are available).

In order to engineering the xpath scraping rules, a visual tool named 
Curiosity Studio is supplied: it will be just matter of selecting the 
relevant text in an embedded Internet Explorer and its xpath will be 
automatically computed.

Finally, Curiosity can be also run in application server mode: this way, 
the scraping facilities can be invoked by means of SOAP web services.

USE in a private domain, and for NON COMMERCIAL use in research projects.

piercarlo slavazza

More information about the microformats-discuss mailing list