[uf-discuss] ANN: Curiosity - an XPath based Microformats scraper
piercarlos at gmail.com
Thu Mar 23 05:40:42 PST 2006
I would like to announce the first release of Curiosity, a .NET xpath
based screen scraper and push platform, which can be straightaway used
for scraping Microformats data.
*** http://www.go-curiosity.com ***
Curiosity can extract data from ANY web page, because it uses Tidy
(http://tidy.sourceforge.net/) for converting the page in xhtml.
For every page defined in its configuration, Curiosity maintains an
history of the data extracted in the past: hence, it can easily identify
new, modified and deleted items.
Moreover, Curiosity can be instructed with (xpath based) crawling rules,
form based authentications (even on https) and proxy settings.
The extracted data can be handled by an extensible architecture of
"providers": Curiosity is equipped with providers for sending the data
by email, and for creating and uploading RSS feeds by means of ftp.
If you want to build your own provider, you just have to implement a
simple interface using .NET languages (the c# sources of a sample custom
provider for pushing data in a MS Jet db are available).
In order to engineering the xpath scraping rules, a visual tool named
Curiosity Studio is supplied: it will be just matter of selecting the
relevant text in an embedded Internet Explorer and its xpath will be
Finally, Curiosity can be also run in application server mode: this way,
the scraping facilities can be invoked by means of SOAP web services.
Curiosity is available FREE OF CHARGE FOR NON-COMMERCIAL AND PERSONAL
USE in a private domain, and for NON COMMERCIAL use in research projects.
More information about the microformats-discuss