[uf-discuss] Re: Perl microformat parsing

Toby A Inkster mail at tobyinkster.co.uk
Sat Feb 23 06:50:12 PST 2008


Takatsugu Shigeta wrote:

> Web::Scraper
> http://search.cpan.org/dist/Web-Scraper/

This looks like a handy module for some purposes, but it's not sufficient 
for fully parsing microformats:

1. It doesn't seem to support the rel attribute on links, so it will fall 
down when looking for rel-tag (which is used for encoding categories in 
hcard). 

2. For images, the alt text should normally be returned (except for a few 
properties like photo and logo in hcard), but this module doesn't read alt 
text. [ It seems that my code kicks Operator's ass in this dept ;-) ]

3. The module won't handle nested hcards properly. Nested hcards are 
sometimes used for the "agent" property, and sometimes just to be damn 
annoying.

Probably other reasons too, but I can't be bothered to think them all 
through. These three ought to be enough to deter people from using it for 
serious parsing though.

To parse microformats properly you need DOM, or something of similar 
sophistication. For what it's worth, for alpha3 of my code I've switched 
to XML::LibXML, which is an alternative to XML::DOM. It copes much better 
with parsing random HTML off the web, and has namespace support should I 
decide I need it for anything.

A preview of what I've already got working for alpha3:

	- Partial support for RDFa. (All the important bits.)

	- Support for CURIEs.

	- Internally store all parsed data as RDF triples. Ability to
	  dump parsed metadata (including microformats) as valid RDF.
	  (The web interface offers a choice of Perl object dump or
	  formatted RDF.)

	- tag URIs <http://taguri.org> and geo URIs <http://geouri.org>

There are still a few things that I want to take care of before I release 
alpha3 publicly. One thing that I'm quite looking forward to getting 
working is to more fully unite metadata from different sources, so say for 
example there is an hcard like this:

	<div class="hcard" id="#foo">
		<span class="fn">Toby Inkster</span>
	</div>

And somewhere else on the page I have some RDFa:

	<span about="#foo" property="dc:creator">Joe Bloggs</span>

Then it will know that my hCard was created by Joe Bloggs. Not quite there 
yet, but nearly.

-- 
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 24 days, 20:42.]

                               Bottled Water
          http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/



More information about the microformats-discuss mailing list