[uf-discuss] Re: Perl microformat parsing
Toby A Inkster
mail at tobyinkster.co.uk
Sat Feb 23 06:50:12 PST 2008
Takatsugu Shigeta wrote:
> Web::Scraper
> http://search.cpan.org/dist/Web-Scraper/
This looks like a handy module for some purposes, but it's not sufficient
for fully parsing microformats:
1. It doesn't seem to support the rel attribute on links, so it will fall
down when looking for rel-tag (which is used for encoding categories in
hcard).
2. For images, the alt text should normally be returned (except for a few
properties like photo and logo in hcard), but this module doesn't read alt
text. [ It seems that my code kicks Operator's ass in this dept ;-) ]
3. The module won't handle nested hcards properly. Nested hcards are
sometimes used for the "agent" property, and sometimes just to be damn
annoying.
Probably other reasons too, but I can't be bothered to think them all
through. These three ought to be enough to deter people from using it for
serious parsing though.
To parse microformats properly you need DOM, or something of similar
sophistication. For what it's worth, for alpha3 of my code I've switched
to XML::LibXML, which is an alternative to XML::DOM. It copes much better
with parsing random HTML off the web, and has namespace support should I
decide I need it for anything.
A preview of what I've already got working for alpha3:
- Partial support for RDFa. (All the important bits.)
- Support for CURIEs.
- Internally store all parsed data as RDF triples. Ability to
dump parsed metadata (including microformats) as valid RDF.
(The web interface offers a choice of Perl object dump or
formatted RDF.)
- tag URIs <http://taguri.org> and geo URIs <http://geouri.org>
There are still a few things that I want to take care of before I release
alpha3 publicly. One thing that I'm quite looking forward to getting
working is to more fully unite metadata from different sources, so say for
example there is an hcard like this:
<div class="hcard" id="#foo">
<span class="fn">Toby Inkster</span>
</div>
And somewhere else on the page I have some RDFa:
<span about="#foo" property="dc:creator">Joe Bloggs</span>
Then it will know that my hCard was created by Joe Bloggs. Not quite there
yet, but nearly.
--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 24 days, 20:42.]
Bottled Water
http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/
More information about the microformats-discuss
mailing list