[uf-discuss] Re: Perl microformat parsing

Fri Feb 22 10:39:30 PST 2008

Hi Toby,

If you want to scrape only web pages,
I would like to recommend the following CPAN module.

Web::Scraper
http://search.cpan.org/dist/Web-Scraper/

== sample code ==
#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;
use URI;
use Web::Scraper;

my $url = 'http://diveintomark.org/projects/greasemonkey/hcard/tests/2-4-2-vcard.xhtml';

my $fn = scraper {
    process '.vcard .fn', 'fn[]' => 'TEXT';
    process '.vcard .tel', 'tel[]' => 'TEXT';
    process '.vcard .title', 'title[]' => 'TEXT';
    result 'fn', 'tel', 'title';
}->scrape(URI->new($url));

print Dumper $fn;

== sample output ==
$ perl hcard.pl
$VAR1 = {
          'tel' => [
                     '+1-919-555-7878'
                   ],
          'title' => [
                       'Area Administrator, Assistant'
                     ],
          'fn' => [
                    'Joe Friday'
                  ]
        };

Thanks.

-- shigeta

On Thu, Feb 21, 2008 at 7:14 PM, Toby A Inkster <mail at tobyinkster.co.uk> wrote:
> Rob Manson wrote:
>
>  > Here's a patch to prove that this is the problem using a quick and dirty
>  > regex fix:
>  >
>  > 848d847
>  > <       $html =~ s/\&nbsp\;//igm;
>  >
>  > I tried it on both a simple hcard like
>  > http://microformats.org/wiki/User:RobManson and the full hcard page
>  > (which is veeeeery slow to parse) http://microformats.org/wiki/hcard and
>  > the patch fixes it.
>
>  Thanks for your hint. The XML::Parser module is able to fetch DTDs and use
>  them, so should be able to handle expansion of named entities by itself --
>  the only problem was that I had disabled it, partly to cut down on
>  bandwidth usage, but also because I thought it would break too many pages
>  to validate them. Anyway, I've re-enabled it and this seems to have fixed
>  more pages than it's broken. I'm guessing that XML::Parser does not
>  validate based on the DTD -- it just uses them to expand entities.
>
>  With regards to speed, that's because I'm using LWP::RobotUA instead of
>  LWP::UserAgent. This downloads the robots.txt (and honours it) and also
>  enforces a delay between each request. The delay is 1 minute by default
>  though I set it to 10 seconds -- or at least I thought I did, but I was
>  trying to set it in the LWP::RobotUA constructor function, which it seems
>  does not work. The delay is now set to 5 seconds and works. This has made
>  it significantly faster.
>
>  New version (0.1-alpha2.1):
>
>  Online:   http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.pl
>  Download: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.txt
>
>  This successfully parses both the pages you mentioned above.
>
>  Thanks again,
>
>
>  --
>  Toby A Inkster BSc (Hons) ARCS
>  [Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
>  [OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 22 days, 16:20.]
>
>
>                                Bottled Water
>           http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/
>
>  _______________________________________________
>
>
> microformats-discuss mailing list
>  microformats-discuss at microformats.org
>  http://microformats.org/mailman/listinfo/microformats-discuss
>