[uf-discuss] Parsing XFN in PHP
Ryan Parman
ryan.lists.warpshare at gmail.com
Thu Apr 10 09:05:47 PDT 2008
As someone with a background in parsing RSS/Atom, I can say from years
of experience that RSS is only occasionally XML and that you typically
find far more HTML in a feed than XML. And parsing HTML can be a bitch.
If an XML parser is conforming to the XML spec, it'll fail on a number
of cases -- most notably, ill-formed XML. Standard, validated HTML
4.01 is ill-formed XML. libxml/libxml2, DOM, and SimpleXML are all
conforming, and I believe that SAX is too, although I've never used it
before. I don't know about XPath. Therefore, parsing data from
anything other than perfectly-formed XHTML (served with an XML-
friendly mime type, as per RFC 3023) is expected to fail. Frequently.
You could attempt to go another route and use regular expressions, but
regex can be slow, and certain recent versions of PCRE in PHP5 were
buggy, causing PHP segmentation faults (i.e. the PHP executable
crashes) in complex PCRE expressions.
In SimplePie, we use a hybrid wherein we do some up-front checking and
fixing with string parsing, then pass the RSS string through libxml-
based parsing functions, then use PCRE regexes on various XML-Array
nodes to pull out specific bits of data. It's the most reliable method
that we could come up with for a syntax (RSS) that at least *attempts*
to be XML. The problem is that *most* of the world's web pages aren't
XML or even trying to be XML... they're straight-up, old-skool HTML.
And you will absolutely run into problems.
"But we can do it in web browsers!" What do web browsers have that PHP
developers don't? An HTML parser. As far as I know there are no HTML
parsers written for PHP (or any other language that I'm aware of). One
of the other SimplePie devs (Geoffrey Sneddon) had gotten together
with another developer and attempted to start writing one for PHP 5.2,
but gave up shortly after reading a number of specification documents
that need to be read and understood before being able to do this
properly.
"But can't we just hack something together that's 'good enough?'" If
you want to support it, sure. But expect bug reports and feature
requests -- LOTS of them. Then you'll go and re-write stuff to make it
better and more compliant, and people will begin complaining because
some behavior changed, and they're mad about that. Oh, and let's not
forget how people will bitch you out because "it doesn't work like
[insert web browser here]," and that you must be some sort of "stupid,
lazy developer who can't get it right." Granted, these people are
complete morons, but after you've put tons of your time and energy
into this project to make it as good as possible, stuff like this can
get demoralizing. And after you get tired of the verbal abuse and
working on the project after a few months, you'll start getting lots
and lots of complaints and requests for somebody else to take over the
project -- but nobody else has taken the time to read through the
relevant spec docs like you have, and it'll take them a really long
time to get up to speed. Long enough, in fact, that the project may
never get picked back up.
_________________________________________________
I said all of that to make these points:
1) Parsing HTML is hard -- especially when the only tools available
are for another language (XML). If you need to screw something in, but
screw drivers don't exist, do you use a hammer? An elegantly folded
paperclip? A combination of both?
2) *Reliably* parsing microformats out of *most* (X)HTML with object-
oriented PHP 5.x is going to be a big project. If you're diligent
about commenting your code so that others can understand what's going
on, I'd expect a PHP5 library to be at least 1 megabyte. You'll need
to account for an unprecedented number of completely idiotic markup
faults.
3) If you want to attempt a project like this, get a team of people
together. You could probably start with 1-2 people who can evaluate
the needs of a project like this, and write some initial code. Open up
to the community early to start accepting feedback. Once this project
gets rolling, I'd expect no less than 5-6 people working on it to make
any notable progress in a reasonable timeframe of 1-2 years. (It's an
open source project, remember? Evenings and weekends, baby!) Break the
project down into modules and assign them to different developers.
Those developers should be prepared to read several specification
documents in order to understand the correct way to do things. Oh, and
create an automated unit testing suite. It'll save you tons of time in
testing.
--
Ryan Parman
<http://ryanparman.com>
On Apr 10, 2008, at 6:01 AM, Ciaran McNulty wrote:
> On Thu, Apr 10, 2008 at 1:40 PM, Mark Ng <mark at markng.me.uk> wrote:
>> XFN itself is fairly easy to deal with by just throwing pages through
>> tidy and using DOM/SAX/xPath, surely ? I made a rudimentary parser
>> to
>> do this some time ago. The code is a little ugly to publish, but I
>> don't mind sharing privately.
>
> Here's a *very* hacky code example from when I just wanted to check my
> 'me' links - I include it here just to demonstrate how simple XFN can
> be and hopefully it's apparent how easy it would be to work up into a
> nice objecty system for spidering:
>
> <?php
>
> $url = 'http://ciaranmcnulty.com/';
> if($html = @file_get_contents($url)){
> $dom = new DomDocument();
> if(@$dom->loadHtml($html)){
> $xpath = new DomXpath($dom);
> if($nodes = $xpath->query("//a[contains(concat(' ',
> normalize-space(@rel), ' '),' me ')]")){
> foreach($nodes as $node){
> echo $node->getAttribute('href'), PHP_EOL;
> }
> }
> }
> else{ echo 'Could not parse HTML', PHP_EOL; }
> }
> else{ echo 'Could not fetch file', PHP_EOL; }
> ?>
> _______________________________________________
> microformats-discuss mailing list
> microformats-discuss at microformats.org
> http://microformats.org/mailman/listinfo/microformats-discuss
More information about the microformats-discuss
mailing list