[uf-discuss] Parsing XFN in PHP

Thu Apr 10 09:05:47 PDT 2008

As someone with a background in parsing RSS/Atom, I can say from years  
of experience that RSS is only occasionally XML and that you typically  
find far more HTML in a feed than XML. And parsing HTML can be a bitch.

If an XML parser is conforming to the XML spec, it'll fail on a number  
of cases --  most notably, ill-formed XML. Standard, validated HTML  
4.01 is ill-formed XML. libxml/libxml2, DOM, and SimpleXML are all  
conforming, and I believe that SAX is too, although I've never used it  
before. I don't know about XPath. Therefore, parsing data from  
anything other than perfectly-formed XHTML (served with an XML- 
friendly mime type, as per RFC 3023) is expected to fail. Frequently.

You could attempt to go another route and use regular expressions, but  
regex can be slow, and certain recent versions of PCRE in PHP5 were  
buggy, causing PHP segmentation faults (i.e. the PHP executable  
crashes) in complex PCRE expressions.

In SimplePie, we use a hybrid wherein we do some up-front checking and  
fixing with string parsing, then pass the RSS string through libxml- 
based parsing functions, then use PCRE regexes on various XML-Array  
nodes to pull out specific bits of data. It's the most reliable method  
that we could come up with for a syntax (RSS) that at least *attempts*  
to be XML. The problem is that *most* of the world's web pages aren't  
XML or even trying to be XML... they're straight-up, old-skool HTML.  
And you will absolutely run into problems.

"But we can do it in web browsers!" What do web browsers have that PHP  
developers don't? An HTML parser. As far as I know there are no HTML  
parsers written for PHP (or any other language that I'm aware of). One  
of the other SimplePie devs (Geoffrey Sneddon) had gotten together  
with another developer and attempted to start writing one for PHP 5.2,  
but gave up shortly after reading a number of specification documents  
that need to be read and understood before being able to do this  
properly.

"But can't we just hack something together that's 'good enough?'" If  
you want to support it, sure. But expect bug reports and feature  
requests -- LOTS of them. Then you'll go and re-write stuff to make it  
better and more compliant, and people will begin complaining because  
some behavior changed, and they're mad about that. Oh, and let's not  
forget how people will bitch you out because "it doesn't work like  
[insert web browser here]," and that you must be some sort of "stupid,  
lazy developer who can't get it right." Granted, these people are  
complete morons, but after you've put tons of your time and energy  
into this project to make it as good as possible, stuff like this can  
get demoralizing. And after you get tired of the verbal abuse and  
working on the project after a few months, you'll start getting lots  
and lots of complaints and requests for somebody else to take over the  
project -- but nobody else has taken the time to read through the  
relevant spec docs like you have, and it'll take them a really long  
time to get up to speed. Long enough, in fact, that the project may  
never get picked back up.

_________________________________________________

I said all of that to make these points:

1) Parsing HTML is hard -- especially when the only tools available  
are for another language (XML). If you need to screw something in, but  
screw drivers don't exist, do you use a hammer? An elegantly folded  
paperclip? A combination of both?

2) *Reliably* parsing microformats out of *most* (X)HTML with object- 
oriented PHP 5.x is going to be a big project. If you're diligent  
about commenting your code so that others can understand what's going  
on, I'd expect a PHP5 library to be at least 1 megabyte. You'll need  
to account for an unprecedented number of completely idiotic markup  
faults.

3) If you want to attempt a project like this, get a team of people  
together. You could probably start with 1-2 people who can evaluate  
the needs of a project like this, and write some initial code. Open up  
to the community early to start accepting feedback. Once this project  
gets rolling, I'd expect no less than 5-6 people working on it to make  
any notable progress in a reasonable timeframe of 1-2 years. (It's an  
open source project, remember? Evenings and weekends, baby!) Break the  
project down into modules and assign them to different developers.  
Those developers should be prepared to read several specification  
documents in order to understand the correct way to do things. Oh, and  
create an automated unit testing suite. It'll save you tons of time in  
testing.

--
Ryan Parman
<http://ryanparman.com>

On Apr 10, 2008, at 6:01 AM, Ciaran McNulty wrote:
> On Thu, Apr 10, 2008 at 1:40 PM, Mark Ng <mark at markng.me.uk> wrote:
>> XFN itself is fairly easy to deal with by just throwing pages through
>> tidy and using DOM/SAX/xPath, surely ?  I made a rudimentary parser  
>> to
>> do this some time ago.  The code is a little ugly to publish, but I
>> don't mind sharing privately.
>
> Here's a *very* hacky code example from when I just wanted to check my
> 'me' links - I include it here just to demonstrate how simple XFN can
> be and hopefully it's apparent how easy it would be to work up into a
> nice objecty system for spidering:
>
> <?php
>
> $url = 'http://ciaranmcnulty.com/';
> if($html = @file_get_contents($url)){
> 	$dom = new DomDocument();
> 	if(@$dom->loadHtml($html)){
> 		$xpath = new DomXpath($dom);
> 		if($nodes = $xpath->query("//a[contains(concat(' ',
> normalize-space(@rel), ' '),' me ')]")){
> 			foreach($nodes as $node){
> 				echo $node->getAttribute('href'), PHP_EOL;
> 			}
> 		}
> 	}
> 	else{ echo 'Could not parse HTML', PHP_EOL; }
> }
> else{  echo 'Could not fetch file', PHP_EOL; }
> ?>
> _______________________________________________
> microformats-discuss mailing list
> microformats-discuss at microformats.org
> http://microformats.org/mailman/listinfo/microformats-discuss