[uf-dev] Parsing XFN in PHP

Geoffrey Sneddon foolistbar at googlemail.com
Fri Apr 11 04:45:03 PDT 2008


On 10 Apr 2008, at 18:34, Toby A Inkster wrote:
> Ryan Parman wrote:
>
>> "But we can do it in web browsers!" What do web browsers have that  
>> PHP
>> developers don't? An HTML parser. As far as I know there are no HTML
>> parsers written for PHP (or any other language that I'm aware of).
>
> http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php

That doesn't really work. libxml2's HTML parsing is nothing like what  
is actually needed for real world compatibility. Just take a look at  
things like <b><i>foo</b>bar</i>, or <plaintext>foo</plaintext><b>bar.


On 11 Apr 2008, at 08:33, Toby A Inkster wrote:
> Another option is XML_HTMLSax3 from PEAR:
> http://pear.php.net/package/XML_HTMLSax3

This really seems like nothing more than a subset of SGML similar to  
XML, and is therefore equally useless at parsing HTML. See the above  
two examples again, as well as things like <b<i>hi</i></b> (note the  
omitted >).

Real world HTML content really does rely on specific parsing rules,  
and attempting to deviate from them will just result in issues. In  
terms of anything useful, you'd really need to implement your own HTML  
parser, likely starting from HTML 5. Then you can run into issues with  
DOM requiring XML well-formedness, so you can't have as a localName  
"a@" (to reuse the example on public-html a few days ago, you need to  
parse <a@> <a#> </a@> correctly, despite all those tags having  
characters that you can't legally store in the DOM)


--
Geoffrey Sneddon
<http://gsnedders.com/>



More information about the microformats-dev mailing list