[uf-dev] Parsing XFN in PHP
Geoffrey Sneddon
foolistbar at googlemail.com
Fri Apr 11 04:45:03 PDT 2008
On 10 Apr 2008, at 18:34, Toby A Inkster wrote:
> Ryan Parman wrote:
>
>> "But we can do it in web browsers!" What do web browsers have that
>> PHP
>> developers don't? An HTML parser. As far as I know there are no HTML
>> parsers written for PHP (or any other language that I'm aware of).
>
> http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php
That doesn't really work. libxml2's HTML parsing is nothing like what
is actually needed for real world compatibility. Just take a look at
things like <b><i>foo</b>bar</i>, or <plaintext>foo</plaintext><b>bar.
On 11 Apr 2008, at 08:33, Toby A Inkster wrote:
> Another option is XML_HTMLSax3 from PEAR:
> http://pear.php.net/package/XML_HTMLSax3
This really seems like nothing more than a subset of SGML similar to
XML, and is therefore equally useless at parsing HTML. See the above
two examples again, as well as things like <b<i>hi</i></b> (note the
omitted >).
Real world HTML content really does rely on specific parsing rules,
and attempting to deviate from them will just result in issues. In
terms of anything useful, you'd really need to implement your own HTML
parser, likely starting from HTML 5. Then you can run into issues with
DOM requiring XML well-formedness, so you can't have as a localName
"a@" (to reuse the example on public-html a few days ago, you need to
parse <a@> <a#> </a@> correctly, despite all those tags having
characters that you can't legally store in the DOM)
--
Geoffrey Sneddon
<http://gsnedders.com/>
More information about the microformats-dev
mailing list