<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Forwarding Geoffrey's off-list message sent to the original thread:<br><div><br></div><div><br><html>Begin forwarded message:</html><br class="Apple-interchange-newline"><blockquote type="cite"><div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>From: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica">Geoffrey Sneddon <<a href="mailto:foolistbar@googlemail.com">foolistbar@googlemail.com</a>></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>Date: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica">April 11, 2008 4:45:03 AM PDT</font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>To: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica">Toby A Inkster <<a href="mailto:mail@tobyinkster.co.uk">mail@tobyinkster.co.uk</a>>, Ryan Parman <<a href="mailto:ryan.lists.warpshare@gmail.com">ryan.lists.warpshare@gmail.com</a>></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" color="#000000" style="font: 12.0px Helvetica; color: #000000"><b>Subject: </b></font><font face="Helvetica" size="3" style="font: 12.0px Helvetica"><b>Re: (Off-list) Parsing XFN in PHP</b></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><br></div> </div><div><br>On 10 Apr 2008, at 18:34, Toby A Inkster wrote:<br><blockquote type="cite">Ryan Parman wrote:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">"But we can do it in web browsers!" What do web browsers have that PHP<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">developers don't? An HTML parser. As far as I know there are no HTML<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">parsers written for PHP (or any other language that I'm aware of).<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><a href="http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php">http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php</a><br></blockquote><br>That doesn't really work. libxml2's HTML parsing is nothing like what is actually needed for real world compatibility. Just take a look at things like <b><i>foo</b>bar</i>, or <plaintext>foo</plaintext><b>bar.<br><br><br>On 11 Apr 2008, at 08:33, Toby A Inkster wrote:<br><blockquote type="cite">Another option is XML_HTMLSax3 from PEAR:<br></blockquote><blockquote type="cite"><a href="http://pear.php.net/package/XML_HTMLSax3">http://pear.php.net/package/XML_HTMLSax3</a><br></blockquote><br>This really seems like nothing more than a subset of SGML similar to XML, and is therefore equally useless at parsing HTML. See the above two examples again, as well as things like <b<i>hi</i></b> (note the omitted >).<br><br>Real world HTML content really does rely on specific parsing rules, and attempting to deviate from them will just result in issues. In terms of anything useful, you'd really need to implement your own HTML parser, likely starting from HTML 5. Then you can run into issues with DOM requiring XML well-formedness, so you can't have as a localName "a@" (to reuse the example on public-html a few days ago, you need to parse <a@> <a#> </a@> correctly, despite all those tags having characters that you can't legally store in the DOM)<br><br><br>--<br>Geoffrey Sneddon<br><<a href="http://gsnedders.com/">http://gsnedders.com/</a>><br><br></div></blockquote></div><br></body></html>