[uf-discuss] Re: Parsing XFN in PHP
Ryan Parman
ryan.lists.warpshare at gmail.com
Thu Apr 10 11:34:23 PDT 2008
On Apr 10, 2008, at 10:04 AM, Julian Bond wrote:
> Ryan Parman <ryan.lists.warpshare at gmail.com> Thu, 10 Apr 2008 09:05:47
>> As someone with a background in parsing RSS/Atom, I can say from
>> years of experience that RSS is only occasionally XML and that you
>> typically find far more HTML in a feed than XML. And parsing HTML
>> can be a bitch.
>
> Big snip.
>
> Woah! That's enough to put one off even starting on parsing and
> reading uF. Which makes uF all a bit pointless. Oh dear. :(
Sarcasm noted. ;)
> I suspect though that this Gordian knot can be cut. It seems quite
> likely that any page marked up with uF is good enough that HTML-Tidy
> won't remove too many uF marked up elements. If that's the case,
> then Fetch html -> HTML-Tidy -> XML parsing is going to get 99% of
> the job done and successfully extract the uF marked data. But that
> HTML-Tidy step is going to be indispensable. It just plain won't
> work without it. And the shortcut that reduces even that step is
> DomDocument>loadHtml($html) which is effectively doing the same thing.
On Apr 10, 2008, at 10:34 AM, Toby A Inkster wrote:
> http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php
This is interesting -- especially if it works. However the version
information is noted as CVS-only. Is this in a shipping version of PHP
yet?
Using HTML-Tidy is a fairly big gotcha for most people on shared
hosting. I don't know the stats, but I would guess that not many
hosting providers have this installed. I have access to dedicated
hardware, so I'm definitely interested in this (assuming it works as
expected, of course), but I'm concerned about the community at-large.
On Apr 10, 2008, at 10:04 AM, Julian Bond wrote:
> It would be interesting to do some interop testing and see just how
> bad a web page has to be before the uF starts getting missed.
I agree.
--
Ryan Parman
<http://ryanparman.com>
More information about the microformats-discuss
mailing list