[uf-discuss] Parsing XFN in PHP
danbri at danbri.org
Thu Apr 10 12:20:26 PDT 2008
Julian Bond wrote:
> Ryan Parman <ryan.lists.warpshare at gmail.com> Thu, 10 Apr 2008 09:05:47
>> As someone with a background in parsing RSS/Atom, I can say from
>> years of experience that RSS is only occasionally XML and that you
>> typically find far more HTML in a feed than XML. And parsing HTML can
>> be a bitch.
> Big snip.
> Woah! That's enough to put one off even starting on parsing and
> reading uF. Which makes uF all a bit pointless. Oh dear. :(
> I suspect though that this Gordian knot can be cut. It seems quite
> likely that any page marked up with uF is good enough that HTML-Tidy
> won't remove too many uF marked up elements. If that's the case, then
> Fetch html -> HTML-Tidy -> XML parsing is going to get 99% of the job
> done and successfully extract the uF marked data.
Aside re 'nofollow':
If you're scrubbing HTMLish character streams with arbitrary other code
to make XHTML, do take care that you're not accidentally scrubbing
rel='nofollow' from comment areas while leaving in potentially
mischievous "rel='me'" claims. I don't know the default behaviour of
HTML Tidy or similar tools, but this risk is worth bearing in mind.
"If a link has the rel value "nofollow", then a "me" rel value DOES
NOT indicate an identity relationship. That is, only rel attributes with
the value "me", and WITHOUT the value "nofollow" indicate an identity
relationship assertion. "
While it might seem odd for a 'nofollow' to be stripped while leaving a
'me' in there, I've seen enough hostility to the 'nofollow' idea
floating around, that it is certainly possible some HTML cleanup tools
will drop that markup. For example,
More information about the microformats-discuss