[uf-discuss] stats on well formed XHTML

Wed Jan 16 15:04:38 PST 2008

On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote:

> Has anyone done any large scale audits of XHTML in the wild to
> determine the percentage that parse correctly?

Yes, Ian Hickson at Google did a survey of about 1B pages and found  
that over 90% had *well-formedness* errors. I can't find a reference  
off hand, but it maybe buried somewhere in [#webstats].

> I'm thinking about deploying one in Spinn3r but I'd rather focus on
> other tasks if this has already been done.

I'd suggest working on other tasks. :)

> I'm curious about the assumptions one could make when assuming that
> XHTML is well formed.

You know what they say about assumptions.

> Specifically, the probability that a naive non-XML parser can make
> while indexing the content.

I'm not sure what you mean here, but I'd reccomend against using an  
XML parser against web content and instead use something like the  
HTML5 parsing algorithm [#html5-parsing].

-ryan

[webstats]: http://code.google.com/webstats/
[html5-parsing]: http://whatwg.org/specs/web-apps/current-work/#parsing