[uf-discuss] stats on well formed XHTML

Wed Jan 16 19:31:42 PST 2008

Kevin Burton wrote:
>> I'm not sure what you mean here, but I'd reccomend against using an
>> XML parser against web content and instead use something like the
>> HTML5 parsing algorithm [#html5-parsing].
>>     
>
> Yes... I'm just trying to avoid using a full HTML parser (DOM or not)
> to avoid garbage generation and processor overhead.
>   
I use a streaming (SAX-like) HTML5 parser every day; because it's 
defined in terms of the underlying state-machine, it's actually quite a 
bit faster than what I had been using. Furthermore, many edge cases that 
might otherwise have gone unnoticed are dealt with cleanly.

There a bigger problems that you'll face if you're indexing content, 
e.g. encoding issues. Tokenizing HTML shouldn't be one of them.

~Derrick