[uf-discuss] stats on well formed XHTML
Derrick Lyndon Pallas
derrick at pallas.us
Wed Jan 16 19:31:42 PST 2008
Kevin Burton wrote:
>> I'm not sure what you mean here, but I'd reccomend against using an
>> XML parser against web content and instead use something like the
>> HTML5 parsing algorithm [#html5-parsing].
> Yes... I'm just trying to avoid using a full HTML parser (DOM or not)
> to avoid garbage generation and processor overhead.
I use a streaming (SAX-like) HTML5 parser every day; because it's
defined in terms of the underlying state-machine, it's actually quite a
bit faster than what I had been using. Furthermore, many edge cases that
might otherwise have gone unnoticed are dealt with cleanly.
There a bigger problems that you'll face if you're indexing content,
e.g. encoding issues. Tokenizing HTML shouldn't be one of them.
More information about the microformats-discuss