[uf-discuss] Scraping or parsing?

Sun Mar 4 00:14:33 PST 2007

Danny Ayers wrote:
> Just as an aside (and I'm open to accusations of
> "architecture astronautics" here), if adding a profile
> attribute is hard for webmasters, the right answer is to
> make it easier rather than working around its absence.
> The <head> of a HTML document is an important part of the
> chain of authoritative metadata [1].

My pedantic side wants to yell "Yea! Right on!" but my pragmatic side tells
me that taking such a position is completely impractical because of the
proliferation of blogs, wikis, and cms that empower users to publish content
with no access to the <head>.  Access may be denyed because the content
publisher is using software on servers they have no access to or because
they can't change the source code of their web app as they are are not
technical enough/don't have admin access to the servers/don't want to fork
the source/have company policies that disallow mods/don't have the
source/etc.

You could say "Well the answer then is to get all the developers of all
those apps to provide the content publishers access to the <head> and then
get all the existing apps in the field replaced with the new versions!"
However, I think you'll agree that requiring such an approach is impractical
when it is possible to craft a workaround. Even if we could someone dicate
the above, it would likely take a decade before most content publishers had
access to <head>.  After all, look how long it took to get the major
browsers to add (some) support for certain standards, and they numbered far
less then 10. There are hundreds of web apps for content publishing with
tens of millions of server installations; I don't see them being 'fixed' any
time soon. :-(

FWIW.

-- 
-Mike Schinkel
http://www.mikeschinkel.com/blogs/
http://www.welldesignedurls.org
http://atlanta-web.org - http://t.oolicio.us
"It never ceases to amaze how many people will proactively debate away
attempts to improve the web..."