[uf-discuss] class="tag"

Tue Jul 1 13:49:33 PDT 2008

Ciaran McNulty wrote:
> On Sun, Jun 29, 2008 at 3:07 PM, Duncan Cragg <uf-discuss at cilux.org> wrote:
>   
>> Those of us who favour opaque URLs (actually for practical reasons such as
>> clean separation of concerns, maintainability, etc.) are unhappy with being
>> forced into a semantic URL schema when using rel-tag.
>>     

> Can you go into a bit more detail, or point to a resource explaining
> the benefits of opaque URLs?  It's something I've not come across
> before and I'd be intrigued to see the reasons behind it.
>   

I'll do both. Here's a resource explaining it - I addressed the subject 
in this blog post:

http://duncan-cragg.org/blog/post/content-types-and-uris-rest-dialogues/

That is a very transparent URL (see: I'm not obsessive about it!). 

The trouble with my URL is that it mixes three concerns:

1. making a connection to my server and kicking off HTTP
2. identifying a resource (with a completely opaque string) within HTTP
3. kicking off some Python code with an argument string

It's 1. and 3. I'm talking about. URLs are already opaque to HTTP.

As soon as you allow in syntax or schema in URLs - as soon as you start 
using anything other than long random numbers - you've got a problem of 
namespace allocation and schema standardisation. I refer to "Zooko's 
Triangle" on my blog's right rail which discusses the trade-off between 
global uniqueness, security and memorability.
_________________________________________

On 1.: Unless you're running fancy P2P algorithms, it's hard to argue 
against putting a big hint in the URL to say where to go to find the 
resource. But don't forget that you needn't go to that server - you 
could ask an intermediary proxy - which is kind of a simplistic P2P 
algorithm... 

However, there is a case for arguing that DNS has been a failure: it 
isn't any more easy to type a URL when you know you have to be so 
precise to avoid scam sites. And it isn't any easier to use it to 
identify a site when you have to avoid the likes of 
www.yahoo.com.baddies.com or www.google.randomtld . You may as well only 
use IP addresses; as hard to type and as useless to read. Most programs 
come with a copy-paste function to save some typing...

Add to this lack of security (and other security holes) the absurd 
scramble for domain name real estate and such bad behaviour as domain 
squatting, etc., and it's looking like a system that only system admins 
and crooks benefit from. 

Most people (including myself) would type 'acme' into Google instead of 
'acme.com' into the URL bar, to give an extra level of intelligence, 
familiarity, trust and user interface consistency.
_________________________________________

But really it's 3. that bothers me most. Using URLs to pass 
human-readable strings to an application 'above' HTTP.

A transparent URL string is always a query string (whether it has a '?' 
or not) - in other words, it could potentially be ambiguous and return, 
not definitely one, but zero or many possible results.  We probably get 
zero results when we 'hack' a URL or when the site gets reorganised. We 
gloss over the many-results case by returning a single page that we call 
'query results'. But by allowing in zero or many resources so easily, 
we've loosened the Web by removing the definite 1-1 mapping of URL to 
resource.

Hackable URLs should not be part of a self-respecting website's user 
interface. We would give a better user experience if we took the URL bar 
away and replaced it with a 'jump to first clipboard web link' button, 
for those copy-paste situations. Such a button would intelligently parse 
the text on the clipboard for URLs and jump to the first location 
discovered.  A good information architecture and user interaction design 
makes hackable URLs irrelevant.

Another problem is when people start using their knowledge of the URL 
structure to generate new URLs - it may be acceptable or encouraged 
(even prescribed in an HTML GET form), but each time it happens, we're 
creating a unique mini-contract - another non-standard schema.  The Web 
thrives on URL proliferation, not on schema proliferation!

The need for URLs to be reliable - to always return what they are 
expected to return each time they're used - means that whatever URL 
schema or namespace you come up with is something you're stuck with - 
people or even programs may depend on it.  But there's no standards body 
or namespace body looking after the bigger picture for you. Your 
mistakes may haunt you for a long time.

Also, query URLs are inherently /not/ reliable - the resource they 
return is /expected/ to change, which again makes their (re)-use less 
desirable.

Clearly, the W3C's unfortunate 'httpRange-14' issue would never have 
occurred with opaque URLs. In other words, opaque, semantics-free HTTP 
URIs are /always/ dereferencable to 'information resources' and /never/ 
refer to cars! Strings that are part of a car domain model belong inside 
/content/ not in links to content - they belong above HTTP. I'm not 
fully conversant in the Semantic Web domain, but I suspect that there 
are issues in there that are caused by mixing up globally unique 
identifier strings used to build information structures with strings 
that are semantically-meaningful over those structures, and that can 
dereference to sets.

So my main objection to transparent URLs is the way they mix up the 
mechanism for linking up the Web with a mechanism for querying it. The 
Web works fine using HTTP and opaque URLs. We have POST and Content-Type 
and OpenSearch schemas to query the Web.
_________________________________________

Practical examples..

You can return opaque links to time-ordered collections listing the 
latest documents to be tagged 'semweb':

<a class="tag" href="http://tagbeat.com/3720a-993117b">semweb</a>

Keep your URLs opaque (like GUIDs in databases) and put your application 
data and queries in the content (like SQL queries and result sets in 
databases). Give your query content resources a first-class schema - see 
OpenSearch - and even their own URLs. POST these queries to opaque 
collection URLs. Make your result sets transient (returned in the POST 
response, thus no-cache by default). Result sets should only be 
'grounded' (thus linkable and cacheable) if explicitly asked for in the 
query, when you should redirect to a new resource in the POST response.

Of course, you can still surround the UUID/GUID part of your opaque URLs 
with human-readable string decorations, as long as they're never used to 
dereference the resource but just for mnemonic purpose, or for search 
engine optimisation.
_________________________________________

I've gone on at length (again!), but hope you have had the patience to 
get my point of view. =0)

Cheers!

Duncan Cragg

PS  I work at the Financial Times over the river from you - but I was a 
URL opacitist /before/ having to wrangle with the FT CMS...!