title attribute and abbreviated class names (Was:[uf-discuss]Currency Quickpoll: Preliminary results)

Sat Oct 14 19:39:09 PDT 2006

On Oct 14, 2006, at 3:27 PM, Mike Schinkel wrote:

>>> Your examples seem to leave a lot of ambiguity about what things  
>>> mean,
>
> I'm new to proposing microformats, so I clearly have a lot to  
> learn, but
> that said I don't see where what I was proposing was ambiguous. Can  
> you give
> me explicit examples where allowing default assumptions for the  
> most common
> use cases will by necessity lead to ambiguity?  It seems to me that  
> either
> something will be specified or if not it will default?  That seems non
> ambiguous to me. Am I wrong?

I'm not entirely sure we're talking about the same thing anymore,  
after reading this exchange:

On Oct 14, 2006, at 3:55 PM, Mike Schinkel wrote:

>>> That said, why not make the "symbol" markup optional?
>
> That's IMO is an additional good idea.

I thought that was basically what you were advocating, but you called  
it an /additional/ good idea, so I'm not sure what it's an addition  
to.  I thought what you suggested was to allow for explicit  
differentiation between the currency identifier and the amount, but  
in certain cases where such differentiation can be made by matching a  
regular expression, allow for markup without explicit  
differentiation, leaving the differentiation implicitly to the parser  
to figure out.  For example, this would be valid:

本が<span class="money"><abbr class="amount" title="1000">一千</ 
abbr><abbr class="currency" title="JPY">円</abbr></span>

because it doesn't fit the pattern you suggested, but this would also  
be valid:

The book is <span class="money">$5.99</span>.

because it does follow the pattern, where everything that's not  
within a certain character group is considered a currency symbol  
(i.e. "$").  If this isn't what you're suggesting, then I'm not clear  
on what you're suggesting.

But if this is what you're suggesting, I think you're underestimating  
the complexity involved in defining which characters might be part of  
an amount and which characters might be part of a currency symbol.  I  
do a lot of parsing via regular expressions and a large part of my  
interest in microformats comes from witnessing the failure rate in  
such parsing.  There's always another unexpected format popping up  
and before you know it, the regular expression is a page long.  See  
this page for a list of regular expressions for identifying the  
information that needs to be parsed from currency values for a quick  
taste:

http://regexlib.com/Search.aspx?k=currency

And those are all defining legitimate input much more strictly than  
would be appropriate for the web at large.

To specifically answer your question of what doesn't work with [A-Za- 
z0-9], there's the decimal point, which is part of the amount rather  
than the currency symbol, and there's any commas, which are also part  
of the amount rather than the currency symbol, and any whitespace  
characters (of which there are many) shouldn't be considered part of  
the amount nor the currency symbol.  That's all I can think of right  
now, but I have no doubt there's much more I haven't thought of, and  
it's that much more I'm worried about.  So if we come up with a  
definition that includes all of that, now we're talking about  
explaining to authors that they can only leave out the currency  
markup if their class="money" tag is only containing letters,  
numbers, decimal points, commas, and whitespace.  Otherwise they have  
to explicitly identify the individual parts.

I think this is already more confusing than just always identifying  
the individual parts, I think it's still likely to cause problems,  
and I think it's only helping a slight majority that is quickly  
becoming a minority.  English language web pages only comprise about  
55% of the web today, and that percent is quickly shrinking.  So I'm  
publishing my currency in English, and you're trying to ease my  
implementation burden, so I don't have to explicitly define my  
currency symbol and parsers will just figure it out for me.  What if  
I want my whitespace to be marked up with HTML entities? E.g.:

The book costs <span class="money">$&nbsp;5.99</span>

That's not an unlikely scenario.  I actually publish currency values  
like that, when someone wants a space to separate the $ from the  
amount, but they don't want the two getting  split onto separate  
lines.  Are we going to include that in the regular expression too or  
do I need to explicitly identify my symbol?  If it's not allowed, how  
will that be explained clearly enough that I won't make this mistake  
and wind up with my currency symbol wrongly interpreted as "$&nbsp;",  
which doesn't map to any known currency, and will lose my space if  
it's replaced by another currency symbol?  This is the kind of  
ambiguity that doesn't really help publishers.  And if it is in the  
regular expression, how are we going to explain to publishers that  
it's okay?  Looks like unnecessary complication to me.

> But one final point on this; has this been discussed this with those
> who make the decisions for markup used at the largest sites:  
> Google, eBay,
> Amazon, etc.?  Just curious? (and I don't mean to push this, it's  
> just that
> being pedantic is in my nature, unfortunately. :)

There are people from Yahoo! on this list, and Technorati's pretty  
big too, so they'd be good people to say whether or not they really  
care how long the class names are.

Peace,
Scott