title attribute and abbreviated classnames(Was:[uf-discuss]Currency Quickpoll: Preliminary results)

Mike Schinkel mikeschinkel at gmail.com
Wed Oct 18 13:41:29 PDT 2006


>> <span class="money" title="USD">$5.99</span>
>> I still think this is bad semantics.  I don't think "USD" is really a title for "$5.99".  

I'll accept that.  

>> I'd propose this as an alternative:
>> <abbr class="currency" title="USD">$</abbr>5.99 

Okay... But is it a good idea to have a microformat as a prefix/suffix instead of as a container? (general question - I hope it hasn't been answered before...)  

If so, you'll also need (note the space after 35.66):

	35.66 <abbr class="currency" title="DKK">kr</abbr> 

However, at the risk of being shot for heresy, has anyone considered allowing this?

	<abbr class="currency usd">$5.99</abbr>
	<abbr class="currency dkk">35.66 kr</abbr>

OR (something tells me this is even worse, but...):

	<abbr class="money currency-usd">$5.99</abbr>
	<abbr class="money currency-dkk">35.66 kr</abbr>

I'm sure there is something just so wrong about this, but part of the reason I'm on this list is to learn. So why not?
Additionally, that would allow:

	<abbr class="currency usd" title="5.99">Five Dollars and 99 cents</abbr>
	<abbr class="currency dkk" title="35.66">Thirty Five point 66 Kroners</abbr>

OR (for orthogonality):

	<abbr class="money currency-usd" title="5.99">Five Dollars and 99 cents</abbr>
	<abbr class="money currency-dkk" title="35.66">Thirty Five point 66 Kroners</abbr>

Just a thought...?

-Mike
P.S. Damn I wish HTML had allowed "rel" for all tags including <span> and <abbr>.  Or that we could just use it anyway and not get shot for heresy. :)


-----Original Message-----
From: microformats-discuss-bounces at microformats.org [mailto:microformats-discuss-bounces at microformats.org] On Behalf Of Scott Reynen
Sent: Tuesday, October 17, 2006 10:30 AM
To: Microformats Discuss
Subject: Re: title attribute and abbreviated classnames(Was:[uf-discuss]Currency Quickpoll: Preliminary results)

I've starting replying to this a few times and become stuck in trying to fit what I'm trying to say in the existing thread, so I'm just going to make some points completely detached from the thread.

First, I think Mike is right that the vast majority of published money formats allow parsers to infer the distinction between the currency symbol and the amount.  But this inference is already possible without a microformat.  What's missing currently is:

1) an indication of which specific currency the symbol refers to.
2) the ability to markup money that doesn't fit this pattern

I think it's best to either cover #1 or both, but I think it's too complicated for publishers to provide what amounts to two distinct  
microformats depending on a relatively complex pattern definition.   
That is, if we're going simple (only #1), I think we should go only simple, and add the complex form to cover #2 later.

So to cover #1, Mike has suggested:

<span class="money" title="USD">$5.99</span>

I still think this is bad semantics.  I don't think "USD" is really a title for "$5.99".  I'd propose this as an alternative:

<abbr class="currency" title="USD">$</abbr>5.99

That is, markup the currency as currency, and treat any adjacent numbers as the amount.

To cover #2, I think we need an additional class="money" container, and a class="amount" markup for the amount, and this could be added without changing the parsing rules for the simple form I've suggested above.  I think it would be best to start with either simple or complex and look at adding the alternative after the microformat has gained some adoption.

I don't think regular expressions should be included in the spec at all.  If we're going to define amounts based on character ranges, we should describe those character ranges in plain English because most people, even most tech geeks, don't understand regular expressions at all.

Peace,
Scott

On Oct 15, 2006, at 4:40 PM, Mike Schinkel wrote:

> Scott:
>
> Thanks for the reply. If probably got confusing on my part; I will try 
> to resolve that here if possible.
>
>>> I thought what you suggested was to allow for explicit 
>>> differentiation between the currency identifier and the amount, but 
>>> in certain cases where such differentiation can be made by matching 
>>> a regular expression, allow for markup without explicit 
>>> differentiation, leaving the differentiation implicitly to the 
>>> parser to figure out.  For example, this would be valid:...
>>> because it does follow the pattern, where everything that's not 
>>> within a certain character group is considered a currency symbol 
>>> (i.e. "$").  If this isn't what you're suggesting, then I'm not 
>>> clear on what you're suggesting.
>
> You got it 100%.  But I did make a mistake in my example as I didn't 
> mean to include alpha [A-Za-z]. It should just have been digits, 
> periods, and commas [0-9\.\,]; everything else would be the currency 
> symbol. I wasn't explicit about the following, but I will be now; no 
> spaces (or &nbsp;) and the currency figure must be
> contiguous and either prefix or suffix a collection of digits.   
> Anythings else, and you need the complexity.
>
> Although I am not good with regex, I opened my regex book and my regex 
> test and crafted this regex which I think identifies 100% of the 
> special case to which I referred:
>
> ^([^0-9,\. ]*)([0-9]+[\.,]?[0-9]*)([^0-9,\. ]*)$
>
> In that regex, if $2 has a value, that's the amount.  If $1 OR $3 has 
> a value, then it's the symbol.  If it doesn't match, you *must* use 
> the complex form.  (btw, this would also be really easy to write a 
> recursive descent and/or a looping parser in javascript or other 
> languages to parse this and we could publish those reference
> implementations.)  We publish the regex (or a better written one) and 
> the recursive descent parsers and say if it matches, you can use the 
> simple form, otherwise the complex
>
> So the following could use the simple form:
>
> 	The book is <span class="money" title="USD">$5.99</span>.
> 	In Brazil, the book would be <span class="money" title="BRL">R 
> $12.84</span>.
> 	In Denmark, the price would be <span class="money"  
> title="DKK">35.66kr</span>.
>
> BTW, it wouldn't be hard to include spaces in the regex and it might 
> be a good idea to go ahead and do that. If so, you can use the 
> javascript replace() string function (or similar in other
> languages) to first normalize the string to containing only real 
> spaces and no &nbsp; like so:
>
> 	s.replace(/&nbsp;/," ")
>
> where "s" is the innertext for the <span> and then use this regex on 
> the result:
>
> 	^([^0-9,\. ]*)[ ]?([0-9]+[\.,]?[0-9]*)[ ]?([^0-9,\. ]*)$
>
> Where again $1 OR $3 will be the symbol and $2 will be the amount.   
> That would make these possible.
>
> 	The book is <span class="money" title="USD">$&nbsp;5.99</span>.
> 	In Brazil, the book would be <span class="money" title="BRL">R$ 
> 12.84</span>.
> 	In Denmark, the price would be <span class="money"  
> title="DKK">35.66 kr</span>.
>
> Yes is it a little more difficult for the person writing the parser, 
> but there will be many times more orders of magnitude people writing 
> the HTML than parsers and besides, we can provide a working regex and 
> reference implementation functions that will be good for 99% of cases 
> and just say "Here; use it!"
>
>>> http://regexlib.com/Search.aspx?k=currency
>
> I reviewed that and it appears there are most regex submitted that do 
> essentially the same thing, correcting for something others didn’t do 
> (like handle leading zeros); did I misread?
>
>>> and I think it's only helping a slight majority that is quickly 
>>> becoming a minority.  English language web pages only comprise about 
>>> 55% of the web today, and that percent is quickly shrinking.  So I'm 
>>> publishing my currency in English, and you're trying to ease my 
>>> implementation burden, so I don't have to explicitly define my 
>>> currency symbol and parsers will just figure it out for me.
>
> I respectfully think it won't be in the minority; I think it will be 
> the vast majority.  And it will work in others language besides 
> English such as German, Spanish, French, Porteguese, Russia, Arabic, 
> and so on; any that use digits + periods/commas for representing 
> numbers.  It seems the only languages in any significant use that it 
> doesn't work for is multibyte characters, which will require the 
> complexity because, frankly, they are complex.
>
>>> I think this is already more confusing than just always identifying 
>>> the individual parts, I think it's still likely to cause problems, 
>>> ..
>
> Requiring identification of individual parts is less confusing in an 
> abstract manner because you don’t assume anything, but it is more 
> difficult to learn because it requires everyone that implements it 
> grok the entire spec to be able to use it.  By offering a simpler 
> version, (I assert that) most people won't have to learn all the of 
> the details because they will just use the simple version.  So it 
> could be described as such:
>
> 	The Money microformat has a simple version that applies in most 
> cases, and a complex
> 	version for when you really need control or if you are using 
> multibyte character sets. You
> 	can use the simple version, if the markup to which you want to add 
> this microformat is
> 	limited to:
> 		1.) currency symbols (i.e. $, £, etc.),
> 		2.) spaces,
> 		3.) digits (i.e. 0-9), and
> 		3.) decimal seperators (comma "," or period ".")
> 	
> 	For example:
>
> 		The book is <span class="money" title="USD">$&nbsp;5.99</span>.
> 		In Brazil, the book would be <span class="money" title="BRL">R$ 
> 12.84</span>.
> 		In Denmark, the price would be <span class="money"  
> title="DKK">35.66 kr</span>.
>
> 	If however you want to markup money represented in much more complex 
> ways, you'll need to
> 	use the more complex version, for example:
>
> 		<p class="money">It'll cost you <abbr class="money"  
> title="50.00">fifty</abbr>
> 		<abbr class="amount" title="GBP">quid</abbr>, mate!</p>
> 	         	
> 		<span class="money">Can you spare <abbr class="amount"  
> title="10">ten</abbr>
> 		<abbr class="currency" title="USD"><span class="unit">dollars</
> span></abbr>?</span>
>
> By describing it this way, people who can use the simple version  
> are never even required to drill down and learn the complex way.   
> This seems infinitely easier for the vast majority of people than for 
> them to have to grok the entire spec right off the bat.
> Frankly, when I first saw it I thought "It isn't really going to be 
> this complex, is it?  I though the theme behind microformats were 
> "Make the simpliest addition to HTML markup required." That's one of 
> the reasons I was so drawn to the initiative.
>
> I actually think you'll end up with more invalid microformats if 
> people are required to implement the current proposal because it is 
> complex enough that it would be relatively easy for someone to get 
> wrong. By having a simplier format, you'll minimize the chance those 
> people get it wrong, and that those who do go to the more complex are 
> more likely to really study it and get it write, and there will be 
> less people overloading the experts by asking less questions about it 
> (IMO).
>
> Question: Maybe we should vet this with typical web developers who are 
> NOT involved with the microformat's initiative?  We could go out and 
> ask workaday web site developers and web site maintainers
> their opinion on the subject of what is easier to comprehend?   
> Honestly, I'm giving my opinion but I could find out my opinion is in 
> a tiny minority. Or vice versa.
>
> BTW, is there a plan to create a series of microformat validator pages 
> where someone could go and enter a URL and have it extract all the 
> data it found for a given microformat?  Without this, I think people 
> will end up creating lots of pages with invalid microformat.  And it 
> would need to be done for *each* microformat.
>
>>> There are people from Yahoo! on this list, and Technorati's pretty 
>>> big too, so they'd be good people to say whether or not they really 
>>> care how long the class names are.
> Yeah, I already said "Okay, concern addressed" in an earlier reply.
>
> Anyway, I'm hoping that my earlier mistake of including [A-Za-z]  
> was the main reason you objected and that you'll agree with a small  
> scope minimum form like I'm proposing.
>
> -Mike Schinkel
> http://www.mikeschinkel.com/blog
> http://www.welldesignedurls.org/
>
> P.S. On another note, another question just occurred to me: why are  
> you using "money" and not "hMoney?"
>
>
>
> -----Original Message-----
> From: microformats-discuss-bounces at microformats.org  
> [mailto:microformats-discuss-bounces at microformats.org] On Behalf Of  
> Scott Reynen
> Sent: Saturday, October 14, 2006 10:39 PM
> To: Microformats Discuss
> Subject: Re: title attribute and abbreviated class names(Was:[uf- 
> discuss]Currency Quickpoll: Preliminary results)
>
> On Oct 14, 2006, at 3:27 PM, Mike Schinkel wrote:
>
>>>> Your examples seem to leave a lot of ambiguity about what things
>>>> mean,
>>
>> I'm new to proposing microformats, so I clearly have a lot to learn,
>> but that said I don't see where what I was proposing was ambiguous.
>> Can you give me explicit examples where allowing default assumptions
>> for the most common use cases will by necessity lead to  
>> ambiguity?  It
>> seems to me that either something will be specified or if not it will
>> default?  That seems non ambiguous to me. Am I wrong?
>
> I'm not entirely sure we're talking about the same thing anymore,  
> after reading this exchange:
>
> On Oct 14, 2006, at 3:55 PM, Mike Schinkel wrote:
>
>>>> That said, why not make the "symbol" markup optional?
>>
>> That's IMO is an additional good idea.
>
> I thought that was basically what you were advocating, but you  
> called it an /additional/ good idea, so I'm not sure what it's an  
> addition to.  I thought what you suggested was to allow for  
> explicit differentiation between the currency identifier and the  
> amount, but in certain cases where such differentiation can be made  
> by matching a regular expression, allow for markup without explicit  
> differentiation, leaving the differentiation implicitly to the  
> parser to figure out.  For example, this would be valid:
>
> 本が<span class="money"><abbr class="amount" title="1000">一千</
> abbr><abbr class="currency" title="JPY">円</abbr></span>
>
> because it doesn't fit the pattern you suggested, but this would  
> also be valid:
>
> The book is <span class="money">$5.99</span>.
>
> because it does follow the pattern, where everything that's not  
> within a certain character group is considered a currency symbol  
> (i.e. "$").  If this isn't what you're suggesting, then I'm not  
> clear on what you're suggesting.
>
> But if this is what you're suggesting, I think you're  
> underestimating the complexity involved in defining which  
> characters might be part of an amount and which characters might be  
> part of a currency symbol.  I do a lot of parsing via regular  
> expressions and a large part of my interest in microformats comes  
> from witnessing the failure rate in such parsing.  There's always  
> another unexpected format popping up and before you know it, the  
> regular expression is a page long.  See this page for a list of  
> regular expressions for identifying the information that needs to  
> be parsed from currency values for a quick
> taste:
>
> http://regexlib.com/Search.aspx?k=currency
>
> And those are all defining legitimate input much more strictly than  
> would be appropriate for the web at large.
>
> To specifically answer your question of what doesn't work with [A- 
> Za- z0-9], there's the decimal point, which is part of the amount  
> rather than the currency symbol, and there's any commas, which are  
> also part of the amount rather than the currency symbol, and any  
> whitespace characters (of which there are many) shouldn't be  
> considered part of the amount nor the currency symbol.  That's all  
> I can think of right now, but I have no doubt there's much more I  
> haven't thought of, and it's that much more I'm worried about.  So  
> if we come up with a definition that includes all of that, now  
> we're talking about explaining to authors that they can only leave  
> out the currency markup if their class="money" tag is only  
> containing letters, numbers, decimal points, commas, and  
> whitespace.  Otherwise they have to explicitly identify the  
> individual parts.
>
> I think this is already more confusing than just always identifying  
> the individual parts, I think it's still likely to cause problems,  
> and I think it's only helping a slight majority that is quickly  
> becoming a minority.  English language web pages only comprise  
> about 55% of the web today, and that percent is quickly shrinking.   
> So I'm publishing my currency in English, and you're trying to ease  
> my implementation burden, so I don't have to explicitly define my  
> currency symbol and parsers will just figure it out for me.  What  
> if I want my whitespace to be marked up with HTML entities? E.g.:
>
> The book costs <span class="money">$&nbsp;5.99</span>
>
> That's not an unlikely scenario.  I actually publish currency  
> values like that, when someone wants a space to separate the $ from  
> the amount, but they don't want the two getting  split onto  
> separate lines.  Are we going to include that in the regular  
> expression too or do I need to explicitly identify my symbol?  If  
> it's not allowed, how will that be explained clearly enough that I  
> won't make this mistake and wind up with my currency symbol wrongly  
> interpreted as "$&nbsp;", which doesn't map to any known currency,  
> and will lose my space if it's replaced by another currency  
> symbol?  This is the kind of ambiguity that doesn't really help  
> publishers.  And if it is in the regular expression, how are we  
> going to explain to publishers that it's okay?  Looks like  
> unnecessary complication to me.
>
>> But one final point on this; has this been discussed this with those
>> who make the decisions for markup used at the largest sites:
>> Google, eBay,
>> Amazon, etc.?  Just curious? (and I don't mean to push this, it's  
>> just
>> that being pedantic is in my nature, unfortunately. :)
>
> There are people from Yahoo! on this list, and Technorati's pretty  
> big too, so they'd be good people to say whether or not they really  
> care how long the class names are.
>
> Peace,
> Scott


_______________________________________________
microformats-discuss mailing list
microformats-discuss at microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss



More information about the microformats-discuss mailing list