measure-brainstorming: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
(HTML proposals / quantity element)
(s/<source>/<syntaxhighlight>/)
Line 54: Line 54:


==== Example Markup ====
==== Example Markup ====
 
<syntaxhighlight lang="html">
<source lang=html4strict>
<span class="h-measure">
<span class="h-measure">
   <data class="p-type" value="weight">weighs</data>  
   <data class="p-type" value="weight">weighs</data>  
Line 70: Line 69:
   <data class="p-unit" value="USD">$</data><span class="p-num">9.90</span>
   <data class="p-unit" value="USD">$</data><span class="p-num">9.90</span>
</span>
</span>
</source>
</syntaxhighlight>


==== Example Parsed JSON ====
==== Example Parsed JSON ====
 
<syntaxhighlight lang="json">
<source lang=javascript>
{
{
   "type": [
   "type": [
Line 107: Line 105:
   }
   }
}
}
</source>
</syntaxhighlight>


==== Issues ====
==== Issues ====
Line 130: Line 128:


==== Example Markup ====
==== Example Markup ====
 
<syntaxhighlight lang="html">
<source lang=html4strict>
<span class="m-weight">
<span class="m-weight">
   <data class="p-num" value="79300">79.3</data><data class="p-unit" value="g">kg</data>
   <data class="p-num" value="79300">79.3</data><data class="p-unit" value="g">kg</data>
Line 141: Line 138:


Price: <span class="m-price"><data class="p-unit" value="USD">$</data><span class="p-num">9.90</span></span>
Price: <span class="m-price"><data class="p-unit" value="USD">$</data><span class="p-num">9.90</span></span>
</source>
</syntaxhighlight>


==== Example Parsed JSON ====
==== Example Parsed JSON ====
 
<syntaxhighlight lang="json">
<source lang=javascript>
{
{
   "weight": [
   "weight": [
Line 188: Line 184:
   ]
   ]
}
}
</source>
</syntaxhighlight>


==== Issues ====
==== Issues ====
Line 211: Line 207:


==== Example Markup ====
==== Example Markup ====
 
<syntaxhighlight lang="html">
<source lang=html4strict>
Standalone measure: <span class="h-measure">
Standalone measure: <span class="h-measure">
   <span class="p-name">Weight of a widget</span>: <data class="p-num">1.5</data><data class="p-unit">kg</data>
   <span class="p-name">Weight of a widget</span>: <data class="p-num">1.5</data><data class="p-unit">kg</data>
Line 234: Line 229:
  </span>
  </span>
</div>
</div>
</source>
</syntaxhighlight>


==== Example Parsed JSON ====
==== Example Parsed JSON ====
 
<syntaxhighlight lang="json">
<source lang=javascript>
{
{
     "items": [
     "items": [
Line 316: Line 310:
         }
         }
     ]
     ]
}</source>
}</syntaxhighlight>


==== Issues ====
==== Issues ====

Revision as of 21:06, 26 July 2023

Measure Microformat Brainstorming

This page collects ideas on how to use semantic XHTML to represent unambiguously measures.

HTML proposals

There is some evidence of past proposals to add measures/quantities directly to HTML, which should be considered in the context of any overall markup solution:

quantity element

https://lists.w3.org/Archives/Public/public-whatwg-archive/2009Aug/0234.html proposed a quantity element

[…]

I've been looking at the meter element, which specifically states that 
"There is no explicit way to specify units in the meter element, but the 
units may be specified in the title attribute in free-form text."

Having used the web for the past 15 years I've always felt that it's a 
shame when you run into a page with a set of measurements and those 
can't be interpreted automatically in a sensible fashion. Especially 
with the fact that there are both imperial and metric units still around 
in this day and age.

An backwards compatible inline element to specify a quantity would be 
rather trivial:

<quantity unit="cm">12 cm</quantity>
<quantity unit="kg">2 kg</quantity>

With this implementation a number inside the quantity element would be 
interpreted as the numerical value of the unit. Other characters would 
be ignored.

[…]


microformats2

Problem: no existing microformats-2 structures can represent measure values.

Proposal: h-measure, h-angle, h-money

Based on existing old draft schema

Example Readable Text

weighs 79.3kg

100W Light Bulb

Price: $9.90

Example Markup

<span class="h-measure">
  <data class="p-type" value="weight">weighs</data> 
  <data class="p-num" value="79300">79.3</data><data class="p-unit" value="g">kg</data>
</span>

<span class="h-measure">
  <span class="p-num">100</span><span class="p-unit">W</span> 
  <span class="item">Light Bulb</span>
</span>

<span class="h-money">
  <span class="p-type">Price</span>:
  <data class="p-unit" value="USD">$</data><span class="p-num">9.90</span>
</span>

Example Parsed JSON

{
  "type": [
    "measure"
  ],
  "properties": {
    "num": [79300],
    "unit": ["g"],
    "type": ["weight"]
  }
}

{
  "type": [
    "measure"
  ],
  "properties": {
    "num": [100],
    "unit": ["W"],
    "item": ["Light Bulb"]
  }
}

{
  "type": [
    "money"
  ],
  "properties": {
    "num": [9.90],
    "unit": ["USD"]
    "type": ["price"]
  }
}

Issues

  • The type of measurement is not always written out in the english version. This requires the addition of a property that may not have been in the sentence to specify "weight" for example.
  • Requires three new root-level definitions (measure, angle, money)


Proposal: m-* values

Based on the existing proposed n-* prefix documented here: microformats2-prefixes#prefixes_for_future_consideration

Example Readable Text

79.3kg

100W Light Bulb

Price: $9.90

Example Markup

<span class="m-weight">
  <data class="p-num" value="79300">79.3</data><data class="p-unit" value="g">kg</data>
</span>

<span class="m-power">
  <span class="p-num">100</span><span class="p-unit">W</span> <span class="p-item">Light Bulb</span>
</span>

Price: <span class="m-price"><data class="p-unit" value="USD">$</data><span class="p-num">9.90</span></span>

Example Parsed JSON

{
  "weight": [
    {
      "type": [
        "measure"
      ],
      "properties": {
        "num": [79300],
        "unit": ["g"]
      }
    }
  ]
}

{
  "power": [
    {
      "type": [
        "measure"
      ],
      "properties": {
        "num": [100],
        "unit": ["W"],
        "item": ["Light Bulb"]
      }
    }
  ]
}

{
  "price": [
    {
      "type": [
        "measure"
      ],
      "properties": {
        "num": [9.90],
        "unit": ["USD"]
      }
    }
  ]
}

Issues

  • Requires the definition of a "type" in order for the value to have a property name in the parsed JSON.
  • How to indicate the measure type (standard, angular, money)? Or is this distinction even important?


Proposal: h-measure named with p-*

This is basically a combination of the two above, dropping the individual types (standard, angular, money) in favor of just using "measure", and allowing these properties to be named the same way the author of an h-entry is named using p-author.

Example Readable Text

weighs 79.3kg

100W Light Bulb

Price: $9.90

Example Markup

Standalone measure: <span class="h-measure">
  <span class="p-name">Weight of a widget</span>: <data class="p-num">1.5</data><data class="p-unit">kg</data>
</span>

Example as properties of an h-card:

<div class="h-card">
 <span class="p-name">Joe Bloggs</span>
 
 weighs 
 <span class="h-measure p-weight">
   <data class="p-num" value="79300">79.3</data><data class="p-unit" value="g">kg</data>
 </span>
</div>

<div class="h-product">
 Price as property of an h-product: 
 <span class="h-measure p-price">
  <data class="p-unit" value="USD">$</data><span class="p-num">9.90</span>
 </span>
</div>

Example Parsed JSON

{
    "items": [
        {
            "type": [
                "h-measure"
            ],
            "properties": {
                "name": [
                    "Weight of a widget"
                ],
                "num": [
                    "1.5"
                ],
                "unit": [
                    "kg"
                ]
            }
        },
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "weight": [
                    {
                        "type": [
                            "h-measure"
                        ],
                        "properties": {
                            "num": [
                                "79300"
                            ],
                            "unit": [
                                "g"
                            ],
                            "name": [
                                "79.3kg"
                            ]
                        },
                        "value": "79.3kg"
                    }
                ],
                "name": [
                    "Joe Bloggs"
                ]
            }
        },
        {
            "type": [
                "h-product"
            ],
            "properties": {
                "price": [
                    {
                        "type": [
                            "h-measure"
                        ],
                        "properties": {
                            "unit": [
                                "USD"
                            ],
                            "num": [
                                "9.90"
                            ],
                            "name": [
                                "$9.90"
                            ]
                        },
                        "value": "$9.90"
                    }
                ],
                "name": [
                    "Price as property of an h-product: \r\n \r\n  $9.90"
                ]
            }
        }
    ]
}

Issues

  • There is no distinction between "standard", "angular" and "money" types other than the units present. This may or may not be important, need to gather some use cases and examples for each. Aaronpk 20:07, 16 June 2013 (UTC)
  • If there is no p-* property defined, the h-measure will end up in the "children" array. Aaronpk 20:07, 16 June 2013 (UTC)
  • This only works when h-measure is a child microformat, i.e. the example given actually makes no sense. But otherwise I feel this is the sanest proposal --bw 20:05, 16 June 2013 (UTC)
    • If the p-* property is omitted and the h-measure, wouldn't this just become a new top-level item in the items array? Aaronpk 20:07, 16 June 2013 (UTC)
      • I updated the examples showing how this could work standalone and/or as a property with live output from php-mf2 --bw 20:23, 16 June 2013 (UTC)

Examples in the Wild

Old pre-mf2 brainstorming

Draft Schema

Rationale: The names "type" and "item" are taken from hReview.

open issue! Is tolerance needed? It is useful for some circumstances, but perhaps not common enough to be included in the spec. open issue! A dtmeasured property may be useful, especially for hmoney, as prices fluctuate.

Standard Measure Schema

  • hmeasure
    • num {1} (numeric)
    • unit {1} (unit)
    • item? (text | hCard | hCalendar)
    • type ? (text, e.g. "height", "width", "weight")
    • tolerance ? (percentage | hmeasure)

Angular Measure Schema

  • hangle
    • num {1} (degree)
    • item? (text | hCard | hCalendar)
    • type ? (text, e.g. "angle of elevation")
    • tolerance ? (percentage | hangle)

Money Schema

  • hmoney
    • num {1} (numeric)
    • unit {1} (ISO 4217 code)
    • item? (text | hCard | hCalendar)
    • type ? (text, e.g. "price", "salary", "exchange rate")
    • tolerance ? (percentage | hmoney)

num: The Value

Arbitrary white space MAY be included in the value to improve readability (but only when the num class is explicitly used — not when mimimisation is employed). Parsers MUST strip out all white space before further processing.

In the standard and money schemas, the value MUST be a number, formatted according to the following EBNF pattern:

non-zero-digit = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
digit          = "0" | non-zero-digit ;
natural        = non-zero-digit , {digit} ;
integer        = "0" | [ "-" ] , natural ;
dot-decimal    = integer , "." , {digit} ;
comma-decimal  = integer , "," , {digit} ;
e-sign         = "e" | "E" ;
mantissa       = dot-decimal | comma-decimal | integer ;
sci-number     = mantissa , e-sign , integer ;
number         = dot-decimal | comma-decimal | integer | sci-number ;

This roughly corresponds to a subset of C syntax for floating points and integers, excluding octal and hexadecimal representations. However, note that both commas and stops may be used as decimal points.

The Unicode minus sign (U+2212) and ASCII-compatible hyphen-minus (U+002D) MUST both be treated as acceptable indicators of a negative number. In addition, the symbols ¼ (U+00BC), ½ (U+00BD) and ¾ (U+00BE) SHOULD be supported as aliases for 0.25, 0.5 and 0.75 respectively.

In the angular measure schema, a measure is expressed as a combination of up to three numeric components: called degrees, minutes and seconds. Any combination of these components may be used, except when degrees and seconds are given minutes MUST be present. The components MUST appear in the correct order (degrees, minutes, seconds). Each component must match the production rule for "mantissa" above, with the following additional constraints:

  • Only the first component can bear a minus sign. Subsequent components "inherit" the negativity (or lack thereof) from their predecessors.
  • All components except the last must match the production rule for "integer".

The numeric components MUST be indicated by appending a suffix to each component. Valid suffixes are:

  • degree: "deg", U+00B0 degree symbol (°)
  • minute: "min", straight single quote ('), U+2032 prime (′)
  • second: "sec", straight double quote ("), U+2033 double prime (″)

Examples

  • 1729 (the smallest number that can be expressed as the sum of two cubes in two different ways)
  • 1.61803399 (the golden ratio)
  • 2,99792458e8 (the speed of light in a vacuum, measured in metres per second)
  • -40 (value at which Celcius and Farenheit scales are equal)
  • 1,000,000,000 (Invalid: commas may be used as decimal points, but not for grouping thousands.)
  • 57.2958 deg (1 radian, in degrees)
  • -57° 17′ 45.1″ (-1 radian, in degrees, minutes and seconds)
  • 4° 30″ (Invalid: no minutes)
  • 4° -30′ (Invalid: only first component may be negative)

Issues

closed issue Will the name of this class (value) cause problems for parsers due to value excerpting?

  • Changed value to num

open issue! What about 5′ 10″ used to mean 5 foot, 10 inches?

  • Possible solution:
<abbr title="70 inch">5′ 10″</abbr>

unit: The Unit of Measurement

In the standard schema, the "unit" class is defined as an arbitrary string.

SI Units

Any unit may be used, but authors SHOULD attempt to use official SI units of measurement where appropriate.

Parsers that treat the unit as anything other than an opaque string SHOULD recognise the following case-sensitive list of units, derived from the SI list of base units and common derived units, with the addition of bits and bytes, which are commonly used on web pages. (Note that gram appears in this table instead of kilogram. This is deliberate.)

Unit Symbols Aliases
metre m meter
gram g gramme
second s, sec
ampere A amp
candela cd
mole mol
kelvin K, K (U+212A)
newton N
pascal Pa
joule J
watt W
coulomb C
volt V
ohm Ω (U+03A9), Ω (U+2126)
siemens S
farad F
weber Wb
henry H
tesla T
hertz Hz
byte B
bit b
litre L, l, ℓ (U+2113) liter
Celsius ℃ (U+2103), °C (U+00B0 followed by captial C)
radian rad
lumen lx
becquerel Bq
gray Gy
sievert Sv
katal kat
steradian sr
10n Prefix Symbol
1024 yotta- Y
1021 zetta- Z
1018 exa- E
1015 peta- P
1012 tera- T
109 giga- G
106 mega- M
103 kilo- k
102 hecto- h
101 deca- da
100 (none) (none)
10−1 deci- d
10−2 centi- c
10−3 milli- m
10−6 micro- µ (U+00B5), μ (U+03BC), u
10−9 nano- n
10−12 pico- p
10−15 femto- f
10−18 atto- a
10−21 zepto- z
10−24 yocto- y


The full names and for SI prefixes SHOULD only be combined with the full names for the units (or their aliases). Likewise the symbols for SI prefixes SHOULD only be combined with the symbols for the units.

  • kilometre
  • milligramme
  • μL
  • microV (not recommended)
  • kgram (not recommended)

Combining units

Units may be multiplied by separating with whitespace, or divided using a slash (/) or U+2215 division slash (∕). Units may be raised to an integer power using a caret character. The unicode superscript numerals 2 to 9 (U+00B2, U+00B3, U+2074-79) MUST be supported as aliases for raising to the appropriate integer powers. Multiplication is more associative than division.

Examples:

  • <span class="unit">kg m / s</span>
  • <span class="unit">m/s^2</span>
  • <span class="unit">meter³</span>
  • <abbr class="unit" title="μm">micron</abbr>

Angular units

Units MUST NOT be given for measurements expressed in the degree schema: the degree itself is the unit. If the standard schema is used, units may be given in radians (rad).

Other / Non-SI Units

Authors MAY specify units other than those defined above, but SHOULD NOT assume that parsers will be able to interpret them. Authors using other units MAY provide a rel=glossary link to a page or fragment that defines the units.

Explicitly Defining a Unit

hmeasure may be used with the <dfn> element to explicitly define a unit in terms of pre-defined units. The "title" attribute (if any) is taken to be an alias of the unit name.

<p class="hmeasure" id="dfn-inch">
  An <dfn class="item" title="in">inch</dfn> is defined as
  <span class="num">0.0254</span> <span class="unit">m</span>.
</p>

Other instances of hmeasure may then refer to this definition, implicitly:

<p class="hmeasure">
  The <span class="item">action figure</span> has a <span class="type">height</span> of
  <span class="num">5</span> <span class="unit">in</span>.
</p>

or explicitly:

<p class="hmeasure">
  The <span class="item">action figure</span> has a <span class="type">height</span> of
  <span class="num">5</span>
  <a class="unit" rel="glossary" href="#dfn-inch">in</a>.
</p>

open issue! Farenheit is reasonably common in some parts of the world. As °C and °F do not share their zero points, it is impossible to use this pattern to define °F. °F thus remains an opaque string with no meaning assigned to it my this spec. Should we add it to the list of pre-defined units?

Currency Units

If the money schema is being used, the unit is not an arbitrary string. It MUST be a three-letter ISO 4217 code. The following aliases for the four largest reserve currencies (as of 2008) are allowed:

Unit Aliases
EUR
GBP £
JPY ¥
USD $

Other currencies MAY be displayed using these symbols only through the ABBR design pattern:

<span class="hmoney">
  <abbr class="unit" title="AUD">$</abbr><span class="num">5.00</span>
</span>

item: The Thing Being Measured

An hCard, hCalendar event or textual description of the item being measured may be supplied.

<p class="hmeasure">
  <span class="item vcard">The <span class="fn">Great Wall</span>of
  <span class="adr"><span class="country-name">China</span></span></span>
  is about <span class="num">6 700</span> <abbr title="km">kilometres</abbr>
  <abbr title="length" class="type">long</abbr>.
</p>

If the item is not an hCard, hCalendar component or other recognised embedded microformat, then its contents are taken to be a string.

The item is optional.

The Item URI

If the item is not an embedded hCard or hCalendar event, and is an <a> element or other linking element, then parsers should parse the URI and the node contents. The item URI is considered a significant way of determining what entity the hmeasure is describing. For example:

  • If the item URI matches the UID for a known contact (e.g. an hCard somewhere on the page, or another page being parsed) then the hmeasure is taken to describe this contact (i.e. person, organisation, etc).
  • A similar meaning can be implied when the item URI matches the UID for a known hCalendar event.

For example:

<div class="vcard">
  <a href="fn url uid" href="http://alice.example.net">Alice Jones</a>,
  <span class="adr">
    <span class="locality">Sydney</span>,
    <span class="country-name">Australia</span>.
  </span>
</div>
... further down the page ...
<span class="hmeasure">
  <a class="item" href="http://alice.example.net">Alice's</a>
  <span class="type">height</span> is
  <span class="num">180</span> <span class="unit">cm</span>
</span>

type: The Dimension

The type specifies the dimension being measured. A measurement in, say, metres may be ambiguous because it could refer to a depth, a height, a length or a width. The optional type parameter allows you to specify a human-readable dimension.

tolerance: The Error Tolerance

An optional tolerance may be specified as a percentage or as a nested hmeasure/hmoney.

Examples:

<span class="hmeasure">
  <span class="type">Height</span>:
  <span class="num">5</span> <span class="unit">m</span>
  ± <span class="tolerance">2%</span>
</span>
<span class="hmoney">
  <span class="unit">$</span><span class="num">5.00</span>
  ± <span class="tolerance hmoney"><span class="unit">$</span><span class="num">1.00</span></span>
</span>

When no tolerance is provided, a default tolerance of 0% MUST NOT be assumed — the tolerance is simply unknown.

Minimisation Techniques

hmeasure

If no num is given, then the first number conforming to the EBNF above is taken to be the numerical value of the measurement. If no unit is given, then the entire string within the "hmeasure" (less the numerical value, item, type and tolerance) is taken to be the unit.

For example:

<span class="hmeasure">3 pints <span class="item">beer</span></span>
  • Num: 3
  • Unit: "pints"
  • Item: "beer"
<span class="hmeasure">4 m</span>
  • Num: 4
  • Unit: metre

open issue! What about cases where there is no white space? SI says white space should always separate the quantity and unit, but in practice, many people do not include white space in measures.

closed issue When no unit is explicitly given, how do we know which of the following two behaviors to take? Assume unit minimisation and follow the procedures here; or Assume angular schema and treat number as a degree/minute/second.

  • Changed root element class for angular schema to hangle

hmoney

If no num is given, then the first number conforming to the EBNF above is taken to be the numerical value. If no unit is given, the first three-letter word (or single character alias) is taken to be the unit. White space between the implied unit and implied number is considered optional. The following are to be equivalent:

<span class="hmoney"><span class="unit">EUR</span> <span class="num">1,00</span></span>
<span class="hmoney">EUR <span class="num">1,00</span></span>
<span class="hmoney">EUR1,00</span>
<span class="hmoney">1,00 EUR</span>
<span class="hmoney">1.00 <abbr class="unit" title="EUR">euro</abbr></span>
<span class="hmoney">€1,00</span>
<abbr class="hmoney" title="EUR 1,00">a euro</abbr>

Minimising Tolerence

If the tolerance is not a percentage (i.e. it is a nested hmeasure/hmoney) and it does not contain a unit (either explicit, or by minimisation rules), then the unit is taken to be the unit of the parent hmeasure/hmoney.

If no explicit tolerance is given, the hmeasure string should be examined for an occurrence of the substring "±". If this is present, the substring after it, and continuing to the end of the hmeasure string is taken to be a tolerance. If the tolerance contains a "%" character, the tolerance is taken to be a percentage. Otherwise is it taken to be an implicit nested hmeasure/hmoney.

Implied Item

If no item is present, then the item MAY be inferred from nesting. If the hmeasure (or hangle, hmoney) is nested within an hCard or hCalendar event, then the implied item is the person, organisation or place represented by the hCard, or the event represented in hCalendar.

Future versions of this specification may add other implied item minimisation techniques.

Worked example

The following example shows a series of expansions taken by a parser encountering a minimised hmoney:

<span class="hmoney">$1.54 ± 0.01</span>

The "±" sign introduces a tolerance, which does not include a "%" symbol, so is treated as a nested hmoney.

<span class="hmoney">$1.54 ±<span class="hmoney tolerance">0.01</span></span>

No explicit units or values are given in either hmoney, so units and numerical values are extracted as per hmoney minimisation:

<span class="hmoney"><span class="unit">$</span><span class="num">1.54</span>
±<span class="hmoney tolerance"><span class="num">0.01</span></span></span>

The nested hmoney contains no unit, so it inherits its unit from the parent hmoney:

<span class="hmoney"><span class="unit">$</span><span class="num">1.54</span>
±<span class="hmoney tolerance"><span class="unit">$</span> <span class="num">0.01</span></span></span>

Parsed values:

  • Unit: USD
  • Num: 1.54
  • Tolerance:
    • Unit: USD
    • Num: 0.01

Examples

An example weather forecast using hmeasure, adr, geo and hCalendar with the include pattern:

<div>
    Weather for
    <span id="loc-lewes">
        <span class="adr location">
            <span class="locality">Lewes</span>,
            <span class="region">East Sussex</span>
        </span>
        (<span class="geo">50.8730;0.005</span>)
    </span>,
    <span class="vevent item" id="day-20080325">
        <a class="include" href="#loc-lewes"></a>
        <span class="summary">Tuesday</span>
        <abbr class="dtstart" title="2008-03-25">25 March</abbr>
        <abbr class="dtend" title="2008-03-26"></abbr>
    </span>:
    <span class="hmeasure">
        <a class="include" href="#day-20080325"></a>
        <abbr title="Maximum temperature" class="type">High</abbr>
        8 ℃
    </span>,
    <span class="hmeasure">
        <a class="include" href="#day-20080325"></a>
        <abbr title="Minimum temperature" class="type">Low</abbr>
        0 ℃
    </span>
</div>

(The above example does not necessarily represent best practice. Authors should make themselves aware of the accessibility issues currently being discussed around the include and abbr design patterns.)

Parsing Hints

This section is informative.

num parsing

This Perl code shows how a number can be parsed according to the EBNF production in this spec. Its author (Toby Inkster) releases the following code into the public domain:

#!/usr/bin/perl

my $nonZeroDigit = '[1-9]';
my $digit        = '\d';
my $natural      = "($nonZeroDigit)($digit)*";
my $integer      = "(0|\-?($natural)+)";
my $decimal      = "($integer)[\.\,]($digit)*";
my $mantissa     = "($decimal|$integer)";
my $sciNumber    = "($mantissa)[Ee]($integer)";
my $number       = "($sciNumber|$decimal|$integer)";

print "/$number/\n";
while (<>)
{
	s/\s*//g;
	m/$number/;
	print "Number found: $1\n";
}

unit parsing

Parsers should note that (with the exception of certain non-ascii characters, which can be converted manually first) all the pre-defined non-currency units can be understood by the GNU units program. A parser could act as a wrapper to a GNU units installation, or make use of a GNU units-based web service to convert between units.

Guillaume Lebleu

Basic example with elementary unit using the abbr pattern and the UNECE code (see measure-formats)

<span class="length">5 <abbr class="unit" title="FOT">Feet</abbr></span>

Optional "value" could be useful in some cases, for instance when the value is provided in plain text:

<span class="length"><abbr class="value" title="5">Five</abbr> <abbr class="unit" title="FOT">Feet</abbr></span>


Andy Mabbett

Converter Extension

This Firefox extension may be of interest. Note, though, that it's been criticised for having a "nag" screen: Converter AndyMabbett 15:32, 3 Oct 2006 (PDT)

This is the author of that extension. I don't want to go much into this, but I just want to clarify this briefly. The part with the nag screen is wrong on two counts: (1) that dialog isn't there anymore, and (2) even if it was there, you only needed to read a paragraph and click a button to make it go away forever -- but you don't have to take my word for it, install it for yourselves and see. Andy's report is accurate however -- the extension was criticized for that dialog (that's what you get from your free extension's users when you ask for 15 seconds of their time in return for hundreds of hours of your time). --BogdanStancescu 09:35, 9 Oct 2006 (PDT)

Wikipedia converter

Wikipedia's Convert Template automatically converts from metric to imperial and vice versa. It's worth noting the measurements it supports.

Google calculator

A Google search, e.g. for "0.6 miles" returns a metric conversion. See also Google calculator help.

HTML Entities

  • For squared and cubic values, the HTML entities &sup2; and &sup3; should be borne in mind.
  • For temperatures and angels, the HTML entity &deg; exists.
  • The following currency entities exist:
    • ¤ - &curren; - currency
    • ¢ - &cent; - cent
    • £ - &pound; - pound
    • ¥ - &yen; - yen
    • - &euro; - Euro

Bogdan Stăncescu

Here are my findings related to automatic parsing of measurements on web pages while developing the Converter extension. Please ask away if you want me to go into more detail on any of the topics -- I'm not sure which of my experiences are relevant to microformats, so I'm going to give you an overview of my conclusions.

By the way of an introduction, the Converter is a Firefox extension which tries to convert all measurements it finds in any web page to their Imperial or metric counterpart (e.g. Fahrenheit to Celsius, and Celsius to Fahrenheit; meters to feet and feet to meters). There are two steps to the conversion process: (1) identifying the measurements in the page, and (2) converting them. As expected, the conversion part is trivial, at least conceptually. The parsing is the tricky bit, and that's also where the Converter's challenges also become relevant for microformats.

Here are the main challenges I have encountered while writing the Converter:

Presentation standardization
The first, biggest and most obvious challenge is lack of almost any de facto standardization in respect to data presentation. What I mean is that although the units themselves are more or less standardized (more on that later), they are presented in various ways within web pages. Take these examples: "50 foot monster", "50 ft monster", "50 feet monster", "50-foot monster", "50-feet monster" -- and my personal favorite, "fifty-foot monster" (more on this later);
Note that using a microformat using in particular the abbr-design-pattern would make each of these examples less ambiguous if not unambiguous. See below --Guillaume_Lebleu:
<span class="height"><span class="value">50</span><abbr class="unit" title="FOT">foot</abbr></span> monster
<span class="height"><span class="value">50</span><abbr class="unit" title="FOT">ft</abbr></span> monster
<span class="height"><span class="value">50</span>-<abbr class="unit" title="FOT">foot</abbr></span> monster
<span class="height"><span class="value">50</span><abbr class="unit" title="FOT">feet</abbr></span> monster
<span class="height"><abbr class="value" title="50">fifty</abbr><abbr class="unit" title="FOT">foot</abbr></span> monster
Of course; as far as I could gather, that's actually the purpose of microformats -- bridging the gap between what humans and machines can understand, no? --BogdanStancescu 00:30, 11 Oct 2006 (PDT)
Unit standardization
I live in Europe, where I've always used the metric system. As such, this probably was a much bigger nasty surprise for me than it is for a user of the Imperial/U.S. Customary system: in the Imperial system, the units themselves vary depending on where you are -- miles, pints, and a whole lot of other units come in many different flavors, but they're all written the same in regular usage;
Language
"1 meter" vs. "1 metre" is a reasonable difference -- but non-SI units are usually translated. Even some SI units have different plurals, depending on the language, although in theory SI units are actually denoted by symbols, not "words", as to make them non-translatable, and truly international (hence the name of the SI). I haven't really given much thought to a solution towards parsing these, because I find it overwhelming for the time.
The sheer number of units
surprisingly, most people don't realize just how many units we humans have invented. Just take a look here: asknumbers.com -- see how many categories there are? Now click on Flow Rate -- a non-ubiquitous type of measurement. Three sub-categories only for flow rates! Now click on Volume Flow Rate and take a look at the number of units in those lists. Remember, those are just in one of the three categories for flow rate! The UNECE standard mentioned in the measure formats page is useful to define just that -- a standard set of units. But in practice there are a lot more being used out there.
Do you have examples from the Web (a URL) of non-UNECE units. One possibility would be to provide the ability for a unit to be defined as a division of products of other units. This is consistent with the measure-formats#Systeme_International, which defines 7 base units and all other units as derived units (of course some units, even though they are derived are much easily represented as simple ones). This is what XBRL has done for financial/accounting/reporting. See currency-formats#XBRL and theorical example (ampere acre per second) below --Guillaume_Lebleu:
Unfortunately I don't have URLs -- almost at all -- with measurements, although I've been in the "business" for a while. The reason for this is that I collect URLs of pages I encounter which are not properly parsed by the Converter, and when I release a version which understands those, I delete the URLs. Also, I never intended to cover all units in the Converter myself, for a multitude of reasons -- therefore I was never interested in the more exotic ones.

Guillaume Lebleu's example

<span class="unit">
<abbr class="unit" title="AMP">Ampere</abbr> <abbr class="unit" title="ACR">acre</abbr> <span class="divide">per</span> <abbr class="unit" title="SEC">second</abbr>
</span>
Regarding your idea of breaking down the units in base units, that's something I've also been toying with in my head for the Converter. For my particular application, it's technically more difficult to implement this breakdown. For microformats, it would be easier, but there still remains at least one potential problem: you end up with a huge mess in the page. If a standard is too complicated to follow, one tends to give up altogether.
Consider a document which actually discusses some sort of current variation per farm, and therefore needs to repeatedly refer to ampere acres per second. For human use, they'd simply define the AAS somewhere at the top of the document, and then refer to AAS, KAAS or MAAS as needed. Maybe a similar approach should be considered for microformats as well:
We define the 
<span class="unit_definition">
  <abbr class="unit_name">AAS</span>
  as
  <abbr class="unit" title="AMP">Ampere</abbr>
  <abbr class="unit" title="ACR">acre</abbr>
  <span class="divide">per</span>
  <abbr class="unit" title="SEC">second</abbr>
</span>.
And then use the "AAS" throughout the document as any other pre-defined unit. How would you define (and use) the KAAS (1000 AAS) or MAAS (1,000,000 AAS) though? Is there any standard way already to use data multipliers in microformats? Or should we discuss that? Or is it out of scope? --BogdanStancescu 00:30, 11 Oct 2006 (PDT)

That's all I can think of as major hurdles right now. If I remember anything else, I'll post here. Please do give me feedback here if you want to ask more about any of the topics I touched above, or if you have other questions I might be able to reply to. --BogdanStancescu 12:08, 9 Oct 2006 (PDT)


Discoleo

Measurement Classification

Because it is easier to provide examples, I will first list examples.

Categorical vs Ordinal Data

Various measurements may produce NON-Numerical values:

  • a pain scale: most severe, very severe, severe, ...
  • or the TNM tumour classification system: T0, Tx, T1, T2, T3, T4, N0, ...


There is even a more fundamental issue related to numbers themselves, e.g.:

  • Lists or Years are sometimes written using Roman Numbers
    • however, the strings corresponding to Roman Numbers, when sorted alphabetically, do NOT retain the correct order
    • i.e. C (100) preceds L (50), which preceds X (10)
  • there are other numbering schemes

A Single Value / Data Point

This is the most simple data format and pretty straitforward to implement.

  • the distance between 2 cities is 40 km
  • the velocity is 62 mph
  • most other simple entires (...)

An Interval Measurement

  • time: the shop is open between 6am - 18pm on every day of the week, exept Saturdays from 9am - 16pm and Sundays from 9am - 13pm

This is more about an interval measurement. Every variable can have 2 (or more) values, e.g.:

  • the levels of rain fall were between 25mm - 35mm
  • the maximum velocity of various cars was 220 - 250 km/h

Should these values be stored as separate values? [e.g. low / high] Or should the microformats be able to store an interval?

See also the examples for statistical summaries below.

  • Mark up each as a separate measurement, and wrap them in a "range" microformat? Andy Mabbett 11:36, 22 Nov 2006 (PST)

Matrices

  • the GPS coordinates are 12°14' N and 25°55' E
  • the dimension of the box is 3m x 2m x 0.55m
    • this is three separate, single measurements, surely? Andy Mabbett 09:21, 22 Nov 2006 (PST)
    • 3 x 2 x 0.55 cubic meter, still 3 measurements, BUT given as cubic meter => ONE measurement?
      • Who writes 3x2x0.55 cubic meter? You'd write "3.3m3" Andy Mabbett 11:36, 22 Nov 2006 (PST)
    • the surface was 2 x 3 square feet ???
      • Who writes 2x3 sq ft? You'd write "2ftx3ft" or "6ft2" Andy Mabbett 11:36, 22 Nov 2006 (PST)


  • IF we write "3.3m3" or "6ft2", we loose information
  • IF I want a surface, I would prefer the sqare feet unit, and NOT ...feet x ...feet
  • writing for every measure a markup, will bloat the code extensively
    • data matrices would be very effective here
      • how would you make such a matrix? There are different ways how such information can be "compounded". (length per time = speed, length * length = area). Maybe a we can group those measurements by a surrounding information, what the context is. --Emil 02:50, 25 Dec 2006 (PST)

Statistical Measurements

Often, a group of data is summarized using a statistics:

  • the mean length was 1.3m (SD 0.12m, group size 22)
  • the median age was 42 years (interquartile range 95% 18 - 97)

Measurement Scales

Accuracy vs. Precision

QUESTIONS

  • How detailed should a measurement be stored?
    • Microformats aren't for storing measurements; they're for "labelling" the measurements that are already present. Andy Mabbett 09:23, 22 Nov 2006 (PST)
  • If Accuracy and precision are relevant to the measurement, how do we store these?
Standardization of Measurement
  • sometimes we may need to store the calibration information / calibration curves
  • we may need to store the reference point the measurement is based on
  • we may need to store the normal values
    • biomedical measurements are often laboratory dependent, so it does NOT make sense to have the measurement without the corresponding normal values
    • e.g. anti-Hepatitis B surface antigen antibody (anti-HBs) Titer: 32 MIU/ml
      • normal: 0 (non-infected, non-past infection, non-immunity)
      • protective immunity: >10 MIU/ml
      • interpretation is however more complex, depending on other tests as well

Emil Thies

From my understanding, this microformat should concentrate on the notation of a measurement. So there will be some aspects, which has to be covered (elsewhere?) to improve the automatic use it or this microformat only uses some base informationens (units / dimensions) and derives all used from those base / build-in once.

Dimension vs. Unit vs. Scale vs. Measurement

A measurement is the combination of a number (value) and a unit (kind).

  • 3km (3 Kilo Metre = 3.000 Metre)

A unit is a view for a measure of a dimension. There are two kinds how units can be different to each user:

  • Units Differ by Scale (Prefix)
    • 3km is the same as 3.000 meter or 300.000 cm (Its the same unit, with a different prefix, which works like a factor for the value, to lower the amount of symbols / numbers. The scale should be an own element and we can make use of the standard prefixes, like they are defined on The Unified Code for Units of Measure or MathML).
  • Different units of the same dimension can be transferred into each other.
    • Metre is a unit of the dimension length.
    • Foot is a unit of the dimension length.

A Dimension is a base-dimension (see SI-System) or a compound dimension.

  • length is a base dimension
  • time is a base dimension
  • speed is a compound dimension (length per time). There for a measurement of speed has one number and two unit by a math expression, which form their own unit. e.g. 10 m/s (10 Metre per second).

If we express a measurement in a microformat by the unit, the dimension is indirect provided by it. But a microformat, which uses measurement as a part, needs to define the dimension of it, to keep the use of the unit as an user choice. So, we could have a general measurement element, which allows all kinds of units to use. As a derived format, we can have sub-formats, which limit the list of units (or define an alternate list) by only allowing specific dimension(s).

E.G.

  • currency-proposal, with the money element which uses the same elements value (should then replace amount), scale (should be introduced), unit (should replace currency) which is limit to the ISO 4217 list.
  • length, which only allows units which measures the dimension length, like FOT, MTR ...

Identification of Units

There are so many Units around - not only the existing one. There are deprecated ones like from Rome empire etc. For example "Foot" is not an unique identification of a unit. There is not only the British and U.S., there are for example same old German ones, before those areas joined the international metre convention in 1875:

  • 25 cm in Hessen
  • 28,935 cm in Bremen
  • 29,641 cm in Oldenburg
  • 29,1859 cm in Bayern
  • 30,385 cm in Meiningen-Hildburghausen
  • 31,385 cm in Preußen
  • 31,608 cm in Wien/Österreich
  • 32,61 cm in Bad Homburg vor der Höhe
  • 33 1/3 cm in der Pfalz

So there is the need of a unique identification of those units. I found two approach right:

In MathML

MathML defines the construction of an URI like:

http://base/units/unit name[/context][/country][#prefix]

http://.../units/foot/de

But as you can see, there is right now no way to distinguish the different German foots based on the area inside Germany. Furthermore the context is so variable, that the same unit can be described by different URLs.

In OpenMath

OpenMath defines the units inside of content directories:

http://www.openmath.org/cd/units_us1.xhtml#foot_us_survey

So there is a unique URLs for a Unit, but not every Unit is covered.

Transformation of Units

A real benefit is the automatic transformation of a unit, so that the write can write the measurement in his context (e.g. in the U.S. foot, or a quote from an antike text in Rome Empire foot) and the reader can get a transformation in his context (e.g. the value in metre). There fore there is the need of additional transformation information. And there are some different kinds of transformation:

units of same dimension

e.g. foot to metre

units of compound but same dimension

e.g. metre/s and mach-number

compound measurement context

This switch works up to 5 Ampere by 220 Volt

The reader might to now, which Watt device he can attach (1100 Watt would be the answer).

The dimension of the box is 3m x 2m x 0.55m

There might be some question like:

  • volume (3,3 m³)
  • surface (17,5 m²)

Approach

A general measurement should make use of the following informations:

value: a number, which represents the amount of the measurement. The number should follow one of the following representation:

  • natural (positive and negative): e.g. -1, 0, 1
  • decimal fraction (positive and negative): e.g. -2.5, 0.123
  • natural fraction (positive and negative): e.g. -2/3, 3/7

scale: a factor used to lower the needed numbers of the value. The scale should be either

unit: the unit used for the measurement. The unit should follow one of the following representation:

  • build-in short-form like defined on Standards for Trade and Electronic Business (or any other defined list which will be defined as the standard list for this format)
  • a reference to a unit definition. (I think there is the need of a markup/language to define new units and/or the transformation between units).


<span class="measurement"><abbr class="value" title="5">Five</abbr> <abbr class="scale" title="k">kilo</abbr> <abbr class="unit" title="MTR">metre</abbr></span>

when we have a defined sub-measurement format for length, it could also be written:

<span class="length"><abbr class="value" title="5">Five</abbr> <abbr class="scale" title="k">kilo</abbr> <abbr class="unit" title="MTR">metre</abbr></span>

List of possible Sub-Formats

Here is a (first) list of possible keywords for sub-formats and their unit list or compound kind:

  • money - unit limit to the ISO 4217 List (or could be a sparate currency microformat)
  • length - unit limited to e.g. MTR (Metre), FOT (Foot) ....)
    • area
      • Either a measurement with units like MTK (Square Metre), FTK (Square Foot)
      • or a compound format with elements (width:length, height:length)
    • volume
      • Either a measurement with units like MTQ (Cubic Metre), FTQ (Cubic Foot), LTR (Litre) ...
      • or a compound format with elements (width:length, height:length, depth:length)
  • time or duration or period - unit limited to e.g. sec (second), min (minute) ...
  • frequency - unit limited to Hertz
  • mass or weight - unit limited to GRM (Gram), ...
  • power or electricity - unit limited to AMP (Ampere), OHM (Ohm), ...


Straw man

Based on Taylor Cowan's currency suggestion, and subsequent mailing list discussion, the following straw man (rendering the above sub-formats unnecessary) is proposed:

        <span class="hmeasure">
          [value]
        </span>

        <abbr class="hmeasure" title="[value]">
          [text]
        </abbr>

Where "value" is a number-type pair ("3Kg", "456g") using SI or other standard unit-codes and where parsers must accept the formats:

  • [unit-code][number]
  • [unit-code][space][number]
  • [number][unit-code]
  • [number]space[unit-code]

and where the acceptable codes are to be determined.

Further comment is invited. A test page is available, at http://www.westmidlandbirdclub.com/test/measure.htm

Notes

  • This is extensible, using agreed new codes for unusual or archaic measurements (say "FUR" for "furlong"); such codes could be contained in the microformat's profile.
  • Otherwise, it works as-is for sub-divisions of currencies:
        <abbr class="hmeasure" title="635mm">
          2' 1"
        </abbr>
(2' 1" is "two feet one inch" in imperial measurement).

Issues

        <span class="hmeasure">
          The <span class="unit-code">kg</span> weight was, in total <span class="value">5</span>.
        </span>
  • If so, where would this be used? And are "unit-code" and "value" appropriate class-names?
  • Measurement errors are fundamental in many technical and scientific fields, they must be supported. LucaPost

       <span class="hmeasure">
           <a href="/depth" rel="tag" class="data-name">Depth</a>:
             ( <span class="data-value">2.17</span> +/-
                  <span class="data-error"> 0.02</span> )
                  x 10<sup class="exp">3</sup>
                  <abbr class="unit-measure" title="m">meters</abbr>.
        </span>

  1. Here the actual physical quantity is better 'defined' with rel-tag, and the optional data-error is clearly identified with its own span; alternatively parsers might identify the data-error part by looking for the '±' html-entity.
  2. The standard scientific notation requires the data and the error values to be rounded to the same number of digits; the exponential notation in powers of ten is useful to have a singular format for values of any order of magnitude.
  3. data-error and exp are not needed outside scientific contexts, thus they would be optional; the above HTML still represents a semantic structure when they're left out.

Suggested amendment 1

  • Use only:
        <abbr class="hmeasure" title="[value]">
          [text]
        </abbr>

Where "value" is a number-type pair ("3 kg", "456 g") using SI or other standard unit-codes where the parser must accept the following formats:

Notes

  • The only values allowed are SI values and prefixes
        <abbr class="hmeasure" title="635 mm">
          2' 1"
        </abbr>

        <abbr class="hmeasure" title="635 km/s">
          635 kilometers per second
        </abbr>

        <abbr class="hmeasure" title="0.5 m^3/s^2">
          half a cubic metre per second squared
        </abbr>

Supported SI Prefixes

  • yotta Y Quadrillion -1 000 000 000 000 000 000 000 000
  • zetta Z Trilliard (thousand trillion) - 1 000 000 000 000 000 000 000
  • exa- E Quintillion Trillion 1 000 000 000 000 000 000
  • peta- P Quadrillion Billiard (thousand billion) 1 000 000 000 000 000
  • tera- T Trillion Billion 1 000 000 000 000
  • giga- G Billion Milliard (thousand million) 1 000 000 000
  • mega- M Million 1 000 000
  • kilo- k Thousand 1 000
  • hecto- h Hundred 100
  • deca- da Ten 10
  • deci- d Tenth 0.1
  • centi- c Hundredth 0.01
  • milli- m Thousandth 0.001
  • micro- u Millionth 0.000 001
    • There is already a unicode character for the micro, sign: µ (U+00B5). Better to use it than substituting a "u". TobyInk 03:56, 18 Nov 2007 (PST)
  • nano- n Billionth Milliardth 0.000 000 001
  • pico- p Trillionth Billionth 0.000 000 000 001
  • femto- f Quadrillionth Billiardth 0.000 000 000 000 001
  • atto- a Quintillionth Trillionth 0.000 000 000 000 000 001
  • zepto- z Sextillionth Trilliardth 0.000 000 000 000 000 000 001
  • yocto- y Septillionth Quadrillionth 0.000 000 000 000 000 000 000 001

Supported SI Units

  • meter (m) - length
  • gram (g) - mass
  • kilogram (kg) - mass
  • second (s) - time
  • ampere (A) - electric current
  • kelvin (K) - thermodynamic temperature
  • mole (mol) - amount of substance
  • candela (cd) - luminous intensity

Supported Derived SI Units

  • hertz (Hz) - frequency
  • newton - (N) force, weight
  • pascal - (Pa) pressure, stress
  • joule (J) - energy, work, heat
  • watt (W) - power, radiant flux
  • coulomb (C) - electric charge or electric flux
  • volt (V) - voltage, electrical potential difference, electromotive force
  • farad (F) - electric capacitance
  • ohm (ohm) - electric resistance, impedance, reactance
  • siemens (S) - electrical conductance
  • weber (Wb) - magnetic flux
  • tesla (T) - magnetic field
  • henry (H) - inductance
  • lumen (lm) - luminous flux
  • lux (lx) - illuminance
  • becquerel (Bq) - radioactivity (decays per unit time)
  • sievert (Sv) - equivalent dose (of ionizing radiation)
  • katal (kat) - catalytic activity

Supported Non-SI Units

  • minute (min) - time
  • hour (h) - time
  • day (d) - time
  • radian (rad) - angle
  • degree of arc (deg) - angle
    • Use instead U+00B0 (°, degree) TobyInk 04:06, 18 Nov 2007 (PST)
  • minute of arc (') - angle
    • Use instead U+2032 (′, prime) TobyInk 04:06, 18 Nov 2007 (PST)
  • second of arc ('') - angle
    • Use instead U+2033 (″, double-prime) TobyInk 04:06, 18 Nov 2007 (PST)
  • steradian (sr) - solid angle
  • square degree (deg^2) - solid angle
  • litre (L) - volume
  • tonne (t) - mass

Units Defined by Microformats.org

  • celcius (cel) - temperature
    • Use U+2103 (℃, degrees celcius) TobyInk 04:07, 18 Nov 2007 (PST)
  • bit (bit) - computing
  • year (y) - year
  • inch (in) - inch
  • foot (ft) - foot

Supported SI Markup

  • solidus (/) - divisor
    • Division slash (∕, U+2215) more appropriate TobyInk 04:09, 18 Nov 2007 (PST)
  • caret (^) - exponentiation

See also