[uf-discuss] Human and machine readable data format

Thu Jul 3 05:39:32 PDT 2008

On Thu, Jul 3, 2008 at 7:04 PM, Dan Brickley <danbri at danbri.org> wrote:
> Breton Slivka wrote:
>
>> I offer the challenge to those developers: If you sincerely believe
>> that simple internationalized date parsing is an unsolvable or
>> difficult problem (which, as I have pointed out has been solved
>> numerous times already, with two examples), please present your
>> evidence. Why is avoiding this work more important than Accessibility?
>> Why is avoiding this work more important than avoiding hidden
>> metadata?

> Imagine the English language permutations of "Tuesday the forteenth of July,
> next year" in terms of word order. Then allow for all natural languages (in
> all written scripts). And don't forget we use a variety of calendars. Big
> job. In theory it could be attempted; but the culture around here is averse
> to 'theoretical' solutions.
>

Once again this straw man is trotted out. Who is discussing this type
of solution other than to specifically discredit the approach as too
hard?

I certainly am not suggesting this kind of wide ranging natural
language parser. I haven't seen anyone else seriously suggesting it
It's a foolish undertaking, and it's obviously a foolish undertaking.
Then WHY OH WHY does this keep being brought up as though it were
being seriously discussed? Where does this idea keep popping out from?

Let me give an example in pseudocode of a parser that would work, and
would be simple to write, and whose format could be read by a screen
reader.

function parser ( datestring, locale ) {

  en-months = [January, February, March, April, May, June, July,
August, September, October, November, December]

  if locale === "en-us"
       dateparse[month, day, year] = regex(datestring, "([A-Za-z]+)
([1-3]?[0-9])s|n|r|tt|d|h, ([0-9]{1, 4}));

  if locale === "en-au"
       dateparse[day, month, year] = regex(datestring,
"([1-3]?[0-9])s|n|r|tt|d|h ([A-Za-z]+), ([0-9]{1, 4}));
  if locale === "en-uk"
       dateparse[day, month, year] = regex(datestring,
"([1-3]?[0-9])s|n|r|tt|d|h ([A-Za-z]+), ([0-9]{1, 4}));

  if locale.contains("en")
       dateparse.month = en-months.indexOf(dateparse.month);

  return dateparse AS [year, month, day];

}

This is a simple example. There are likely better techniques for doing
this than regexes, (or not) but the point is, that you can make a
human READABLE format without having to cover the whole spectrum of
human expression. Instead, you have ONE precise format for US dates,
ONE precise format for UK dates, ONE precise format for japanese
dates, etc, etc.  You stick this format of date in the title of an
ABBR, and you can say whatever you want about the date in whatever
language you like in the contents of the ABBR. The parser shouldn't
care about the contents. IT's just looking at the title. IT already
is. The only change from the current pattern is that we'd be using a
less geeky and obscure format than ISO-8601. The lang attribute of the
ABBR element provides the format in use.

Honestly how difficult is it for a parser author to collect one format
for each locale? I've seen far more heroic efforts on simpler things.
How difficult is it for content publishers to learn ONE format? (The
one for their own locale) ?
How difficult is it to ask content authors to learn a format like
this? We're already asking them to learn a more difficult format!

Yes it's more complicated than parsing ISO 8601. But it's not boiling
the ocean. This isn't a binary decision we're facing. It's not a
choice between "I could implement it in an hour" level of simplicity
and "Human level" AI. Comprimise has to be made if we are to make any
progress.