[uf-new] A microformat for Machine translation software readable words

Tom Morris tom at tommorris.org
Sat Mar 21 05:31:05 PST 2009


On Sat, Mar 21, 2009 at 10:31, Mindaugas Indriunas <inyuki at gmail.com> wrote:
> The implications of this kind of microformat could be far reaching. It
> could result in better machine translation, and possibly something
> like Wikipedia written in one language (that is, in concepts defined
> through use of multitude of all existing human languages and
> dictionaries), yet displayed in a preferred human language
> automatically...
>

The W3C have an Incubator Group in place to try and push 'CWL', the
Common Web Language. The idea of it is that instead of writing a Web
documents in existing natural languages, one writes them in this
semantically-rich markup language, which then gets machine translated.

Here is their charter:
http://www.w3.org/2005/Incubator/cwl-ei/charter

I put up a blog post about it a while back, where I snarkily called it
Esperanto-over-HTTP:
http://tommorris.org/blog/2008/07/01#When:22:29:49

A microformat that sits atop an existing machine language (X/HTML) and
existing natural languages is a lot less impractical than something
like CWL. That said, the idea that general web documents will end up
filled with semantically unambiguous identifiers instead of words is
ambitious to say the least.

Both this proposal and the CWL proposal suffer from the problem that
it'll turn the richness of human languages into machine slop. Human
languages have given us Plato, Dante, the Song of Solomon, Eliot and
Shakespeare. A highly efficient method to turn that into something
like a Java stack trace is perhaps less than ideal. Maybe, in a
hundred years time, we might get some kind of XML Esperanto thing
going on, but we need to just solve the big problems - the common
blobs of data, the common relationships between the things those blobs
of data represent. This is how it is in the real world - there's a
reason why things like the signs at hospitals, train stations,
airports and trams are made internationally readable with a greater
degree of urgency than, say, television shows. If you turn up at a
hospital and don't speak much of the native language, you risk death.
If you can't watch Lost, big deal.

If you think that this approach has a shot, I think the best way is to
produce a demo - write an example in X/HTML and show how linguistic
disambiguation could make for better machine translation. You need to
get the guts working first, then if it's necessary, a microformat can
come later.

-- 
Tom Morris
http://tommorris.org/


More information about the microformats-new mailing list