[uf-discuss] NLP was Apple Data Detectors

Brian Suda brian.suda at gmail.com
Fri Feb 8 10:54:04 PST 2008

2008/2/8, Guillaume Lebleu <guillaume at lebleu.org>:
> I understand the challenge of disambiguation and the value microformats
> bring in terms of easier parser implementation and more reliable
> information consumption experience.

--- without the explicit additional mark-up declaring something to be
of a certain type, we are left with just Natural Language Processing
(NLP = http://en.wikipedia.org/wiki/Natural_language_processing)

This is a dangerous and slippery slope, NLP sounds like a great idea
but has never attained the hype proceeding it. NLP is language
specific, so while it might be great that Apple Data Detectors work in
English, the NLP for all language makes the code explode quickly.
Microformats, while requiring extra mark-up, can accommodate for ISO
dates in any language.

Geonames has a service that attempts to find places and give them Geo
Coordinates. You can judge for yourself how well NLP can or can not
correctly extract data.

I Want Sandy attempts to parse dates and times, but usually needs some
help or a well structured format.  http://iwantsandy.com/ while not
impossible todo, you end-up writing in a way that isn't natural.

> The challenge for average people
> writing microformats can't be underestimated though. I strongly believe
> that the time where disambiguation costs are the lowest are at
> publishing time, but this is also the time where you are focused on the
> english content, not the microformats.

The dangers of these are that you are attempting to "have your cake
and eat it too". There will always be an effort on someone's part to
explain this data. AI is not, and probably won't get there any time
soon, so microformats are the lightest-weight way to add the
information needed to help machines without over burdening the

The ideal solution would be for somesort of plugin in the CMS so you
can simply highlight areas and push a button and it will add the
microformatted information, or (like Microsoft Writer) have a
hCalendar Plugin, so you fill out the forms and it puts it all inline
with the mark-up for you. Both of these efforts lighten the load on
the publisher while keeping the mark-up to remove ambiguities.


brian suda

