[uf-dev] Human and machine readable data format
danny.ayers at gmail.com
Sun Jun 29 10:35:56 PDT 2008
2008/6/29 Glenn Jones <glenn.jones at madgex.com>:
> As we turnaround on the spot about machine data issue, the question of
> Natural Language Processing (NPL) has come up again. The main problem
> with any form of NLP is there are too many ambiguities in reading dates
> or any other form of freeform human written text. I don't want us to go
> down this path it is unworkable with currently available technologies.
I'm sure others are more capable than I of giving good responses to your
date format suggestions. But I find it interesting you should bring NLP up
over here. I'm afraid I can't resist chipping in on that ;-)
So the basic scenario is presumably the producer(s) wish to convey
information to the consumer(s). [Either of which may be human or largely
* With an isolated Plain Old Semantic HTML document, the majority of the
information is encoded in human-readable text, enhanced with markup elements
(e.g. for emphasis).
* With HTML+HTTP, we get extra semantics through linking - even if it's just
pageA is somehow related to pageB
* With microformats there can be communication of machine-readable data
embedded in the HTML
- caveat: as generally found in the wild, interpretation of the message from
producer to consumer relies on them both having prior knowledge of the
conventions of microformats.org - effectively a registry of keywords (though
only discoverable with manual intervention - Google etc)
- however where @profile URIs are provided, the consumer can "follow their
nose" to these other resources to discover the semantics intended by the
* Other languages are available (notably RDF, in this context especially
RDFa and microformats used in concert with GRDDL) where there is, thanks to
the 'follow your nose' discovery of URIs/HTTP, a more direct route to
In all these cases, at the end of the chain (of authority) there will be a
human element - the folks that designed the super-duper furniture ontology
may have their own world view that differs from those of others in the
furniture trade. They may simply have got stuff wrong. Fortunately use of
URIs allows potentially conflicting statements (in data, as in Web
documents) to coexist, and it's up to the consumer to apply their own
judgement on what to trust (based on provenance etc).
Now in the case of NLP, consumer-side heuristics will be applied to extract
something from text which *may* correspond to the producer's intended
message. So now not only do you have issues of provenance/trust, there's
also the margin of error of the heuristics to be factored in.
Overall, this seems to be a situation with a range of communication
possibilities - from lo-fidelity tag soup markup up to generally unambiguous
hi-fidelity communication thanks to data expressed as microformats with
@profile URIs, or (more or less equivalently) using web data oriented
languages such as RDF.
Going back to the "extra semantics through linking" remark above, in
whichever of the above approaches the data is expressed and/or interpreted,
the value of that data can be significantly increased through using linked
data techniques. Yeah, I had to get that in.
Bottom line is that the Web is a vastly broad church, and ideally we should
maximising the benefit from all these approaches in as interoperable fasion
as possible - something like the old "think global act local" slogan.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the microformats-dev