2008/6/29 Glenn Jones <<a href="mailto:glenn.jones@madgex.com">glenn.jones@madgex.com</a>>: <div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> As we turnaround on the spot about machine data issue, the question of Natural Language Processing (NPL) has come up again. The main problem with any form of NLP is there are too many ambiguities in reading dates or any other form of freeform human written text.  I don't want us to go down this path it is unworkable with currently available technologies.</blockquote><div> I'm sure others are more capable than I of giving good responses to your date format suggestions.  But I find it interesting you should bring NLP up over here. I'm afraid  I can't resist chipping in on that ;-) So the basic scenario is presumably the producer(s) wish to convey information to the consumer(s). [Either of which may be human or largely automated systems] * With an isolated Plain Old Semantic HTML document, the majority of the information is encoded in human-readable text, enhanced with markup elements (e.g. for emphasis). * With HTML+HTTP, we get extra semantics through linking - even if it's just pageA is somehow related to pageB * With microformats there can be communication of machine-readable data embedded in the HTML - caveat: as generally found in the wild, interpretation of the message from producer to consumer relies on them both having prior knowledge of the conventions of <a href="http://microformats.org">microformats.org</a> - effectively a registry of keywords (though only discoverable with manual intervention - Google etc) - however where @profile URIs are provided, the consumer can "follow their nose" to these other resources to discover the semantics intended by the producer, * Other languages are available (notably RDF, in this context especially RDFa and microformats used in concert with GRDDL) where there is, thanks to the 'follow your nose' discovery of URIs/HTTP, a more direct route to machine-interpretability In all these cases, at the end of the chain (of authority) there will be a human element - the folks that designed the super-duper furniture ontology may have their own world view that differs from those of others in the furniture trade. They may simply have got stuff wrong. Fortunately use of URIs allows potentially conflicting statements (in data, as in Web documents) to coexist, and it's up to the consumer to apply their own judgement on what to trust (based on provenance etc). Now in the case of NLP, consumer-side heuristics will be applied to extract something from text which *may* correspond to the producer's intended message. So now not only do you have issues of provenance/trust, there's also the margin of error of the heuristics to be factored in. Overall, this seems to be a situation with a range of communication possibilities - from lo-fidelity tag soup markup up to generally unambiguous hi-fidelity communication thanks to data expressed as microformats with @profile URIs, or (more or less equivalently) using web data oriented languages such as RDF. Going back to the "extra semantics through linking" remark above, in whichever of the above approaches the data is expressed and/or interpreted, the value of that data can be significantly increased through using linked data techniques. Yeah, I had to get that in. <a href="http://en.wikipedia.org/wiki/Linked_Data">http://en.wikipedia.org/wiki/Linked_Data</a> Bottom line is that the Web is a vastly broad church, and ideally we should maximising the benefit from all these approaches in as interoperable fasion as possible - something like the old "think global act local" slogan. </div></div>Cheers, Danny. -- <a href="http://dannyayers.com">http://dannyayers.com</a> ~ <a href="http://blogs.talis.com/nodalities/this_weeks_semantic_web/">http://blogs.talis.com/nodalities/this_weeks_semantic_web/</a>