value-class-pattern-issues: Difference between revisions
|  (→Problems:  Added point about design for humans first, objection to use of emotive, confrontational term ‘dark data’ in functional, proactive discussion) |  (→Parsing title from Empty value Elements:  Added note about this already functioning with ABBR elements due to parser quirks) | ||
| Line 64: | Line 64: | ||
| e.g. <code><span class="dtstart">Tuesday the 24th at 6pm <span class="value" title="20080624T180000+1000"></span>lt;/span></code> | e.g. <code><span class="dtstart">Tuesday the 24th at 6pm <span class="value" title="20080624T180000+1000"></span>lt;/span></code> | ||
| Note that due to a quirk in parsers, this technique can already be used in some parsers where the empty <code>value</code> node is also an <code>abbr</code> element. That is semantically incorrect use of <code>abbr</code>, though. | |||
| ====Problems==== | ====Problems==== | ||
Revision as of 12:37, 26 June 2008
Value Excerption Pattern Issues
Open issues concerning the parsing of the value excerption pattern.
Open Issues
These issues are awaiting resolution and reflection in the specification, but may not be blockers on the implementation of the specification.
Excluded Fields
There seem to be some properties within which value excerpting is NOT allowed (or should not be allowed!) e.g. "type" in hCard. TobyInk 07:38, 22 May 2008 (PDT)
- You mean typeas a sub-property oftel? That's one of the identified machine-data items that needs a means of including the publisher's choice text along with the microformat specified one. Not to conflate two separate issues, but just noting that separation oftypetext andtype valueneeds to be handled somewhere, and value-excerption-pattern could be considered as part of the solution. BenWard 07:54, 22 May 2008 (PDT)
- Some fields make sense to exclude this, as it seems unintuitive, and can be used to avoid many of nested-microformat problems that may avoid a messier mfo pattern. E.g. entry-summaryandentry-contentin hAtom, both could very feasibly have nested formats or any kind, but doesn't strike me as useful to segregate into "value" at all. BenWard 10:41, 6 Jun 2008 (PDT)
- Total other alternative, make value-excerption opt-in. Would need a bit of effort to go through all the specs and clarify, but actually might make more sense. It's a useful pattern for some properties (especially those with data patterns). BenWard 10:41, 6 Jun 2008 (PDT)
- Setting the rules over depth-of-parsing (see below) to children-only would obviate the remaining need for this issue.
White-space behaviour when concatenating value nodes
We specify that no characters get inserted between concatenated occurrences of ‘value’. Need to audit all properties to ensure that this behaviour would be correct in all cases.
Possibly specify that individual properties can override this behaviour, specifying a separator character. Possibly specify that this should be a provision of parsing implementations, so as to maintain flexibility for future publishing.
Depth of Parsing
Currently any descendent is parsed, which causes issues if a microformat field using the value-excerption-pattern is nested within another.
Example:
<div class="hentry vevent">
    <h1 class="entry-title summary">Party on Sunday!</h1>
    <div class="updated published">Tuesday <span class="value">2008-06-17</span></div>
    <p class="entry-content description">We're having a party on <span class="dtstart">Sunday, at 7pm! 
        <span class="value">2008-06-22T19:00:00+0100</span></span>. Please bring your friends!</p>
</div>
In this example, hAtom and hCalendar are interleaved. The DTSTART property of the event is contained within the entry-content of the hAtom entry, using the value-excerption-pattern to include the machine-data datetime. However, with full descendent parsing, the hAtom model will come out as the following:
ENTRY
    ENTRY-TITLE=Party on Sunday!
    UPDATED= 2008-06-17
    PUBLISHED=2008-06-17
    ENTRY-CONTENT=2008-06-22T19:00:00+0100
- e.g. an hCalendar veventnested inside hAtomentry-contentmust not result inentry-contentparsing as20080627T12:34:00+100.
- e.g. hCalendar defines organizer, which may be an hCard, which may have atelproperty containing a sub-propertyvalue. Under these parsing rules, the entireorganizerfield would be parsed as the telephone number.
- Cognition copes with this OK -- the organizer is parsed as a full contact with an hCard - not just a number. TobyInk 07:38, 22 May 2008 (PDT)
Possible resolutions:
- Specify the mfo(‘microformat object’) class be used when nesting microformats, as a processing instruction to parsers not to parse unrelated nested items
- Specify that valuemust only be read from children, not from all descendants. Restrictive, unlikely to work for existing hCard TEL usage.
- Specify the above (parse children, not all descendants), but allow individual properties (such as telto override and parse all descendants. This would result in a parse-depth flag on all fields, and many getting overridden for all descendants, but again, seems to be a well structured solution. Property name dictionaries in parsers would have to include the depth flag with the property.- This may break existing published microformats. TobyInk 07:25, 22 Jun 2008 (PDT)
 
Parsing title from Empty value Elements
As a solution to the invisible data requirements sometimes presented by machine-data in microformats, a parsing rule is proposed where the value element is empty (contains no non-whitespace characters), the title attribute instead be parsed.
e.g. <span class="dtstart">Tuesday the 24th at 6pm <span class="value" title="20080624T180000+1000"></span>lt;/span>
Note that due to a quirk in parsers, this technique can already be used in some parsers where the empty value node is also an abbr element. That is semantically incorrect use of abbr, though.
Problems
- Violates the microformats principle of visible data. Numerous previous efforts (e.g. markup in comments etc.) have walked down that path of "dark data" and failed in practice. We must hold ourselves to higher standards than any XML/RDF solution.  It's part of what sets microformats apart from so many other failed efforts at data representation on the web. We must not go down the path of dark data. IMHO that principle is inviolable for microformats. Tantek
- The approach here is that we have exceptional situations where we are requiring data to be duplicated for machines. They are exceptions which have existed in microformats since hCard, and this is a pattern to handle those exceptions and only those exceptions in response to the problems people have publishing them. The specification for this could be written to make it a per-property opt-in device, only for those properties identified above. This is not a ‘generic data embedding’ device and in line with the cited principals, should not be allowed to become one. --BenWard 05:17, 25 Jun 2008 (PDT)
- An alternative, I suppose, would be to recognise all of the above data format examples as being in violation of the microformats principal, since authors are hiding them in favour of their own content. Every instance of fixed data formats in microformats that force authors to break the invisible data principal would need to be eliminated in favour of accessible, i18n compatible replacements, including those in hCard which are 1:1 mappings from vCard. We _could_ undertake that, but previous discussions (people being advised to misuse ABBR for translation of the vCard telephone types, for example) have already suggested that supporting the visible publishing is too complex. --BenWard 05:17, 25 Jun 2008 (PDT)
- Additionally, the use of terms such as ‘dark data’ is inappropriate for this discussion, which is focused on functional, practical solutions to the identified problem. The term is emotive, and aggressive toward other, completely unrelated technologies (such as RDF) which is irrelevant to this solving this issue. Precisely, the machine-data in this technique is ‘non-visible machine-data’, and is being approached with specific regard to the microformats principal of design for humans first. --BenWard 05:34, 26 Jun 2008 (PDT)
 
- Worsens the DRY violation by separating the human visible version and machine readable version into separate elements.  Duplicate data itself is bad, but at least by keeping the duplicates local on the same element (as the existing abbr-pattern does), the risk of drift/divergence is reduced. The greater the distance in content of the duplicates, the greater the risk of drift/divergence, and thus the lower the quality of data. This has been illustrated by the divergence of invisible metadata in the head of a document versus the content in the body, and even more so across documents.
- The machine-data form is kept as a sibling of the human form, and in distance in code, is not much further away than the data stored on a single elements titleattribute. Further, the specification for this could demand the value element be placed as the _first child_ of the parent property, forcing it to be published immediately after the property element. --BenWard 05:17, 25 Jun 2008 (PDT)
 
- The machine-data form is kept as a sibling of the human form, and in distance in code, is not much further away than the data stored on a single elements 
- Some parsers (particularly those that run incoming HTML through Tidy to convert it into well-formed XML) may strip empty inline elements. A workaround may be to allow (or even require) hard white space (i.e.  ) within the element with class='value".- It is, however, trivial to patch and build Tidy not to do this (keeping empty elements where that element also has a class attribute). Parser writers need to feed back on whether using a custom build is impossible to their solution, but since Tidy can be made to work, the problem can likely be alleviated. Ben Ward has put up an experimental build of Tidy with patched element-dropping behaviour here: tidy-microformats.zip
- Tidy is not just used in parsers, but also by publishers, as part of CMSes, etc.
 
 
- It is, however, trivial to patch and build Tidy not to do this (keeping empty elements where that element also has a class attribute). Parser writers need to feed back on whether using a custom build is impossible to their solution, but since Tidy can be made to work, the problem can likely be alleviated. Ben Ward has put up an experimental build of Tidy with patched element-dropping behaviour here: tidy-microformats.zip
Additional Notes
- This is parsable, needs to be specced.
- Suggest restricting to instances where a single valueelement exists, e.g.- Disallow concatenation of multiple embedded values
- Disallow embedded values from being appended to visible data.
 
- This pattern exists to solve the machine data problem, and restricting it more will discourage it being used for hiding other, useful data.
- Perhaps restrict the value element to be the first-child (excluding white-space text-nodes), forcing the data to be kept physically close to the µf property in code. Keeps data close, helps maintenance issues. Intends to alleviate some invisible data concerns.
- Restrict opt-in to specific properties. Do not allow it to be parsed globally. Fail parsers which implement it globally.
 
Closed Issues
These issues are closed, and either dismissed with reason, or the specification has been updated in resolution.
Nested value
Should <span class="value">Foo <span class="value">Bar</span></span> parse as foo bar or bar? Should value elements be allowed to be nested within value elements?
Resolution: Disallowed. Deemed complex to parse, and unnecessary when publishing.