microformats2-parsing-issues

(Difference between revisions)

Jump to: navigation, search
(parsing a dt- property: Value-class-pattern parsing should instruct to use a single space as the separator)
Current revision (22:17, 25 April 2017) (view source)
(any h- root class name overrides and stops backcompat root: add examples)
 
(46 intermediate revisions not shown.)
Line 1: Line 1:
-
This page is for documenting issues with the [[microformats2-parsing]] specification.
+
This page documents issues with the [[microformats2-parsing]] specification before 2016-06-20.
-
== issues ==
+
'''See https://github.com/microformats/microformats2-parsing/issues for current and new issues!'''
-
Open issues in various states of partial resolution from none to nearly resolved.
+
-
=== parsing a dt- property ===
+
'''See [[microformats2-parsing#change_control|change control]] for how to move issues forward.'''
-
* Should instruct to replace a "T" separator with a single space.
+
-
* Value-class-pattern parsing should instruct to use a single space as the separator.
+
-
* Value-class-pattern should instruct to keep the authored level of specificity, rather than implying 00 seconds when not present. http://microformats.org/wiki/value-class-pattern##If+by+parsing+the+%22value%22
+
-
Log: https://indiewebcamp.com/irc/2016-04-25#t1461606553653
+
{{warn|Note: all current issues resolved as of 2016-06-04}}
 +
* Pending edits to this page, and the [[microformats2-parsing]] specification.
-
<div class="discussion">
+
== issues ==
-
* +0 [[User:Kylewm|Kylewm]] on replacing "T" as the separator. Would you please clarify whether that is only for value class pattern/assembling dates from components, or is it proposing to *always* normalize dt's?
+
Open issues in various states of partial resolution from none to nearly resolved.
-
** +1 [[User:Tantek|Tantek]] definitely value class pattern/assembling dates from components should use " " instead of "T" as separator.
+
-
** +0 [[User:Tantek|Tantek]] slight pref (but unsure) for replace a "T" separator with a single space in other dt-* parsing.
+
-
** +1 [[User:GlennJones|Glenn]] happy to move to single space separator for dates built from the value-class pattern.
+
-
** -1 [[User:GlennJones|Glenn]] I think we should pass through the authored format of a date as default output. We should process the content as little as possible, so it is as authored. We can then add parser options to force one of the date formats such as ISO profiles HTML5 or W3C if we need consistency. This is the approach I have taken.
+
-
** See related specific issue: [[microformats2-parsing-issues#Standard_datetime_format]]
+
-
* +1 [[User:Kylewm|Kylewm]] on not implying seconds
+
-
* +1 [[User:GlennJones|Glenn]] on not implying seconds. Authored level of specificity should always be kept in dates.
+
-
</div>
+
=== unicode generation in JSON ===
 +
STATUS: 2016-06-05 apparent implementers consensus at IndieWeb Summit issues resolution session.
 +
* WAITING FOR: 1+ implementation to support and validate
-
Consensus resolutions:
 
-
* '''Drop value-class-pattern implying 00 seconds.''' Note: keeping/implying 00 minutes due to common human usage of whole hours to specifically mean "on the hour" which is 00 minutes. The same implied precision does not exist for seconds in practice.
 
-
** 2016-05-18 [[vcp]] updated with this resolution.
 
-
* '''Value-class-pattern parsing should instruct to use a single space as the separator.'''
 
-
** 2016-05-18 [[vcp]] updated with this resolution.
 
-
 
-
=== implied name when alt="" ===
 
-
 
-
The implied name rule
 
-
 
-
* else if .h-x>img:only-child[alt]:not[.h-*] then use that img alt for name
 
-
 
-
is slightly under-specified for the case where alt is provided but intentionally blank. The desired behavior is to use the img alt tag only if it is non-empty. For example:
 
-
 
-
<pre>
 
-
<a class="h-card" href="https://kylewm.com">
 
-
  <img src="https://kylewm.com/photo.jpg" alt="">
 
-
  Kyle
 
-
</a>
 
-
</pre>
 
-
 
-
 
-
The PHP and JS parsers already seem to return the desired result ("Kyle" in the above example). The Python parser uses the alt text and returns "".
 
-
 
-
Proposal: modify the spec to explicitly exclude these tags:
 
-
 
-
* else if .h-x>img:only-child[alt]:not([alt=""]):not[.h-*] then use that img alt for name
 
-
 
-
And audit the other implied rules for similar cases.
 
-
 
-
<div class="discussion">
 
-
* +1 [[User:Tantek|Tantek]] this makes sense to me, and as far as I can tell, for the other cases too for *implied* properties:
 
-
** area[alt], abbr[title], and all other attributes where there is an existence test, there should be a :not[alt=""] empty test, for implied p-name, u-photo, u-url
 
-
* +1 [[User:GRegorLove|gRegor]] sounds good to me.
 
-
* +1 [[User:Kylewm|Kylewm]] We've added this in mf2py too now, and I'm happy with it.
 
-
* +1 [[User:GlennJones|Glenn]] This is often define by the underlying HTML parsing library which will remove attributes that do not have a values.
 
-
* ...
 
-
</div>
 
-
 
-
=== img fallback in p- ===
 
-
Trying to make an author h-card without too many extra elements I first did:
 
-
<pre><div class="p-author h-card">
 
-
<a href="/" class="p-org p-name"><img class="u-logo" src="/static/logo.jpg" >mention.tech</a>
 
-
</div></pre>
 
-
 
-
rather than:
 
-
<pre><div class="p-author h-card">
 
-
<a href="/" class="p-org p-name"><img class="u-logo" src="/static/logo.jpg" alt="">mention.tech</a>
 
-
</div></pre>
 
-
 
-
I was surprised that the p-name and p-org  took the src and the plaintext and concatenated them giving <code>http://mention-tech.appspot.com/static/logo.jpgmention.tech</code>, though that is the current spec (a separate php-mf2 bug ignored the empty alt when I added it).
 
-
 
-
While this is what the spec says, I can't think of a scenario where concatenating a string to a URL gives a useful result. Instead:
 
-
 
-
Proposal:
 
-
* If we fallback on the src of an img due to it having no alt I propose we put a space on beginning and end. As whitespace is stripped from beginning and end of p- values, this should still give the url in the simplest case, but avoid creating nonsensical URLs in cases like this.
 
-
 
-
<div class="discussion">
 
-
* +1 [[User:Kylewm|Kylewm]] reasoning given here makes sense
 
-
* +1 [[User:Tantek|Tantek]] agreed with reasoning
 
-
* ...
 
-
</div>
 
-
 
-
=== de-dupe URLs? ===
 
-
Currently, Known templates end up linking to the author's url in the h-card twice. This leads to duplicate URLs in the parsed output, which make jf2 conversion insert a children element.
 
-
Should we be deduping URLs? Or is this a GIGO issue?
 
-
 
-
<div class="discussion">
 
-
* -1 [[User:Kylewm|Kylewm]] I can't necessarily think of a case where two of the same URL values is useful, but it feels like the parser's job to preserve the fidelity of the input. (this has been fixed in Known's markup btw [https://github.com/idno/Known/issues/1372])
 
-
* -1 [[User:Tantek|Tantek]] on de-duping for mf2 json. jf2 can do what it prefers, no specific opinion on that.
 
-
* -1 [[User:GlennJones|Glenn]] for both the reasons mentioned already
 
-
* ...
 
-
</div>
 
-
 
-
=== unicode generation in JSON ===
 
Currently we will convert HTML entities into unicode as part of the parsing process. However, these and other non-asciicharacters can be output as escaped unicode in the generated JSON
Currently we will convert HTML entities into unicode as part of the parsing process. However, these and other non-asciicharacters can be output as escaped unicode in the generated JSON
Broadly this is OK, as we assume JSON parsers should be able to handle this accordingly.
Broadly this is OK, as we assume JSON parsers should be able to handle this accordingly.
Line 113: Line 30:
** When parsing e- properties, HTML entities should be left escaped in the "html" value. This is important when parsing a reply-context; if the original post contains an escaped HTML code snippet, I want the reply context to show the same code snippet, rather than converting it all into real tags.
** When parsing e- properties, HTML entities should be left escaped in the "html" value. This is important when parsing a reply-context; if the original post contains an escaped HTML code snippet, I want the reply context to show the same code snippet, rather than converting it all into real tags.
** e.g. <code>"content": [{"html": "&amp;lt;b&amp;gt;1&amp;mdash;2&amp;lt;/b&amp;gt;", "value": "&lt;b&gt;1&mdash;2&lt;/b&gt;"}]</code>.  
** e.g. <code>"content": [{"html": "&amp;lt;b&amp;gt;1&amp;mdash;2&amp;lt;/b&amp;gt;", "value": "&lt;b&gt;1&mdash;2&lt;/b&gt;"}]</code>.  
-
</div>
+
* +0 [[User:WillNorris|willnorris]] (+1 to saying that microformats parsers should standardize on UTF-8 for e-* text, however I feel like e-* html should be left as unscathed as possible. html encoding may harken to a time before UTF-8, but if the content was authored that way, shouldn't necessary be changing that)
-
 
+
-
=== ignore u-camelCase properties ===
+
-
Due to Suit CSS (and others? citations?) recent (2015-?) use of "u-*" class names for so-called "[http://davidtheclark.com/on-utility-classes/ utility classes]", we are seeing some false positives in a few very rare instances, e.g.: [http://www.unmung.com/mf2?url=http%3A%2F%2Fwww.kevinmarks.com%2Ftwitterutils.html&html=&pretty=on this twitter markup]
+
-
 
+
-
(Nearly) all these "utility classes" use camelCase for the class name suffixes, thus we can filter them out by looking for camelCase (since microformats class name conventions are always all lowercase and hyphenated), or even just looking for (and rejecting) *any* capital letters.
+
-
 
+
-
Proposal:
+
-
* microformats2 parsers MUST IGNORE u-* classnames where the * has any uppercase letter(s).
+
-
 
+
-
<div class="discussion">
+
-
* +1 [[User:Tantek|Tantek]] Let's get this fix rolling quickly to avoid further pollution.
+
-
* +1 [[User:Barnabywalters|Barnaby]] php-mf2 already ignores classnames with capitalised prefixes, ignoring any classnames with capital letters seems totally reasonable
+
-
* +1 [[User:Kylewm|Kylewm]] agree with rejecting property names that include capital letters (specifically detecting camelCase seems harder to define)
+
-
* +1 [[User:GlennJones|Glenn]] agreed, a simple change which should help avoid further pollution
+
-
* ...
+
-
</div>
+
-
 
+
-
=== exclude style elements before parsing ===
+
-
[http://logs.glob.uno/?c=freenode%23microformats#c85457 2016-01-25 raised in #microformats]
+
-
 
+
-
Ran into an issue of a <style> element being parsed as plain text in a p-name. Should [[microformats2-parsing]] be updated to indicate <style> should be excluded when parsing? Appears to implicitly fall under [[microformats2-parsing#note_HTML_parsing_rules]]
+
-
 
+
-
Sample link: http://veganstraightedge.com/notes/2016/01/16/tonight-s-dinner-tacocleanse-beverly-hills-c
+
-
 
+
-
The <script> tag can be similarly problematic.
+
-
 
+
-
Proposal: Drop both <script> and <style> elements completely when parsing any property (including e-* HTML values). [[User:Tantek|Tantek]] 01:01, 29 February 2016 (UTC)
+
-
 
+
-
<div class="discussion">
+
-
Please discuss and/or give +1/0/-1 feedback
+
-
* +1 [[User:Tantek|Tantek]] as proposer
+
-
* +1 [[User:Aaronpk|aaronpk]] as a consumer of HTML from an e-* property, I will always be sanitizing the HTML and removing <script> and <style> anyway
+
-
* +1 [[User:Kylewm|kylewm]]
+
-
* 0 [[User:Barnabywalters|Barnaby]] +1 to removing the contents of <script> and <style> from all plaintext properties (and 'value' property in HTML dicts), -1 to removing <script> and <style> from HTML. That’s a job for a sanitization stage. As aaronpk points out, sanitization will have to be done anyway if the content is to be reposted, so doing so in the parser doesn’t actually save anyone any work, but removes information which could be useful to people (example use cases: publishing posts with embedded per-post styling, publishing interactive HTML documents with embedded javascript)
+
-
** +1 this seems like reasonable feedback to make a new refined proposal. [[User:Tantek|Tantek]] 20:37, 13 March 2016 (UTC)
+
-
** +1 I like the revised proposal and am happy to change my vote to this [[User:Aaronpk|Aaronpk]] 21:16, 13 March 2016 (UTC)
+
-
** +1 Totally agree with narrowing the proposal. All the problems I've had with script and style tags come from plaintext properties, and agree that they may even be useful to some consumers of the HTML properties (e.g. an embedded YouTube video) [[User:Kylewm|Kylewm]] 23:40, 13 March 2016 (UTC)
+
-
</div>
+
-
 
+
-
Proposal 2: Drop both <script> and <style> elements completely when parsing any property (except for e-* HTML values, which preserve all markup). [[User:Tantek|Tantek]] 20:37, 13 March 2016 (UTC)
+
-
 
+
-
<div class="discussion">
+
-
Please discuss and/or give +1/0/-1 feedback
+
-
* +1 [[User:Tantek|Tantek]] as proposer
+
-
* +1 [[User:Barnabywalters|Barnaby]]
+
-
* +1 [[User:Kylewm|Kylewm]] leave sanitization to the sanitizers!
+
-
* +1 [[User:GlennJones|Glenn]]
+
-
* ...
+
-
</div>
+
-
 
+
-
=== use poster if no src on video for u props ===
+
-
[https://indiewebcamp.com/irc/2015-12-13#t1450035721661 2015-12-13 raised in #indiewebcamp]
+
-
 
+
-
There is a use-case of marking up the "poster" of a video element as the u-featured of an [[h-entry]], to do that, we need to change [[microformats2-parsing#parsing_a_u-_property|u- property parsing]] to look at the poster attribute of the video element, after it's looked for the src attribute.
+
-
<blockquote>" else if video.u-x[poster], then get the poster attribute "</blockquote>
+
-
 
+
-
Real-world example of markup in the wild:
+
-
* http://veganstraightedge.com/videos/2013/5/31/1/backyard-squirrel-buddy
+
-
** and likely all other videos posted there.
+
-
 
+
-
Background discussion that led to this proposal:
+
-
* https://indiewebcamp.com/irc/2015-12-13#t1450035721661
+
-
 
+
-
This seems very straightforward so I've added it as PROPOSED directly in the parsing spec. This issue is for tracking the discussion.
+
-
 
+
-
Feedback from parser implementers please!
+
-
<div class="discussion">
+
-
* +1 [[User:Barnabywalters|Barnaby]] easy to implement and based on real-world markup, no objections
+
-
* +1 [[User:Kylewm|Kylewm]] sgtm
+
-
* +1 [[User:GlennJones|Glenn]]
+
-
* ...
+
-
</div>
+
-
 
+
-
=== uf2 children on backcompat properties ===
+
-
[http://logs.glob.uno/?c=freenode%23microformats&s=today#c84632 2015-11-24 raised by Calli] in #microformats
+
-
 
+
-
Related but different from [[#uf2_children_inside_a_classic_microformats_root_class_name]], when there is a uf2 child directly on a backcompat property, what should happen? E.g.
+
-
 
+
-
<source lang=html4strict>
+
-
<div class="vcard">
+
-
<div class="adr h-adr">
+
-
  <div class="locality">MF1</div>
+
-
  <div class="p-locality">MF2</div>
+
-
</div>
+
-
</div>
+
-
</source>
+
-
 
+
-
What is the expected behavior and parser output?
+
-
 
+
-
<source lang=javascript>
+
-
"items": [{
+
-
  "type": ["h-card"],
+
-
  "properties": {
+
-
    "adr": [{
+
-
      "value": "MF1MF2",
+
-
      "type": ["h-adr"],
+
-
      "properties": {
+
-
        "locality": ["MF2"],
+
-
        "name": ["MF1MF2"]
+
-
      }
+
-
    }]
+
-
  } 
+
-
}]
+
-
</source>
+
-
 
+
-
<div class="discussion">
+
-
* Proposal: the nested "adr h-adr" child is treated as an mf2 object, not backcompat, and thus the resulting parsed "locality" property has a single value of "MF2". Proposed by Calli, noting that Glenn Jones's microformatshiv gets that result currently, and it would be easier for him (Calli) to implement this way.
+
-
** +1 Tantek, seems reasonable and the reasoning provided is good (we have one implementation this way already)
+
-
** +1 Kyle, this is consistent with the resolution to the related issue
+
-
** +1 Calli, yes, this is easier for me to implement (than taking both MF1 and MF2 properties) because it is consistent - for me, consistency is the controlling factor in favor rather than ease of parser implementation
+
-
** +1 Barnaby, php-mf2’s mf1 backcompat produces this exact result, and it makes a lot of sense to me
+
-
** ...
+
-
</div>
+
-
 
+
-
Another example:
+
-
 
+
-
<source lang=html4strict>
+
-
<div class="vcard">
+
-
  <div class="adr h-custom">
+
-
    <div class="locality">MF1</div>
+
-
    <div class="p-locality">MF2</div>
+
-
  </div>
+
-
</div>
+
-
</source>
+
-
 
+
-
<source lang=javascript>
+
-
"items": [{
+
-
  "type": ["h-card"],
+
-
  "properties": {
+
-
    "adr": [{
+
-
      "value": "MF1MF2",
+
-
      "type": ["h-custom"],
+
-
      "properties": {
+
-
        "locality": ["MF2"],
+
-
        "name": ["MF1MF2"]
+
-
      }
+
-
    }]
+
-
  } 
+
-
}]
+
-
</source>
+
-
 
+
-
Per the [[#any_h-_root_class_name_overrides_and_stops_backcompat_root]] resolution, the class name "h-custom" overrides the use of "adr" as a backcompat root.
+
-
 
+
-
=== default generated HTML ===
+
-
2015-09-08 raised by Tantek in #indiewebcamp
+
-
 
+
-
Should there be a default (perhaps not quite "canonical") way to map/generate HTML+microformats2 from a parsed mf2 JSON output?
+
-
 
+
-
E.g. straw proposal:
+
-
* JSON/mf2 -> [[XOXO]]+mf2
+
-
 
+
-
Existing work / mappings:
+
-
* https://github.com/snarfed/granary/blob/master/granary/microformats2.py#L295
+
-
 
+
-
Related to:
+
-
* https://github.com/snarfed/granary/issues/31
+
-
 
+
-
Use-cases:
+
-
* default webview / presentation for a site that stores mf2 JSON output
+
-
* possibly a way to implement a distributed HTTP webcache retrieval protocol
+
-
 
+
-
Thoughts?
+
-
<div class="discussion">
+
-
* +1 [[User:Tantek|Tantek]] I think we should have this, but am open to proposals on specifics!
+
-
* +1 [[User:GlennJones|Glenn]] Also think this is worth looking at, but I am not sure it should be part of the parser spec. Feels like it should be built as a separate library and have it own spec on the microformats wiki.
+
-
* +1 [[User:Barnabywalters|Barnaby]] agreed with Glenn, this would be a nice thing to have, but IMO it’s out of scope for the parser and should be specified separately. Personally I would probably implement it separately too, depending on how much work it is.
+
-
* -1 [[User:Kylewm|Kylewm]] A pretty display would be a nice debugging tool, but I'm -1 the proposal to define a specific, default HTML output. The two proposed use-cases are totally buildable without it.
+
-
* ...
+
</div>
</div>
Line 303: Line 52:
As a separate new point we need to consider "exclude tags" lists for parsed text from html.  We should consider <code><noscript></code>, <code><noframe></code> and <code><template></code> there maybe other I have not gone through all the tags in current HTML spec. Also we should consider what to do about the more common pattern of fallback text within media tags <code><video></code>, <code><audio></code> etc. This should be explicitly discussed in the parsing rules. At the moment my experimental text normalisation does exclude tags, but the default text parse does not. Currently the fallback content in media tags like <code><video></code> is added to the parse text. 12:56, 25 Septemeber 2015 (UTC)
As a separate new point we need to consider "exclude tags" lists for parsed text from html.  We should consider <code><noscript></code>, <code><noframe></code> and <code><template></code> there maybe other I have not gone through all the tags in current HTML spec. Also we should consider what to do about the more common pattern of fallback text within media tags <code><video></code>, <code><audio></code> etc. This should be explicitly discussed in the parsing rules. At the moment my experimental text normalisation does exclude tags, but the default text parse does not. Currently the fallback content in media tags like <code><video></code> is added to the parse text. 12:56, 25 Septemeber 2015 (UTC)
* [[User:Barnabywalters|Barnaby]] in theory, as the video and audio data by default can’t be included in plaintext properties, and the fallback content (much like img alt attributes) should be somehow human-readable and useful, I would suggest keeping it in plaintext properties. I’d like to see some real-world examples of what fallback content people are using — if it’s links or plaintext descriptions this approach could work well, if people are writing instructions saying “install flash” or “update your browser” it’s not going to produce very pretty results
* [[User:Barnabywalters|Barnaby]] in theory, as the video and audio data by default can’t be included in plaintext properties, and the fallback content (much like img alt attributes) should be somehow human-readable and useful, I would suggest keeping it in plaintext properties. I’d like to see some real-world examples of what fallback content people are using — if it’s links or plaintext descriptions this approach could work well, if people are writing instructions saying “install flash” or “update your browser” it’s not going to produce very pretty results
 +
* +1 in theory to stripping it out for the same reasons Tantek mentioned above.  I haven't tested this in go, but would be surprised if it had issues.  [[User:WillNorris|WillNorris]] 22:37, 5 June 2016 (UTC)
* ...
* ...
</div>
</div>
-
 
-
=== Standard datetime format ===
 
-
2015-07-28
 
-
 
-
http://microformats.org/wiki/microformats2-parsing#parsing_a_dt-_property does not specify any standard format to use for datetimes. e.g.  <pre>2015-07-28T12:55:33</pre> vs <pre>2015-07-28 12:55:33</pre>
 
-
Would be good to standardize this to compare various parser outputs.
 
-
 
-
2015-07-29: This subject is (somewhat) covered in http://microformats.org/wiki/iso-8601 As it stands the JavaScript parsers support output in the 3 main profiles, 'W3C Note', 'RFC 3339' and 'HTML5' plus 'auto' which keeps authors format. The default date output for the JavaScript parsers is the same format as the date was originally authored in. This can be changes by setting the options.dateFormat switch to any of the other profiles mentioned. It would be good if other parser also had a switch to force output to a common profiles so we could compare various parser outputs, but I think the default should be how a date was authored. All output whatever profile should also keeps the authored level of specificity, i.e. not adding minutes or seconds if they are not in the input date string. This is important if you want to compare parser outputs.
 
-
 
-
The only exception to this where date and times are combined such as the implied h-event rule for dt-start and dt-end where I output in the HTML5 style 2015-07-29 12:55:33 as there is no predefined author preference and HTML5 profile is more human readable. [[User:GlennJones|Glenn Jones]] 11:02, 29 July 2015 (UTC)
 
-
 
-
<div class="discussion">
 
-
* +1 [[User:Tantek|Tantek]] output more human readable <code>2015-07-28 12:55:33</code> canonically, with authored level of specificity, i.e. not adding minutes or seconds if they are not in the input date string.
 
-
** Let's update any test cases as needed for this. - [[User:Tantek|Tantek]]
 
-
* ...
 
-
</div>
 
-
 
-
 
-
=== implied date for dt properties both mf2 and backcompat ===
 
-
The [[value-class-pattern#microformats2_parsers|value class pattern dt-* date proposal]] should apply to both mf2 dt-* properties, and backcompat classic microformats, to preserve the hAtom / hCalendar optimizations noted on that page, but in a generic way.
 
-
 
-
<div class="discussion">
 
-
* +1 [[User:Tantek|Tantek]] 16:12, 18 August 2015 (UTC)
 
-
* +1 [[User:Glenn Jones|Glenn Jones]] 20 August 2015
 
-
</div>
 
-
 
-
2015-08-21: [[User:Glenn Jones|Glenn Jones]]
 
-
Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html
 
-
 
=== implied properties when an explicit class is provided ===
=== implied properties when an explicit class is provided ===
Line 373: Line 94:
**** Yet over-implied p-names appear to cause problems with many Bridgy webmention consuming use-cases[https://indiewebcamp.com/irc/2016-04-01/line/1459530869915], that's a good case - [[User:Tantek|Tantek]] 22:11, 1 April 2016 (UTC)
**** Yet over-implied p-names appear to cause problems with many Bridgy webmention consuming use-cases[https://indiewebcamp.com/irc/2016-04-01/line/1459530869915], that's a good case - [[User:Tantek|Tantek]] 22:11, 1 April 2016 (UTC)
**** may need a broader rule, like any explicit p-* property on an element stops implied p-name. - [[User:Tantek|Tantek]] 22:11, 1 April 2016 (UTC)
**** may need a broader rule, like any explicit p-* property on an element stops implied p-name. - [[User:Tantek|Tantek]] 22:11, 1 April 2016 (UTC)
-
*** ... provide input on this refined proposal here
+
**** p-name issue forked off and filed separately: https://github.com/microformats/microformats2-parsing/issues/6
 +
*** +1 consensus in the room at IWS 2016 (willnorris, gRegor, kylewm, tantek): do this for specifically for implied URL, do nothing for now for name and photo. [[User:WillNorris|WillNorris]] 22:56, 5 June 2016 (UTC)
</div>
</div>
Line 482: Line 204:
Currently if I put u-* on an iframe it gets the value of the fallback text. This seems a shame. Getting the URL seems a sensible answer.
Currently if I put u-* on an iframe it gets the value of the fallback text. This seems a shame. Getting the URL seems a sensible answer.
[[User:Kevin Marks|Kevin Marks]] 09:07, 11 July 2015 (UTC)
[[User:Kevin Marks|Kevin Marks]] 09:07, 11 July 2015 (UTC)
 +
 +
<div class="discussion">
 +
* +1 from the room at IWS 2016 [[User:WillNorris|WillNorris]] 22:57, 5 June 2016 (UTC)
 +
</div>
=== i- parsing iframe src ===
=== i- parsing iframe src ===
Line 618: Line 344:
== resolved ==
== resolved ==
 +
Most recent resolved issues first:
 +
 +
=== exclude style elements before parsing ===
 +
2016-06-05 RESOLVED. 2016-07-14 spec updated.
 +
 +
[http://logs.glob.uno/?c=freenode%23microformats#c85457 2016-01-25 raised in #microformats]
 +
 +
Ran into an issue of a <style> element being parsed as plain text in a p-name. Should [[microformats2-parsing]] be updated to indicate <style> should be excluded when parsing? Appears to implicitly fall under [[microformats2-parsing#note_HTML_parsing_rules]]
 +
 +
Sample link: http://veganstraightedge.com/notes/2016/01/16/tonight-s-dinner-tacocleanse-beverly-hills-c
 +
 +
The <script> tag can be similarly problematic.
 +
 +
Proposal: Drop both <script> and <style> elements completely when parsing any property (including e-* HTML values). [[User:Tantek|Tantek]] 01:01, 29 February 2016 (UTC)
 +
 +
<div class="discussion">
 +
Please discuss and/or give +1/0/-1 feedback
 +
* +1 [[User:Tantek|Tantek]] as proposer
 +
* +1 [[User:Aaronpk|aaronpk]] as a consumer of HTML from an e-* property, I will always be sanitizing the HTML and removing <script> and <style> anyway
 +
* +1 [[User:Kylewm|kylewm]]
 +
* 0 [[User:Barnabywalters|Barnaby]] +1 to removing the contents of <script> and <style> from all plaintext properties (and 'value' property in HTML dicts), -1 to removing <script> and <style> from HTML. That’s a job for a sanitization stage. As aaronpk points out, sanitization will have to be done anyway if the content is to be reposted, so doing so in the parser doesn’t actually save anyone any work, but removes information which could be useful to people (example use cases: publishing posts with embedded per-post styling, publishing interactive HTML documents with embedded javascript)
 +
** +1 this seems like reasonable feedback to make a new refined proposal. [[User:Tantek|Tantek]] 20:37, 13 March 2016 (UTC)
 +
** +1 I like the revised proposal and am happy to change my vote to this [[User:Aaronpk|Aaronpk]] 21:16, 13 March 2016 (UTC)
 +
** +1 Totally agree with narrowing the proposal. All the problems I've had with script and style tags come from plaintext properties, and agree that they may even be useful to some consumers of the HTML properties (e.g. an embedded YouTube video) [[User:Kylewm|Kylewm]] 23:40, 13 March 2016 (UTC)
 +
</div>
 +
 +
Proposal 2: Drop both <script> and <style> elements completely when parsing any property (except for e-* HTML values, which preserve all markup). [[User:Tantek|Tantek]] 20:37, 13 March 2016 (UTC)
 +
 +
<div class="discussion">
 +
Please discuss and/or give +1/0/-1 feedback
 +
* +1 [[User:Tantek|Tantek]] as proposer
 +
* +1 [[User:Barnabywalters|Barnaby]]
 +
* +1 [[User:Kylewm|Kylewm]] leave sanitization to the sanitizers!
 +
* +1 [[User:GlennJones|Glenn]]
 +
* +1 implemented in go parser [[User:WillNorris|WillNorris]] 22:16, 5 June 2016 (UTC)
 +
</div>
 +
 +
 +
=== default generated HTML ===
 +
2016-06-05 RESOLVED. No change to spec.
 +
 +
2015-09-08 raised by Tantek in #indiewebcamp
 +
 +
Should there be a default (perhaps not quite "canonical") way to map/generate HTML+microformats2 from a parsed mf2 JSON output?
 +
 +
E.g. straw proposal:
 +
* JSON/mf2 -> [[XOXO]]+mf2
 +
 +
Existing work / mappings:
 +
* https://granary-demo.appspot.com/?input=json-mf2&output=html#input
 +
* https://github.com/snarfed/granary/blob/master/granary/microformats2.py#L295
 +
 +
Related to:
 +
* https://github.com/snarfed/granary/issues/31
 +
 +
Use-cases:
 +
* default webview / presentation for a site that stores mf2 JSON output
 +
* possibly a way to implement a distributed HTTP webcache retrieval protocol
 +
 +
Thoughts?
 +
<div class="discussion">
 +
* +1 [[User:Tantek|Tantek]] I think we should have this, but am open to proposals on specifics!
 +
* +1 [[User:GlennJones|Glenn]] Also think this is worth looking at, but I am not sure it should be part of the parser spec. Feels like it should be built as a separate library and have it own spec on the microformats wiki.
 +
* +1 [[User:Barnabywalters|Barnaby]] agreed with Glenn, this would be a nice thing to have, but IMO it’s out of scope for the parser and should be specified separately. Personally I would probably implement it separately too, depending on how much work it is.
 +
* -1 [[User:Kylewm|Kylewm]] A pretty display would be a nice debugging tool, but I'm -1 the proposal to define a specific, default HTML output. The two proposed use-cases are totally buildable without it.
 +
* -1 Agree with Kyle above... this sounds like a great tool that someone should build and we could even publish "recommended" markup if you don't already have your own template, but this doesn't really belong in the mf2 spec itself. [[User:WillNorris|WillNorris]] 22:32, 5 June 2016 (UTC)
 +
</div>
 +
 +
 +
=== uf2 children on backcompat properties ===
 +
2016-06-05. RESOLVED. Verified 2016-07-14 [[microformats2-parsing#parse_an_element_for_class_microformats|parse an element for class microformats]] appears to already enforce this behavior. No additional spec changes made.
 +
 +
[http://logs.glob.uno/?c=freenode%23microformats&s=today#c84632 2015-11-24 raised by Calli] in #microformats
 +
 +
Related but different from [[#uf2_children_inside_a_classic_microformats_root_class_name]], when there is a uf2 child directly on a backcompat property, what should happen? E.g.
 +
 +
<source lang=html4strict>
 +
<div class="vcard">
 +
<div class="adr h-adr">
 +
  <div class="locality">MF1</div>
 +
  <div class="p-locality">MF2</div>
 +
</div>
 +
</div>
 +
</source>
 +
 +
What is the expected behavior and parser output?
 +
 +
<source lang=javascript>
 +
"items": [{
 +
  "type": ["h-card"],
 +
  "properties": {
 +
    "adr": [{
 +
      "value": "MF1MF2",
 +
      "type": ["h-adr"],
 +
      "properties": {
 +
        "locality": ["MF2"],
 +
        "name": ["MF1MF2"]
 +
      }
 +
    }]
 +
  } 
 +
}]
 +
</source>
 +
 +
Another example:
 +
 +
<source lang=html4strict>
 +
<div class="vcard">
 +
  <div class="adr h-acme-some-acme-object">
 +
    <div class="locality">MF1</div>
 +
    <div class="p-locality">MF2</div>
 +
  </div>
 +
</div>
 +
</source>
 +
 +
<source lang=javascript>
 +
"items": [{
 +
  "type": ["h-card"],
 +
  "properties": {
 +
    "adr": [{
 +
      "value": "MF1MF2",
 +
      "type": ["h-acme-some-acme-object"],
 +
      "properties": {
 +
        "locality": ["MF2"],
 +
        "name": ["MF1MF2"]
 +
      }
 +
    }]
 +
  } 
 +
}]
 +
</source>
 +
 +
Per the [[#any_h-_root_class_name_overrides_and_stops_backcompat_root]] resolution, the class name "h-acme-some-acme-object" overrides the use of "adr" as a backcompat root.
 +
 +
<div class="discussion">
 +
* Proposal: the nested "adr h-adr" child is treated as an mf2 object, not backcompat, and thus the resulting parsed "locality" property has a single value of "MF2". Proposed by Calli, noting that Glenn Jones's microformatshiv gets that result currently, and it would be easier for him (Calli) to implement this way.
 +
** +1 Tantek, seems reasonable and the reasoning provided is good (we have one implementation this way already)
 +
** +1 Kyle, this is consistent with the resolution to the related issue
 +
** +1 Calli, yes, this is easier for me to implement (than taking both MF1 and MF2 properties) because it is consistent - for me, consistency is the controlling factor in favor rather than ease of parser implementation
 +
** +1 Barnaby, php-mf2’s mf1 backcompat produces this exact result, and it makes a lot of sense to me
 +
** +1 makes sense [[User:WillNorris|WillNorris]] 22:23, 5 June 2016 (UTC)
 +
</div>
 +
 +
=== use poster if no src on video for u props ===
 +
2016-06-05. RESOLVED. [http://microformats.org/wiki/index.php?title=microformats2-parsing&diff=65620&oldid=65604 SPEC UPDATED 2016-06-23].
 +
[https://indiewebcamp.com/irc/2015-12-13#t1450035721661 2015-12-13 raised in #indiewebcamp]
 +
 +
There is a use-case of marking up the "poster" of a video element as the u-featured of an [[h-entry]], to do that, we need to change [[microformats2-parsing#parsing_a_u-_property|u- property parsing]] to look at the poster attribute of the video element, after it's looked for the src attribute.
 +
<blockquote>" else if video.u-x[poster], then get the poster attribute "</blockquote>
 +
 +
Real-world example of markup in the wild:
 +
* http://veganstraightedge.com/videos/2013/5/31/1/backyard-squirrel-buddy
 +
** and likely all other videos posted there.
 +
 +
Background discussion that led to this proposal:
 +
* https://indiewebcamp.com/irc/2015-12-13#t1450035721661
 +
 +
This seems very straightforward so I've added it as PROPOSED directly in the parsing spec. This issue is for tracking the discussion.
 +
 +
Feedback from parser implementers please!
 +
<div class="discussion">
 +
* +1 [[User:Barnabywalters|Barnaby]] easy to implement and based on real-world markup, no objections
 +
* +1 [[User:Kylewm|Kylewm]] sgtm
 +
* +1 [[User:GlennJones|Glenn]]
 +
* +1 implemented in go library [[User:WillNorris|WillNorris]] 22:19, 5 June 2016 (UTC)
 +
</div>
 +
 +
=== de-dupe URLs? ===
 +
2016-06-05. REJECTED. NO SPEC CHANGE.
 +
 +
Currently, Known templates end up linking to the author's url in the h-card twice. This leads to duplicate URLs in the parsed output, which make jf2 conversion insert a children element.
 +
Should we be deduping URLs? Or is this a GIGO issue?
 +
 +
<div class="discussion">
 +
* -1 [[User:Kylewm|Kylewm]] I can't necessarily think of a case where two of the same URL values is useful, but it feels like the parser's job to preserve the fidelity of the input. (this has been fixed in Known's markup btw [https://github.com/idno/Known/issues/1372])
 +
* -1 [[User:Tantek|Tantek]] on de-duping for mf2 json. jf2 can do what it prefers, no specific opinion on that.
 +
* -1 [[User:GlennJones|Glenn]] for both the reasons mentioned already
 +
* -1 willnorris
 +
</div>
 +
 +
=== img fallback in p- ===
 +
2016-06-05. ACCEPTED. SPEC UPDATED.
 +
 +
Trying to make an author h-card without too many extra elements I first did:
 +
<pre><div class="p-author h-card">
 +
<a href="/" class="p-org p-name"><img class="u-logo" src="/static/logo.jpg" >mention.tech</a>
 +
</div></pre>
 +
 +
rather than:
 +
<pre><div class="p-author h-card">
 +
<a href="/" class="p-org p-name"><img class="u-logo" src="/static/logo.jpg" alt="">mention.tech</a>
 +
</div></pre>
 +
 +
I was surprised that the p-name and p-org  took the src and the plaintext and concatenated them giving <code>http://mention-tech.appspot.com/static/logo.jpgmention.tech</code>, though that is the current spec (a separate php-mf2 bug ignored the empty alt when I added it).
 +
 +
While this is what the spec says, I can't think of a scenario where concatenating a string to a URL gives a useful result. Instead:
 +
 +
Proposal:
 +
* If we fallback on the src of an img due to it having no alt I propose we put a space on beginning and end. As whitespace is stripped from beginning and end of p- values, this should still give the url in the simplest case, but avoid creating nonsensical URLs in cases like this.
 +
 +
<div class="discussion">
 +
* +1 [[User:Kylewm|Kylewm]] reasoning given here makes sense
 +
* +1 [[User:Tantek|Tantek]] agreed with reasoning
 +
* +1 [[User:GRegorLove|gRegor]] agreed
 +
* +1 totally makes sense to me [[User:WillNorris|WillNorris]] 21:21, 5 June 2016 (UTC)
 +
* ...
 +
</div>
 +
 +
=== namespacing for better integrability ===
 +
2016-06-05. REJECTED.
 +
 +
All the implied class names may conflict with existing stylesheets, because the prefixes used are too short and are not proper namespaces for what follows them ("p-", "u-", "e-", "h-", "dt-", "x-", ...) and too many of these short prefixes are used.
 +
 +
You should add the support for namespacing with arbitrary "MYCARD" name usiong a second class on the same root element that uses class "h-card":
 +
 +
<pre><div class="p-author h-card h-card-ns-MYCARD"><!-- this defines the "MYCARD" namespace used below -->
 +
<a href="/" class="MYCARD-p-org MYCARD-p-name"><img class="MYCARD-u-logo" src="/static/logo.jpg" alt="">mention.tech</a>
 +
<p class="e-form"><!-- "e-form" is not recognized, because not in a known namespace -->
 +
...
 +
</p>
 +
</div></pre>
 +
 +
This is important because tools are autogenerating class names and stylesheets for HTML and associate them with other functions not intended for vCards.
 +
 +
In fact this support should be added in ALL microformats, not just for vCards...
 +
 +
And this will reduces a lot the ambiguities in microformat parsers by allowing them to be more selective (in fact the namespace being used as a common prefix for all properties, parsers could be faster, additionally it would allow easier editing on vcards in HTML, for operations like finds/replace, or even for automated replacements using regexp searches.
 +
 +
It would also allow nested vcards created from different tools using their own private extensions, to not conflict each other on these extensions, if they can be properly namespaced.
 +
 +
Note: these defined namespaces are automatically replacable by parsers if they regenerate a new composite document (they could be removed by tools if there are no conflict, or shortened, or made unique by changing them with another arbitrary name).
 +
 +
The other solution would be to use namespaces on the HTML attribute names themselves, notably class:
 +
 +
<pre><div class="p-author h-card h-card-ns-MYCARD"><!-- this defines the "MYCARD" namespace used below -->
 +
<a href="/" MYCARD:class="p-org p-name"><img MYCARD:class="u-logo" src="/static/logo.jpg" alt="">mention.tech</a>
 +
<p class="e-form"><!-- "e-form" is not recognized, because not in a known namespace -->
 +
...
 +
</p>
 +
</div></pre>
 +
 +
But this solution will not work reliably in strict XHTML or XML parsers if there's no XML namespace definition, or this could invalidate the document on basic DOM parsers for HTML (e.g. in MediaWiki, unknown HTML attributes are discarded so that MYCARD:class="..." would not appear at all in the final HTML, only class="..." is accepted).
 +
 +
Note: this would also cleanly solve problems like the one related in [[#ignore u-camelCase properties]] below !
 +
 +
Finally, it woul allow the coexistence of multiple microformats coexisting in the same document (only the root element is distinctive, but the "p-*", "u-*", "dt-*" elements will collide: which microformat should interpret them? It is easy to solve by assigning to the root ("h-<microformat>" element for each microformat a namespace that will be used in their content, such as "h-card-ns-MYCARD" for assigning the "MYCARD" namespace to the "h-card" microformat, or "h-goog-doc-ns-MYDOC" to assign the "MYDOC" namespace to the "h-goog-doc" microformat that google may want to develop for Google Docs, or "h-x-doubleclick-X78954218" for assigning the "X78954218" namespace that would be used in a "x-doubleclick" custom microformat developed by doubleclick with contents using "X78954218-p-*", "X78954218-u-*", "X78954218-e-*", "X78954218-dt-*").
 +
 +
[[User:Verdy p|Verdy p]] 03:54, 1 June 2016 (UTC)
 +
 +
<div class="discussion">
 +
* -1 [[User:GRegorLove|gRegor]]: see [[namespaces-considered-harmful]]; also seems to solve only hypothetical problems. Are there real-world parsing collision examples?
 +
* -1 agreed with gRegor above.  I would certainly want to see real world parsing problems before adding just a heavyweight "solution". [[User:WillNorris|WillNorris]] 21:16, 5 June 2016 (UTC)
 +
* -1 [[User:Tantek|Tantek]]: historically none of these namespace setting/using proposals have actually survived in the wild on the web, they all get co-opted to treating the shorthands/prefixes in a hardcoded way, e.g. og: etc. All evidence to date is against such proposals, plus there's no concrete examples provided to motivate this change, only theory.
 +
</div>
 +
 +
=== consistent implied name url from grandchildren of root ===
 +
2016-06-05. ACCEPTED. SPEC UPDATED.
 +
 +
See https://github.com/microformats/tests/issues/50
 +
 +
Summary:
 +
<blockquote>
 +
Proposal to update spec to include the following at the end of implied url parsing rules:
 +
* else if .h-x>:only-child>a[href]:only-of-type:not[.h-*] then use that [href] for url
 +
* else if .h-x>:only-child>area[href]:only-of-type:not[.h-*] then use that [href] for url
 +
these are identical to the existing rules with the addition of the :only-child selector.
 +
</blockquote>
 +
 +
<div class="discussion">
 +
* +1 [[User:Tantek|Tantek]]
 +
* +1 [[User:WillNorris|willnorris]]
 +
* +1 gRegor
 +
* +1 kylewm
 +
</div>
 +
 +
=== need more not h-* to avoid child root implying properties ===
 +
2016-06-05. ACCEPTED. SPEC UPDATED.
 +
 +
See https://github.com/microformats/tests/issues/52 for an example of this
 +
 +
Proposal:
 +
* any element being used to imply a property
 +
* any intermediate :only-child to get to a grandchild element to imply a property
 +
Should also be restricted to :not[.h-*]
 +
 +
E.g. <code>>:only-child></code> should be <code>>:only-child:not[.h-*]></code>
 +
 +
<div class="discussion">
 +
* +1 [[User:Tantek|Tantek]]
 +
* +1 willnorris
 +
* +1 gRegor
 +
* +1 kylewm
 +
</div>
 +
 +
=== Standard datetime format ===
 +
2016-06-04 (before). REJECTED.
 +
 +
2015-07-28
 +
 +
http://microformats.org/wiki/microformats2-parsing#parsing_a_dt-_property does not specify any standard format to use for datetimes. e.g.  <pre>2015-07-28T12:55:33</pre> vs <pre>2015-07-28 12:55:33</pre>
 +
Would be good to standardize this to compare various parser outputs.
 +
 +
2015-07-29: This subject is (somewhat) covered in http://microformats.org/wiki/iso-8601 As it stands the JavaScript parsers support output in the 3 main profiles, 'W3C Note', 'RFC 3339' and 'HTML5' plus 'auto' which keeps authors format. The default date output for the JavaScript parsers is the same format as the date was originally authored in. This can be changes by setting the options.dateFormat switch to any of the other profiles mentioned. It would be good if other parser also had a switch to force output to a common profiles so we could compare various parser outputs, but I think the default should be how a date was authored. All output whatever profile should also keeps the authored level of specificity, i.e. not adding minutes or seconds if they are not in the input date string. This is important if you want to compare parser outputs.
 +
 +
The only exception to this where date and times are combined such as the implied h-event rule for dt-start and dt-end where I output in the HTML5 style 2015-07-29 12:55:33 as there is no predefined author preference and HTML5 profile is more human readable. [[User:GlennJones|Glenn Jones]] 11:02, 29 July 2015 (UTC)
 +
 +
<div class="discussion">
 +
* -1 [[User:Tantek|Tantek]] we are maintaining whole properties as authored, with authored level of specificity, i.e. not adding minutes or seconds if they are not in the input date string, and [[vcp]] cases handled in separate issue.
 +
** Consensus in room at IWC 2016 session also. Resolving accordingly.
 +
</div>
 +
 +
 +
=== implied date for dt properties both mf2 and backcompat ===
 +
2016-06-04 (before). ACCEPTED. SPEC UPDATED.
 +
 +
The [[value-class-pattern#microformats2_parsers|value class pattern dt-* date proposal]] should apply to both mf2 dt-* properties, and backcompat classic microformats, to preserve the hAtom / hCalendar optimizations noted on that page, but in a generic way.
 +
 +
<div class="discussion">
 +
* +1 [[User:Tantek|Tantek]] 16:12, 18 August 2015 (UTC)
 +
* +1 [[User:Glenn Jones|Glenn Jones]] 20 August 2015
 +
</div>
 +
 +
2015-08-21: [[User:Glenn Jones|Glenn Jones]]
 +
Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html
 +
 +
And [[vcp]] updated too. [[User:Tantek|Tantek]] 22:53, 5 June 2016 (UTC)
 +
 +
=== implied name when alt="" ===
 +
 +
The implied name rule
 +
 +
* else if .h-x>img:only-child[alt]:not[.h-*] then use that img alt for name
 +
 +
is slightly under-specified for the case where alt is provided but intentionally blank. The desired behavior is to use the img alt tag only if it is non-empty. For example:
 +
 +
<pre>
 +
<a class="h-card" href="https://kylewm.com">
 +
  <img src="https://kylewm.com/photo.jpg" alt="">
 +
  Kyle
 +
</a>
 +
</pre>
 +
 +
 +
The PHP and JS parsers already seem to return the desired result ("Kyle" in the above example). The Python parser uses the alt text and returns "".
 +
 +
Proposal: modify the spec to explicitly exclude these tags:
 +
 +
* else if .h-x>img:only-child[alt]:not([alt=""]):not[.h-*] then use that img alt for name
 +
 +
And audit the other implied rules for similar cases.
 +
 +
<div class="discussion">
 +
* +1 [[User:Tantek|Tantek]] this makes sense to me, and as far as I can tell, for the other cases too for *implied* properties:
 +
** area[alt], abbr[title], and all other attributes where there is an existence test, there should be a :not[alt=""] empty test, for implied p-name, u-photo, u-url
 +
* +1 [[User:GRegorLove|gRegor]] sounds good to me.
 +
* +1 [[User:Kylewm|Kylewm]] We've added this in mf2py too now, and I'm happy with it.
 +
* +1 [[User:GlennJones|Glenn]] This is often define by the underlying HTML parsing library which will remove attributes that do not have a values.
 +
* ...
 +
</div>
 +
 +
* 2016-05-30 Resolved and incorporated into [[microformats2-parsing]]. [[User:Tantek|Tantek]] 23:32, 30 May 2016 (UTC)
 +
 +
=== parsing a dt- property ===
 +
* Should instruct to replace a "T" separator with a single space.
 +
* Value-class-pattern parsing should instruct to use a single space as the separator.
 +
* Value-class-pattern should instruct to keep the authored level of specificity, rather than implying 00 seconds when not present. http://microformats.org/wiki/value-class-pattern##If+by+parsing+the+%22value%22
 +
 +
Log: https://indiewebcamp.com/irc/2016-04-25#t1461606553653
 +
 +
<div class="discussion">
 +
* +0 [[User:Kylewm|Kylewm]] on replacing "T" as the separator. Would you please clarify whether that is only for value class pattern/assembling dates from components, or is it proposing to *always* normalize dt's?
 +
** +1 [[User:Tantek|Tantek]] definitely value class pattern/assembling dates from components should use " " instead of "T" as separator.
 +
** +0 [[User:Tantek|Tantek]] slight pref (but unsure) for replace a "T" separator with a single space in other dt-* parsing.
 +
** +1 [[User:GlennJones|Glenn]] happy to move to single space separator for dates built from the value-class pattern.
 +
** -1 [[User:GlennJones|Glenn]] I think we should pass through the authored format of a date as default output. We should process the content as little as possible, so it is as authored. We can then add parser options to force one of the date formats such as ISO profiles HTML5 or W3C if we need consistency. This is the approach I have taken.
 +
** See related specific issue: [[microformats2-parsing-issues#Standard_datetime_format]]
 +
* +1 [[User:Kylewm|Kylewm]] on not implying seconds
 +
* +1 [[User:GlennJones|Glenn]] on not implying seconds. Authored level of specificity should always be kept in dates.
 +
 +
</div>
 +
 +
Consensus resolutions:
 +
* '''Drop value-class-pattern implying 00 seconds.''' Note: keeping/implying 00 minutes due to common human usage of whole hours to specifically mean "on the hour" which is 00 minutes. The same implied precision does not exist for seconds in practice.
 +
** 2016-05-18 [[vcp]] updated with this resolution.
 +
* '''Value-class-pattern parsing should instruct to use a single space as the separator.'''
 +
** 2016-05-18 [[vcp]] updated with this resolution.
 +
 +
Dropped:
 +
* replace a "T" separator (when authored that way) with a single space
 +
** No consensus on this, some opposition. Prefer to "process the content as little as possible, so it is as authored".
 +
 +
=== ignore u-camelCase properties ===
 +
RESOLVED. [http://microformats.org/wiki/index.php?title=microformats2-parsing&diff=65418&oldid=65356 SPEC UPDATED 2016-02-29].
 +
 +
Due to Suit CSS (and others? citations?) recent (2015-?) use of "u-*" class names for so-called "[http://davidtheclark.com/on-utility-classes/ utility classes]", we are seeing some false positives in a few very rare instances, e.g.: [http://www.unmung.com/mf2?url=http%3A%2F%2Fwww.kevinmarks.com%2Ftwitterutils.html&html=&pretty=on this twitter markup]
 +
 +
(Nearly) all these "utility classes" use camelCase for the class name suffixes, thus we can filter them out by looking for camelCase (since microformats class name conventions are always all lowercase and hyphenated), or even just looking for (and rejecting) *any* capital letters.
 +
 +
For your own site, it might be a good idea to prefix the "utility classes" e.g. [http://danmall.me/articles/cooking-with-design-systems/ Cooking with Design Systems by Dan Mall]
 +
 +
Proposal:
 +
* microformats2 parsers MUST IGNORE u-* classnames where the * has any uppercase letter(s).
 +
 +
<div class="discussion">
 +
* +1 [[User:Tantek|Tantek]] Let's get this fix rolling quickly to avoid further pollution.
 +
* +1 [[User:Barnabywalters|Barnaby]] php-mf2 already ignores classnames with capitalised prefixes, ignoring any classnames with capital letters seems totally reasonable
 +
* +1 [[User:Kylewm|Kylewm]] agree with rejecting property names that include capital letters (specifically detecting camelCase seems harder to define)
 +
* +1 [[User:GlennJones|Glenn]] agreed, a simple change which should help avoid further pollution
 +
* +1 (see also below) [[User:WillNorris|WillNorris]] 22:15, 5 June 2016 (UTC)
 +
</div>
 +
 +
Additional proposal: (same reasoning, filter out more crap)
 +
* microformats2 parsers MUST IGNORE all property classnames where the property name has any capital letters or numerals (0-9).
 +
* microformats2 class names MUST contain only lowercase letters and dash: /[a-z\-]/
 +
 +
<div class="discussion">
 +
* +1 [[User:Tantek|Tantek]] Let's get this fix rolling quickly to avoid further pollution.
 +
* +1 I've already implemented this in the go library, additionally extending it to all properties, not just u-* [[User:WillNorris|WillNorris]] 22:15, 5 June 2016 (UTC)
 +
</div>
 +
=== When to collapse whitespace in properties ===
=== When to collapse whitespace in properties ===
Line 1,223: Line 1,367:
** added "on that same element" as that was what we were discussing/implying in this issue. [[User:Tantek|Tantek]] 22:49, 18 September 2015 (UTC)
** added "on that same element" as that was what we were discussing/implying in this issue. [[User:Tantek|Tantek]] 22:49, 18 September 2015 (UTC)
</div>
</div>
 +
 +
Example:
 +
 +
<source lang=html4strict>
 +
<div class="adr h-adr">
 +
  <div class="locality">MF1</div>
 +
  <div class="p-locality">MF2</div>
 +
</div>
 +
</source>
 +
 +
Expected parser output:
 +
 +
<source lang=javascript>
 +
"items": [{
 +
  "type": ["h-adr"],
 +
  "properties": {
 +
    "locality": ["MF2"],
 +
    "name": ["MF1MF2"]
 +
  }
 +
}]
 +
</source>
 +
 +
 +
Or with a custom root mf2 classname:
 +
 +
<source lang=html4strict>
 +
<div class="adr h-acme-address">
 +
  <div class="locality">MF1</div>
 +
  <div class="p-locality">MF2</div>
 +
</div>
 +
</source>
 +
 +
Expected parser output:
 +
 +
<source lang=javascript>
 +
"items": [{
 +
  "type": ["h-acme-address"],
 +
  "properties": {
 +
    "locality": ["MF2"],
 +
    "name": ["MF1MF2"]
 +
  }
 +
}]
 +
</source>
=== backcompat classic microformats should only see backcompat properties ===
=== backcompat classic microformats should only see backcompat properties ===

Current revision

This page documents issues with the microformats2-parsing specification before 2016-06-20.

See https://github.com/microformats/microformats2-parsing/issues for current and new issues!

See change control for how to move issues forward.

Note: all current issues resolved as of 2016-06-04

Contents

issues

Open issues in various states of partial resolution from none to nearly resolved.

unicode generation in JSON

STATUS: 2016-06-05 apparent implementers consensus at IndieWeb Summit issues resolution session.

Currently we will convert HTML entities into unicode as part of the parsing process. However, these and other non-asciicharacters can be output as escaped unicode in the generated JSON Broadly this is OK, as we assume JSON parsers should be able to handle this accordingly. However, it does mean the text is somewhat ambiguous, and unclear, especially when complex unicode codepoints like emoji are involved.

Secondarily, when the parsed output of an e- element is presented, having \u escaped text in the HTML is not really valid, and utf8 would be preferred. That way the JSON output could safely pass through a naive string concatenation model as well as a valid unicode decoder (some languages do not cope with astral plane unicode well, yet utf8 safely encodes them).

See https://github.com/tommorris/mf2py/issues/65 for further discussion of this.

I have tweaked unmung to output utf8 instead for inline data entry, eg: this case

  • +1 Kylewm with a caveat.
    • Returned JSON SHOULD (not MUST) be UTF8 rather than ASCII with \u encoding because it is easier to read and debug.
    • When parsing e- properties, HTML entities should be left escaped in the "html" value. This is important when parsing a reply-context; if the original post contains an escaped HTML code snippet, I want the reply context to show the same code snippet, rather than converting it all into real tags.
    • e.g. "content": [{"html": "&lt;b&gt;1&mdash;2&lt;/b&gt;", "value": "<b>1—2</b>"}].
  • +0 willnorris (+1 to saying that microformats parsers should standardize on UTF-8 for e-* text, however I feel like e-* html should be left as unscathed as possible. html encoding may harken to a time before UTF-8, but if the content was authored that way, shouldn't necessary be changing that)

Noscript skip/parse

2015-07-28

Should mf parsers skip <noscript> tag in the HTML, like the <template> tag mentioned in http://microformats.org/wiki/microformats2-parsing#note_HTML_parsing_rules ?

mf2py skips <noscript> when using the html5lib DOM parser but no when using lxml parser. Example use of <noscript> https://kartikprabhu.com/ featured images have a no javascript fallback image inside <noscript> with class='u-featured' markup.

  • +1 Tantek skip noscript (and inside) because in today's typical browsing contexts, nothing in noscript is displayed, thus we should discourage marking up effectively invisible content.
  • -1 Glenn This subject does need to be address, but differently to proposed change. My personal view is that e-* html should be passed through raw and then the consumer can process it in a way they feel fit. Its a case for helper libraries. I am about to build a helper library to do this based on the Readability code to post process e-* html. Other people may want to take different approaches, defining this in the spec feels like move into a whole area of new functionally.
  • 0 Barnaby to me this should be treated the same as the script/style contents — removed completely from all plaintext properties, but left unaltered in raw HTML.
  • +0 Kylewm Is it possible a js-only client-rendered site would want to serve microformats in a noscript block? I know we encourage people to do better progressive-enhancement than that, but not everyone does, and I'd prefer it to no microformats.

As a separate new point we need to consider "exclude tags" lists for parsed text from html. We should consider <noscript>, <noframe> and <template> there maybe other I have not gone through all the tags in current HTML spec. Also we should consider what to do about the more common pattern of fallback text within media tags <video>, <audio> etc. This should be explicitly discussed in the parsing rules. At the moment my experimental text normalisation does exclude tags, but the default text parse does not. Currently the fallback content in media tags like <video> is added to the parse text. 12:56, 25 Septemeber 2015 (UTC)

  • Barnaby in theory, as the video and audio data by default can’t be included in plaintext properties, and the fallback content (much like img alt attributes) should be somehow human-readable and useful, I would suggest keeping it in plaintext properties. I’d like to see some real-world examples of what fallback content people are using — if it’s links or plaintext descriptions this approach could work well, if people are writing instructions saying “install flash” or “update your browser” it’s not going to produce very pretty results
  • +1 in theory to stripping it out for the same reasons Tantek mentioned above. I haven't tested this in go, but would be surprised if it had issues. WillNorris 22:37, 5 June 2016 (UTC)
  • ...

implied properties when an explicit class is provided

Should "u-url" still be implied if another explicit class is already provided?

Should "p-name" still be implied if another explicit class is already provided?

Here is a somewhat contrived "u-url" related example, taken from Bridgy's unit tests.

<article class="h-entry">
  <a class="u-like-of" href="http://orig.domain/baz">liked this</a>
</article>

In this case, http://orig.domain/baz is almost certainly not the u-url, so IMO it would be better to leave it out —Kylewm 15:10, 7 October 2014 (UTC)

2015-01-20 consensus

  • Changed my mind. Simpler to do nothing. Example provided is artificially constructed, does not reflect likely real world confusion of if we make implied properties more complicated. Tantek 06:26, 21 January 2015 (UTC)
  • ++ Consensus on do nothing for this case. At 2015-01-20

Proposed resolution:

  • Changed again. Due to indiewebcamp.com/edit use-case, this now makes sense for all implied properties. That is:
    • If an element has any explicit property class name(s) on it, then it must not be used to imply any properties. Tantek 20:50, 27 May 2015 (UTC)
      • +1 this seems reasonable, if a publisher is going to add an mf2 class, it is unlikely they want other classes automatically implied from the same value Aaronpk 23:19, 29 November 2015 (UTC)
      • -1 for now. To my knowledge, this has only been observed in artificially constructed unit tests and examples, and it adds some weird edge cases that are hard to reason about. Kylewm 00:46, 1 December 2015 (UTC)
      • -0 failed consensus, this proposal is rejected. - Tantek 22:11, 1 April 2016 (UTC)
    • Refined: Or should this be refined by per parsing prefix? Tantek 22:59, 18 September 2015 (UTC)
      • Any explicit "p-*" property means no implied "p-name" from that element
      • Any explicit "u-*" property means no implied "u-url" nor "u-photo" from that element.
      • ... provide input on this refined proposal here
    • Suggestion: split this issue up per property. I think it makes sense for u-url and can be easily added to the spec as e.g. ".h-x>a[href]:only-of-type:not[.h-*,.u-*]". Kylewm 00:46, 1 December 2015 (UTC)
      • Obviously wrong to assume a link explicitly pointing elsewhere is our "url"
      • More difficult to make a case for photo
        • though if there is an img with u-featured but not u-photo, it's likely that was an explicit author decision (e.g. for an article) - Tantek 22:11, 1 April 2016 (UTC)
      • Especially difficult to make a case for name.
        • "name" is implied at least by [.h-x textContent], so often the excluded element would end up included anyway.
        • Yet over-implied p-names appear to cause problems with many Bridgy webmention consuming use-cases[1], that's a good case - Tantek 22:11, 1 April 2016 (UTC)
        • may need a broader rule, like any explicit p-* property on an element stops implied p-name. - Tantek 22:11, 1 April 2016 (UTC)
        • p-name issue forked off and filed separately: https://github.com/microformats/microformats2-parsing/issues/6
      • +1 consensus in the room at IWS 2016 (willnorris, gRegor, kylewm, tantek): do this for specifically for implied URL, do nothing for now for name and photo. WillNorris 22:56, 5 June 2016 (UTC)

whitespace collapsing revisited

2015-05-27: (raised by Kevin Marks per Glenn Jones)

Revising the microformats tests to conform to the "don't collapse whitespace" rule below reveals some non-intuitive cases. preserving whitespace in addresses is somewhat defensible, but in an implied name it is often unhelpful, as it preserves non-user visible space there for authoring reasons.

For example: this test shows how extraneous whitespace ends up in the name

<div class="h-review-aggregate">
    <div class="p-item h-event">
        <h3 class="p-name">Fullfrontal</h3>
        <p class="p-description">A one day JavaScript Conference held in Brighton</p>
        <p><time class="dt-start" datetime="2012-11-09">9th November 2012</time></p>    
    </div> 
    
    <p class="p-rating">
        <span class="p-average value">9.9</span> out of 
        <span class="p-best">10</span> 
        based on <span class="p-count">62</span> reviews
    </p>
</div>

give a parsed result of:

{
    "items": [{
        "type": ["h-review-aggregate"],
        "properties": {
            "item": [{
                "value": "Fullfrontal\nA one day JavaScript Conference held in Brighton\n9th November 2012",
                "type": ["h-event"],
                "properties": {
                    "name": ["Fullfrontal"],
                    "description": ["A one day JavaScript Conference held in Brighton"],
                    "start": ["2012-11-09"]
                }
            }],
            "rating": ["9.9"],
            "average": ["9.9"],
            "best": ["10"],
            "count": ["62"],
            "name": ["Fullfrontal\nA one day JavaScript Conference held in Brighton\n9th November 2012\n\n\n9.9 out of \n        10 \n        based on 62 reviews"]
        }
    }],
    "rels": {}
}

The value is a reasonable textual representation of the event, but the implied name is full of spurious whitespace that any consumer would have to strip.

h-review has similar issues

2015-05-28: (Addition by Glenn Jones)

The example below shows the type of markup most effected by the "implied name" and "don't collapse whitespace" rule working together to produce output that is hard to use without further processing.

<a class="h-card" href="http://glennjones.net">
     <span class="p-given-name">Glenn</span>
     <span class="p-family-name">Jones</span>
</a>

The output for the name property from the above HTML would be Glenn\r\n Jones using the (trim lead/trailing) suggested in the parsing spec. I could of course move the spans onto one line, but it feels fragile to consider whitespace sensitivity in HTML like this. Added to the fact that HTML templating environments often take away that level of whitespace control from authors anyway.

There are issues with both: keeping whitespace, returns and tabs from parsed HTML or collapse that whitespace. If we return the whitespace it becomes mal-formatted for humans because it was only added to make the HTML code understandable and in most cases was not meant to be used/read outside of that context. If we collapse the whitespace we can have issues of whitespace sensitive text from <pre> etc. being incorrectly formatted.

Providing a CSS aware innerText feature would produce the most useable output, but this is too complex/time consuming to build for most parser developers. In the face of no perfect solution I have taken the 80:20 view, whereby errant whitespace, causes me considerably more problems than mal-formatted <pre> content so I collapse whitespace on all text returned.

This feature is a non-CSS aware version of innerText. It does not cover all rendering edge cases, but enough to produce practical output.

For now, I have started changing the node parser to flag "white-space collapsing" as an experimental feature which is off by default i.e. http://glennjones.net/tools/microformats/ but personally I will parse everything with this on as I find it the most practical solution.

Not sure where that leaves me on the options below.


Options:

Choose from:

  1. keep as is and every parser client has to post process for common cases.
  2. keep as is but have mf2 parser trim leading/trailing whitespace (likely to provide desired result and be reasonably backcompat)
    • +1 my preference of the two options. Tantek 20:45, 27 May 2015 (UTC)
    • +1 though this doesn't solve any of the problems discussed above, it's still worth doing Kevin Marks 16:41, 28 May 2015 (UTC)
    • +1 will help parsers be more consistent with each other, and I haven't ever encountered a case where preserving leading/trailing whitespace was desirable Kylewm 20:45, 8 June 2015 (UTC)

2015-06-08 option 2 resolved by consensus and implementation in mf2py.

Somewhat orthogonal:

Because there are both code markup and specific vocabulary (label) needs for preserving whitespace, we are compelled to preserve in general, perhaps except for very specific limited generic cases (e.g. trim leading/trailing, "value" parsing, implied name). Tantek

u- parsing iframe src

Currently if I put u-* on an iframe it gets the value of the fallback text. This seems a shame. Getting the URL seems a sensible answer. Kevin Marks 09:07, 11 July 2015 (UTC)

  • +1 from the room at IWS 2016 WillNorris 22:57, 5 June 2016 (UTC)

i- parsing iframe src

More controversially, what about using an iframe for transclusion? A use case here is comments on a static site. Currently, on eg http://www.kevinmarks.com/microformatschema.html the comments are injected via JS, making them opaque to parsers and thus precluding further parsing such as salmentions.

If instead they were an iframe embedding them, a parser could optionally fetch its contents, parse them, and include them in the parsed mf2 output at that point. Overloading u-* for this seems wrong; e-* as below for srcdoc would have a different effect; this implies a new prefix directive would be needed. A strawman i-* (for include) may work. Kevin Marks 09:07, 11 July 2015 (UTC)

e- parsing iframe srcdoc

  • Proposal: addition of a new e-* parsing rule for iframe elements with srcdoc attributes. E.G.
<div class="h-entry">
 <iframe class="e-content" srcdoc="<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp;amp; doubly quoted ampersands</p>" />
</div>
{
 "items": [{
  "type": ["h-entry"],
  "properties": {
   "content": ["<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp; doubly quoted ampersands</p>"]
  }
 }]
}

This would allow, for example, HTML comments to be sandboxed inside iframes but still parsable as microformats.

I believe the correct processing would be to leave " entities as they are but to unescape any doubly-escaped ampersands.

    • Is there any use case for that? —Tom Morris 12:32, 14 September 2013 (UTC)
    • +1 we need documentation of use case and existing sites publishing iframe srcdoc like this - Tantek 00:47, 15 September 2013 (UTC)
    • Rejected by consensus at 2015-01-20 meetup due to lack of real world uses cases / existing sites. Tantek 06:26, 21 January 2015 (UTC)

How to interpret mf2 properties on select

How should select elements with properties be treated any differently?

Awaiting real world examples / stronger use-cases, until then no special treatment of select elements with properties:

  • Are there any real world examples of select elements with microformats properties?
  • What would the use-case be for putting a microformats property class name on a select element?
  • Nothing special. By consensus at 2015-01-20 meetup due to lack of real world uses cases / existing sites. Tantek 06:26, 21 January 2015 (UTC)

How to interpret mf2 root name on form

See what to do about root class names on <form> elements in particular:

Awaiting real world examples / stronger use-cases, until then, no special treatment of root class names on <form> elements:

  • Are there any real world examples of a <form> element with a microformats root class name?
  • hcard-input is one possible use-case, is anyone attempting to use forms for hCard input, e.g. with scripts to help make it work?
  • Are there other use-cases for putting a microformats root class name on a <form> element?
  • As of 2015-01-20 - no consensus - need more input as to when/why this is useful to do anything special.


Parsing Literal Values

Issue raised by: Ben Ward

It is proposed for microformats2 that all microformats be parsable from just their root element, e.g. <p class="h-card">Ben Ward</p> would create an hCard with the following properties after parsing:

{ 
  'type': ['h-card'],
  'properties': {
     'name': ['Ben Ward']
  }
}

This is a four-fold change from the current hCard:

  1. type is generically identifiable as a microformat root, even in parsed form. The use of the 'h-' prefix persists into the type of the object. This is deliberately so, as a result of re-using the JSON data model of microdata which itself is re-using a common JSON convention, such that microformatted data is clearly distinguishable (as opposed to any other random schema that may be using a similar data model).
  2. root-class-only support. Per microformats-2-implied-properties, the name property is implied by the entirety of the root class name element.
  3. 'name' instead of 'fn'. As also documented in microformats-2-implied-properties, the continuous challenges/problems and need to repeatedly re-explain 'fn' over the years combined with the real-world market response of nearly every other party doing a person vocabulary renaming 'fn' to 'name', microformats 2 makes this change as well.
  4. There is no automatic parse-time inferring of 'given-name': ['Ben'] and 'family-name': ['Ward']. Any such inferring *might* be made by a vCard converter, but is left up to that specific application (not all applications) built on that vocabulary, though even in that case it may not be necessary, as an empty "N:;;;" vCard property is sufficient to satisfy the N property requirement of vCard, and also causes no problems when imported into various vcard-implementations.

It is required of the extractor to understand that when a microformats object specifies no explicit child properties, that it must treat h-card as having a p-name. But, the parser is generic, so it also treats h-review, h-entry, h-recipe, h-geo as having a ‘p-name’.

As a result, specific vocabularies are evolved to drop their specific form of name (e.g. fn, summary, entry-title) and simplified to use a common 'name' property instead.

Note: while the overwhelming majority of real world publishing/consuming uses of microformats do so with proper nouns which have names (and thus this parser-level incorporation of an implied 'name'), there are some formats that do not have a 'name' semantic. For example, geo, adr, and possibly if/when developed, units of measure, length, cost. The current thinking is that the benefits to the far greater proper-noun use-case of microformats outweigh the technical inelegance of having an extra/ignored 'name' property on formats that lack such a semantic.

Some formats also may appear in theory to better imply some other property, e.g. a review might be thought to imply its content, not its name, and an Atom entry its content, not its title, but in practice (actual publishing patterns) this is not the case. Typically, brief unstructured reviews (or mentions thereof) provide a summary (often hyperlinked to an expanded structured form) of that review, not its content, and similarly, brief unstructured posts (e.g. RSS items) have historically most often been link blog items which include the title of an item and a link. Short status updates as well established by Twitter are newer and would seem to imply purely content with no title, at least semantically, however, even Twitter populates the RSS title and ATOM entry title of their feeds with the content. It's not clear what went into that decision, however, that's likely irrelevant, as the outcome turns out to be emergent consistency among publishing behaviors.

To avoid overloading or undermining the semantics of a vocabulary, I propose that we handle this at the extractor level in a simpler fashion: Define a new property for literal data, that an extractor will provide if no other information was available. All interpreters may then be instructed that in the event that an object has no properties, it can attempt to interpret the literal value from the page instead.

In existing microformats, the closest existing example we have for this is the label property in hCard, which is used to represent the literal address label for a place. It is a corresponding piece of fn, org and adr in combination, but has no structure in and of itself. Possibly, every microformat could have a label form where structured data is unavailable.

However in practice, the hCard label property is both little understood and little used. It's not even clear that it ought to be kept for microformats 2 (no known consumers, very few (if any?) real-world non-test publishers). This disuse is likely a good indicator that we should avoid basing anything on its design.

Alternatively, value is used throughout microformats to target a generic value (e.g. in combination with price in hListing.) It has been proposed that when parsing properties that are also themselves microformats, we create native objects of the form:

   {
       'value': '1900 12th Street, San Francisco, CA 94'
     , 'type': ['h-adr']
     , 'properties': {
           'street-address': '1900 12th Street'
         , 'etc': 'etc'
       }
   }

We could apply this same pattern to the root level:

   { 
       type: [h-card]
     , properties: {}
     , value: 'Ben Ward'
   }

In this case, an interpreter or implementation is responsible for using value in place of fn, or restructuring the object. It would be the responsibility of each vocabulary to define its root property. The parsing layer of microformats 2.0 would not impose semantics or naming onto that.

For another example, h-geo would end up like this:

   {
       type: [h-geo]
     , properties: {}
     , value: '1.3232;-0.543'
   }

resolved

Most recent resolved issues first:

exclude style elements before parsing

2016-06-05 RESOLVED. 2016-07-14 spec updated.

2016-01-25 raised in #microformats

Ran into an issue of a <style> element being parsed as plain text in a p-name. Should microformats2-parsing be updated to indicate <style> should be excluded when parsing? Appears to implicitly fall under microformats2-parsing#note_HTML_parsing_rules

Sample link: http://veganstraightedge.com/notes/2016/01/16/tonight-s-dinner-tacocleanse-beverly-hills-c

The <script> tag can be similarly problematic.

Proposal: Drop both <script> and <style> elements completely when parsing any property (including e-* HTML values). Tantek 01:01, 29 February 2016 (UTC)

Please discuss and/or give +1/0/-1 feedback

  • +1 Tantek as proposer
  • +1 aaronpk as a consumer of HTML from an e-* property, I will always be sanitizing the HTML and removing <script> and <style> anyway
  • +1 kylewm
  • 0 Barnaby +1 to removing the contents of <script> and <style> from all plaintext properties (and 'value' property in HTML dicts), -1 to removing <script> and <style> from HTML. That’s a job for a sanitization stage. As aaronpk points out, sanitization will have to be done anyway if the content is to be reposted, so doing so in the parser doesn’t actually save anyone any work, but removes information which could be useful to people (example use cases: publishing posts with embedded per-post styling, publishing interactive HTML documents with embedded javascript)
    • +1 this seems like reasonable feedback to make a new refined proposal. Tantek 20:37, 13 March 2016 (UTC)
    • +1 I like the revised proposal and am happy to change my vote to this Aaronpk 21:16, 13 March 2016 (UTC)
    • +1 Totally agree with narrowing the proposal. All the problems I've had with script and style tags come from plaintext properties, and agree that they may even be useful to some consumers of the HTML properties (e.g. an embedded YouTube video) Kylewm 23:40, 13 March 2016 (UTC)

Proposal 2: Drop both <script> and <style> elements completely when parsing any property (except for e-* HTML values, which preserve all markup). Tantek 20:37, 13 March 2016 (UTC)

Please discuss and/or give +1/0/-1 feedback


default generated HTML

2016-06-05 RESOLVED. No change to spec.

2015-09-08 raised by Tantek in #indiewebcamp

Should there be a default (perhaps not quite "canonical") way to map/generate HTML+microformats2 from a parsed mf2 JSON output?

E.g. straw proposal:

Existing work / mappings:

Related to:

Use-cases:

Thoughts?

  • +1 Tantek I think we should have this, but am open to proposals on specifics!
  • +1 Glenn Also think this is worth looking at, but I am not sure it should be part of the parser spec. Feels like it should be built as a separate library and have it own spec on the microformats wiki.
  • +1 Barnaby agreed with Glenn, this would be a nice thing to have, but IMO it’s out of scope for the parser and should be specified separately. Personally I would probably implement it separately too, depending on how much work it is.
  • -1 Kylewm A pretty display would be a nice debugging tool, but I'm -1 the proposal to define a specific, default HTML output. The two proposed use-cases are totally buildable without it.
  • -1 Agree with Kyle above... this sounds like a great tool that someone should build and we could even publish "recommended" markup if you don't already have your own template, but this doesn't really belong in the mf2 spec itself. WillNorris 22:32, 5 June 2016 (UTC)


uf2 children on backcompat properties

2016-06-05. RESOLVED. Verified 2016-07-14 parse an element for class microformats appears to already enforce this behavior. No additional spec changes made.

2015-11-24 raised by Calli in #microformats

Related but different from #uf2_children_inside_a_classic_microformats_root_class_name, when there is a uf2 child directly on a backcompat property, what should happen? E.g.

<div class="vcard">
 <div class="adr h-adr">
  <div class="locality">MF1</div>
  <div class="p-locality">MF2</div>
 </div>
</div>

What is the expected behavior and parser output?

"items": [{ 
  "type": ["h-card"],
  "properties": {
    "adr": [{
      "value": "MF1MF2",
      "type": ["h-adr"],
      "properties": {
        "locality": ["MF2"],
        "name": ["MF1MF2"]
       }
     }]
   }  
}]

Another example:

<div class="vcard">
  <div class="adr h-acme-some-acme-object">
    <div class="locality">MF1</div>
    <div class="p-locality">MF2</div>
  </div>
</div>
"items": [{ 
  "type": ["h-card"],
  "properties": {
    "adr": [{
      "value": "MF1MF2",
      "type": ["h-acme-some-acme-object"],
      "properties": {
        "locality": ["MF2"],
        "name": ["MF1MF2"]
       }
     }]
   }  
}]

Per the #any_h-_root_class_name_overrides_and_stops_backcompat_root resolution, the class name "h-acme-some-acme-object" overrides the use of "adr" as a backcompat root.

  • Proposal: the nested "adr h-adr" child is treated as an mf2 object, not backcompat, and thus the resulting parsed "locality" property has a single value of "MF2". Proposed by Calli, noting that Glenn Jones's microformatshiv gets that result currently, and it would be easier for him (Calli) to implement this way.
    • +1 Tantek, seems reasonable and the reasoning provided is good (we have one implementation this way already)
    • +1 Kyle, this is consistent with the resolution to the related issue
    • +1 Calli, yes, this is easier for me to implement (than taking both MF1 and MF2 properties) because it is consistent - for me, consistency is the controlling factor in favor rather than ease of parser implementation
    • +1 Barnaby, php-mf2’s mf1 backcompat produces this exact result, and it makes a lot of sense to me
    • +1 makes sense WillNorris 22:23, 5 June 2016 (UTC)

use poster if no src on video for u props

2016-06-05. RESOLVED. SPEC UPDATED 2016-06-23. 2015-12-13 raised in #indiewebcamp

There is a use-case of marking up the "poster" of a video element as the u-featured of an h-entry, to do that, we need to change u- property parsing to look at the poster attribute of the video element, after it's looked for the src attribute.

" else if video.u-x[poster], then get the poster attribute "

Real-world example of markup in the wild:

Background discussion that led to this proposal:

This seems very straightforward so I've added it as PROPOSED directly in the parsing spec. This issue is for tracking the discussion.

Feedback from parser implementers please!

  • +1 Barnaby easy to implement and based on real-world markup, no objections
  • +1 Kylewm sgtm
  • +1 Glenn
  • +1 implemented in go library WillNorris 22:19, 5 June 2016 (UTC)

de-dupe URLs?

2016-06-05. REJECTED. NO SPEC CHANGE.

Currently, Known templates end up linking to the author's url in the h-card twice. This leads to duplicate URLs in the parsed output, which make jf2 conversion insert a children element. Should we be deduping URLs? Or is this a GIGO issue?

  • -1 Kylewm I can't necessarily think of a case where two of the same URL values is useful, but it feels like the parser's job to preserve the fidelity of the input. (this has been fixed in Known's markup btw [2])
  • -1 Tantek on de-duping for mf2 json. jf2 can do what it prefers, no specific opinion on that.
  • -1 Glenn for both the reasons mentioned already
  • -1 willnorris

img fallback in p-

2016-06-05. ACCEPTED. SPEC UPDATED.

Trying to make an author h-card without too many extra elements I first did:

<div class="p-author h-card">
<a href="/" class="p-org p-name"><img class="u-logo" src="/static/logo.jpg" >mention.tech</a>
</div>

rather than:

<div class="p-author h-card">
<a href="/" class="p-org p-name"><img class="u-logo" src="/static/logo.jpg" alt="">mention.tech</a>
</div>

I was surprised that the p-name and p-org took the src and the plaintext and concatenated them giving http://mention-tech.appspot.com/static/logo.jpgmention.tech, though that is the current spec (a separate php-mf2 bug ignored the empty alt when I added it).

While this is what the spec says, I can't think of a scenario where concatenating a string to a URL gives a useful result. Instead:

Proposal:

  • +1 Kylewm reasoning given here makes sense
  • +1 Tantek agreed with reasoning
  • +1 gRegor agreed
  • +1 totally makes sense to me WillNorris 21:21, 5 June 2016 (UTC)
  • ...

namespacing for better integrability

2016-06-05. REJECTED.

All the implied class names may conflict with existing stylesheets, because the prefixes used are too short and are not proper namespaces for what follows them ("p-", "u-", "e-", "h-", "dt-", "x-", ...) and too many of these short prefixes are used.

You should add the support for namespacing with arbitrary "MYCARD" name usiong a second class on the same root element that uses class "h-card":

<div class="p-author h-card h-card-ns-MYCARD"><!-- this defines the "MYCARD" namespace used below -->
<a href="/" class="MYCARD-p-org MYCARD-p-name"><img class="MYCARD-u-logo" src="/static/logo.jpg" alt="">mention.tech</a>
<p class="e-form"><!-- "e-form" is not recognized, because not in a known namespace -->
...
</p>
</div>

This is important because tools are autogenerating class names and stylesheets for HTML and associate them with other functions not intended for vCards.

In fact this support should be added in ALL microformats, not just for vCards...

And this will reduces a lot the ambiguities in microformat parsers by allowing them to be more selective (in fact the namespace being used as a common prefix for all properties, parsers could be faster, additionally it would allow easier editing on vcards in HTML, for operations like finds/replace, or even for automated replacements using regexp searches.

It would also allow nested vcards created from different tools using their own private extensions, to not conflict each other on these extensions, if they can be properly namespaced.

Note: these defined namespaces are automatically replacable by parsers if they regenerate a new composite document (they could be removed by tools if there are no conflict, or shortened, or made unique by changing them with another arbitrary name).

The other solution would be to use namespaces on the HTML attribute names themselves, notably class:

<div class="p-author h-card h-card-ns-MYCARD"><!-- this defines the "MYCARD" namespace used below -->
<a href="/" MYCARD:class="p-org p-name"><img MYCARD:class="u-logo" src="/static/logo.jpg" alt="">mention.tech</a>
<p class="e-form"><!-- "e-form" is not recognized, because not in a known namespace -->
...
</p>
</div>

But this solution will not work reliably in strict XHTML or XML parsers if there's no XML namespace definition, or this could invalidate the document on basic DOM parsers for HTML (e.g. in MediaWiki, unknown HTML attributes are discarded so that MYCARD:class="..." would not appear at all in the final HTML, only class="..." is accepted).

Note: this would also cleanly solve problems like the one related in #ignore u-camelCase properties below !

Finally, it woul allow the coexistence of multiple microformats coexisting in the same document (only the root element is distinctive, but the "p-*", "u-*", "dt-*" elements will collide: which microformat should interpret them? It is easy to solve by assigning to the root ("h-<microformat>" element for each microformat a namespace that will be used in their content, such as "h-card-ns-MYCARD" for assigning the "MYCARD" namespace to the "h-card" microformat, or "h-goog-doc-ns-MYDOC" to assign the "MYDOC" namespace to the "h-goog-doc" microformat that google may want to develop for Google Docs, or "h-x-doubleclick-X78954218" for assigning the "X78954218" namespace that would be used in a "x-doubleclick" custom microformat developed by doubleclick with contents using "X78954218-p-*", "X78954218-u-*", "X78954218-e-*", "X78954218-dt-*").

Verdy p 03:54, 1 June 2016 (UTC)

  • -1 gRegor: see namespaces-considered-harmful; also seems to solve only hypothetical problems. Are there real-world parsing collision examples?
  • -1 agreed with gRegor above. I would certainly want to see real world parsing problems before adding just a heavyweight "solution". WillNorris 21:16, 5 June 2016 (UTC)
  • -1 Tantek: historically none of these namespace setting/using proposals have actually survived in the wild on the web, they all get co-opted to treating the shorthands/prefixes in a hardcoded way, e.g. og: etc. All evidence to date is against such proposals, plus there's no concrete examples provided to motivate this change, only theory.

consistent implied name url from grandchildren of root

2016-06-05. ACCEPTED. SPEC UPDATED.

See https://github.com/microformats/tests/issues/50

Summary:

Proposal to update spec to include the following at the end of implied url parsing rules:
  • else if .h-x>:only-child>a[href]:only-of-type:not[.h-*] then use that [href] for url
  • else if .h-x>:only-child>area[href]:only-of-type:not[.h-*] then use that [href] for url
these are identical to the existing rules with the addition of the :only-child selector.

need more not h-* to avoid child root implying properties

2016-06-05. ACCEPTED. SPEC UPDATED.

See https://github.com/microformats/tests/issues/52 for an example of this

Proposal:

Should also be restricted to :not[.h-*]

E.g. >:only-child> should be >:only-child:not[.h-*]>

  • +1 Tantek
  • +1 willnorris
  • +1 gRegor
  • +1 kylewm

Standard datetime format

2016-06-04 (before). REJECTED.

2015-07-28

http://microformats.org/wiki/microformats2-parsing#parsing_a_dt-_property does not specify any standard format to use for datetimes. e.g.
2015-07-28T12:55:33
vs
2015-07-28 12:55:33

Would be good to standardize this to compare various parser outputs.

2015-07-29: This subject is (somewhat) covered in http://microformats.org/wiki/iso-8601 As it stands the JavaScript parsers support output in the 3 main profiles, 'W3C Note', 'RFC 3339' and 'HTML5' plus 'auto' which keeps authors format. The default date output for the JavaScript parsers is the same format as the date was originally authored in. This can be changes by setting the options.dateFormat switch to any of the other profiles mentioned. It would be good if other parser also had a switch to force output to a common profiles so we could compare various parser outputs, but I think the default should be how a date was authored. All output whatever profile should also keeps the authored level of specificity, i.e. not adding minutes or seconds if they are not in the input date string. This is important if you want to compare parser outputs.

The only exception to this where date and times are combined such as the implied h-event rule for dt-start and dt-end where I output in the HTML5 style 2015-07-29 12:55:33 as there is no predefined author preference and HTML5 profile is more human readable. Glenn Jones 11:02, 29 July 2015 (UTC)

  • -1 Tantek we are maintaining whole properties as authored, with authored level of specificity, i.e. not adding minutes or seconds if they are not in the input date string, and vcp cases handled in separate issue.
    • Consensus in room at IWC 2016 session also. Resolving accordingly.


implied date for dt properties both mf2 and backcompat

2016-06-04 (before). ACCEPTED. SPEC UPDATED.

The value class pattern dt-* date proposal should apply to both mf2 dt-* properties, and backcompat classic microformats, to preserve the hAtom / hCalendar optimizations noted on that page, but in a generic way.

2015-08-21: Glenn Jones Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html

And vcp updated too. Tantek 22:53, 5 June 2016 (UTC)

implied name when alt=""

The implied name rule

is slightly under-specified for the case where alt is provided but intentionally blank. The desired behavior is to use the img alt tag only if it is non-empty. For example:

<a class="h-card" href="https://kylewm.com">
  <img src="https://kylewm.com/photo.jpg" alt="">
  Kyle
</a>


The PHP and JS parsers already seem to return the desired result ("Kyle" in the above example). The Python parser uses the alt text and returns "".

Proposal: modify the spec to explicitly exclude these tags:

And audit the other implied rules for similar cases.

  • +1 Tantek this makes sense to me, and as far as I can tell, for the other cases too for *implied* properties:
    • area[alt], abbr[title], and all other attributes where there is an existence test, there should be a :not[alt=""] empty test, for implied p-name, u-photo, u-url
  • +1 gRegor sounds good to me.
  • +1 Kylewm We've added this in mf2py too now, and I'm happy with it.
  • +1 Glenn This is often define by the underlying HTML parsing library which will remove attributes that do not have a values.
  • ...

parsing a dt- property

Log: https://indiewebcamp.com/irc/2016-04-25#t1461606553653

  • +0 Kylewm on replacing "T" as the separator. Would you please clarify whether that is only for value class pattern/assembling dates from components, or is it proposing to *always* normalize dt's?
    • +1 Tantek definitely value class pattern/assembling dates from components should use " " instead of "T" as separator.
    • +0 Tantek slight pref (but unsure) for replace a "T" separator with a single space in other dt-* parsing.
    • +1 Glenn happy to move to single space separator for dates built from the value-class pattern.
    • -1 Glenn I think we should pass through the authored format of a date as default output. We should process the content as little as possible, so it is as authored. We can then add parser options to force one of the date formats such as ISO profiles HTML5 or W3C if we need consistency. This is the approach I have taken.
    • See related specific issue: microformats2-parsing-issues#Standard_datetime_format
  • +1 Kylewm on not implying seconds
  • +1 Glenn on not implying seconds. Authored level of specificity should always be kept in dates.

Consensus resolutions:

Dropped:

ignore u-camelCase properties

RESOLVED. SPEC UPDATED 2016-02-29.

Due to Suit CSS (and others? citations?) recent (2015-?) use of "u-*" class names for so-called "utility classes", we are seeing some false positives in a few very rare instances, e.g.: this twitter markup

(Nearly) all these "utility classes" use camelCase for the class name suffixes, thus we can filter them out by looking for camelCase (since microformats class name conventions are always all lowercase and hyphenated), or even just looking for (and rejecting) *any* capital letters.

For your own site, it might be a good idea to prefix the "utility classes" e.g. Cooking with Design Systems by Dan Mall

Proposal:

  • +1 Tantek Let's get this fix rolling quickly to avoid further pollution.
  • +1 Barnaby php-mf2 already ignores classnames with capitalised prefixes, ignoring any classnames with capital letters seems totally reasonable
  • +1 Kylewm agree with rejecting property names that include capital letters (specifically detecting camelCase seems harder to define)
  • +1 Glenn agreed, a simple change which should help avoid further pollution
  • +1 (see also below) WillNorris 22:15, 5 June 2016 (UTC)

Additional proposal: (same reasoning, filter out more crap)

  • +1 Tantek Let's get this fix rolling quickly to avoid further pollution.
  • +1 I've already implemented this in the go library, additionally extending it to all properties, not just u-* WillNorris 22:15, 5 June 2016 (UTC)

When to collapse whitespace in properties

The spec doesn’t explicitly require whitespace to be collapsed or not. The official mf2 test suite requires it to be collapsed.

Reasons why whitespace shouldn’t be collapsed:

Resolution 2013-11-12: Agreed, whitespace should not be collapsed (other than normal HTML5 parsing rules). The spec now refers to "textContent" rather than "innertext" to make this explicit.

How to interpret mf2 classnames on form inputs

E.G. how to parse:

<input class="u-url" value="https://brennannovak.com/notes/338" />

Examples in the wild: https://brennannovak.com/notes/338

See proposal:

Resolution 2013-11-12: Per that proposal, p- u- dt- properties on input[value] elements now use the value attribute.

mixture of microformats2 and classic microformats classnames on different elements

Some sites in the wild have mistakenly combined classic mf and mf2 markup in ways which misrepresent the content if parsed in BC mode.

Typically this is caused by putting classic and mf2 classnames for the same vocabulary on different elements, e.g:

<body class="hentry">
 <article class="h-entry">
  <h1 class="p-name"></h1>
 </article>
</body>

Sites where this has been observed:

Discussion:

e- and p- escaping levels

  • The fact that the parsed value of any element with .e-* is at a different level of escaping to the parsed values of p-*, dt-* etc. without any indication of how the property was parsed in the output is a security problem. For example:
input output
<p class="h-card">
 <span class="p-name">&lt;tag&gt;</span>
</p>
{
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "<tag>"
                ]
            }
        }
    ]
}
<p class="h-card">
 <span class="e-name">&lt;tag&gt;</span>
</p>
{
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "&lt;tag&gt;"
                ]
            }
        }
    ]
}
  • As a parser developer, the most straightforward way I can think of solving this is to add an option (enabled by default) which encodes HTML special characters on all non e-* properties, so the developer knows that all property values are going to be at the same level of escaping. --bw 20:00, 15 June 2013 (UTC)
    • Your suggestion of auto-HTML-encoding p-*/u-*/dt-* property values is the most sensible I think. I would NOT make it an option, as it makes sense write consistent microformats2 consumers. - Tantek 07:18, 5 July 2013 (UTC)
    • Can you think of any existing apps/consumers of microformats2 via the parser that would break? What would indieweb comments parsers do? - Tantek 07:18, 5 July 2013 (UTC)
      • The only breakage which might occur would be over-encoding of non e-* properties, but I’ll release this update as v0.2.0 and warn people about the changes. The worst thing which could happen is that some comments look a bit weird, as opposed to the current worst possible scenario of easy XSS attacks --bw 12:55, 5 July 2013 (UTC)
      • We should also decide exactly which characters get encoded — just angle brackets, or quotes/ampersands as well? --bw 12:55, 5 July 2013 (UTC)
      • I am not sure about this, it seems more like a helper function rather than a core feature of the parser. Personally I would like to store data as text and encode only when I am going to use and I known the format it is going to be use in. --Glenn Jones 9:54, 14 July 2013 (UTC)
      • After the discussion on the indiewebcamp IRC with Barnaby Walters I now understand the XSS issue that this change is trying to address. A rogue author could include HTML with scripts to execute a XSS attack. These could be masked by switch prefixes i.e. p-* to e-* on a well use property. As the consumer does not see the prefix in the JSON output they have no idea if a property will content HTML or text. I will update my two parsers and the test suite --Glenn Jones 8:02, 17 July 2013 (UTC)
    • So what about an author setting a property to e-* when it would normal be p-*, dt-* or u-* i.e.
<div class="h-card"><p class="e-name"><script> alert('xss test') </script></p></div>
  • Resolved by changes to the parsing spec: all properties are plaintext (non-HTML escaped), e-* properties result in a dictionary with value = plaintext version, html = raw HTML version


br hr empty string

  • The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for Glenn Jones
    • Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write <span class="p-foo"></span> which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - Tantek 15:29, 10 February 2013 (UTC)

datetime examples without T delimiter

  • The examples in the wiki microformats-2 pages such h-entry and h-entry had datetime without the 'T' delimiter between date and time. ie
<time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>

I have updated the pages. As far as I known this is a new pattern for dates. Was it a mistake in the examples or is it a new datetime pattern.

    • The HTML5 "time" element, and "datetime" attribute allow for space " " as a separator between date and time as well as "T", thus we allow it for microformats as well. The " " separator is preferred as the date and time are more readable when separated by a space. The examples noted in those specs deliberately use this. - Tantek 18:48, 15 July 2013 (UTC)

rel-alternate absent optional attributes

  • What should rel-alternate parsing do when one of the optional attributes specified (hreflang or media or both) is not there? The options seem to be:
    1. leave the corresponding key out of the alternate JSON object
      • This one. Leave the corresponding key out.
    2. include the corresponding key in the alternate JSON object, but set the value to the JSON null object
    3. include the corresponding key in the alternate JSON object, but set the value to a blank string
    4. something I haven't thought of

I haven't checked the existing implementations, but Barnaby said he's not sure what the appropriate way to deal with it is either. —Tom Morris 15:41, 9 August 2013 (UTC)

rel-alternate and type attribute

Status: incorporated into microformats2-parsing

  • Should rel-alternate parsing also pick up the type attribute? It’s fairly widely used, e.g. for ATOM feeds.
    • Numerous existing sites/pages have various rel-alternate uses with a type attribute for feeds/APIs so that's good enough to add this for help with discovery in general. Rel parsing updated. - Tantek 00:47, 15 September 2013 (UTC)

Extraction vs Interpretation

Issue raised by: Ben Ward

A microformats ‘1.0’ parser performs the following function:

This is performing two types of function: Extraction of data from an HTML document or fragment, and interpretation and optimisation of that content to match the rules set out by a vocabulary specification.

It is only possible to write a generic parser that covers the first half of this task: Extraction, and application of global rules based on HTML elements and patterns common to all formats.

The purpose of a generic parser (as supported by use cases such as search engines, and other crawlers) is:

To provide a way for tools to extract rich data from a page for native storage, such that the data may be interpreted later by applications. This allows microformats to be crawled, and indexed, and removes the need to include complex HTML parsing within every implementation of microformat data.

Microformats will continue to define various vocabulary-specific optimisations. as part of the design to be optimised for authors. For example: The fn pattern in hcard, or the lat;long pattern in geo, as well as default values for properties, such as the maximum rating in an hreview.

Extraction resolution

Proposed resolution:

Microformats2 should refer only to extraction of microformats. Vocabularies should in turn document their appropriate optimisations, which will need to be applied by implementations, or a companion to an extractor, which I'll refer to here as an ‘interpreter’.

A microformats2 ‘extractor’, in combination with the functionality of a domain and format-aware ‘interpreter’ (either another shared component, or part of the implementation itself) would be equivalent to a microformats 1.0 ‘parser.’

N.B. I'll rewrite some of these as microformats2-parsing-faq to help better clarify. The reasoning that led to most of these design decisions is documented in the microformats 2: About This Brainstorm section and following sections. I'll recheck those sections to see if/where reasoning for some of the above noted design decisions may have been missed, and back-fill accordingly. This is necessary because microformats2 is a evolutionary result of simultaneously addressing both numerous generic issues as well as various common format-specific problems in microformats1 syntax and vocabularies. The very number of changes may make it more challenging (from a microformats1 perspective) to see why any particular design change has been made. Tantek 12:43, 4 October 2011 (UTC)

This issue can be moved from resolved to closed once the above-mentioned write-ups have occurred.


Parsing properties from rel attributes

tl;dr resolution: As of 2013, microformats2-parsing handles parsing all link and a href rel values at document scope level, and producing canonical JSON accordingly. - Tantek

Issue raised by BenWard 07:24, 5 October 2011 (UTC):

Microformats parsers could instead extract all link relationships from rel attributes within an microformat object, parsing them as if a u- prefixed property.

This results in:

Since rel attributes are not overloaded for other functionality like class is, and other uses of rel within content are low (and non-semantic uses are nil, to the best of my knowledge) the risk of property pollution would be extremely low.

Note, with regard to this last point, that a generic microformats parser will parse false-positive properties, and will parse objects in combined chunks, rather than individually by format. Extracted objects will often not represent a vocabulary without further processing.

  • This sounds like it might be workable. Let's try it and see how well authors "get it". - Tantek
  • Possible issue: do we have any collisions between class property names and rel names? (I don't think so offhand, but useful to ask the question). - Tantek
    • None that I can think of in microformats. There is the case of Google's rel=author and p-author in hAtom. However, the next point, about mfo scoping, would cover it in most situations (rel-author on a hyperlink within an hcard wouldn't be applied to the hentry.) The one situation in a parse tree where it's ambiguous would be this:
<a href="p-author h-card" 
   rel="author" 
   href="http://benward.me">
   Ben Ward
</a>
    • I can think of two quite reasonable solutions:
      • 1. Declare that class properties take precedence over rel properties of the same name, discarding rel values if a class is also found, or
      • 2. Since all properties are now multi-value anyway, the hAtom object could be parsed as:
{
   'type': ['h-entry'],
   'properties': {'author': [
        {
          'value': ['Ben Ward'], /* from the p-author     */
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        'http://benward.me'      /* from the rel="author" */
     ],}
 }
    • BenWard 08:29, 5 October 2011 (UTC)
      • Option 2 makes sense and is consistent with the rest of the multi-value parsing/handling. - Tantek 14:56, 5 October 2011 (UTC)
      • What about without the 'p-author'?
<a href="h-card" 
   rel="author" 
   href="http://benward.me">
   Ben Ward
</a>

Should that be parsed as:

{
   'type': ['h-entry'],
   'properties': {'author': [
        {
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        'http://benward.me'      /* from the rel="author" */
     ],}
 }

Or

{
   'type': ['h-entry'],
   'properties': {'author': [
        {
          'value': 'http://benward.me' /* from the rel="author" */
          'type': ['h-card'],          /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
 
     ],}
 }
      • And if the former, then we're presumably saying that the value parsed due to the presence of a rel is always its own value, and does not combine with any other structures. I am fine with this, but I wanted to make sure we are ok with that explicitly. Tantek 14:56, 5 October 2011 (UTC)
        • +1 I think that since the rel attribute is specifically concerned with the relation to an href attribute, it should not be combined with other structures that are rightly declared uses classes.
          • The more I've thought about this and how consuming applications may want to treat rel semantics, the more it seems correct to keep rel semantics distinct from class semantics. Class semantics are quite general/flexible, whereas rel is quite specific, naming something else in terms of a relationship from the current page/microformat's perspective. I think we should consider putting rel values in their own 'rel' collection, separate from the 'properties' collection. E.g. the original rel-author p-author h-card markup example would be parsed into this:
{
   'type': ['h-entry'],
   'properties': {'author': [
        {
          'value': ['Ben Ward'], /* from the p-author     */
          'type': ['h-card'],    /* from the h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        }
     ],}
   'rel': {
     'author': ['http://benward.me'] /* from the rel="author" */
   }
 }
          • and if a post had multiple authors:
{
   'type': ['h-entry'],
   'properties': {'author': [
        {
          'value': ['Ben Ward'], /* from p-author     */
          'type': ['h-card'],    /* from h-card ...   */
          'properties': { 
            'name': ['Ben Ward'], 
            'url': ['http://benward.me']
        },
        {
          'value': ['Tantek Çelik'], /* from 2nd p-author     */
          'type': ['h-card'],        /* from 2nd h-card ...   */
          'properties': { 
            'name': ['Tantek Çelik'], 
            'url': ['http://tantek.com']
        },
     ],}
   'rel': {
     'author': [
       'http://benward.me',      /* from rel="author" */
       'http://tantek.com'       /* from 2nd rel="author" */
     ]
   }
 }
          • This preserves the semantic distinction between rel and properties in general, and leaves it up to a higher-level application to implement any logic around showing "more info" about a rel-author, e.g. by correlating the rel-author URL with the 'url' of an hCard it found in the same entry. However, note that even in the earlier JSON data model, the rel-author value just shows up as another property value, and any higher level application would still have to do some correlation logic. At least with this JSON data model, applications that may be looking for a rel value in particular, or a property value in particular can do so without having one unintentionally pollute the other. Tantek 17:33, 6 October 2011 (UTC)


  • Presumably we'd apply all the same property scoping rules to rel scoping as well. E.g. a rel hyperlink inside a microformat won't be seen by any containing microformat. - Tantek
    • Correct, it should be parsed in the same scope as all other class properties in the object.
      • Update: all rel microformats are now parsed at page-scope. Per-microformat scoping of rel has been found to be too confusing in practice (and against the general semantic of rel expressed in the HTML/HTML5 specs) Tantek 01:00, 10 July 2014 (UTC)


This issue can be moved from resolved to closed once we've verified that all the above-mentioned and implied needs to write things up have occurred.

deduping of rels

Status: incorporated into microformats2-parsing

2015-06-02 by Kevin Marks

Many sites have multiple duplicate rel links to the same url - a very common case is WordPress home pages eg ma.tt

Each post on the page has a block like

<div class="entry-meta">
<span class="date">
<a href="http://ma.tt/2015/05/beethoven-mozart-bach/" 
    title="Permalink to Beethoven, Mozart, Bach" rel="bookmark">
<time class="entry-date" datetime="2015-05-31T22:42:00+00:00">May 31, 2015</time></a></span>
<span class="categories-links">
<a href="http://ma.tt/category/asides/" rel="category tag">Asides</a></span>
<span class="author vcard">
<a class="url fn n" href="http://ma.tt/author/saxmatt/" 
    title="View all posts by Matt" rel="author">Matt</a></span>
</div><!-- .entry-meta -->

As currently defined, the parser will create duplicate entries in rels for each post:

    "rels": {
        "category": [
            "https://ma.tt/category/asides/", 
            "https://ma.tt/category/asides/", 
            "https://ma.tt/category/asides/", 
            "https://ma.tt/category/asides/"
        ], 
        "author": [
            "https://ma.tt/author/saxmatt/", 
            "https://ma.tt/author/saxmatt/", 
            "https://ma.tt/author/saxmatt/", 
            "https://ma.tt/author/saxmatt/"
        ], 
        "tag": [
            "https://ma.tt/category/asides/", 
            "https://ma.tt/category/asides/", 
            "https://ma.tt/category/asides/", 
            "https://ma.tt/category/asides/"
        ], 
        "home": [
            "https://ma.tt/", 
            "https://ma.tt/"
        ], 
}

and in the rel-urls we will also see:

    "rel-urls": {
        "https://ma.tt/author/saxmatt/": {
            "rels": [
                "author", 
                "author", 
                "author", 
                "author"
            ], 
            "text": "Matt", 
            "title": "View all posts by Matt"
        }, 
…
        "https://ma.tt/": {
            "rels": [
                "home", 
                "home"
            ], 
            "text": "Matt Mullenweg", 
            "title": "Matt Mullenweg"
        }, 
…
        "https://ma.tt/category/asides/": {
            "rels": [
                "category", 
                "tag", 
                "category", 
                "tag", 
                "category", 
                "tag", 
                "category", 
                "tag"
            ], 
            "text": "Asides"
        }

These duplicates are unhelpful for parser consumers. We should:

    "rels": {
        "category": [
            "https://ma.tt/category/asides/"
        ], 
        "author": [
            "https://ma.tt/author/saxmatt/"
        ], 
        "tag": [
            "https://ma.tt/category/asides/"
        ], 
        "home": [
            "https://ma.tt/"
        ], 
}
…
    "rel-urls": {
        "https://ma.tt/author/saxmatt/": {
            "rels": [
                "author"
            ], 
            "text": "Matt", 
            "title": "View all posts by Matt"
        }, 
…
        "https://ma.tt/": {
            "rels": [
                "home"
            ], 
            "text": "Matt Mullenweg", 
            "title": "Matt Mullenweg"
        }, 
…
        "https://ma.tt/category/asides/": {
            "rels": [
                "category", 
                "tag"
            ], 
            "text": "Asides"
        }


  • +1 as the proposer. A version of mf2py that does this is running at unmung,com. see fro ma.tt or a simple test Kevin Marks 23:49, 2 June 2015 (UTC)
  • +1 makes sense to me, and not having them be sets in the current spec is likely an oversight on my part. Thanks for noting this issue. Tantek 05:28, 3 June 2015 (UTC)

include alternates in rels

Status: incorporated into microformats2-parsing

2015-06-01 by Tantek, per inconsistency noted by Kevin Marks.

As fallout from the adoption and implementation of 'rel-urls' per microformats2-parsing-brainstorming#more_information_for_rel-based_formats, we should:

  • +1 Tantek as the documenter of this issue, and attempting to represent what I think KevinMarks intended with "rels" and "rel-urls".
  • +1 This makes sense to me, as the rels and rel-urls should match so you can lookup in rels first, then get details about urls from rel-urls. We can drop "alternates" independently from this change. Kevin Marks 00:23, 2 June 2015 (UTC)
    • +1 drop "alternates" independently now made a separate issue. Tantek 03:42, 6 June 2015 (UTC)
  • +1 Makes sense. And I agree with Kevin; I think parsers should deprecate "alternates" for now and drop it after a version cycle or two. Kylewm 00:30, 2 June 2015 (UTC)
  • implemented in my fork of mf2py and running on unmung Kevin Marks 01:29, 2 June 2015 (UTC)

Empty properties overridden by implied rules against user expectation

Status: resolved, existing behavior correct, no changes to parsing spec.

2015-07-03: raised by Glenn Jones

Emma Kuo brought up an issue (https://github.com/glennjones/microformat-node/issues/22) based on following the indieweb note pattern, where the content of a note is given both the e-content and p-name classes. If the element containing the notes only has none text content like image the p-name can have unexpected value. Here is the example she gave:

<div class="h-entry">
        <a href="http://this.site/photo" class="u-url"></a>
        <div class="e-content p-name"><img src="photo.jpg" class="u-photo"/></div>
        Some extraneous text
        <div class="h-cite">
            <a href="http://someother.site/like" class="u-url"></a>
            <a href="http://this.site/photo" class="u-like-of"></a>
            <div class="e-content p-name">liked this</div>
        </div>
    </div>

At the moment I parser this as follows: - if a property (p-name) is empty do not add it to the output. In this case "empty" is classed as not containing any non-whitespace text. As far as I known there is no guidance on how to handle "empty" properties in microformats paring rules, so I followed the conventions of JSON API's not to return "empty" properties.

The side effect of the above is that p-name also has a number of "implied rules". The "implied rules" try to automatically fill properties like p-name if there is no defined value. In the example above it uses the textContent of the parent h-entry, so value of the h-entry>p-name is the text content of the h-cite i.e. "likes this".

Options:

1. We should not allow the "implied name rule" to get textContent from within a child h-*

  • +1 I believe this is inline with how we parse properties and will meet user/author expectations Glenn Jones 9:22, 3 July 2015 (UTC)
  • -1 A nested h-* is still part of the content of the parent h-*, I don't quite understand the rationale for excluding it. For example, I may include lots of h-cards in the body of a post that references people and wouldn't want them to be excluded from the implied name generation. Kylewm 14:40, 3 July 2015 (UTC)
  • -1 I'm not sure this would solve the problem because auto-filled text could come from the parent h-* ("some extraneous text" in the example above) Emma Kuo 20:50, 4 July 2015 (UTC)
  • -1 I agree with the other -1s. This would break some of the simplicity of the model. Tantek 05:12, 14 July 2015 (UTC)

2. We should not execute the "implied rules" where there is an author defined "empty" property.

  • -1 Although the output would meet author expectations it is complex for parsers as they will have to keep state for each property through the whole series of parsing rules. Glenn Jones 9:22, 3 July 2015 (UTC)
  • +1 An explicit, empty, p-name property should prevent an implicit p-name from being generated. For example tantek.com includes <span class="p-name"></span> at the start of the h-feed to prevent a giant name from being auto-generated. From my reading of the parsing spec, I don't see any reason that blank strings should be excluded from parsing. (mf2py and php-mf2 will both happily include empty strings in their output) Kylewm 14:40, 3 July 2015 (UTC)
  • +1 We already have interop on this between mf2py and phpmf2, as well as people depending on it to explicitly set empty property values. Tantek 05:12, 14 July 2015 (UTC)
  • +1 As we already have interop with two parsers and solid user issue from Emma we should take this approach. Glenn Jones 11:51, 29 July 2015 (UTC)

2015-08-21: Glenn Jones Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html

uf2 children inside a classic microformats root class name

Status: incorporated into microformats2-parsing

2015-020: (raised by kylewm) What should microformats2 children inside a classic microformats root class name do?

Options:

1. Nothing. Any unattached uf2 children inside a classic microformats root are ignored. Problems:

  • However then there's a possible surprise if/when the author upgrades the classic microformats root to uf2, then all of a sudden all the new uf2 children show-up.
  • Another downside: author adds uf2 markup, can't figure out why nothing is happening (because somewhere up the tree in code they didn't touch is classic microformats that are hiding these unattached uf2 children.

2. Show up in the children collection of the classic microformats root

  • Feels most predictable. When you add uf2 root class names anywhere, they will show up in the JSON output hierarchy.
  • When you convert ancestor class microformats root class names to uf2 root class names, no surprise in terms of which microformats show up. Same children collection.
  • +1 Thus I'm leaning towards this one, despite the fact that classic microformats never had a concept of generic unattached children. Tantek 04:55, 21 January 2015 (UTC)
  • +1 I think this is the best option I will implement it and update the wiki once its in the JavaScript parser. Glenn Jones 11:55, 29 July 2015 (UTC)
  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.

3. Show up as peers to the classic microformats root. Issue(s)

  • Has ths surprise aspect of if/when you convert the classic root class name to a uf2 root class name, the former peers become unattached children.

2015-08-21: Glenn Jones Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html

any h- root class name overrides and stops backcompat root

Status: resolved, awaiting implementation attempt/experience.

2015-020: The presence of any h-* root class name overrides and stop any backcompat parsing of classic microformats root class names on that same element. Tantek 04:55, 21 January 2015 (UTC)

Thoughts?

  • Tom & Kyle - implementable with the same backcompat root flag as needed for restricting backcompat root class name to only seeing backcompat property class names
  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.
  • I don't think I understand this rule. If I was stop all parsing of of classic microformats in the presence of any h-* root in a document then some of the other rules such as "uf2 children inside a classic microformats root class name" do not make sense. Could this item be expanded and explained a bit more? Glenn Jones 12:13, 29 July 2015 (UTC)
    • added "on that same element" as that was what we were discussing/implying in this issue. Tantek 22:49, 18 September 2015 (UTC)

Example:

<div class="adr h-adr">
  <div class="locality">MF1</div>
  <div class="p-locality">MF2</div>
</div>

Expected parser output:

"items": [{ 
  "type": ["h-adr"],
  "properties": {
    "locality": ["MF2"],
    "name": ["MF1MF2"]
  }
}]


Or with a custom root mf2 classname:

<div class="adr h-acme-address">
  <div class="locality">MF1</div>
  <div class="p-locality">MF2</div>
</div>

Expected parser output:

"items": [{ 
  "type": ["h-acme-address"],
  "properties": {
    "locality": ["MF2"],
    "name": ["MF1MF2"]
  }
}]

backcompat classic microformats should only see backcompat properties

Status: incorporated into microformats2-parsing

2015-020: When parsing a microformats vocabulary that indicates a backcompat root class name (and thus an absence of the microformats2 equivalent on the same element), parsers must only look for the backcompat properties that are specified explicitly for that backcompat root class. Tantek 04:04, 21 January 2015 (UTC)

Reasoning: such behaviour was never expected by authors, and crossing a classic microformats root class name with microformats2 property names were never explicitly expected nor specified to work.

Thoughts?

  • Tom & Kyle - implementable with the same backcompat root flag as needed for
  • +1 I think this will help backcompat parsing, I will implement it and update the wiki once its in the JavaScript parser. The test suite will also need updating. Glenn Jones 11:55, 29 July 2015 (UTC)
  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.

2015-08-21: Glenn Jones Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html Currently you need to switch on the option "Block overlapping properties from different microformat versions"

microformats2 root class names should only see microformats2 properties

Status: incorporated into microformats2-parsing

2015-020: When parsing a microformats2 root class name, only explicit microformats2 properties should be parsed. Any backcompat property names must be ignored. Tantek 04:04, 21 January 2015 (UTC)

Reasoning: such microformats2 authors should be expected to do all their microformats markup with microformats2 class names - this is a deliberate expectation so that their microformats aren't polluted with other (classic microformats) coincidentally named generic class names.

Thoughts?

  • +1 I think this will help backcompat parsing, I will implement it and update the wiki once its in the JavaScript parser. The test suite will also need updating. Glenn Jones 11:55, 29 July 2015 (UTC)
  • ++ Consensus at 2015-01-20 - option that presents the least surprises in the most cases.

2015-08-21: Glenn Jones Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html Currently you need to switch on the option "Block overlapping properties from different microformat versions"

implied properties on backcompat parsing unlikely to be intended

Status: incorporated into microformats2-parsing

Since classic microformats had no notion of implied properties, when implied property parsing occurs on backward compat classic microformats root class names, it is unlikely that any implied property (p-name u-url u-photo) was ever intended by the author of the classic microformat. Tantek 02:43, 30 December 2014 (UTC) Examples:

Proposed resolution:

  • Be explicit in implied property parsing that it must only be done for explicit 'h-*' root class name microformats, not for any (back)compat parsing of microformats. Please comment on this proposal with "** comment" on new lines below. Tantek 02:43, 30 December 2014 (UTC)
    • +1 This makes a lot of sense to me. We should strive to parse mf1 as it was intended by the author, and I think you're right that implied rules are unlikely to be what was intended Kylewm 03:22, 30 December 2014 (UTC)
    • +1 I think this will help backcompat parsing, but there are two major things to consider. It may well break some consumer code as the output for a microformats currently always has the name property, there may not be the defences code to check this is true when we remove the implied name rule for classic microformats. The test suite will also need major updating as all the test output for classic microformats will have. I will look into implementing this and report back to the wiki. Glenn Jones 12:13, 29 July 2015 (UTC)
    • Also it should be made clear that we are only removing the implied rules from classic microformats parsing and not the value property? Glenn Jones 12:13, 29 July 2015 (UTC)
      • The "parsing for implied properties" section only references name, photo, url properties. Where (in the spec) is the confusion about "value" coming from? Tantek
    • RESOLVED at 2015-01-20 meetup. Tantek 04:09, 21 January 2015 (UTC)

2015-08-21: Glenn Jones Now implemented in microformat-shiv can be tested at http://microformatshiv.com/editor.html Currently you need to switch on the option "Set implied properties by microformat version"

link elements and u- parsing

Status: incorporated into microformats2-parsing

  • Raised by tantek on 2014-07-08 on irc: should the parsing specification for handling u- properties be modified to include the link element? The potential downside is that invisible-metadata-is-considered-harmful, however all known real world examples of link are semi-visible data (not fully hidden).

There are potential cases for wanting to use link as an alternative to a (and area), such as a whole page where the root html element is an h-card and the properties are included across the page: some in visible data in the body while others are in the head as link elements. Example:

One specific use-case is the semi-visible link rel="shortcut icon" href="..." - which is visible sometimes in browser UI, and also when a user chooses "Add to Home Screen" on a mobile device. Such page level icons may be used as a u-photo or u-logo of the containing h-* object on the html element.

  • http://adactio.com/about/myself/ on 2014-190
    • could use <html class=h-card> - page is all about Jeremy Keith the person
    • icon / logo is only on <link> tag which could use class=u-logo:
      <link rel="shortcut icon apple-touch-icon" type="image/png" href="/icon.png" />

Another specific use-case is a post permalink page, e.g. with <html class=h-entry>

Another use-case is publishing links to PGP/GPG keys linked from the head which is currently handled by <link rel=pgpkey> which is already supported in existing microformats2 rel parsing of link rel elements. Thus there is a (admittedly weak) argument for consistently parsing both <link rel> and <link class="u-*">.

E.g. inside that aforementioned real world <html class=h-entry> post permalink page example,

  • why should <link rel="in-reply-to"> work
  • but not <link class="u-in-reply-to"> ?

The slightly stronger argument for consistency of link handling is that it simplifies the publisher (and parser) model:

  • <a> and <area> work for both rel and class
  • why does <link> only work for rel ?
  • it would be simpler if all three tags just worked (in the same way) for both rel and class

Should the parsing spec be modified to handle these cases?Tom Morris 09:25, 9 July 2014 (UTC)

    • I'm generally in favour. It'd be good to see what other parser developers think. —Tom Morris 10:16, 9 July 2014 (UTC)
    • adding this to the parsers won't be an issue. The question is should the door be opened to hidden mf data? Up on further reflection, there seems to be no need to distinguish between rel=property and class=u-property on link elements. So I am in favour for consistency. Kartik 18:30, 2014-07-09 (EST)
    • RESOLVED at 2015-01-20 meetup. Make link consistent with a.

drop alternates collection

Status: incorporated into microformats2-parsing

2015-06-01 by Tantek, per inconsistency noted by Kevin Marks.

As fallout from the adoption and implementation of 'rel-urls' per microformats2-parsing-brainstorming#more_information_for_rel-based_formats, we should:

  • +1 Tantek as the documenter of this issue, and attempting to represent what I think KevinMarks intended with "rels" and "rel-urls" in original issue now "include alternates in rels", we no longer need "alternates", and those with client consuming code have universally indicated that they would rather use rel-urls anyway. Tantek 03:42, 6 June 2015 (UTC)

see also

microformats2-parsing-issues was last modified: Tuesday, April 25th, 2017

Views