microformats2-parsing-issues: Difference between revisions

Revision as of 13:38, 22 October 2013

This article is a stub. You can help the microformats.org wiki by expanding it.

This page is for documenting issues with the microformats2-parsing specification.

issues

When to collapse whitespace in properties

The spec doesn’t explicitly require whitespace to be collapsed or not. The official mf2 test suite requires it to be collapsed.

Reasons why whitespace shouldn’t be collapsed:

Plaintext property representations of syntax-highlighted code, poetry and song lyrics require whitespace to be present
Whether or not whitespace is an important part of the content being parsed is determined by css white-space and CANNOT be inferred from HTML markup alone

mixture of microformats2 and classic microformats classnames on different elements

Some sites in the wild have mistakenly combined classic mf and mf2 markup in ways which misrepresent the content if parsed in BC mode.

Typically this is caused by putting classic and mf2 classnames for the same vocabulary on different elements, e.g:

<body class="hentry">
 <article class="h-entry">
  <h1 class="p-name"></h1>
 </article>
</body>

Sites where this has been observed:

Discussion:

As far as I can tell, the problems in all of these examples were caused by mf2 markup being injected by a wordpress plugin, but classic mf classnames being present further up the DOM in the themes. When parsed in compatibility mode, the classic mf classnames are transformed into mf2 classnames, making the original mf2 classnames look like children of empty items.
Turns out this isn’t theme-specific, WordPress injects hentry via PHP [1]. The bug with the wordpress mf2 plugin is resolved as of 2013-10-22 --bw 13:38, 22 October 2013 (UTC)

e- parsing iframe srcdoc

Proposal: addition of a new e-* parsing rule for iframe elements with srcdoc attributes. E.G.

<div class="h-entry">
 <iframe class="e-content" srcdoc="<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp;amp; doubly quoted ampersands</p>" />
</div>

{
 "items": [{
  "type": ["h-entry"],
  "properties": {
   "content": ["<p>A paragraph of HTML with &quot;quoted quotes&quot; &amp; doubly quoted ampersands</p>"]
  }
 }]
}

This would allow, for example, HTML comments to be sandboxed inside iframes but still parsable as microformats.

I believe the correct processing would be to leave " entities as they are but to unescape any doubly-escaped ampersands.

- Is there any use case for that? —Tom Morris 12:32, 14 September 2013 (UTC)
- +1 we need documentation of use case and existing sites publishing iframe srcdoc like this - Tantek 00:47, 15 September 2013 (UTC)

resolved

e- and p- escaping levels

The fact that the parsed value of any element with .e-* is at a different level of escaping to the parsed values of p-*, dt-* etc. without any indication of how the property was parsed in the output is a security problem. For example:

input

output

   <p class="h-card">
 <span class="p-name">&lt;tag&gt;</span>
</p>

   {
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "<tag>"
                ]
            }
        }
    ]
}

   <p class="h-card">
 <span class="e-name">&lt;tag&gt;</span>
</p>

   {
    "items": [
        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "&lt;tag&gt;"
                ]
            }
        }
    ]
}

As a parser developer, the most straightforward way I can think of solving this is to add an option (enabled by default) which encodes HTML special characters on all non e-* properties, so the developer knows that all property values are going to be at the same level of escaping. --bw 20:00, 15 June 2013 (UTC)
- Your suggestion of auto-HTML-encoding p-*/u-*/dt-* property values is the most sensible I think. I would NOT make it an option, as it makes sense write consistent microformats2 consumers. - Tantek 07:18, 5 July 2013 (UTC)
- Can you think of any existing apps/consumers of microformats2 via the parser that would break? What would indieweb comments parsers do? - Tantek 07:18, 5 July 2013 (UTC)
  - The only breakage which might occur would be over-encoding of non e-* properties, but I’ll release this update as v0.2.0 and warn people about the changes. The worst thing which could happen is that some comments look a bit weird, as opposed to the current worst possible scenario of easy XSS attacks --bw 12:55, 5 July 2013 (UTC)
  - We should also decide exactly which characters get encoded — just angle brackets, or quotes/ampersands as well? --bw 12:55, 5 July 2013 (UTC)
  - I am not sure about this, it seems more like a helper function rather than a core feature of the parser. Personally I would like to store data as text and encode only when I am going to use and I known the format it is going to be use in. --Glenn Jones 9:54, 14 July 2013 (UTC)
  - After the discussion on the indiewebcamp IRC with Barnaby Walters I now understand the XSS issue that this change is trying to address. A rogue author could include HTML with scripts to execute a XSS attack. These could be masked by switch prefixes i.e. p-* to e-* on a well use property. As the consumer does not see the prefix in the JSON output they have no idea if a property will content HTML or text. I will update my two parsers and the test suite --Glenn Jones 8:02, 17 July 2013 (UTC)
- So what about an author setting a property to e-* when it would normal be p-*, dt-* or u-* i.e.

<div class="h-card"><p class="e-name"><script> alert('xss test') </script></p></div>

- per microformats2 parsing discussion 2013-09-14, parsers should never automatically attempt to HTML-special-characters encode - as that would provide the client of the parser a false sense of security. It's *always* up to client code to escape any text being output to HTML *at the moment it is output to HTML* and never before, because they can never trust that any text from storage/elsewhere has for sure been escaped or not. - Tantek 18:07, 17 October 2013 (UTC)
- Should we not encode e-* as well and the consumer can decode at their own risk --Glenn Jones 18:42, 21 July 2013 (UTC)
  - No, never, per above point from microformats2 parsing discussion 2013-09-14 - Tantek 18:07, 17 October 2013 (UTC)
- See microformats2 parsing discussion 2013-09-14 etherpad: https://etherpad.mozilla.org/microformats2parsing for more details on the resolution to this issue - and incorporate here (then move to resolved section below). - Tantek 18:07, 17 October 2013 (UTC)
Resolved by changes to the parsing spec: all properties are plaintext (non-HTML escaped), e-* properties result in a dictionary with value = plaintext version, html = raw HTML version

br hr empty string

The parsing rule 'else if br.p-x or hr.p-x, then return "" (empty string)' for p-* can cause any code consuming the API to become quite bloated. It means that you have test every array value to see if its an empty string. It is also unclear to me what the purpose of this mark-up pattern is for Glenn Jones
- Upon reconsidering this, I agree with you, this is an unlikely use case. If a publisher wants to explicitly set an empty property "p-foo" they can simply write <span class="p-foo"></span> which looks explicit. Whereas BR and HR tags are often just presentational, so we should both not encourage usage of them for semantics, and anyone that did use them would be subject to likely loss of semantics upon a redesign (that got rid of those particular BR and HR tags). I'm going to remove them from the parsing spec. - Tantek 15:29, 10 February 2013 (UTC)

datetime examples without T delimiter

The examples in the wiki microformats-2 pages such h-entry and h-entry had datetime without the 'T' delimiter between date and time. ie

<time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time>

I have updated the pages. As far as I known this is a new pattern for dates. Was it a mistake in the examples or is it a new datetime pattern.

- The HTML5 "time" element, and "datetime" attribute allow for space " " as a separator between date and time as well as "T", thus we allow it for microformats as well. The " " separator is preferred as the date and time are more readable when separated by a space. The examples noted in those specs deliberately use this. - Tantek 18:48, 15 July 2013 (UTC)

rel-alternate absent optional attributes

What should rel-alternate parsing do when one of the optional attributes specified (hreflang or media or both) is not there? The options seem to be:
1. leave the corresponding key out of the alternate JSON object
  - This one. Leave the corresponding key out.
2. include the corresponding key in the alternate JSON object, but set the value to the JSON null object
3. include the corresponding key in the alternate JSON object, but set the value to a blank string
4. something I haven't thought of

I haven't checked the existing implementations, but Barnaby said he's not sure what the appropriate way to deal with it is either. —Tom Morris 15:41, 9 August 2013 (UTC)

rel-alternate and type attribute

Should rel-alternate parsing also pick up the type attribute? It’s fairly widely used, e.g. for ATOM feeds.
- Numerous existing sites/pages have various rel-alternate uses with a type attribute for feeds/APIs so that's good enough to add this for help with discovery in general. Rel parsing updated. - Tantek 00:47, 15 September 2013 (UTC)

@@ Line 30: / Line 30: @@
 * http://notizblog.org/2013/06/18/the-rise-of-the-indieweb/ (fixed)
-As far as I can tell, the problems in all of these examples were caused by mf2 markup being injected by a wordpress plugin, but classic mf classnames being present further up the DOM in the themes. When parsed in compatibility mode, the classic mf classnames are transformed into mf2 classnames, making the original mf2 classnames look like children of empty items.
+Discussion:
+* As far as I can tell, the problems in all of these examples were caused by mf2 markup being injected by a wordpress plugin, but classic mf classnames being present further up the DOM in the themes. When parsed in compatibility mode, the classic mf classnames are transformed into mf2 classnames, making the original mf2 classnames look like children of empty items.
+* Turns out this isn’t theme-specific, WordPress injects hentry via PHP [http://indiewebcamp.com/irc/2013-10-22/line/1382448759]. The bug with the wordpress mf2 plugin is resolved as of [http://indiewebcamp.com/irc/2013-10-22/line/1382449035 2013-10-22] --[[User:Barnabywalters|bw]] 13:38, 22 October 2013 (UTC)
 === e- parsing iframe srcdoc ===

microformats2-parsing-issues: Difference between revisions

Revision as of 13:38, 22 October 2013

Contents

issues

When to collapse whitespace in properties

mixture of microformats2 and classic microformats classnames on different elements

e- parsing iframe srcdoc

resolved

e- and p- escaping levels

br hr empty string

datetime examples without T delimiter

rel-alternate absent optional attributes

rel-alternate and type attribute

see also

Navigation menu

microformats2-parsing-issues: Difference between revisions

Revision as of 13:38, 22 October 2013

issues

When to collapse whitespace in properties

mixture of microformats2 and classic microformats classnames on different elements

e- parsing iframe srcdoc

resolved

e- and p- escaping levels

br hr empty string

datetime examples without T delimiter

rel-alternate absent optional attributes

rel-alternate and type attribute

see also

Navigation menu

Search