microformats2-parsing-brainstorming: Difference between revisions
Kevin Marks (talk | contribs) (→xfn: wiht urls) |
m (Replace <entry-title> with {{DISPLAYTITLE:}}) |
||
(75 intermediate revisions by 11 users not shown) | |||
Line 1: | Line 1: | ||
{{DISPLAYTITLE:microformats2 parsing brainstorming}} | |||
This page is for brainstorming, discussion, and other questions and explorations about [[microformats2]] parsing. | This page is for brainstorming, discussion, and other questions and explorations about [[microformats2]] parsing. | ||
Line 5: | Line 5: | ||
For the microformats2 parsing algorithm, see: | For the microformats2 parsing algorithm, see: | ||
* [[microformats2-parsing]] | * [[microformats2-parsing]] | ||
For filing issues / problems with microformats2-parsing, see: | |||
* https://github.com/microformats/microformats2-parsing/issues | |||
** [[microformats2-parsing-issues|Resolved issues before 2016-06-20]] | |||
__TOC__ | __TOC__ | ||
== Parse img alt == | |||
Per https://github.com/microformats/microformats2-parsing/issues/2 currently any u-* property (e.g. u-photo, u-featured) that extracts a 'src' attr from an img tag loses any associated 'alt' text alternative, and if at some point the consuming application wants to display that u-* property as an img, they have to either omit or synthesize a fake text alternative. | |||
It is desirable to somehow maintain that image src and alt association from the original markup, through the parsing process, up until a consuming application wishes to re-present the image with the text alternative. | |||
There are a number of possibilities / approaches here worth brainstorming: | |||
=== Include alt property in parent object === | |||
# explicit authoring: require the author to use a new 'p-alt' property on the image to cause parsing and extraction of the text alternative. | |||
#* Problem(s): fails for multiple images, some of which may or may not have alt attrs or corresponding p-alt properties (and fragile, forgetting one p-alt throws off the parallel lists of u-* and p-alt). | |||
# implicit p-alt: for every img that is parsed for a u-* property, the parse could generate a p-alt property with value. | |||
#* Problem(s): fragile again for similar reasons, not all u-*s may be on img elements, or may not have alt attrs for all imgs in the source. | |||
# implicit p-alt only for implied u-photo | |||
#* This is better since there can only be one implied u-photo, and thus if there is a p-alt, it must be associated with the one u-photo | |||
#* Problem(s): does not work for other u-* image properties e.g. u-featured | |||
<code><nowiki><div class="h-entry"><img src="http://example.com/photo.jpg" alt="Example" class="u-photo p-alt"></div></nowiki></code> | |||
<code><nowiki>{"type":["h-entry"],"properties":{"photo":["http://example.com/photo.jpg"],"alt":["Example"]}</nowiki></code> | |||
=== Make photo property an object === | |||
1. use "h-image" on any u-* on img elements to imply a structure with paired photo and 'name' text alternative, e.g. <blockquote><code><img src="a.jpg" alt="text about a" class="u-featured h-image"/></code></blockquote> which would result in a u-featured property with one value, a structure of an h-image with itself having implied properties of a u-photo of "a.jpg" and a p-name of the "text about a". Similarly the author can use the object tag for the same result: <blockquote><code><object data="a.jpg" class="u-featured h-image">text about a</object></code></blockquote> In either case, the same microformats JSON would be generated, which is correct, as in both cases, there is an image with a fallback text alternative. The specific HTML used should not matter. The semantic of pairing the image with the text alternative is communicated the same way for both. | |||
* Challenge: requires author use of additional classname "h-image". | |||
* Benefit: does not require a change to the parsing algorithm | |||
== | <source lang=html4strict> | ||
<div class="h-entry"> | |||
<img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured h-image"> | |||
</div> | |||
</source> | |||
<source lang=javascript> | |||
{ | |||
"type":["h-entry"], | |||
"properties":{ | |||
"featured":[{ | |||
"type":["h-image"], | |||
"properties":{ | |||
"photo":["http://example.com/eg.jpg"], | |||
"name":["Photo of an example"] | |||
} | |||
}] | |||
} | |||
</source> [http://pin13.net/mf2/?id=20160719001154920] | |||
2. have u-* on an <img> automatically create an object if there is a non-empty 'alt' attribute. <br/>If a u-* property is parsed on an <img> element with a non-empty 'alt' attribute, then: <br/> | |||
Create a structure similar to the e-content nested structure that provides the "value" as the URL, and an "alt" as the text alternative. | |||
* Advantage: no additional microformats markup needed from author | |||
* Challenge: Many (most?) existing published u-photo properties will now return an object instead of a string, and consuming applications may not be expecting an object for a photo | |||
** Mitigation: If this is done as an explicit parser library upgrade, consuming applications may decide when to take this parser upgrade and thus fix their u-photo handling to look for string or object before upgrading their microformats2 parsing library instance. | |||
< | <source lang=html4strict> | ||
<div class="h-entry"> | |||
<img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured"> | |||
</div> | |||
</source> | |||
<source lang=javascript> | |||
{ | { | ||
" | "type":["h-entry"], | ||
"properties":{ | |||
" | "featured":[{ | ||
" | "value":"http://example.com/eg.jpg", | ||
" | "alt":"Photo of an example" | ||
}] | |||
} | |||
</source> | |||
" | |||
... more brainstorming needed | |||
</ | |||
=== img alt thoughts === | |||
Thoughts about img alt brainstorm proposals. Feel free to offer counterpoints with nested items and/or alternative preferences/opinions with (potentially multiple) top level items! | |||
<div class="discussion"> | |||
* Tantek: I am '''leaning towards "Make photo property an object" brainstorm "2."''' because it feels more "automatic" and thus provides lower friction to more accessibility. Less (author) work for "alt" information to get passed through to the JSON result, and thus more potentially re-usable by consuming applications that want to preserve or re-emit the pairing of a photo and its fallback text alternative. -- [[User:Tantek|Tantek]] 00:53, 19 July 2016 (UTC) | |||
* Aaron: I am leaning towards ''2'' because it takes less work on the part of publishers as well as consumers. From the publisher POV, if they add the alt attribute, that should be all they need to do, it seems odd to make them do additional work to make that show up in the parsed result. From the consumer side, some implementations will not need changing since when looking for a string value, they already use either the string directly or look for the "value" of the property if it's an object. Making consumers handle a new h- object just to read alt text seems overkill. | |||
** Additionally, if the alt attribute is an empty string, this should be considered the same as if it were missing, so that the photo value will be the URL string rather than the object in this case as well | |||
* Kevin: 2 makes sense to me as well, as this is a very specific need. If we want an image object with more substructure as 1 implies, that should be a new object type that follows the [[process]] - there is a case for that based on usage of figure/figcaption etc. but caption is not alt, and using name for it implies that it is. [[User:Kevin Marks|Kevin Marks]] 01:50, 19 July 2016 (UTC) | |||
* Bear: The thoughts given above for option 2 make the most sense as a library writer and consumer, tying this change to a parser implementation's major version change will (should) give everyone notice and time to adjust | |||
... | |||
* (unanimity copied to GitHub) | |||
</div> | |||
When it looks like thoughts are naturally converging, we should take that emergent convergence back to the github thread for proper back/forth discussion and figuring out of details. | |||
https://github.com/microformats/microformats2-parsing/issues/2 | |||
* [[User:Tantek|Tantek]] 22:10, 1 August 2016 (UTC): Thanks Aaron, Kevin, Bear - based on the unanimous support of one particular brainstorm proposal, that proposal has been moved to the GitHub issue, and any follow-up about it (corrections, refinements, iterations) should occur there: | |||
** https://github.com/microformats/microformats2-parsing/issues/2#issuecomment-236708854 | |||
== Parse language information == | |||
Raised by [[User:VoxPelli|VoxPelli]] 18:04, 23 July 2015 (UTC) | |||
* 2016-060: Update: and parse "id" attribute. [[User:Tantek|Tantek]] 16:39, 29 February 2016 (UTC) (see Additionally below) | |||
* 2016-07-13: Update: created [https://github.com/microformats/microformats2-parsing/issues/3 GitHub issue] for this brainstorm [[User:VoxPelli|VoxPelli]] 14:34, 13 July 2016 (UTC) | |||
Currently there’s no way to tell the language of parsed microformats even if those microformats has been marked up with HTML "lang"-attributes. | |||
There are examples in the wild of people marking up pages in such a way: | |||
* [http://voxpelli.com/ VoxPelli.com] has a "lang"-attribute on the h-entry of his [http://voxpelli.com/2011/03/sista-dagen-p-good-old/ swedish articles] to signify that the article is swedish even though the rest of the site is english. | |||
* Stephanie [http://climbtothestars.org/archives/2013/09/17/basic-bilingual-1-0-plugin-for-wordpress-blog-in-more-than-one-language/ uses a WordPress plugin] that adds summaries of other languages at the start of her content. | |||
* [https://seblog.nl/ Seblog.nl] has a <code>lang="nl"</code>-attribute on the <code><html></code> of each page, and uses a <code>lang="en"</code> on the p-name, p-summary and e-content of a h-entry if the CMS-field 'lang' is set to "en" (or any language other than "nl"). This to signify that the article is English, but the rest of the page Dutch (including the textual representation of the date). ([https://seblog.nl/2017/01/02/2/screenshots example]) | |||
Proposal is to add a new "lang" keyword to h-* and e-* objects so that the following example: | |||
<source lang=html4strict> | |||
<div class="h-entry" lang="sv"> | |||
<h1 class="p-name">En svensk titel</h1> | |||
<div class="e-content" lang="en">With an <em>english</em> summary</div> | |||
<div class="e-content">Och <em>svensk</em> huvudtext</div> | |||
</div> | |||
</source> | |||
Would be parsed into something like: | |||
< | <source lang=javascript> | ||
{ | { | ||
"type": ["h-entry"], | |||
"lang": "sv", | |||
"properties": { | |||
"name": ["En svensk titel"], | |||
"content": [ | |||
" | { | ||
" | "lang": "en", | ||
" | "html": "With an <em>english</em> summary", | ||
"value": "With an english summary" | |||
}, | |||
{ | |||
"html": "Och <em>svensk</em> huvudtext", | |||
"value": "Och svensk huvudtext" | |||
} | |||
] | |||
} | } | ||
} | } | ||
</ | </source> | ||
This was [http://indiewebcamp.com/irc/2015-07-23#t1437667712078 brainstormed on the IndieWebCamp IRC-channel] where the mentioned example came up. | |||
* Pull request for implementation in microformat-node added 2015-07-23 https://github.com/glennjones/microformat-node/pull/23 | |||
** Closed 2015-09-08 because the library has changed and parsing is now handled by microformat-shiv. New issue opened there: https://github.com/glennjones/microformat-shiv/issues/22 | |||
* Issue around implementation in php-mf2 added 2016-05-07 https://github.com/indieweb/php-mf2/issues/96 | |||
** Released 2017-05-27 in v0.3.2 behind a feature flag. | |||
= | Additionally: consider the same for "id" attributes (use-case: rel=feed local discovery of a nested h-feed on the home page), specifically, parsing the first instance of any "id" attribute (ignoring latter duplicate id attribute values on any subsequent elements). | ||
And alternatively: consider parsing as "html-id" and "html-lang" prefixed properties in the parsed result, e.g. | |||
* '''Q:''' Why parse with the "html-" prefix? | |||
* '''A:''' "html-lang and html-id to avoid confusing them with a possible actual property p-lang or p-id (which we don't have but might / could, especially from a vocabulary agnostic parser perspective)" https://chat.indieweb.org/microformats/2017-05-30#t1496166813294000 | |||
<source lang=html4strict> | |||
<div class="h-entry" lang="sv" id="postfrag123"> | |||
<h1 class="p-name">En svensk titel</h1> | |||
<div class="e-content" lang="en">With an <em>english</em> summary</div> | |||
<div class="e-content">Och <em>svensk</em> huvudtext</div> | |||
</div> | |||
</source> | |||
Would be parsed into something like: | |||
<source lang=javascript> | |||
{ | { | ||
"type": ["h-entry"], | |||
"html-id": "postfrag123", | |||
"html-lang": "sv", | |||
"properties": { | |||
"name": ["En svensk titel"], | |||
" | "content": [ | ||
" | { | ||
" | "html-lang": "en", | ||
"html": "With an <em>english</em> summary", | |||
"value": "With an english summary" | |||
}, | |||
{ | |||
"html": "Och <em>svensk</em> huvudtext", | |||
"value": "Och svensk huvudtext" | |||
} | |||
} | ] | ||
} | |||
} | } | ||
</ | </source> | ||
=== Language inheritance === | |||
If the "lang" attribute is not specified for a particular element, it is inherited from the nearest parent (or from the HTTP Content-Language header) | |||
HTML5: https://www.w3.org/TR/html5/dom.html#the-lang-and-xml:lang-attributes<br> | |||
HTML4: https://www.w3.org/TR/html4/struct/dirlang.html#h-8.1.2 | |||
= | Proposal: Determine and include the inherited "lang" value on *every* microformat object that directly specifies a lang or that has an ancestor that does, e.g. if <html lang="en">, then every object in the output will have "lang": "en". | ||
=== Pronouns in different languages === | |||
Language is also useful context when defining [[pronouns]], discussed a bit here[https://github.com/idno/Known/pull/1426#issuecomment-217626923]. | |||
<source lang=html4strict> | |||
<div class="h-card" lang="en"> | |||
<span class="p-x-pronoun-nominative">he</span> / | |||
<span class="p-x-pronoun-possessive">him</span> / | |||
<span class="p-x-pronoun-oblique">his</span> | |||
</div> | |||
</source> | |||
would parse as | |||
<source lang=javascript> | |||
{ | { | ||
"type": ["h-card"], | |||
"lang": "en", | |||
" | "properties": { | ||
"x-pronoun-nominative": ["he"], | |||
"x-pronoun-possessive": ["him"], | |||
"x-pronoun-oblique": ["his"] | |||
} | |||
} | |||
</source> | |||
} | |||
</ | |||
It could also be useful to specify multiple languages within a single h-card (pardon me if I butcher Swedish pronouns) | |||
<source lang=html4strict> | |||
<div class="h-card"> | |||
<span lang="en" class="p-x-pronoun-nominative">he</span> / | |||
<span lang="en" class="p-x-pronoun-possessive">him</span> / | |||
<span lang="en" class="p-x-pronoun-oblique">his</span> | |||
<span lang="sv" class="p-x-pronoun-nominative">han</span> / | |||
<span lang="sv" class="p-x-pronoun-possessive">hans</span> / | |||
<span lang="sv" class="p-x-pronoun-oblique">honom</span> | |||
</div> | |||
</source> | |||
which might parse as | |||
< | <source lang=javascript> | ||
{ | |||
"type": ["h-card"], | |||
"properties": { | |||
"x-pronoun-nominative": [{"lang": "en", "value": "he"}, {"lang": "sv", "value": "han"}], | |||
"x-pronoun-possessive": [{"lang": "en", "value": "him"}, {"lang": "sv", "value": "hans"}], | |||
</ | "x-pronoun-oblique": [{"lang": "en", "value": "his"}, {"lang": "sv", "value": "honom"}] | ||
} | |||
} | |||
</source> | |||
or alternatively, we could introduce a new microformat h-x-pronoun to wrap a set of pronouns | |||
< | <source lang=html4strict> | ||
<div class="h- | <div class="h-card"> | ||
<div class=" | <div class="p-x-pronoun h-x-pronoun" lang="en"> | ||
< | <span class="p-nominative">he</span> / | ||
<span class="p-possessive">him</span> / | |||
<span class="p-oblique">his</span> | |||
</div> | |||
<div class="p-x-pronoun h-x-pronoun" lang="sv"> | |||
<span class="p-nominative">han</span> / | |||
<span class="p-possessive">hans</span> / | |||
<span class="p-oblique">honom</span> | |||
</div> | </div> | ||
</div> | </div> | ||
</ | </source> | ||
parsed as | |||
< | <source lang=javascript> | ||
{ | |||
"type": ["h-card"], | |||
"type": ["h- | |||
"properties": { | "properties": { | ||
" | "x-pronoun": [{ | ||
"type": ["h-x-pronoun"], | |||
"lang": "en", | |||
"type":["h- | "properties": { | ||
"nominative": ["he"], | |||
"possessive": ["him"], | |||
"oblique": ["his"] | |||
} | |||
}, { | |||
"type": ["h-x-pronoun"], | |||
"lang": "sv", | |||
"properties": { | "properties": { | ||
" | "nominative": ["han"], | ||
" | "possessive": ["hans"], | ||
" | "oblique": ["honom"] | ||
} | } | ||
}] | }] | ||
} | } | ||
} | |||
} | </source> | ||
</ | |||
<div class="discussion"> | <div class="discussion"> | ||
* | Discussion: | ||
* | * [[User:Kylewm|Kylewm]] Including the "lang" attribute in h- and e- properties makes a ton of sense to me. | ||
* [[User:Kylewm|Kylewm]] I like the idea of introducing an h-x-pronoun container that can define all the different pronoun forms for a particular language | |||
* [[User:Zegnat|Martijn]] Turns out that the neat summary of different p-x-pronoun-* per language from the second example is never going to happen. Objective case (here <i>oblique</i>) exists in English and then suddenly doesn’t exist at all in e.g. German. | |||
* [[User:Zegnat|Martijn]] The container is still a viable option because it gives a clear language split. Within the container, completely different case names would be used though. German would get properties for nominative, accusative, genitive, dative, and possessive cases. Every language will require its own documentation for properties, and some like Finnish would require up to 13 properties. | |||
* [[User:Zegnat|Martijn]] I propose an entirely different way of marking up pronouns. See [[h-card-brainstorming]]. | |||
* ... | |||
</div> | </div> | ||
== Canonicalization of datetime output == | == Canonicalization of datetime output == | ||
Status: resolved, awaiting implementation attempt/experience. | |||
It would be useful to choose a (more) uniform output format for datetimes to make it easier for users of the parser to consume datetimes. Microformats2 parsers already do sophisticated pattern matching to recognize date vs. time vs. datetimes, so converting this to any specific format should not add overhead. | It would be useful to choose a (more) uniform output format for datetimes to make it easier for users of the parser to consume datetimes. Microformats2 parsers already do sophisticated pattern matching to recognize date vs. time vs. datetimes, so converting this to any specific format should not add overhead. | ||
Line 197: | Line 340: | ||
== Add meta http-equiv to microformats2 parsing model == | == Add meta http-equiv to microformats2 parsing model == | ||
Status: disagreement, awaiting implementation attempt/experience. | |||
Similar to document level parsing of <code>rel</code> attributes, it makes sense simultaneously to parse <code><meta http-equiv></code> elements, perhaps treating "Status" in a special way (only using first number (sequence of digits) for its "value"). | Similar to document level parsing of <code>rel</code> attributes, it makes sense simultaneously to parse <code><meta http-equiv></code> elements, perhaps treating "Status" in a special way (only using first number (sequence of digits) for its "value"). | ||
Line 212: | Line 357: | ||
<div class="discussion"> | <div class="discussion"> | ||
* Interesting thought. Are you suggesting a top level "http-equivs:" collection similar to "rels:" in the parsed output? Should we consider "metas:" instead or in addition? [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC) | * Interesting thought. Are you suggesting a top level "http-equivs:" collection similar to "rels:" in the parsed output? Should we consider "metas:" instead or in addition? [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC) | ||
* What's the use case for this? Also, http-equiv on its own is useless. http-equiv is only a supplement to the data stored in headers. And headers aren't always there: what happens in the context of someone debugging a page who pastes the source into the textarea of an mf2 parser? Without a compelling use case for including headers (and then over-riding some of them with http-equivs), I'm not sure why an implementor want to do this. —[[User:TomMorris|Tom Morris]] 00:25, 8 May 2015 (UTC) | |||
</div> | </div> | ||
Line 236: | Line 382: | ||
</div> | </div> | ||
==MIME type== | |||
See [[microformats2-mime-type]] | |||
---- | ---- | ||
Line 276: | Line 425: | ||
*** 7. Class name 'value' can, by its simpler naming nature, be expected to potentially collide with other web designer class name usage though we have no documentation/mention thereof. We could consider a renaming, or providing of alternative, such as 'value-string', or 'value-content', etc. However, let's keep that as a backup plan to use only if/when evidence is presented that we need to. | *** 7. Class name 'value' can, by its simpler naming nature, be expected to potentially collide with other web designer class name usage though we have no documentation/mention thereof. We could consider a renaming, or providing of alternative, such as 'value-string', or 'value-content', etc. However, let's keep that as a backup plan to use only if/when evidence is presented that we need to. | ||
** Conclusions: for now, in microformats-2, keep using 'value' and 'value-title' as defined in the [[value-class-pattern]], and add the additional (obvious) interpretation that [[value-class-pattern#Date_and_time_parsing|value class pattern: date and time parsing]] applies to all 'dt-' properties. - [[User:Tantek|Tantek]] 12:12, 10 October 2011 (UTC) | ** Conclusions: for now, in microformats-2, keep using 'value' and 'value-title' as defined in the [[value-class-pattern]], and add the additional (obvious) interpretation that [[value-class-pattern#Date_and_time_parsing|value class pattern: date and time parsing]] applies to all 'dt-' properties. - [[User:Tantek|Tantek]] 12:12, 10 October 2011 (UTC) | ||
== incorporated 2015-05-28 == | |||
The following brainstorms were incorporated 2015-05-28. | |||
== more information for alternates == | |||
Raised 2015-04-24 by [[User:Kevin Marks|Kevin Marks]] | |||
The existing <code>alternate</code> parsing is omitting <code>title</code> - that should be added. The <code>text</code> would make sense to add here too. | |||
Use-case: labels for presenting alternates | |||
<div class="discussion"> | |||
* +1 Makes sense. [[User:Tantek|Tantek]] 03:41, 25 April 2015 (UTC) | |||
</div> | |||
== more information for rel-based formats == | |||
Raised 2015-04-18 by [[User:Kevin Marks|Kevin Marks]] | |||
Related github test suite issue: https://github.com/microformats/tests/issues/16 | |||
Several rel-based formats have additional information that is useful beyond the link itself, which is all we capture at the moment. As I am trying to update the Universal feedparser to support mf2 based I will show examples from the [https://github.com/kevinmarks/feedparser/tree/365623a9470e99246f393a8c1f49a0db567826b8/feedparser/tests/microformats testcases] there. | |||
The main change is to add a <code>rel-urls</code> entry for more information about the attributes and text of the urls pointed to by rel's in the document | |||
A fork of mf2py that implements these changes is at https://github.com/kevinmarks/mf2py | |||
=== rel-tag === | |||
<code><a rel="tag" href="http://del.icio.us/tag/tech">Technology</a> </code> | |||
currently parses to: | |||
<code>{"rels": {"tag": ["http://del.icio.us/tag/tech"]}, "items": []} </code> | |||
This loses the link text, which is useful as a label. | |||
We add a <code>rel-urls</code> element to the parsed output with this extra data that can be looked up from the rels, which doesn't break backward compatibility and works better with xfn (see below) | |||
<code><pre> | |||
{ | |||
"rels": { | |||
"tag": [ | |||
"http://del.icio.us/tag/tech" | |||
] | |||
}, | |||
"items": [], | |||
"rel-urls": { | |||
"http://del.icio.us/tag/tech": { | |||
"rels": [ | |||
"tag" | |||
], | |||
"text": "Technology" | |||
} | |||
} | |||
} | |||
</pre></code> | |||
=== xfn === | |||
<code><a rel="coworker" href="http://example.com/johndoe">John Doe</a></code> | |||
currently parses to: | |||
<code>{"rels": {"coworker": ["http://example.com/johndoe"]}, "items": []}</code> | |||
This loses the link text, which is the person's name. Suggested output using the urls object: | |||
<code><pre> | |||
{ | |||
{ | |||
"rels": { | |||
"coworker": [ | |||
"http://example.com/johndoe" | |||
] | |||
}, | |||
"items": [], | |||
"rel-urls": { | |||
"http://example.com/johndoe": { | |||
"rels": [ | |||
"coworker" | |||
], | |||
"text": "John Doe" | |||
} | |||
} | |||
} | |||
</pre></code> | |||
with multiple xfn values | |||
<code><a rel="coworker friend" href="http://example.com/johndoe">John Doe</a></code> | |||
we get this: | |||
<code><pre> | |||
{ | |||
"rels": { | |||
"coworker": [ | |||
"http://example.com/johndoe" | |||
], | |||
"friend": [ | |||
"http://example.com/johndoe" | |||
] | |||
}, | |||
"items": [], | |||
"rel-urls": { | |||
"http://example.com/johndoe": { | |||
"rels": [ | |||
"coworker", | |||
"friend" | |||
], | |||
"text": "John Doe" | |||
} | |||
} | |||
} | |||
</pre></code> | |||
=== rel-enclosure === | |||
<code><a rel="enclosure" href="http://example.com/movie.mp4" type="video/mpeg" title="real title">my movie</a></code> | |||
currently parses to: | |||
<code>'{"rels": {"enclosure": ["http://example.com/movie.mp4"]}, "items": []}'</code> | |||
This loses the link text, which is the title and the attributes which give type. Suggested output: | |||
<code><pre> | |||
{ | |||
"rels": { | |||
"enclosure": [ | |||
"http://example.com/movie.mp4" | |||
] | |||
}, | |||
"items": [], | |||
"rel-urls": { | |||
"http://example.com/movie.mp4": { | |||
"rels": [ | |||
"enclosure" | |||
], | |||
"text": "my movie", | |||
"type": "video/mpeg", | |||
"title": "real title" | |||
} | |||
} | |||
} | |||
</pre></code> | |||
This generalises to other rel's too, such as [[rel-feed]] and [[rel-alternate]] that have type, lang etc attributes. | |||
(updated to include changes from feedback below) [[User:Kevin Marks|Kevin Marks]] 22:13, 26 April 2015 (UTC) | |||
=== attributes parsed === | |||
Attributes currently parsed are: | |||
* <code>hreflang</code> for alternate and enclosure | |||
* <code>media</code> for alternate and enclosure | |||
* <code>title</code> for alternate and enclosure | |||
* <code>type</code> for alternate and enclosure | |||
Attributes we may consider parsing if we have a use case are | |||
* <code>sizes</code> for icon - need use-case documentation | |||
* <code>coords</code> for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformats | |||
* <code>shape</code> for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformats | |||
In addition there is a special attribute <s><code>name</code> </s><code>text</code> which is the text contents of the link, which is useful in rel-tag rel-enclosure and xfn, and in alternate when used for feeds. It's also clarifying for rel-me links. | |||
Tantek [http://logs.glob.uno/?c=freenode%23microformats&s=today#c79057 suggests] we use <code>textContent</code> for this instead, and make it a single string, not a list as <code>name</code> is elsewhere in mf2 parsing | |||
* Update: "text" is good enough, and "textContent" is ugly camelCase. [[User:Tantek|Tantek]] 04:39, 29 May 2015 (UTC) | |||
=== feedback on more rel info === | |||
<div class="discussion"> | |||
# "name" is bad because it misleadingly conflates with use of "name" elsewhere in microformats2. | |||
#* Suggested alternative: [https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent textContent] - since that's literally what is being returned there. [[User:Tantek|Tantek]] 02:35, 25 April 2015 (UTC) | |||
#** as all other mf2 keys are lowercase-with-hyphens, [http://logs.glob.uno/?c=freenode%23microformats&s=today#c79101 Tantek suggests] 'text' as that isn't going to be an html [[User:Kevin Marks|Kevin Marks]] 07:28, 25 April 2015 (UTC) | |||
# no need for array for "name"/textContent - since there is always only one at most | |||
#* E.g. should be "textContent": "my movie" [[User:Tantek|Tantek]] 02:35, 25 April 2015 (UTC) | |||
#* Update: "text": "my movie" [[User:Tantek|Tantek]] 04:39, 29 May 2015 (UTC) | |||
# "urls" key is misleading - implies all URLs in the document, which is neither true, nor desired (takes much more parsing time and work and code) | |||
#* Suggested alternative: "rel-urls". And open to better alternatives too. [[User:Tantek|Tantek]] 02:35, 25 April 2015 (UTC) | |||
#** If we are trying to extend the number of properties retuned from a rel without breaking the old structure why don't we call the new structure something like "rels-extended" [[User:GlennJones|Glenn Jones]] 12:29, 1 June 2015 (UTC) | |||
#*** Extension is not the point, but rather to use them complementary. One structure for look-up of any rel value, hence "rels", which returns you a list of URLs. Then you can lookup those URLs in the new mapping, by URL, hence it is called "rel-urls" - that's the point to use them in conjunction and that's why rel-urls is named what it is. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC) | |||
# Why is the structure of "rel-urls" different to the "alternates" structure. Should the "url" not just be added as a property and not as a key. Creating two data structures for one type of object seems inconsistent. It adds cognitive load to anyone trying to understand the JSON structure [[User:GlennJones|Glenn Jones]] 12:29, 1 June 2015 (UTC) | |||
#* I was trying to avoid breaking the existing <code>rels</code> structure and use of it - I did implement a variant that put the structure inside rels, and it became cumbersome and repetitive where there were multiple rels on a url (xfn cases). Denormalising as properties of the URL made more sense. It also dedupes if there is repetitive linking to the same URL, eg a series of posts with rel-author on each. [[User:Kevin Marks|Kevin Marks]] 20:05, 1 June 2015 (UTC) | |||
# If the rel is a "tag" then the main value we need to return should be the last path component of the URL, not the link text? Should we add another output property ie "tag" [[User:GlennJones|Glenn Jones]] 12:29, 1 June 2015 (UTC) | |||
#* No need to return last path segment of the URL, because the URL is already there - and that's just a library/framework utility function to get the last path segment of a URL. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC) | |||
# As currently described, the URL from <code>alternates</code> is repeated in the <code>rel-urls</code> structure. If we are doing this, surely <code>alternate</code> should be in <code>rels</code> too? I assumed a mapping between them. [[User:Kevin Marks|Kevin Marks]] 20:05, 1 June 2015 (UTC) | |||
## edit showing this variant: http://microformats.org/wiki/index.php?title=microformats2-parsing&oldid=65021#parse_a_hyperlink_element_for_rel_microformats | |||
</div> | |||
#* Yes it makes sense to drop "alternates" assuming the backcompat impact is low, put alternates in "rels" along with everything else, and direct people to use rels and rel-urls for alternates functionality. Evidence this is an acceptable even preferable approach.[http://indiewebcamp.com/irc/2015-06-01/line/1433195247005] Will add an issue accordingly. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC) | |||
== Incorporated 2015-06-06 == | |||
== Nested h-* objects' "value" property == | |||
Status: resolved, resolution iterated, one real world implementation proven implementability, incorporated | |||
* 2015-06-06 incorporated into [[microformats2-parsing]] | |||
Raised 2015-01-06 by [[User:Kylewm]]; | |||
If a child element has a microformat (h-*) and is a property element (p-*, u-*, dt-*, e-*), the parser will add a "value" property to the resulting object. The value should attempt to be a useful representation of the object for consumers that do not have semantic knowledge of the particular h-* type. Ref: [[microformats2-parsing#parse_an_element_for_class_microformats]]. | |||
To determine the "value", we parse the property element simply (as if it did not have a h-* class), which works well for simple h-* objects, e.g. <code><a class="u-like-of h-cite" href="...">...</a></code> | |||
<div class="discussion"> | |||
* To handle more complex microformats, I propose that "value" for a p-* property element take on the first explicit "name" property of the nested microformat, and for a u-* property, the first explicit "url" property. Parsing will fall back on the current rules if an explicit property does not exist. | |||
** This makes sense to me, and fits with the use-cases and examples I've seen. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC) | |||
** A similar (possibly simpler?) formulation would use the implied name and url rules to determine the "value" for p-* and u-* properties respectively | |||
*** I don't think that's needed, as there are already implied rules on a property that should handle that. I'd start with just the "first explicit" scoping to be more conservative, and then see if we find any use-cases that (and existing implied rules) don't/doesn't catch. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC) | |||
**** Agreement at [[2015-01-20]] meetup. | |||
</div> | |||
For example: | |||
<code><pre> | |||
<div class="h-entry"> | |||
<div class="u-in-reply-to h-cite"> | |||
<a class="p-author h-card" href="http://example.com">Example Author</a> | |||
<a class="p-name u-url" href="http://example.com/post">Example Post</a> | |||
</div> | |||
</div> | |||
</pre></code> | |||
The nested u-in-reply-to object would parse as | |||
<code><pre> | |||
... | |||
"in-reply-to": [{ | |||
"type": ["h-cite"], | |||
"properties": { | |||
"name": ["Example Post"], | |||
"url": ["http://example.com/post"], | |||
"author": [{ | |||
"type":["h-card"], | |||
"properties": { | |||
"url": ["http://example.com"], | |||
"name": ["Example Author"] | |||
}, | |||
"value": "Example Author" | |||
}], | |||
}, | |||
"value": "http://example.com/post" | |||
}] | |||
... | |||
</pre></code> | |||
where the outer "value" gets the in-reply-to h-cite's u-url property, and the inner "value" gets the author's p-name property. | |||
<div class="discussion"> | |||
* Because there are no implied properties of the dt-* and e-* types, and no obvious defaults, the value rules for these types would not change. | |||
** A possibility for dt-* h-*: The dt-* could take either the first dt-* of the h-*, or (perhaps if no dt-* in the h-*,) the first <code><time></code> element inside. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC) | |||
** First dt-* seems reasonable, predictable, and usable. Consensus at [[2015-01-20]] meetup. | |||
** Update 2015-05-29: no known use-cases for first dt-* or first e-*, and implementing that "would require some refactoring" (in mf2py at least per kylewm), thus until there's a use-case for first dt-*/e-* inside, let's treat "dt-* h-*" and "e-* h-*" as before. [[User:Tantek|Tantek]] . In particular: | |||
*** p-* h-* - value from first "name" as proposed above | |||
*** u-* h-* - value from first "url" as proposed above | |||
*** e-* h-* - value is already defined for e-* parsing, nothing special here | |||
*** dt-* h-* - value from normal dt-* parsing - nothing special. | |||
*** +1 totally agree, let's wait for use cases of e-* dt-* [[User:Kylewm|Kylewm]] 19:44, 29 May 2015 (UTC) | |||
</div> | |||
* Implemented in mf2py 2015-06-01 https://github.com/tommorris/mf2py/commit/edc895ef5a780bcee654e6644a688688934517b0 | |||
* Added to microformats test suite (experimental) 2015-06-01 https://github.com/microformats/tests/commit/90c8a7d8e96c7160036a298e13f16d9ddaec218e | |||
== see also == | == see also == |
Latest revision as of 16:29, 18 July 2020
This page is for brainstorming, discussion, and other questions and explorations about microformats2 parsing.
For the microformats2 parsing algorithm, see:
For filing issues / problems with microformats2-parsing, see:
Parse img alt
Per https://github.com/microformats/microformats2-parsing/issues/2 currently any u-* property (e.g. u-photo, u-featured) that extracts a 'src' attr from an img tag loses any associated 'alt' text alternative, and if at some point the consuming application wants to display that u-* property as an img, they have to either omit or synthesize a fake text alternative.
It is desirable to somehow maintain that image src and alt association from the original markup, through the parsing process, up until a consuming application wishes to re-present the image with the text alternative.
There are a number of possibilities / approaches here worth brainstorming:
Include alt property in parent object
- explicit authoring: require the author to use a new 'p-alt' property on the image to cause parsing and extraction of the text alternative.
- Problem(s): fails for multiple images, some of which may or may not have alt attrs or corresponding p-alt properties (and fragile, forgetting one p-alt throws off the parallel lists of u-* and p-alt).
- implicit p-alt: for every img that is parsed for a u-* property, the parse could generate a p-alt property with value.
- Problem(s): fragile again for similar reasons, not all u-*s may be on img elements, or may not have alt attrs for all imgs in the source.
- implicit p-alt only for implied u-photo
- This is better since there can only be one implied u-photo, and thus if there is a p-alt, it must be associated with the one u-photo
- Problem(s): does not work for other u-* image properties e.g. u-featured
<div class="h-entry"><img src="http://example.com/photo.jpg" alt="Example" class="u-photo p-alt"></div>
{"type":["h-entry"],"properties":{"photo":["http://example.com/photo.jpg"],"alt":["Example"]}
Make photo property an object
1. use "h-image" on any u-* on img elements to imply a structure with paired photo and 'name' text alternative, e.g.
<img src="a.jpg" alt="text about a" class="u-featured h-image"/>
which would result in a u-featured property with one value, a structure of an h-image with itself having implied properties of a u-photo of "a.jpg" and a p-name of the "text about a". Similarly the author can use the object tag for the same result:
<object data="a.jpg" class="u-featured h-image">text about a</object>
In either case, the same microformats JSON would be generated, which is correct, as in both cases, there is an image with a fallback text alternative. The specific HTML used should not matter. The semantic of pairing the image with the text alternative is communicated the same way for both.
- Challenge: requires author use of additional classname "h-image".
- Benefit: does not require a change to the parsing algorithm
<div class="h-entry">
<img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured h-image">
</div>
{
"type":["h-entry"],
"properties":{
"featured":[{
"type":["h-image"],
"properties":{
"photo":["http://example.com/eg.jpg"],
"name":["Photo of an example"]
}
}]
}
2. have u-* on an <img> automatically create an object if there is a non-empty 'alt' attribute.
If a u-* property is parsed on an <img> element with a non-empty 'alt' attribute, then:
Create a structure similar to the e-content nested structure that provides the "value" as the URL, and an "alt" as the text alternative.
- Advantage: no additional microformats markup needed from author
- Challenge: Many (most?) existing published u-photo properties will now return an object instead of a string, and consuming applications may not be expecting an object for a photo
- Mitigation: If this is done as an explicit parser library upgrade, consuming applications may decide when to take this parser upgrade and thus fix their u-photo handling to look for string or object before upgrading their microformats2 parsing library instance.
<div class="h-entry">
<img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured">
</div>
{
"type":["h-entry"],
"properties":{
"featured":[{
"value":"http://example.com/eg.jpg",
"alt":"Photo of an example"
}]
}
... more brainstorming needed
img alt thoughts
Thoughts about img alt brainstorm proposals. Feel free to offer counterpoints with nested items and/or alternative preferences/opinions with (potentially multiple) top level items!
- Tantek: I am leaning towards "Make photo property an object" brainstorm "2." because it feels more "automatic" and thus provides lower friction to more accessibility. Less (author) work for "alt" information to get passed through to the JSON result, and thus more potentially re-usable by consuming applications that want to preserve or re-emit the pairing of a photo and its fallback text alternative. -- Tantek 00:53, 19 July 2016 (UTC)
- Aaron: I am leaning towards 2 because it takes less work on the part of publishers as well as consumers. From the publisher POV, if they add the alt attribute, that should be all they need to do, it seems odd to make them do additional work to make that show up in the parsed result. From the consumer side, some implementations will not need changing since when looking for a string value, they already use either the string directly or look for the "value" of the property if it's an object. Making consumers handle a new h- object just to read alt text seems overkill.
- Additionally, if the alt attribute is an empty string, this should be considered the same as if it were missing, so that the photo value will be the URL string rather than the object in this case as well
- Kevin: 2 makes sense to me as well, as this is a very specific need. If we want an image object with more substructure as 1 implies, that should be a new object type that follows the process - there is a case for that based on usage of figure/figcaption etc. but caption is not alt, and using name for it implies that it is. Kevin Marks 01:50, 19 July 2016 (UTC)
- Bear: The thoughts given above for option 2 make the most sense as a library writer and consumer, tying this change to a parser implementation's major version change will (should) give everyone notice and time to adjust
...
- (unanimity copied to GitHub)
When it looks like thoughts are naturally converging, we should take that emergent convergence back to the github thread for proper back/forth discussion and figuring out of details.
https://github.com/microformats/microformats2-parsing/issues/2
- Tantek 22:10, 1 August 2016 (UTC): Thanks Aaron, Kevin, Bear - based on the unanimous support of one particular brainstorm proposal, that proposal has been moved to the GitHub issue, and any follow-up about it (corrections, refinements, iterations) should occur there:
Parse language information
Raised by VoxPelli 18:04, 23 July 2015 (UTC)
- 2016-060: Update: and parse "id" attribute. Tantek 16:39, 29 February 2016 (UTC) (see Additionally below)
- 2016-07-13: Update: created GitHub issue for this brainstorm VoxPelli 14:34, 13 July 2016 (UTC)
Currently there’s no way to tell the language of parsed microformats even if those microformats has been marked up with HTML "lang"-attributes.
There are examples in the wild of people marking up pages in such a way:
- VoxPelli.com has a "lang"-attribute on the h-entry of his swedish articles to signify that the article is swedish even though the rest of the site is english.
- Stephanie uses a WordPress plugin that adds summaries of other languages at the start of her content.
- Seblog.nl has a
lang="nl"
-attribute on the<html>
of each page, and uses alang="en"
on the p-name, p-summary and e-content of a h-entry if the CMS-field 'lang' is set to "en" (or any language other than "nl"). This to signify that the article is English, but the rest of the page Dutch (including the textual representation of the date). (example)
Proposal is to add a new "lang" keyword to h-* and e-* objects so that the following example:
<div class="h-entry" lang="sv">
<h1 class="p-name">En svensk titel</h1>
<div class="e-content" lang="en">With an <em>english</em> summary</div>
<div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>
Would be parsed into something like:
{
"type": ["h-entry"],
"lang": "sv",
"properties": {
"name": ["En svensk titel"],
"content": [
{
"lang": "en",
"html": "With an <em>english</em> summary",
"value": "With an english summary"
},
{
"html": "Och <em>svensk</em> huvudtext",
"value": "Och svensk huvudtext"
}
]
}
}
This was brainstormed on the IndieWebCamp IRC-channel where the mentioned example came up.
- Pull request for implementation in microformat-node added 2015-07-23 https://github.com/glennjones/microformat-node/pull/23
- Closed 2015-09-08 because the library has changed and parsing is now handled by microformat-shiv. New issue opened there: https://github.com/glennjones/microformat-shiv/issues/22
- Issue around implementation in php-mf2 added 2016-05-07 https://github.com/indieweb/php-mf2/issues/96
- Released 2017-05-27 in v0.3.2 behind a feature flag.
Additionally: consider the same for "id" attributes (use-case: rel=feed local discovery of a nested h-feed on the home page), specifically, parsing the first instance of any "id" attribute (ignoring latter duplicate id attribute values on any subsequent elements).
And alternatively: consider parsing as "html-id" and "html-lang" prefixed properties in the parsed result, e.g.
- Q: Why parse with the "html-" prefix?
- A: "html-lang and html-id to avoid confusing them with a possible actual property p-lang or p-id (which we don't have but might / could, especially from a vocabulary agnostic parser perspective)" https://chat.indieweb.org/microformats/2017-05-30#t1496166813294000
<div class="h-entry" lang="sv" id="postfrag123">
<h1 class="p-name">En svensk titel</h1>
<div class="e-content" lang="en">With an <em>english</em> summary</div>
<div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>
Would be parsed into something like:
{
"type": ["h-entry"],
"html-id": "postfrag123",
"html-lang": "sv",
"properties": {
"name": ["En svensk titel"],
"content": [
{
"html-lang": "en",
"html": "With an <em>english</em> summary",
"value": "With an english summary"
},
{
"html": "Och <em>svensk</em> huvudtext",
"value": "Och svensk huvudtext"
}
]
}
}
Language inheritance
If the "lang" attribute is not specified for a particular element, it is inherited from the nearest parent (or from the HTTP Content-Language header)
HTML5: https://www.w3.org/TR/html5/dom.html#the-lang-and-xml:lang-attributes
HTML4: https://www.w3.org/TR/html4/struct/dirlang.html#h-8.1.2
Proposal: Determine and include the inherited "lang" value on *every* microformat object that directly specifies a lang or that has an ancestor that does, e.g. if <html lang="en">, then every object in the output will have "lang": "en".
Pronouns in different languages
Language is also useful context when defining pronouns, discussed a bit here[2].
<div class="h-card" lang="en">
<span class="p-x-pronoun-nominative">he</span> /
<span class="p-x-pronoun-possessive">him</span> /
<span class="p-x-pronoun-oblique">his</span>
</div>
would parse as
{
"type": ["h-card"],
"lang": "en",
"properties": {
"x-pronoun-nominative": ["he"],
"x-pronoun-possessive": ["him"],
"x-pronoun-oblique": ["his"]
}
}
It could also be useful to specify multiple languages within a single h-card (pardon me if I butcher Swedish pronouns)
<div class="h-card">
<span lang="en" class="p-x-pronoun-nominative">he</span> /
<span lang="en" class="p-x-pronoun-possessive">him</span> /
<span lang="en" class="p-x-pronoun-oblique">his</span>
<span lang="sv" class="p-x-pronoun-nominative">han</span> /
<span lang="sv" class="p-x-pronoun-possessive">hans</span> /
<span lang="sv" class="p-x-pronoun-oblique">honom</span>
</div>
which might parse as
{
"type": ["h-card"],
"properties": {
"x-pronoun-nominative": [{"lang": "en", "value": "he"}, {"lang": "sv", "value": "han"}],
"x-pronoun-possessive": [{"lang": "en", "value": "him"}, {"lang": "sv", "value": "hans"}],
"x-pronoun-oblique": [{"lang": "en", "value": "his"}, {"lang": "sv", "value": "honom"}]
}
}
or alternatively, we could introduce a new microformat h-x-pronoun to wrap a set of pronouns
<div class="h-card">
<div class="p-x-pronoun h-x-pronoun" lang="en">
<span class="p-nominative">he</span> /
<span class="p-possessive">him</span> /
<span class="p-oblique">his</span>
</div>
<div class="p-x-pronoun h-x-pronoun" lang="sv">
<span class="p-nominative">han</span> /
<span class="p-possessive">hans</span> /
<span class="p-oblique">honom</span>
</div>
</div>
parsed as
{
"type": ["h-card"],
"properties": {
"x-pronoun": [{
"type": ["h-x-pronoun"],
"lang": "en",
"properties": {
"nominative": ["he"],
"possessive": ["him"],
"oblique": ["his"]
}
}, {
"type": ["h-x-pronoun"],
"lang": "sv",
"properties": {
"nominative": ["han"],
"possessive": ["hans"],
"oblique": ["honom"]
}
}]
}
}
Discussion:
- Kylewm Including the "lang" attribute in h- and e- properties makes a ton of sense to me.
- Kylewm I like the idea of introducing an h-x-pronoun container that can define all the different pronoun forms for a particular language
- Martijn Turns out that the neat summary of different p-x-pronoun-* per language from the second example is never going to happen. Objective case (here oblique) exists in English and then suddenly doesn’t exist at all in e.g. German.
- Martijn The container is still a viable option because it gives a clear language split. Within the container, completely different case names would be used though. German would get properties for nominative, accusative, genitive, dative, and possessive cases. Every language will require its own documentation for properties, and some like Finnish would require up to 13 properties.
- Martijn I propose an entirely different way of marking up pronouns. See h-card-brainstorming.
- ...
Canonicalization of datetime output
Status: resolved, awaiting implementation attempt/experience.
It would be useful to choose a (more) uniform output format for datetimes to make it easier for users of the parser to consume datetimes. Microformats2 parsers already do sophisticated pattern matching to recognize date vs. time vs. datetimes, so converting this to any specific format should not add overhead.
Specifically:
- Choose either 'T' or space as the date/time separator.
- Prefer space as it is more human friendly/readable, which matters even for syntaxes/formats, as human still develop, debug them. Tantek 19:31, 6 January 2015 (UTC)
- Choose either +XXYY or +XX:YY as the timezone specification (and convert 'Z' to +0000).
- Would appreciate some study / input here as to which timezone offset syntax is more human friendly. I lean slightly toward +/-NNNN (without the colon) because in the context of seeing a time, leaving out the colon makes it less likely the offset will be confused for a time. E.g. "07:00-08:00" looks like 7-8am, even if it meant 07:00 in PST. Tantek 19:31, 6 January 2015 (UTC)
- Space is fine - consensus 2015-01-20 meetup.
- Parsers should not attempt make datetimes more exact than specified. They should not add time, seconds, or timezone if omitted in the original. Kylewm 04:02, 14 May 2014 (UTC)
- Agreed. Tantek 19:31, 6 January 2015 (UTC)
- or month, day per Tom Morris
- consensus 2015-01-20 meetup
- Counterpoint: PHP's builtin date parsing does not require strict formatting. And the equivalent functionality for Python is provided by the widely used python-dateutil library. Kylewm 19:02, 14 May 2014 (UTC)
- However we cannot (must not) depend on either PHP or Python's "smart" "fixing" or Postelian "liberal handling", or any other language/framework's for that matter, as they all differ in how "intelligent" they are. Tantek 19:31, 6 January 2015 (UTC)
Perhaps just provide a guideline for these based on the above consensus.
Add meta http-equiv to microformats2 parsing model
Status: disagreement, awaiting implementation attempt/experience.
Similar to document level parsing of rel
attributes, it makes sense simultaneously to parse <meta http-equiv>
elements, perhaps treating "Status" in a special way (only using first number (sequence of digits) for its "value").
Use case: IndieWeb "deleted" indication inline in content for static file services that don't support HTTP return codes.
HTTP Header example:
- Content-Type: text/html; charset=utf-8
HTML equivalent:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Related:
- Interesting thought. Are you suggesting a top level "http-equivs:" collection similar to "rels:" in the parsed output? Should we consider "metas:" instead or in addition? Tantek 19:31, 6 January 2015 (UTC)
- What's the use case for this? Also, http-equiv on its own is useless. http-equiv is only a supplement to the data stored in headers. And headers aren't always there: what happens in the context of someone debugging a page who pastes the source into the textarea of an mf2 parser? Without a compelling use case for including headers (and then over-riding some of them with http-equivs), I'm not sure why an implementor want to do this. —Tom Morris 00:25, 8 May 2015 (UTC)
E.g. from https://gist.github.com/aaronpk/10297489
<meta http-equiv="Status" content="410 GONE"/>
{
"items": [],
"rels": {},
"http": {
"status": 410
}
}
- Maybe make this an optional pass in the parser? - Tom Morris 2015-01-20
- For now, don't bother with metas until someone provides a use-case. Tom Morris
- Agreed on both counts. Tantek 06:56, 21 January 2015 (UTC)
MIME type
Other Interpretation Parsing Notes
Note: most of these need to be written up as separate microformats2-parsing-issues
Author: Ben Ward
Microformats 2 proposes a new, all encompassing syntax modification of prefixes that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.
Collection of other unresolved parsing issues in a generic model:
This is good material for documenting as microformats-2-issues, microformats-2-faq, and perhaps some of the more technical details in microformats-2-parsing-faq.
- The include pattern references other elements from elsewhere in a document. A generic parser needs to track IDs and fill them in after walking the DOM. (also,
itemref
if adopted.)- The current thinking per microformats-2-brainstorming is to adopt
itemref
and drop the include-pattern. Tantek
- The current thinking per microformats-2-brainstorming is to adopt
- Will
itemref
always map to anitem
property name?- No,
itemref
maps to one or more elements by ids, and their children. Those referenced elements may have property class names themselves, or they may contain elements that do. Tantek
- No,
- hAtom implies
author
from an hCard in a page that uses anaddress
element. This requires format knowledge, but a generic parser does not currently track the element type of a property node. Should it?- It should not. element-specific handling (e.g. using "alt" from img, and "title" from abbr) is completely done at parse time. The JSON data model does not reflect which element type or attribute the value came from. Additionally, hAtom is an example where we created far too many vocabulary-specific rules, in practice they're not necessary, and only complicate the microformat for both publisher understanding and parser implementation. Tantek
- hAtom defines that the highest level heading within an entry implies
entry-title
. This particular optimisation might be better off dead.- Agreed, this is gone in microformats 2. Tantek
- hAtom defines that permalinks be parsed from
rel
attributes, notclass
- In practice this has been one of the more problematic/error prone aspects of hAtom implementations, and it's also inconsistent with other microformats (although hReview tried to use both rel permalinks and "url"). The dependence upon rel-bookmark for permalinks is dropped in h-atom in preference to re-using "u-url" and "u-uid". Tantek
- XFN is entirely built on
rel
(although, has various other differences from structural microformats, as do vote-links, so perhaps are excluded from this discussion and will always be handled by dedicated parsers/queries regardless?)- The best (easiest and most reliable) use of 'rel' microformats in practice is when they are orthogonal to 'class' microformats. This is true both with XFN and some newer rel values like rel-author. In addition, it was very clear at the recent schema.org workshop's syntax session that RDFa's decision to apparently arbitrarily mix use of 'rel' and 'property' attributes for specifying different types of properties (it wasn't clear to people in the room when you use which for what) has caused a high degree of confusion among publishers and thus high error-rates. Thus if anything we should learn from both the mistakes of RDFa and our own experiences with even very deliberate/specific mixing of rel microformats in class microformats, and keep them defined as separate orthogonal building blocks that work together, but don't depend on each other. Tantek
- Relatedly to this:
rel-tag
in hAtom. --BenWard 06:50, 5 October 2011 (UTC)- Yes, and two related things here. First, despite my (and others') objection and (past) interoperable post/entry-specific treatment by Technorati and Ice Rocket, Hixie has redefined rel-tag in HTML5 to mean applying to the whole page, not a single post. Second, I've explicitly added 'p-category' to the draft 'h-atom' vocabulary in microformats-2. Tantek 07:12, 5 October 2011 (UTC)
- Relatedly to this:
- The best (easiest and most reliable) use of 'rel' microformats in practice is when they are orthogonal to 'class' microformats. This is true both with XFN and some newer rel values like rel-author. In addition, it was very clear at the recent schema.org workshop's syntax session that RDFa's decision to apparently arbitrarily mix use of 'rel' and 'property' attributes for specifying different types of properties (it wasn't clear to people in the room when you use which for what) has caused a high degree of confusion among publishers and thus high error-rates. Thus if anything we should learn from both the mistakes of RDFa and our own experiences with even very deliberate/specific mixing of rel microformats in class microformats, and keep them defined as separate orthogonal building blocks that work together, but don't depend on each other. Tantek
- HTML's
time
element includes an optionalpubdate
attribute. Simply: We should parse this asdt-published
. --BenWard 06:12, 10 October 2011 (UTC)- *If* there is even some reasonable data on actual use of the "pubdate" attribute (I don't think there is, frankly, especially with the removal of the algorithm to produce Atom from HTML5), then we could consider parsing "pubdate" as backwards compatible option for "dt-published". As a general rule, however, it is bad (demonstrably/experienced) design to depend on additional attributes (c.f. RDFa confusion over "property" vs. "rel"), especially for an instance where no additional attribute is necessary. I would leave this out for now until there is non-trivial (more than just test pages or folks who've written HTML5 books, ahem) use in the wild. When there is such use in the wild, it should be documented on a wiki page. We don't want to encourage more complex (additional attribute) publishing as a result of supporting it. Tantek 12:12, 10 October 2011 (UTC)
- value-class-pattern: In microformats-2, since there are no sub-properties, there will presumably no-longer be a 'value' property in any parsed model. Properties such as 'tel > type' in hCard are, as I recall, deprecated due to underuse anyway, so 'tel > value' becomes redundant. (There's also potentially some clarification around 'price > value' in hListing, whereby value was used in a pattern. So, what does this mean for value class parsing, with regard to value-title patterns and date separation patterns. Are we looking for a 'p-value' and 'p-value-title' classname, but treating them specially (excluding them from regular property parsing.) Or, are we giving them a special prefix (v-text, v-title? That seems confusing, but could be a concept.) I'm fine with p- for both, and just having the parser ignore them since they're special, but need clarification and naming confirmation. --BenWard 09:35, 10 October 2011 (UTC)
- A few things:
- 1. Yes, no more subproperties. 'tel' becomes just 'p-tel'. If there is demand for a structured 'tel' value, then we can use that demand (and research into publishing in practice) to brainstorm and create an 'h-tel' structured telephone number (with perhaps fields like 'type', 'extension', some indication of it being local dialing (an extra 0 in some countries) or international dialing, etc.) Or, we address the different 'tel' types as their own flat properties (again as justified by research), e.g. perhaps 'p-tel-fax', or 'p-tel-mobile'. Something for hcard-2-brainstorming.
- 2. For prices, e.g. hListing, either we're going to need to encode how to parse monetary amounts including monetary symbols, or consider creating an 'h-price' structured price. Not sure what the right answer is here, again, will need to be informed by analysis of documented actual price publication practices.
- 3. We should avoid introducing a new prefix 'v-' just for value-class-pattern. As we've noted elsewhere, each new prefix adds complexity and should be avoided without substantial advantage.
- 4. Using 'p-value-title' is strange, as it would be an exception to 'p-' parsing, since it would get the value from the 'title' attribute whereas 'p-' properties don't normally do that (exception: abbr).
- 5. Using 'p-value' is also strange, as it wouldn't generate a 'value' property in the JSON data model.
- 6. Class name 'value-title' is already sufficiently prefixed - we've found or even heard of no collisions in practice.
- 7. Class name 'value' can, by its simpler naming nature, be expected to potentially collide with other web designer class name usage though we have no documentation/mention thereof. We could consider a renaming, or providing of alternative, such as 'value-string', or 'value-content', etc. However, let's keep that as a backup plan to use only if/when evidence is presented that we need to.
- Conclusions: for now, in microformats-2, keep using 'value' and 'value-title' as defined in the value-class-pattern, and add the additional (obvious) interpretation that value class pattern: date and time parsing applies to all 'dt-' properties. - Tantek 12:12, 10 October 2011 (UTC)
- A few things:
incorporated 2015-05-28
The following brainstorms were incorporated 2015-05-28.
more information for alternates
Raised 2015-04-24 by Kevin Marks
The existing alternate
parsing is omitting title
- that should be added. The text
would make sense to add here too.
Use-case: labels for presenting alternates
- +1 Makes sense. Tantek 03:41, 25 April 2015 (UTC)
more information for rel-based formats
Raised 2015-04-18 by Kevin Marks
Related github test suite issue: https://github.com/microformats/tests/issues/16
Several rel-based formats have additional information that is useful beyond the link itself, which is all we capture at the moment. As I am trying to update the Universal feedparser to support mf2 based I will show examples from the testcases there.
The main change is to add a rel-urls
entry for more information about the attributes and text of the urls pointed to by rel's in the document
A fork of mf2py that implements these changes is at https://github.com/kevinmarks/mf2py
rel-tag
<a rel="tag" href="http://del.icio.us/tag/tech">Technology</a>
currently parses to:
{"rels": {"tag": ["http://del.icio.us/tag/tech"]}, "items": []}
This loses the link text, which is useful as a label.
We add a rel-urls
element to the parsed output with this extra data that can be looked up from the rels, which doesn't break backward compatibility and works better with xfn (see below)
{
"rels": {
"tag": [
"http://del.icio.us/tag/tech"
]
},
"items": [],
"rel-urls": {
"http://del.icio.us/tag/tech": {
"rels": [
"tag"
],
"text": "Technology"
}
}
}
xfn
<a rel="coworker" href="http://example.com/johndoe">John Doe</a>
currently parses to:
{"rels": {"coworker": ["http://example.com/johndoe"]}, "items": []}
This loses the link text, which is the person's name. Suggested output using the urls object:
{
{
"rels": {
"coworker": [
"http://example.com/johndoe"
]
},
"items": [],
"rel-urls": {
"http://example.com/johndoe": {
"rels": [
"coworker"
],
"text": "John Doe"
}
}
}
with multiple xfn values
<a rel="coworker friend" href="http://example.com/johndoe">John Doe</a>
we get this:
{
"rels": {
"coworker": [
"http://example.com/johndoe"
],
"friend": [
"http://example.com/johndoe"
]
},
"items": [],
"rel-urls": {
"http://example.com/johndoe": {
"rels": [
"coworker",
"friend"
],
"text": "John Doe"
}
}
}
rel-enclosure
<a rel="enclosure" href="http://example.com/movie.mp4" type="video/mpeg" title="real title">my movie</a>
currently parses to:
'{"rels": {"enclosure": ["http://example.com/movie.mp4"]}, "items": []}'
This loses the link text, which is the title and the attributes which give type. Suggested output:
{
"rels": {
"enclosure": [
"http://example.com/movie.mp4"
]
},
"items": [],
"rel-urls": {
"http://example.com/movie.mp4": {
"rels": [
"enclosure"
],
"text": "my movie",
"type": "video/mpeg",
"title": "real title"
}
}
}
This generalises to other rel's too, such as rel-feed and rel-alternate that have type, lang etc attributes.
(updated to include changes from feedback below) Kevin Marks 22:13, 26 April 2015 (UTC)
attributes parsed
Attributes currently parsed are:
hreflang
for alternate and enclosuremedia
for alternate and enclosuretitle
for alternate and enclosuretype
for alternate and enclosure
Attributes we may consider parsing if we have a use case are
sizes
for icon - need use-case documentationcoords
for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformatsshape
for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformats
In addition there is a special attribute name
text
which is the text contents of the link, which is useful in rel-tag rel-enclosure and xfn, and in alternate when used for feeds. It's also clarifying for rel-me links.
Tantek suggests we use textContent
for this instead, and make it a single string, not a list as name
is elsewhere in mf2 parsing
- Update: "text" is good enough, and "textContent" is ugly camelCase. Tantek 04:39, 29 May 2015 (UTC)
feedback on more rel info
- "name" is bad because it misleadingly conflates with use of "name" elsewhere in microformats2.
- Suggested alternative: textContent - since that's literally what is being returned there. Tantek 02:35, 25 April 2015 (UTC)
- as all other mf2 keys are lowercase-with-hyphens, Tantek suggests 'text' as that isn't going to be an html Kevin Marks 07:28, 25 April 2015 (UTC)
- Suggested alternative: textContent - since that's literally what is being returned there. Tantek 02:35, 25 April 2015 (UTC)
- no need for array for "name"/textContent - since there is always only one at most
- "urls" key is misleading - implies all URLs in the document, which is neither true, nor desired (takes much more parsing time and work and code)
- Suggested alternative: "rel-urls". And open to better alternatives too. Tantek 02:35, 25 April 2015 (UTC)
- If we are trying to extend the number of properties retuned from a rel without breaking the old structure why don't we call the new structure something like "rels-extended" Glenn Jones 12:29, 1 June 2015 (UTC)
- Extension is not the point, but rather to use them complementary. One structure for look-up of any rel value, hence "rels", which returns you a list of URLs. Then you can lookup those URLs in the new mapping, by URL, hence it is called "rel-urls" - that's the point to use them in conjunction and that's why rel-urls is named what it is. Tantek 22:03, 1 June 2015 (UTC)
- If we are trying to extend the number of properties retuned from a rel without breaking the old structure why don't we call the new structure something like "rels-extended" Glenn Jones 12:29, 1 June 2015 (UTC)
- Suggested alternative: "rel-urls". And open to better alternatives too. Tantek 02:35, 25 April 2015 (UTC)
- Why is the structure of "rel-urls" different to the "alternates" structure. Should the "url" not just be added as a property and not as a key. Creating two data structures for one type of object seems inconsistent. It adds cognitive load to anyone trying to understand the JSON structure Glenn Jones 12:29, 1 June 2015 (UTC)
- I was trying to avoid breaking the existing
rels
structure and use of it - I did implement a variant that put the structure inside rels, and it became cumbersome and repetitive where there were multiple rels on a url (xfn cases). Denormalising as properties of the URL made more sense. It also dedupes if there is repetitive linking to the same URL, eg a series of posts with rel-author on each. Kevin Marks 20:05, 1 June 2015 (UTC)
- I was trying to avoid breaking the existing
- If the rel is a "tag" then the main value we need to return should be the last path component of the URL, not the link text? Should we add another output property ie "tag" Glenn Jones 12:29, 1 June 2015 (UTC)
- No need to return last path segment of the URL, because the URL is already there - and that's just a library/framework utility function to get the last path segment of a URL. Tantek 22:03, 1 June 2015 (UTC)
- As currently described, the URL from
alternates
is repeated in therel-urls
structure. If we are doing this, surelyalternate
should be inrels
too? I assumed a mapping between them. Kevin Marks 20:05, 1 June 2015 (UTC)
- Yes it makes sense to drop "alternates" assuming the backcompat impact is low, put alternates in "rels" along with everything else, and direct people to use rels and rel-urls for alternates functionality. Evidence this is an acceptable even preferable approach.[3] Will add an issue accordingly. Tantek 22:03, 1 June 2015 (UTC)
Incorporated 2015-06-06
Nested h-* objects' "value" property
Status: resolved, resolution iterated, one real world implementation proven implementability, incorporated
- 2015-06-06 incorporated into microformats2-parsing
Raised 2015-01-06 by User:Kylewm;
If a child element has a microformat (h-*) and is a property element (p-*, u-*, dt-*, e-*), the parser will add a "value" property to the resulting object. The value should attempt to be a useful representation of the object for consumers that do not have semantic knowledge of the particular h-* type. Ref: microformats2-parsing#parse_an_element_for_class_microformats.
To determine the "value", we parse the property element simply (as if it did not have a h-* class), which works well for simple h-* objects, e.g. <a class="u-like-of h-cite" href="...">...</a>
- To handle more complex microformats, I propose that "value" for a p-* property element take on the first explicit "name" property of the nested microformat, and for a u-* property, the first explicit "url" property. Parsing will fall back on the current rules if an explicit property does not exist.
- This makes sense to me, and fits with the use-cases and examples I've seen. Tantek 19:31, 6 January 2015 (UTC)
- A similar (possibly simpler?) formulation would use the implied name and url rules to determine the "value" for p-* and u-* properties respectively
- I don't think that's needed, as there are already implied rules on a property that should handle that. I'd start with just the "first explicit" scoping to be more conservative, and then see if we find any use-cases that (and existing implied rules) don't/doesn't catch. Tantek 19:31, 6 January 2015 (UTC)
- Agreement at 2015-01-20 meetup.
- I don't think that's needed, as there are already implied rules on a property that should handle that. I'd start with just the "first explicit" scoping to be more conservative, and then see if we find any use-cases that (and existing implied rules) don't/doesn't catch. Tantek 19:31, 6 January 2015 (UTC)
For example:
<div class="h-entry">
<div class="u-in-reply-to h-cite">
<a class="p-author h-card" href="http://example.com">Example Author</a>
<a class="p-name u-url" href="http://example.com/post">Example Post</a>
</div>
</div>
The nested u-in-reply-to object would parse as
...
"in-reply-to": [{
"type": ["h-cite"],
"properties": {
"name": ["Example Post"],
"url": ["http://example.com/post"],
"author": [{
"type":["h-card"],
"properties": {
"url": ["http://example.com"],
"name": ["Example Author"]
},
"value": "Example Author"
}],
},
"value": "http://example.com/post"
}]
...
where the outer "value" gets the in-reply-to h-cite's u-url property, and the inner "value" gets the author's p-name property.
- Because there are no implied properties of the dt-* and e-* types, and no obvious defaults, the value rules for these types would not change.
- A possibility for dt-* h-*: The dt-* could take either the first dt-* of the h-*, or (perhaps if no dt-* in the h-*,) the first
<time>
element inside. Tantek 19:31, 6 January 2015 (UTC) - First dt-* seems reasonable, predictable, and usable. Consensus at 2015-01-20 meetup.
- Update 2015-05-29: no known use-cases for first dt-* or first e-*, and implementing that "would require some refactoring" (in mf2py at least per kylewm), thus until there's a use-case for first dt-*/e-* inside, let's treat "dt-* h-*" and "e-* h-*" as before. Tantek . In particular:
- p-* h-* - value from first "name" as proposed above
- u-* h-* - value from first "url" as proposed above
- e-* h-* - value is already defined for e-* parsing, nothing special here
- dt-* h-* - value from normal dt-* parsing - nothing special.
- +1 totally agree, let's wait for use cases of e-* dt-* Kylewm 19:44, 29 May 2015 (UTC)
- A possibility for dt-* h-*: The dt-* could take either the first dt-* of the h-*, or (perhaps if no dt-* in the h-*,) the first
- Implemented in mf2py 2015-06-01 https://github.com/tommorris/mf2py/commit/edc895ef5a780bcee654e6644a688688934517b0
- Added to microformats test suite (experimental) 2015-06-01 https://github.com/microformats/tests/commit/90c8a7d8e96c7160036a298e13f16d9ddaec218e