microformats2-parsing-brainstorming: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
(→‎Other Interpretation/Parsing Notes: Suggest parsing time/@pubdate as .dt-published.)
m (Replace <entry-title> with {{DISPLAYTITLE:}})
 
(97 intermediate revisions by 12 users not shown)
Line 1: Line 1:
Author: [[User:BenWard|Ben Ward]]
{{DISPLAYTITLE:microformats2 parsing brainstorming}}
 
This page is for brainstorming, discussion, and other questions and explorations about [[microformats2]] parsing.
 
For the microformats2 parsing algorithm, see:
* [[microformats2-parsing]]
 
For filing issues / problems with microformats2-parsing, see:
* https://github.com/microformats/microformats2-parsing/issues
** [[microformats2-parsing-issues|Resolved issues before 2016-06-20]]


[[microformats-2|Microformats 2]] proposes a new, all encompassing syntax modification of [[microformats-2-prefixes|prefixes]] that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.
__TOC__


==Parsing Microformats 2.0 Syntax: Extraction vs. Interpretation==
== Parse img alt ==
Per https://github.com/microformats/microformats2-parsing/issues/2 currently any u-* property (e.g. u-photo, u-featured) that extracts a 'src' attr from an img tag loses any associated 'alt' text alternative, and if at some point the consuming application wants to display that u-* property as an img, they have to either omit or synthesize a fake text alternative.


A microformats ‘1.0’ parser performs the following function:
It is desirable to somehow maintain that image src and alt association from the original markup, through the parsing process, up until a consuming application wishes to re-present the image with the text alternative.


* Given a piece of HTML content, discover a known microformat, extract it, apply various extraction patterns based upon the HTML mark-up used (e.g. include pattern, <code>abbr</code> patterns, date-time patterns, value-title pattern), apply various content optimisations where applicable, and return the result in an object native to the programming language.
There are a number of possibilities / approaches here worth brainstorming:


This is performing two types of function: Extraction of data from an HTML document or fragment, ''and'' interpretation and optimisation of that content to match the rules set out by a vocabulary specification.
=== Include alt property in parent object ===
# explicit authoring: require the author to use a new 'p-alt' property on the image to cause parsing and extraction of the text alternative.
#* Problem(s): fails for multiple images, some of which may or may not have alt attrs or corresponding p-alt properties (and fragile, forgetting one p-alt throws off the parallel lists of u-* and p-alt).
# implicit p-alt: for every img that is parsed for a u-* property, the parse could generate a p-alt property with value.
#* Problem(s): fragile again for similar reasons, not all u-*s may be on img elements, or may not have alt attrs for all imgs in the source.
# implicit p-alt only for implied u-photo
#* This is better since there can only be one implied u-photo, and thus if there is a p-alt, it must be associated with the one u-photo
#* Problem(s): does not work for other u-* image properties e.g. u-featured


It is only possible to write a generic parser that covers the first half of this task: Extraction, and application of global rules based on HTML elements and patterns common to all formats.
<code><nowiki><div class="h-entry"><img src="http://example.com/photo.jpg" alt="Example" class="u-photo p-alt"></div></nowiki></code>


The purpose of a generic parser (as supported by use cases such as search engines, and other crawlers) is:  
<code><nowiki>{"type":["h-entry"],"properties":{"photo":["http://example.com/photo.jpg"],"alt":["Example"]}</nowiki></code>


To provide a way for tools to extract rich data from a page for native storage, such that the data may be interpreted later by applications. This allows microformats to be crawled, and indexed, and removes the need to include complex HTML parsing within every implementation of microformat data.
=== Make photo property an object ===
1. use "h-image" on any u-* on img elements to imply a structure with paired photo and 'name' text alternative, e.g. <blockquote><code>&lt;img src="a.jpg" alt="text about a" class="u-featured h-image"/></code></blockquote> which would result in a u-featured property with one value, a structure of an h-image with itself having implied properties of a u-photo of "a.jpg" and a p-name of the "text about a". Similarly the author can use the object tag for the same result: <blockquote><code>&lt;object data="a.jpg" class="u-featured h-image">text about a</object></code></blockquote> In either case, the same microformats JSON would be generated, which is correct, as in both cases, there is an image with a fallback text alternative. The specific HTML used should not matter. The semantic of pairing the image with the text alternative is communicated the same way for both.


Microformats will continue to define various vocabulary-specific optimisations. as part of the design to be optimised for authors. For example: The <code>fn</code> pattern in [[hcard]], or the <code>lat;long</code> pattern in [[geo]], as well as default values for properties, such as the maximum rating in an [[hreview]].
* Challenge: requires author use of additional classname "h-image".
* Actually, no, as it is defined currently, microformats 2 ''drops'' vocabulary-specific optimizations. Such optimizations have often been too inapplicable, error prone or i18n-unsafe (e.g. fn to given-name + family-name fails for both numerous cases where middlenames/initials are used, and in general in numerous Asian languages where given/family name order is the reverse of Western English conventions, or languages with multiple family-names, e.g. Spanish - see [[hcard-issues-resolved]] for more). This is a deliberate cutting of a "feature" from microformats 1, it is a deliberate model simplification design decision. [[User:Tantek|Tantek]] 12:43, 4 October 2011 (UTC)
* Benefit: does not require a change to the parsing algorithm


Microformats 2.0 should refer only to ''extraction'' of microformats. Vocabularies should in turn document their appropriate optimisations, which will need to be applied by implementations, or a companion to an extractor, which I'll refer to here as an ‘interpreter’.
<source lang=html4strict>
* Vocabularies will no longer have optimizations, this is again deliberately, as they've been shown to be more error prone than helpful. Thus there should be no need for any vocabulary-specific 'interpreters'. However, due to design quirks in various legacy/interchange formats, ''export conversions algorithms'' to those legacy/interchange formats will require some additional legacy-format-specific rules (e.g. odd "required" rules in [[Atom]] or [[vCard]] will require specific synthesis rules, limitations in said formats will require filtering of some values, e.g. [[vcard3]] BDAY disallows vague birthdays like year-month and --month-day - subsequently allowed in [[vcard4]]). [[User:Tantek|Tantek]] 12:43, 4 October 2011 (UTC)
<div class="h-entry">
<img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured h-image">
</div>
</source>
 
<source lang=javascript>
{
"type":["h-entry"],
"properties":{
  "featured":[{
    "type":["h-image"],
    "properties":{
      "photo":["http://example.com/eg.jpg"],
      "name":["Photo of an example"]
    }
  }]
}
</source> [http://pin13.net/mf2/?id=20160719001154920]


A microformats 2.0 ‘extractor’, in combination with the functionality of a domain and format-aware ‘interpreter’ (either another shared component, or part of the implementation itself) would be equivalent to a microformats 1.0 ‘parser.’
* A microformats 2.0 parser is both generic (no knowledge of specific vocabularies), and lacks any/all such vocabulary-specific rules as compared to a microformats 1.0 [[hcard-parsing|parser]] with the exception of a 1) a limited list of well-established/interoperable backward compat root class names (of current [[microformats]] that are or can be soon shown to be specifications/standards per the [[process]]), 2) flat sets of backward compat property names (some with prefix/name specific conversion) for each of those backward compat root class names.  This is a deliberate design decision that makes microformats 2 more "micro", and yes this means that even with such backward compat support, this simple form of backward compat may mean that some existing microformats 1 content breaks. We'll assess those and iterate on a documented case-by-case basis rather than attempt to maintain theoretical 100% backward compatibility (since many current microformats format-specific-features are either unused, or may have caused more problems than solutions). [[User:Tantek|Tantek]] 12:43, 4 October 2011 (UTC)


N.B. I'll rewrite some of these as microformats-2-parsing-faq to help better clarify. The reasoning that led to most of these design decisions is documented in the [[microformats-2#About_This_Brainstorm|microformats 2: About This Brainstorm]] section and following sections. I'll recheck those sections to see if/where reasoning for some of the above noted design decisions may have been missed, and back-fill accordingly. This is necessary because [[microformats 2]] is a evolutionary result of simultaneously addressing both numerous generic [[issues]] as well as various common [[hcard-issues-resolved|format]]-[[hcalendar-issues-resolved|specific]] [[mfo|problems]] in microformats 1 syntax and vocabularies. The very number of changes may make it more challenging (from a microformats 1 perspective) to see why any particular design change has been made. [[User:Tantek|Tantek]] 12:43, 4 October 2011 (UTC)
2. have u-* on an &lt;img> automatically create an object if there is a non-empty 'alt' attribute. <br/>If a u-* property is parsed on an &lt;img> element with a non-empty 'alt' attribute, then: <br/>
Create a structure similar to the e-content nested structure that provides the "value" as the URL, and an "alt" as the text alternative.


==Parsing Literal Values==
* Advantage: no additional microformats markup needed from author
* Challenge: Many (most?) existing published u-photo properties will now return an object instead of a string, and consuming applications may not be expecting an object for a photo
** Mitigation: If this is done as an explicit parser library upgrade, consuming applications may decide when to take this parser upgrade and thus fix their u-photo handling to look for string or object before upgrading their microformats2 parsing library instance.


It is proposed for microformats 2.0 that all microformats be parseable from just their root element, e.g. <code>&lt;p class="h-card">Ben Ward&lt;/p></code> would create an hCard with the following properties after parsing:
<source lang=html4strict>
<div class="h-entry">
<img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured">
</div>
</source>


<source lang=javascript>
<source lang=javascript>
{  
{
  'type': ['h-card'],
"type":["h-entry"],
  'properties': {
"properties":{
    'name': ['Ben Ward']
  "featured":[{
   }
    "value":"http://example.com/eg.jpg",
    "alt":"Photo of an example"
   }]
}
}
</source>
</source>


This is a four-fold change from the current [[hCard]]:
... more brainstorming needed
# type is generically identifiable as a microformat root, even in parsed form. The use of the 'h-' prefix persists into the type of the object. This is deliberately so, as a result of re-using the JSON data model of microdata which itself is re-using a common JSON convention, such that microformatted data is clearly distinguishable (as opposed to any other random schema that may be using a similar data model).
 
# root-class-only support. Per [[microformats-2-implied-properties]], the ''name'' property is implied by the entirety of the root class name element.
=== img alt thoughts ===
# 'name' instead of 'fn'. As also documented in [[microformats-2-implied-properties]], the continuous challenges/problems and need to repeatedly re-explain 'fn' over the years combined with the real-world market response of nearly every other party doing a person vocabulary renaming 'fn' to 'name', microformats 2 makes this change as well.
Thoughts about img alt brainstorm proposals. Feel free to offer counterpoints with nested items and/or alternative preferences/opinions with (potentially multiple) top level items!
# There is no automatic parse-time inferring of <code>'given-name': ['Ben']</code> and <code>'family-name': ['Ward']</code>. Any such inferring *might* be made by a vCard converter, but is left up to that specific application (not all applications) built on that vocabulary, though even in that case it may not be necessary, as an empty "N:;;;" [[vCard]] property is sufficient to satisfy the N property requirement of [[vCard]], and also causes no problems when imported into various [[vcard-implementations]].
 
<div class="discussion">
* Tantek: I am '''leaning towards "Make photo property an object" brainstorm "2."''' because it feels more "automatic" and thus provides lower friction to more accessibility. Less (author) work for "alt" information to get passed through to the JSON result, and thus more potentially re-usable by consuming applications that want to preserve or re-emit the pairing of a photo and its fallback text alternative. -- [[User:Tantek|Tantek]] 00:53, 19 July 2016 (UTC)
* Aaron: I am leaning towards ''2'' because it takes less work on the part of publishers as well as consumers. From the publisher POV, if they add the alt attribute, that should be all they need to do, it seems odd to make them do additional work to make that show up in the parsed result. From the consumer side, some implementations will not need changing since when looking for a string value, they already use either the string directly or look for the "value" of the property if it's an object. Making consumers handle a new h- object just to read alt text seems overkill.
** Additionally, if the alt attribute is an empty string, this should be considered the same as if it were missing, so that the photo value will be the URL string rather than the object in this case as well
* Kevin: 2 makes sense to me as well, as this is a very specific need. If we want an image object with more substructure as 1 implies, that should be a new object type that follows the [[process]] - there is a case for that based on usage of figure/figcaption etc. but caption is not alt, and using name for it implies that it is. [[User:Kevin Marks|Kevin Marks]] 01:50, 19 July 2016 (UTC)
* Bear: The thoughts given above for option 2 make the most sense as a library writer and consumer, tying this change to a parser implementation's major version change will (should) give everyone notice and time to adjust
...
* (unanimity copied to GitHub)
</div>
 
When it looks like thoughts are naturally converging, we should take that emergent convergence back to the github thread for proper back/forth discussion and figuring out of details.
 
https://github.com/microformats/microformats2-parsing/issues/2


It is required of the extractor to understand that when a microformats object specifies no explicit child properties, that it must treat <code>h-card</code> as having a <code>p-name</code>. But, the parser is generic, so it also treats <code>h-review</code>, <code>h-entry</code>, <code>h-recipe</code>, <code>h-geo</code> as having a ‘<code>p-name</code>’.
* [[User:Tantek|Tantek]] 22:10, 1 August 2016 (UTC): Thanks Aaron, Kevin, Bear - based on the unanimous support of one particular brainstorm proposal, that proposal has been moved to the GitHub issue, and any follow-up about it (corrections, refinements, iterations) should occur there:
** https://github.com/microformats/microformats2-parsing/issues/2#issuecomment-236708854


As a result, specific vocabularies are evolved to drop their specific form of name (e.g. fn, summary, entry-title) and simplified to use a common 'name' property instead.
== Parse language information ==
Raised by [[User:VoxPelli|VoxPelli]] 18:04, 23 July 2015 (UTC)
* 2016-060: Update: and parse "id" attribute. [[User:Tantek|Tantek]] 16:39, 29 February 2016 (UTC) (see Additionally below)
* 2016-07-13: Update: created [https://github.com/microformats/microformats2-parsing/issues/3 GitHub issue] for this brainstorm [[User:VoxPelli|VoxPelli]] 14:34, 13 July 2016 (UTC)


Note: while the overwhelming majority of real world publishing/consuming uses of microformats do so with proper nouns which have names (and thus this parser-level incorporation of an implied 'name'), there are some formats that do not have a 'name' semantic. For example, [[geo]], [[adr]], and possibly if/when developed, units of measure, length, cost. The current thinking is that the benefits to the far greater proper-noun use-case of microformats outweigh the technical inelegance of having an extra/ignored 'name' property on formats that lack such a semantic.
Currently there’s no way to tell the language of parsed microformats even if those microformats has been marked up with HTML "lang"-attributes.


Some formats also may appear in theory to better imply some other property, e.g. a review might be thought to imply its ''content'', not its name, and an Atom entry its ''content'', not its title, but in practice (actual publishing patterns) this is not the case. Typically, brief unstructured reviews (or mentions thereof) provide a ''summary'' (often hyperlinked to an expanded structured form) of that review, not its content, and similarly, brief unstructured posts (e.g. RSS items) have historically most often been link blog items which include the title of an item and a link. Short status updates as well established by Twitter are newer and would seem to imply purely content with no title, at least semantically, however, even Twitter populates the RSS title and ATOM entry title of their feeds with the content. It's not clear what went into that decision, however, that's likely irrelevant, as the outcome turns out to be emergent consistency among publishing behaviors.
There are examples in the wild of people marking up pages in such a way:


To avoid overloading or undermining the semantics of a vocabulary, I propose that we handle this at the extractor level in a simpler fashion: Define a new property for literal data, that an extractor will provide if no other information was available. All ''interpreters'' may then be instructed that in the event that an object has no properties, it can attempt to interpret the literal value from the page instead.
* [http://voxpelli.com/ VoxPelli.com] has a "lang"-attribute on the h-entry of his [http://voxpelli.com/2011/03/sista-dagen-p-good-old/ swedish articles] to signify that the article is swedish even though the rest of the site is english.
* This was one of the design iterations I went through which led me to the current implied 'name' design. Another iteration was the ability for a vocabulary to specify a single required property which was implied if there were no properties provided. However, the combination of the fact that in most cases such single required properties were quite name-like, and that a vocabulary-specific rule like that would then bind parsers to specific vocabularies (even so slightly) led me to collapse them into implying a 'name'. It's not perfect, but it's the best alternative so far that balances practical convenience of publishing/consuming, avoids vocabulary-specific knowledge in the parser, and technical (in)elegance. [[User:Tantek|Tantek]] 13:48, 4 October 2011 (UTC)
* Stephanie [http://climbtothestars.org/archives/2013/09/17/basic-bilingual-1-0-plugin-for-wordpress-blog-in-more-than-one-language/ uses a WordPress plugin] that adds summaries of other languages at the start of her content.
* [https://seblog.nl/ Seblog.nl] has a <code>lang="nl"</code>-attribute on the <code><html></code> of each page, and uses a <code>lang="en"</code> on the p-name, p-summary and e-content of a h-entry if the CMS-field 'lang' is set to "en" (or any language other than "nl"). This to signify that the article is English, but the rest of the page Dutch (including the textual representation of the date). ([https://seblog.nl/2017/01/02/2/screenshots example])


In existing microformats, the closest existing example we have for this is the <code>label</code> property in hCard, which is used to represent the literal address label for a place. It is a corresponding piece of <code>fn</code>, <code>org</code> and <code>adr</code> in combination, but has no structure in and of itself. Possibly, every microformat could have a <code>label</code> form where structured data is unavailable.
Proposal is to add a new "lang" keyword to h-* and e-* objects so that the following example:


However in practice, the hCard <code>label</code> property is both little understood and little used. It's not even clear that it ought to be kept for microformats 2 (no known consumers, very few (if any?) real-world non-test publishers). This disuse is likely a good indicator that we should avoid basing anything on its design.
<source lang=html4strict>
<div class="h-entry" lang="sv">
  <h1 class="p-name">En svensk titel</h1>
  <div class="e-content" lang="en">With an <em>english</em> summary</div>
  <div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>
</source>


Alternatively, <code>value</code> is used throughout microformats to target a generic value (e.g. in combination with <code>price</code> in hListing.) It has been proposed that when parsing properties that are also themselves microformats, we create native objects of the form:
Would be parsed into something like:


    {
<source lang=javascript>
        'value': '1900 12th Street, San Francisco, CA 94'
{
      , 'type': ['adr']
  "type": ["h-entry"],
       , 'properties': {
  "lang": "sv",
            'street-address': '1900 12th Street'
  "properties": {
          , 'etc': 'etc'
    "name": ["En svensk titel"],
         }
    "content": [
     }
       {
        "lang": "en",
        "html": "With an <em>english</em> summary",
        "value": "With an english summary"
      },
      {
        "html": "Och <em>svensk</em> huvudtext",
         "value": "Och svensk huvudtext"
      }
     ]
  }
}
</source>


We could apply this same pattern to the root level:
This was [http://indiewebcamp.com/irc/2015-07-23#t1437667712078 brainstormed on the IndieWebCamp IRC-channel] where the mentioned example came up.


    {
* Pull request for implementation in microformat-node added 2015-07-23 https://github.com/glennjones/microformat-node/pull/23
        type: [hcard]
** Closed 2015-09-08 because the library has changed and parsing is now handled by microformat-shiv. New issue opened there: https://github.com/glennjones/microformat-shiv/issues/22
      , properties: {}
* Issue around implementation in php-mf2 added 2016-05-07 https://github.com/indieweb/php-mf2/issues/96
      , value: 'Ben Ward'
** Released 2017-05-27 in v0.3.2 behind a feature flag.
    }


In this case, an interpreter or implementation is responsible for using <code>value</code> in place of <code>fn</code>, or restructuring the object. It would be the responsibility of each vocabulary to define its root property. The parsing layer of microformats 2.0 would not impose semantics or naming onto that.
Additionally: consider the same for "id" attributes (use-case: rel=feed local discovery of a nested h-feed on the home page), specifically, parsing the first instance of any "id" attribute (ignoring latter duplicate id attribute values on any subsequent elements).


For another example, geo would end up like this:
And alternatively: consider parsing as "html-id" and "html-lang" prefixed properties in the parsed result, e.g.


    {
* '''Q:''' Why parse with the "html-" prefix?
        type: [geo]
* '''A:''' "html-lang and html-id to avoid confusing them with a possible actual property p-lang or p-id (which we don't have but might / could, especially from a vocabulary agnostic parser perspective)" https://chat.indieweb.org/microformats/2017-05-30#t1496166813294000
      , properties: {}
      , value: '1.3232;-0.543'
    }


* This is an alternative I've been considering as well:  [[User:Tantek|Tantek]] 13:48, 4 October 2011 (UTC)
<source lang=html4strict>
** 'value' is more generic than 'name' (applies to more vocabularies) with the trade-off that it naturally has less (weaker) semantics.
<div class="h-entry" lang="sv" id="postfrag123">
*** +1 I think that having naturally weaker semantics would be appropriate for this parsing functionality. —[[User:BenWard|BenWard]] 07:24, 5 October 2011 (UTC)
  <h1 class="p-name">En svensk titel</h1>
** The interesting thing that this analysis has revealed is that there appear to be two distinct clusters of microformats, the much more commonly used/understood/useful proper-noun microformats which markup things with names (people, events, reviews, recipes), and the less used compound-data microformats which are often used ''inside'' other microformats and just have some sort of semi-structured value (adr, geo, measure, and perhaps even things like tel). Perhaps this is implying the possibility and some degree of utility for ''two'' microformats root class name prefixes, 'h-' for existing proper-noun microformats, and something else ('m-' for microformat/molecule?, 's-' for structured-value?, 'v-' for value (though historically "v-"/"v." has meant "vendor-specific")?) for unnamed structured data microformats.
  <div class="e-content" lang="en">With an <em>english</em> summary</div>
*** This more and more feels like a good idea, and I'm leaning toward "s-" for struct / structure / structured value. "s-" works just like "h-" except that it doesn't imply any properties at parse time. We can try it and see what happens. There's also no harm if publishers just use "h-" structures, they just (possibly) get a few extra properties if they happen to omit properties.
  <div class="e-content">Och <em>svensk</em> huvudtext</div>
** Parallels the same JSON when a property has both a string value ''and'' is a structure itself.
</div>
*** Changed my mind on this. The parallel is not quite there. 'name'/'url'/'photo' are only implied if there are NO properties, where as the JSON string value + structure convention *always* provides a 'value'. [[User:Tantek|Tantek]] 22:39, 4 October 2011 (UTC)
</source>
*** And due to this difference in behavior ('value' is there when nested properties are present, whereas 'name' is only implied when there are no properties specified), I think it's correct to keep them separate, i.e. stick with implied 'name'. [[User:Tantek|Tantek]] 14:56, 5 October 2011 (UTC)
** However, I'm still currently leaning towards the practical convenience of providing a 'name' for the vast majority of microformats uses, rather than diluting this feature for the sake of avoiding implying inapplicable semantics to the few plain structured data microformats, and even then, only when no properties are explicitly specified! I'd rather introduce a new root prefix for those than lose the simplicity and utility of implied 'name'. [[User:Tantek|Tantek]] 13:48, 4 October 2011 (UTC)


==Parsing properties from rel attributes==
Would be parsed into something like:


--[[User:BenWard|BenWard]] 07:24, 5 October 2011 (UTC):
<source lang=javascript>
{
  "type": ["h-entry"],
  "html-id": "postfrag123",
  "html-lang": "sv",
  "properties": {
    "name": ["En svensk titel"],
    "content": [
      {
        "html-lang": "en",
        "html": "With an <em>english</em> summary",
        "value": "With an english summary"
      },
      {
        "html": "Och <em>svensk</em> huvudtext",
        "value": "Och svensk huvudtext"
      }
    ]
  }
}
</source>


* Currently, hAtom parses `bookmark` as a permalink
=== Language inheritance ===
* Various microformats parse `rel=tag` as tags
* The current proposal for parsing does not allow parsing properties from rel attributes.


Microformats parsers could instead extract ''all'' link relationships from rel attributes within an microformat object, parsing them as if a u- prefixed property.
If the "lang" attribute is not specified for a particular element, it is inherited from the nearest parent (or from the HTTP Content-Language header)
* Minor nit: Rather than same as a u- prefixed property, I think such "rel" properties should be parsed purely from the <code>href</code> attribute on <code>&lt;a&gt;</code> and <code>&lt;area&gt;</code> elements and nothing more. I would strongly disagree to extending rel to apply to other elements with URLs like img src, object data, or to apply to elements in general like div. That's the path that RDFa has taken and caused much confusion as a result. [[User:Tantek|Tantek]] 07:39, 5 October 2011 (UTC)
** Agree: That seems like a perfectly reasonable restriction. --[[User:BenWard|BenWard]] 08:29, 5 October 2011 (UTC)


This results in:
HTML5: https://www.w3.org/TR/html5/dom.html#the-lang-and-xml:lang-attributes<br>
HTML4: https://www.w3.org/TR/html4/struct/dirlang.html#h-8.1.2


* Continuing use of the <code>rel</code> attribute in HTML, thereby building on HTML semantics rather than bypassing them or ignoring them in favour of something less meaningful.
Proposal: Determine and include the inherited "lang" value on *every* microformat object that directly specifies a lang or that has an ancestor that does, e.g. if &lt;html lang="en"&gt;, then every object in the output will have "lang": "en".
* Parsing hAtom objects contain a property named <code>bookmark</code>, in place of <code>permalink</code>.
* All microformats that use <code>rel-tag</code> would contain a property named… <code>tag</code>. Perfect.


Since <code>rel</code> attributes are not overloaded for other functionality like class is, and other uses of <code>rel</code> within content are low (and non-semantic uses are nil, to the best of my knowledge) the risk of property pollution would be extremely low.
=== Pronouns in different languages ===


Note, with regard to this last point, that a generic microformats parser ''will'' parse false-positive properties, and ''will'' parse objects in combined chunks, rather than individually by format. Extracted objects will often not represent a vocabulary without further processing.
Language is also useful context when defining [[pronouns]], discussed a bit here[https://github.com/idno/Known/pull/1426#issuecomment-217626923].


<div class="discussion">
* This sounds like it might be workable. Let's try it and see how well authors "get it". - [[User:Tantek|Tantek]]
* Possible issue: do we have any collisions between class property names and rel names? (I don't think so offhand, but useful to ask the question). - [[User:Tantek|Tantek]]
** None that I can think of in microformats. There is the case of Google's <code>rel=author</code> and <code>p-author</code> in hAtom. However, the next point, about mfo scoping, would cover it in most situations (rel-author on a hyperlink within an hcard wouldn't be applied to the hentry.) The one situation in a parse tree where it's ambiguous would be this:
<source lang=html4strict>
<source lang=html4strict>
<a href="p-author h-card"  
<div class="h-card" lang="en">
  rel="author"  
  <span class="p-x-pronoun-nominative">he</span> /
  href="http://benward.me">
  <span class="p-x-pronoun-possessive">him</span> /
  Ben Ward
  <span class="p-x-pronoun-oblique">his</span>
</a>
</div>
</source>
</source>
** I can think of two quite reasonable solutions:
 
*** 1. Declare that class properties take precedence over rel properties of the same name, discarding rel values if a class is also found, or
would parse as
*** 2. Since all properties are now multi-value anyway, the hAtom object could be parsed as:
 
<source lang=javascript>
<source lang=javascript>
{
{
  'type': ['h-entry'],
  "type": ["h-card"],
  'properties': {
  "lang": "en",
    …
  "properties": {
    'author': [
    "x-pronoun-nominative": ["he"],
        {
     "x-pronoun-possessive": ["him"],
          'value': ['Ben Ward'], /* from the p-author     */
    "x-pronoun-oblique": ["his"]
          'type': ['h-card'],    /* from the h-card ...  */
  }
          'properties': {
}
            'name': ['Ben Ward'],  
            'url': ['http://benward.me']
        },
        'http://benward.me'      /* from the rel="author" */
    ],
    …
  }
}
</source>
</source>
** —[[User:BenWard|BenWard]] 08:29, 5 October 2011 (UTC)
 
*** Option 2 makes sense and is consistent with the rest of the multi-value parsing/handling. - [[User:Tantek|Tantek]] 14:56, 5 October 2011 (UTC)
It could also be useful to specify multiple languages within a single h-card (pardon me if I butcher Swedish pronouns)
*** What about without the 'p-author'?
 
<source lang=html4strict>
<source lang=html4strict>
<a href="h-card"  
<div class="h-card">
  rel="author"  
  <span lang="en" class="p-x-pronoun-nominative">he</span> /
  href="http://benward.me">
  <span lang="en" class="p-x-pronoun-possessive">him</span> /
  Ben Ward
  <span lang="en" class="p-x-pronoun-oblique">his</span>
</a>
  <span lang="sv" class="p-x-pronoun-nominative">han</span> /
  <span lang="sv" class="p-x-pronoun-possessive">hans</span> /
  <span lang="sv" class="p-x-pronoun-oblique">honom</span>
</div>
</source>
</source>
Should that be parsed as:
 
which might parse as
 
<source lang=javascript>
<source lang=javascript>
{
{
  'type': ['h-entry'],
  "type": ["h-card"],
  'properties': {
  "properties": {
    …
    "x-pronoun-nominative": [{"lang": "en", "value": "he"}, {"lang": "sv", "value": "han"}],
    'author': [
    "x-pronoun-possessive": [{"lang": "en", "value": "him"}, {"lang": "sv", "value": "hans"}],
        {
    "x-pronoun-oblique": [{"lang": "en", "value": "his"}, {"lang": "sv", "value": "honom"}]
          'type': ['h-card'],   /* from the h-card ...  */
  }
          'properties': {  
}
            'name': ['Ben Ward'],  
</source>
            'url': ['http://benward.me']
 
        },
or alternatively, we could introduce a new microformat h-x-pronoun to wrap a set of pronouns
        'http://benward.me'      /* from the rel="author" */
 
    ],
<source lang=html4strict>
    …
<div class="h-card">
  }
  <div class="p-x-pronoun h-x-pronoun" lang="en">
}
    <span class="p-nominative">he</span> /
    <span class="p-possessive">him</span> /
    <span class="p-oblique">his</span>
  </div>
  <div class="p-x-pronoun h-x-pronoun" lang="sv">
    <span class="p-nominative">han</span> /
    <span class="p-possessive">hans</span> /
    <span class="p-oblique">honom</span>
  </div>
</div>
</source>
</source>
Or
 
 
parsed as
 
<source lang=javascript>
<source lang=javascript>
{
{
  'type': ['h-entry'],
  "type": ["h-card"],
  'properties': {
  "properties": {
    …
    "x-pronoun": [{
    'author': [
      "type": ["h-x-pronoun"],
        {
      "lang": "en",
          'value': 'http://benward.me' /* from the rel="author" */
      "properties": {
          'type': ['h-card'],         /* from the h-card ...  */
        "nominative": ["he"],
          'properties': {  
        "possessive": ["him"],
            'name': ['Ben Ward'],  
        "oblique": ["his"]
            'url': ['http://benward.me']
      }
        },
    }, {
       
      "type": ["h-x-pronoun"],
    ],
      "lang": "sv",
    …
      "properties": {
  }
        "nominative": ["han"],
}
        "possessive": ["hans"],
        "oblique": ["honom"]
      }
    }]
  }
}
</source>
</source>
*** And if the former, then we're presumably saying that the value parsed due to the presence of a rel is always its own value, and does not combine with any other structures. I am fine with this, but I wanted to make sure we are ok with that explicitly. [[User:Tantek|Tantek]] 14:56, 5 October 2011 (UTC)
 
**** +1 I think that since the rel attribute is specifically concerned with the relation to an href attribute, it should not be combined with other structures that are rightly declared uses classes.
 
***** The more I've thought about this and how consuming applications may want to treat rel semantics, the more it seems correct to keep rel semantics distinct from class semantics. Class semantics are quite general/flexible, whereas rel is quite specific, naming something else in terms of a relationship from the current page/microformat's perspective. I think we should consider putting rel values in their own 'rel' collection, separate from the 'properties' collection. E.g. the original rel-author p-author h-card markup example would be parsed into this:
 
<source lang=javascript>
<div class="discussion">
{
Discussion:
  'type': ['h-entry'],
* [[User:Kylewm|Kylewm]] Including the "lang" attribute in h- and e- properties makes a ton of sense to me.
  'properties': {
* [[User:Kylewm|Kylewm]] I like the idea of introducing an h-x-pronoun container that can define all the different pronoun forms for a particular language
    …
* [[User:Zegnat|Martijn]] Turns out that the neat summary of different p-x-pronoun-* per language from the second example is never going to happen. Objective case (here <i>oblique</i>) exists in English and then suddenly doesn’t exist at all in e.g. German.
    'author': [
* [[User:Zegnat|Martijn]] The container is still a viable option because it gives a clear language split. Within the container, completely different case names would be used though. German would get properties for nominative, accusative, genitive, dative, and possessive cases. Every language will require its own documentation for properties, and some like Finnish would require up to 13 properties.
        {
* [[User:Zegnat|Martijn]] I propose an entirely different way of marking up pronouns. See [[h-card-brainstorming]].
          'value': ['Ben Ward'], /* from the p-author    */
* ...
          'type': ['h-card'],   /* from the h-card ...   */
 
          'properties': {
</div>
            'name': ['Ben Ward'],
 
            'url': ['http://benward.me']
== Canonicalization of datetime output ==
        }
Status: resolved, awaiting implementation attempt/experience.
    ],
 
    …
It would be useful to choose a (more) uniform output format for datetimes to make it easier for users of the parser to consume datetimes. Microformats2 parsers already do sophisticated pattern matching to recognize date vs. time vs. datetimes, so converting this to any specific format should not add overhead.
  }
 
  'rel': {
<div class="discussion">
    'author': ['http://benward.me'] /* from the rel="author" */
Specifically:
  }
* Choose either 'T' or space as the date/time separator.
}
** Prefer space as it is more human friendly/readable, which matters even for syntaxes/formats, as human still develop, debug them. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
* Choose either +XXYY or +XX:YY as the timezone specification (and convert 'Z' to +0000).
** Would appreciate some study / input here as to which timezone offset syntax is more human friendly. I lean slightly toward +/-NNNN (without the colon) because in the context of seeing a time, leaving out the colon makes it less likely the offset will be confused for a time. E.g. "07:00-08:00" looks like 7-8am, even if it meant 07:00 in PST. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
** Space is fine - consensus [[2015-01-20]] meetup.
* Parsers should ''not'' attempt make datetimes more exact than specified. They should not add time, seconds, or timezone if omitted in the original. [[User:Kylewm|Kylewm]] 04:02, 14 May 2014 (UTC)
** Agreed. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
** or month, day per Tom Morris
** consensus [[2015-01-20]] meetup
 
* Counterpoint: PHP's builtin date parsing does not require strict formatting. And the equivalent functionality for Python is provided by the widely used python-dateutil library. [[User:Kylewm|Kylewm]] 19:02, 14 May 2014 (UTC)
** However we cannot (must not) depend on either PHP or Python's "smart" "fixing" or Postelian "liberal handling", or any other language/framework's for that matter, as they all differ in how "intelligent" they are. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
</div>
 
Perhaps just provide a guideline for these based on the above consensus.
 
== Add meta http-equiv to microformats2 parsing model ==
Status: disagreement, awaiting implementation attempt/experience.
 
Similar to document level parsing of <code>rel</code> attributes, it makes sense simultaneously to parse <code>&lt;meta http-equiv&gt;</code> elements, perhaps treating "Status" in a special way (only using first number (sequence of digits) for its "value").
 
Use case: IndieWeb "deleted" indication inline in content for static file services that don't support HTTP return codes.
* http://indiewebcamp.com/deleted#HTML_meta_http-equiv_for_status
 
HTTP Header example:
* <samp>Content-Type: text/html; charset=utf-8</samp>
HTML equivalent:  
* <code>&lt;meta http-equiv="Content-Type" content="text/html; charset=utf-8"&gt;</code>
 
Related:
* https://www.w3.org/International/O-HTTP-charset
 
<div class="discussion">
* Interesting thought. Are you suggesting a top level "http-equivs:" collection similar to "rels:" in the parsed output? Should we consider "metas:" instead or in addition? [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
* What's the use case for this? Also, http-equiv on its own is useless. http-equiv is only a supplement to the data stored in headers. And headers aren't always there: what happens in the context of someone debugging a page who pastes the source into the textarea of an mf2 parser? Without a compelling use case for including headers (and then over-riding some of them with http-equivs), I'm not sure why an implementor want to do this. —[[User:TomMorris|Tom Morris]] 00:25, 8 May 2015 (UTC)
</div>
 
E.g. from https://gist.github.com/aaronpk/10297489
 
<source lang="html4strict">
<meta http-equiv="Status" content="410 GONE"/>
</source>
</source>
***** and if a post had multiple authors:
 
<source lang=javascript>
<source lang="javascript">
{
{
  'type': ['h-entry'],
"items": [],
  'properties': {
"rels": {},
    …
"http": {
    'author': [
"status": 410
        {
          'value': ['Ben Ward'], /* from p-author    */
          'type': ['h-card'],    /* from h-card ...  */
          'properties': {  
            'name': ['Ben Ward'],
            'url': ['http://benward.me']
        },
        {
          'value': ['Tantek Çelik'], /* from 2nd p-author    */
          'type': ['h-card'],        /* from 2nd h-card ...  */
          'properties': {
            'name': ['Tantek Çelik'],
            'url': ['http://tantek.com']
        },
    ],
    …
  }
  'rel': {
    'author': [
      'http://benward.me',      /* from rel="author" */
      'http://tantek.com'      /* from 2nd rel="author" */
    ]
  }
  }
  }
}
</source>
</source>
***** This preserves the semantic distinction between rel and properties in general, and leaves it up to a higher-level application to implement any logic around showing "more info" about a rel-author, e.g. by correlating the rel-author URL with the 'url' of an hCard it found in the same entry. However, note that even in the earlier JSON data model, the rel-author value just shows up as another property value, and any higher level application would still have to do some correlation logic. At least with this JSON data model, applications that may be looking for a rel value in particular, or a property value in particular can do so without having one unintentionally pollute the other. [[User:Tantek|Tantek]] 17:33, 6 October 2011 (UTC)


<div class="discussion">
* Maybe make this an optional pass in the parser? - Tom Morris [[2015-01-20]]
* For now, don't bother with metas until someone provides a use-case. Tom Morris
* Agreed on both counts. [[User:Tantek|Tantek]] 06:56, 21 January 2015 (UTC)
</div>
==MIME type==
See [[microformats2-mime-type]]
----


* Presumably we'd apply all the same property scoping rules to rel scoping as well. E.g. a rel hyperlink inside a microformat won't be seen by any containing microformat. - [[User:Tantek|Tantek]]
==Other Interpretation Parsing Notes==
** Correct, it should be parsed in the same scope as all other class properties in the object.
Note: most of these need to be written up as separate [[microformats2-parsing-issues]]
</div>
 
Author: [[User:BenWard|Ben Ward]]


==Other Interpretation/Parsing Notes==
[[microformats-2|Microformats 2]] proposes a new, all encompassing syntax modification of [[microformats-2-prefixes|prefixes]] that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.


Collection of other unresolved parsing issues in a generic model:
Collection of other unresolved parsing issues in a generic model:
Line 289: Line 414:
**** Yes, and two related things here. First, despite my (and others') objection and (past) interoperable post/entry-specific treatment by Technorati and Ice Rocket, Hixie has redefined rel-tag in [[HTML5]] to mean applying to the whole page, not a single post. Second, I've explicitly added 'p-category' to the draft 'h-atom' vocabulary in [[microformats-2]]. [[User:Tantek|Tantek]] 07:12, 5 October 2011 (UTC)
**** Yes, and two related things here. First, despite my (and others') objection and (past) interoperable post/entry-specific treatment by Technorati and Ice Rocket, Hixie has redefined rel-tag in [[HTML5]] to mean applying to the whole page, not a single post. Second, I've explicitly added 'p-category' to the draft 'h-atom' vocabulary in [[microformats-2]]. [[User:Tantek|Tantek]] 07:12, 5 October 2011 (UTC)
* HTML's <code>time</code> element includes an optional <code>pubdate</code> attribute. Simply: We should parse this as <code>dt-published</code>. --[[User:BenWard|BenWard]] 06:12, 10 October 2011 (UTC)
* HTML's <code>time</code> element includes an optional <code>pubdate</code> attribute. Simply: We should parse this as <code>dt-published</code>. --[[User:BenWard|BenWard]] 06:12, 10 October 2011 (UTC)
** *If* there is even some reasonable data on actual use of the "pubdate" attribute (I don't think there is, frankly, especially with the removal of the algorithm to produce Atom from HTML5), then we could consider parsing "pubdate" as backwards compatible option for "dt-published". As a general rule, however, it is bad (demonstrably/experienced) design to depend on additional attributes (c.f. RDFa confusion over "property" vs. "rel"), especially for an instance where no additional attribute is necessary. I would leave this out for now until there is non-trivial (more than just test pages or folks who've written HTML5 books, ahem) use in the wild. When there is such use in the wild, it should be documented on a wiki page. We don't want to encourage more complex (additional attribute) publishing as a result of supporting it. [[User:Tantek|Tantek]] 12:12, 10 October 2011 (UTC)
* [[value-class-pattern]]: In microformats-2, since there are no sub-properties, there will presumably no-longer be a 'value' property in any parsed model. Properties such as 'tel > type' in hCard are, as I recall, deprecated due to underuse anyway, so 'tel > value' becomes redundant. (There's also potentially some clarification around 'price > value' in hListing, whereby value was used in a pattern. So, what does this mean for value class parsing, with regard to value-title patterns and date separation patterns. Are we looking for a 'p-value' and 'p-value-title' classname, but treating them specially (excluding them from regular property parsing.) Or, are we giving them a special prefix (v-text, v-title? That seems confusing, but could be a concept.) I'm fine with p- for both, and just having the parser ignore them since they're special, but need clarification and naming confirmation. --[[User:BenWard|BenWard]] 09:35, 10 October 2011 (UTC)
** A few things:
*** 1. Yes, no more subproperties. 'tel' becomes just 'p-tel'. If there is demand for a structured 'tel' value, then we can use that demand (and research into publishing in practice) to brainstorm and create an 'h-tel' structured telephone number (with perhaps fields like 'type', 'extension', some indication of it being local dialing (an extra 0 in some countries) or international dialing, etc.) Or, we address the different 'tel' types as their own flat properties (again as justified by research), e.g. perhaps 'p-tel-fax', or 'p-tel-mobile'. Something for hcard-2-brainstorming.
*** 2. For prices, e.g. hListing, either we're going to need to encode how to parse monetary amounts including monetary symbols, or consider creating an 'h-price' structured price. Not sure what the right answer is here, again, will need to be informed by analysis of documented actual price publication practices.
*** 3. We should avoid introducing a new prefix 'v-' just for value-class-pattern. As we've noted elsewhere, each new prefix adds complexity and should be avoided without substantial advantage.
*** 4. Using 'p-value-title' is strange, as it would be an exception to 'p-' parsing, since it would get the value from the 'title' attribute whereas 'p-' properties don't normally do that (exception: abbr).
*** 5. Using 'p-value' is also strange, as it wouldn't generate a 'value' property in the JSON data model.
*** 6. Class name 'value-title' is already sufficiently prefixed - we've found or even heard of no collisions in practice.
*** 7. Class name 'value' can, by its simpler naming nature, be expected to potentially collide with other web designer class name usage though we have no documentation/mention thereof. We could consider a renaming, or providing of alternative, such as 'value-string', or 'value-content', etc. However, let's keep that as a backup plan to use only if/when evidence is presented that we need to.
** Conclusions: for now, in microformats-2, keep using 'value' and 'value-title' as defined in the [[value-class-pattern]], and add the additional (obvious) interpretation that [[value-class-pattern#Date_and_time_parsing|value class pattern: date and time parsing]] applies to all 'dt-' properties. - [[User:Tantek|Tantek]] 12:12, 10 October 2011 (UTC)
== incorporated 2015-05-28 ==
The following brainstorms were incorporated 2015-05-28.
== more information for alternates ==
Raised 2015-04-24 by [[User:Kevin Marks|Kevin Marks]]
The existing <code>alternate</code> parsing is omitting <code>title</code> - that should be added.  The <code>text</code> would make sense to add here too.
Use-case: labels for presenting alternates
<div class="discussion">
* +1 Makes sense. [[User:Tantek|Tantek]] 03:41, 25 April 2015 (UTC)
</div>
== more information for rel-based formats ==
Raised 2015-04-18 by [[User:Kevin Marks|Kevin Marks]]
Related github test suite issue: https://github.com/microformats/tests/issues/16
Several rel-based formats have additional information that is useful beyond the link itself, which is all we capture at the moment. As I am trying to update the Universal feedparser to support mf2 based I will show examples from the [https://github.com/kevinmarks/feedparser/tree/365623a9470e99246f393a8c1f49a0db567826b8/feedparser/tests/microformats testcases] there.
The main change is to add a <code>rel-urls</code> entry for more information about the attributes and text of the urls pointed to by rel's in the document
A fork of mf2py that implements these changes is at https://github.com/kevinmarks/mf2py
=== rel-tag ===
<code><a rel="tag" href="http://del.icio.us/tag/tech">Technology</a> </code>
currently parses to:
<code>{"rels": {"tag": ["http://del.icio.us/tag/tech"]}, "items": []} </code>
This loses the link text, which is useful as a label.
We add a <code>rel-urls</code> element to the parsed output with this extra data that can be looked up from the rels, which doesn't break backward compatibility and works better with xfn (see below)
<code><pre>
{
    "rels": {
        "tag": [
            "http://del.icio.us/tag/tech"
        ]
    },
    "items": [],
    "rel-urls": {
        "http://del.icio.us/tag/tech": {
            "rels": [
                "tag"
            ],
            "text": "Technology"
        }
    }
}
</pre></code>
=== xfn ===
<code><a rel="coworker" href="http://example.com/johndoe">John Doe</a></code>
currently parses to:
<code>{"rels": {"coworker": ["http://example.com/johndoe"]}, "items": []}</code>
This loses the link text, which is the person's name. Suggested output using the urls object:
<code><pre>
{
{
    "rels": {
        "coworker": [
            "http://example.com/johndoe"
        ]
    },
    "items": [],
    "rel-urls": {
        "http://example.com/johndoe": {
            "rels": [
                "coworker"
            ],
            "text": "John Doe"
        }
    }
}
</pre></code>
with multiple xfn values
<code><a rel="coworker friend" href="http://example.com/johndoe">John Doe</a></code>
we get this:
<code><pre>
{
    "rels": {
        "coworker": [
            "http://example.com/johndoe"
        ],
        "friend": [
            "http://example.com/johndoe"
        ]
    },
    "items": [],
    "rel-urls": {
        "http://example.com/johndoe": {
            "rels": [
                "coworker",
                "friend"
            ],
            "text": "John Doe"
        }
    }
}
</pre></code>
=== rel-enclosure ===
<code><a rel="enclosure" href="http://example.com/movie.mp4" type="video/mpeg" title="real title">my movie</a></code>
currently parses to:
<code>'{"rels": {"enclosure": ["http://example.com/movie.mp4"]}, "items": []}'</code>
This loses the link text,  which is the title and the attributes which give type. Suggested output:
<code><pre>
{
    "rels": {
        "enclosure": [
            "http://example.com/movie.mp4"
        ]
    },
    "items": [],
    "rel-urls": {
        "http://example.com/movie.mp4": {
            "rels": [
                "enclosure"
            ],
            "text": "my movie",
            "type": "video/mpeg",
            "title": "real title"
        }
    }
}
</pre></code>
This generalises to other rel's too, such as [[rel-feed]] and [[rel-alternate]] that have type, lang etc attributes.
(updated to include changes from feedback below) [[User:Kevin Marks|Kevin Marks]] 22:13, 26 April 2015 (UTC)
=== attributes parsed ===
Attributes currently parsed are:
* <code>hreflang</code> for alternate and enclosure
* <code>media</code> for alternate and enclosure
* <code>title</code> for alternate and enclosure
* <code>type</code> for alternate and enclosure
Attributes we may consider parsing if we have a use case are
* <code>sizes</code> for icon - need use-case documentation
* <code>coords</code> for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformats
* <code>shape</code> for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformats
In addition there is a special attribute <s><code>name</code> </s><code>text</code> which is the text contents of the link, which is useful in rel-tag rel-enclosure and xfn, and in alternate when used for feeds. It's also clarifying for rel-me links.
Tantek [http://logs.glob.uno/?c=freenode%23microformats&s=today#c79057 suggests] we use <code>textContent</code> for this instead, and make it a single string, not a list as <code>name</code> is elsewhere in mf2 parsing
* Update: "text" is good enough, and "textContent" is ugly camelCase. [[User:Tantek|Tantek]] 04:39, 29 May 2015 (UTC)
=== feedback on more rel info ===
<div class="discussion">
# "name" is bad because it misleadingly conflates with use of "name" elsewhere in microformats2.
#* Suggested alternative: [https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent textContent] - since that's literally what is being returned there. [[User:Tantek|Tantek]] 02:35, 25 April 2015 (UTC)
#** as all other mf2 keys are lowercase-with-hyphens, [http://logs.glob.uno/?c=freenode%23microformats&s=today#c79101 Tantek suggests] 'text' as that isn't going to be an html [[User:Kevin Marks|Kevin Marks]] 07:28, 25 April 2015 (UTC)
# no need for array for "name"/textContent - since there is always only one at most
#* E.g. should be "textContent": "my movie" [[User:Tantek|Tantek]] 02:35, 25 April 2015 (UTC)
#* Update: "text": "my movie" [[User:Tantek|Tantek]] 04:39, 29 May 2015 (UTC)
# "urls" key is misleading - implies all URLs in the document, which is neither true, nor desired (takes much more parsing time and work and code)
#* Suggested alternative: "rel-urls". And open to better alternatives too. [[User:Tantek|Tantek]] 02:35, 25 April 2015 (UTC)
#** If we are trying to extend the number of properties retuned from a rel without breaking the old structure why don't we call the new structure something like "rels-extended" [[User:GlennJones|Glenn Jones]] 12:29, 1 June 2015 (UTC)
#*** Extension is not the point, but rather to use them complementary. One structure for look-up of any rel value, hence "rels", which returns you a list of URLs. Then you can lookup those URLs in the new mapping, by URL, hence it is called "rel-urls" - that's the point to use them in conjunction and that's why rel-urls is named what it is. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC)
# Why is the structure of  "rel-urls" different to the "alternates" structure. Should the "url" not just be added as a property and not as a key. Creating two data structures for one type of object seems inconsistent. It adds cognitive load to anyone trying to understand the JSON structure [[User:GlennJones|Glenn Jones]] 12:29, 1 June 2015 (UTC)
#* I was trying to avoid breaking the existing <code>rels</code> structure and use of it - I did implement a variant that put the structure inside rels, and it became cumbersome and repetitive where there were multiple rels on a url (xfn cases). Denormalising as properties of the URL made more sense. It also dedupes if there is repetitive linking to the same URL, eg a series of posts with rel-author on each. [[User:Kevin Marks|Kevin Marks]] 20:05, 1 June 2015 (UTC)
# If the rel is a "tag" then the main value we need to return should be the last path component of the URL, not the link text? Should we add another output property ie "tag" [[User:GlennJones|Glenn Jones]] 12:29, 1 June 2015 (UTC)
#* No need to return last path segment of the URL, because the URL is already there - and that's just a library/framework utility function to get the last path segment of a URL. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC)
# As currently described, the URL from <code>alternates</code> is repeated in the <code>rel-urls</code> structure. If we are doing this, surely <code>alternate</code> should be in <code>rels</code> too? I assumed a mapping between them. [[User:Kevin Marks|Kevin Marks]] 20:05, 1 June 2015 (UTC)
## edit showing this variant: http://microformats.org/wiki/index.php?title=microformats2-parsing&oldid=65021#parse_a_hyperlink_element_for_rel_microformats
</div>
#* Yes it makes sense to drop "alternates" assuming the backcompat impact is low, put alternates in "rels" along with everything else, and direct people to use rels and rel-urls for alternates functionality. Evidence this is an acceptable even preferable approach.[http://indiewebcamp.com/irc/2015-06-01/line/1433195247005] Will add an issue accordingly. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC)
== Incorporated 2015-06-06 ==
== Nested h-* objects' "value" property ==
Status: resolved, resolution iterated, one real world implementation proven implementability, incorporated
* 2015-06-06 incorporated into [[microformats2-parsing]]
Raised 2015-01-06 by [[User:Kylewm]];
If a child element has a microformat (h-*) and is a property element (p-*, u-*, dt-*, e-*), the parser will add a "value" property to the resulting object. The value should attempt to be a useful representation of the object for consumers that do not have semantic knowledge of the particular h-* type. Ref: [[microformats2-parsing#parse_an_element_for_class_microformats]].
To determine the "value", we parse the property element simply (as if it did not have a h-* class), which works well for simple h-* objects, e.g. <code><a class="u-like-of h-cite" href="...">...</a></code>
<div class="discussion">
* To handle more complex microformats, I propose that "value" for a p-* property element take on the first explicit "name" property of the nested microformat, and for a u-* property, the first explicit "url" property. Parsing will fall back on the current rules if an explicit property does not exist.
** This makes sense to me, and fits with the use-cases and examples I've seen. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
** A similar (possibly simpler?) formulation would use the implied name and url rules to determine the "value" for p-* and u-* properties respectively
*** I don't think that's needed, as there are already implied rules on a property that should handle that. I'd start with just the "first explicit" scoping to be more conservative, and then see if we find any use-cases that (and existing implied rules) don't/doesn't catch. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
**** Agreement at [[2015-01-20]] meetup.
</div>
For example:
<code><pre>
<div class="h-entry">
  <div class="u-in-reply-to h-cite">
    <a class="p-author h-card" href="http://example.com">Example Author</a>
    <a class="p-name u-url" href="http://example.com/post">Example Post</a>
  </div>
</div>
</pre></code>
The nested u-in-reply-to object would parse as
<code><pre>
...
"in-reply-to": [{
  "type": ["h-cite"],
  "properties": {
    "name": ["Example Post"],
    "url": ["http://example.com/post"],
    "author": [{
      "type":["h-card"],
      "properties": {
        "url": ["http://example.com"],
        "name": ["Example Author"]
      },
      "value": "Example Author"
    }],
  },
  "value": "http://example.com/post"
}]
...
</pre></code>
where the outer "value" gets the in-reply-to h-cite's u-url property, and the inner "value" gets the author's p-name property.
<div class="discussion">
* Because there are no implied properties of the dt-* and e-* types, and no obvious defaults, the value rules for these types would not change.
** A possibility for dt-* h-*: The dt-* could take either the first dt-* of the h-*, or (perhaps if no dt-* in the h-*,) the first <code>&lt;time&gt;</code> element inside. [[User:Tantek|Tantek]] 19:31, 6 January 2015 (UTC)
** First dt-* seems reasonable, predictable, and usable. Consensus at [[2015-01-20]] meetup.
** Update 2015-05-29: no known use-cases for first dt-* or first e-*, and implementing that "would require some refactoring" (in mf2py at least per kylewm), thus until there's a use-case for first dt-*/e-* inside, let's treat "dt-* h-*" and "e-* h-*" as before. [[User:Tantek|Tantek]] . In particular:
*** p-* h-* - value from first "name" as proposed above
*** u-* h-* - value from first "url" as proposed above
*** e-* h-* - value is already defined for e-* parsing, nothing special here
*** dt-* h-* - value from normal dt-* parsing - nothing special.
*** +1 totally agree, let's wait for use cases of e-* dt-* [[User:Kylewm|Kylewm]] 19:44, 29 May 2015 (UTC)
</div>
* Implemented in mf2py 2015-06-01 https://github.com/tommorris/mf2py/commit/edc895ef5a780bcee654e6644a688688934517b0
* Added to microformats test suite (experimental) 2015-06-01 https://github.com/microformats/tests/commit/90c8a7d8e96c7160036a298e13f16d9ddaec218e


== see also ==
== see also ==
* [[microformats-2]]
* [[microformats2]]
* [[microformats-2-brainstorming]]
* [[microformats2-brainstorming]]
* [[microformats-2-prefixes]]
* [[microformats2-prefixes]]
* [[microformats-2-faq]]
* [[microformats2-faq]]
* [[microformats-2-issues]]
* [[microformats2-issues]]
* [[microformats-2-parsing-faq]]
* [[microformats2-parsing-faq]]

Latest revision as of 16:29, 18 July 2020


This page is for brainstorming, discussion, and other questions and explorations about microformats2 parsing.

For the microformats2 parsing algorithm, see:

For filing issues / problems with microformats2-parsing, see:

Parse img alt

Per https://github.com/microformats/microformats2-parsing/issues/2 currently any u-* property (e.g. u-photo, u-featured) that extracts a 'src' attr from an img tag loses any associated 'alt' text alternative, and if at some point the consuming application wants to display that u-* property as an img, they have to either omit or synthesize a fake text alternative.

It is desirable to somehow maintain that image src and alt association from the original markup, through the parsing process, up until a consuming application wishes to re-present the image with the text alternative.

There are a number of possibilities / approaches here worth brainstorming:

Include alt property in parent object

  1. explicit authoring: require the author to use a new 'p-alt' property on the image to cause parsing and extraction of the text alternative.
    • Problem(s): fails for multiple images, some of which may or may not have alt attrs or corresponding p-alt properties (and fragile, forgetting one p-alt throws off the parallel lists of u-* and p-alt).
  2. implicit p-alt: for every img that is parsed for a u-* property, the parse could generate a p-alt property with value.
    • Problem(s): fragile again for similar reasons, not all u-*s may be on img elements, or may not have alt attrs for all imgs in the source.
  3. implicit p-alt only for implied u-photo
    • This is better since there can only be one implied u-photo, and thus if there is a p-alt, it must be associated with the one u-photo
    • Problem(s): does not work for other u-* image properties e.g. u-featured

<div class="h-entry"><img src="http://example.com/photo.jpg" alt="Example" class="u-photo p-alt"></div>

{"type":["h-entry"],"properties":{"photo":["http://example.com/photo.jpg"],"alt":["Example"]}

Make photo property an object

1. use "h-image" on any u-* on img elements to imply a structure with paired photo and 'name' text alternative, e.g.

<img src="a.jpg" alt="text about a" class="u-featured h-image"/>

which would result in a u-featured property with one value, a structure of an h-image with itself having implied properties of a u-photo of "a.jpg" and a p-name of the "text about a". Similarly the author can use the object tag for the same result:

<object data="a.jpg" class="u-featured h-image">text about a</object>

In either case, the same microformats JSON would be generated, which is correct, as in both cases, there is an image with a fallback text alternative. The specific HTML used should not matter. The semantic of pairing the image with the text alternative is communicated the same way for both.

  • Challenge: requires author use of additional classname "h-image".
  • Benefit: does not require a change to the parsing algorithm
<div class="h-entry">
 <img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured h-image">
</div>
{
"type":["h-entry"],
"properties":{
  "featured":[{
    "type":["h-image"],
    "properties":{
      "photo":["http://example.com/eg.jpg"],
      "name":["Photo of an example"]
    }
  }]
}

[1]


2. have u-* on an <img> automatically create an object if there is a non-empty 'alt' attribute.
If a u-* property is parsed on an <img> element with a non-empty 'alt' attribute, then:
Create a structure similar to the e-content nested structure that provides the "value" as the URL, and an "alt" as the text alternative.

  • Advantage: no additional microformats markup needed from author
  • Challenge: Many (most?) existing published u-photo properties will now return an object instead of a string, and consuming applications may not be expecting an object for a photo
    • Mitigation: If this is done as an explicit parser library upgrade, consuming applications may decide when to take this parser upgrade and thus fix their u-photo handling to look for string or object before upgrading their microformats2 parsing library instance.
<div class="h-entry">
 <img src="http://example.com/eg.jpg" alt="Photo of an example" class="u-featured">
</div>
{
"type":["h-entry"],
"properties":{
  "featured":[{
    "value":"http://example.com/eg.jpg",
    "alt":"Photo of an example"
  }]
}

... more brainstorming needed

img alt thoughts

Thoughts about img alt brainstorm proposals. Feel free to offer counterpoints with nested items and/or alternative preferences/opinions with (potentially multiple) top level items!

  • Tantek: I am leaning towards "Make photo property an object" brainstorm "2." because it feels more "automatic" and thus provides lower friction to more accessibility. Less (author) work for "alt" information to get passed through to the JSON result, and thus more potentially re-usable by consuming applications that want to preserve or re-emit the pairing of a photo and its fallback text alternative. -- Tantek 00:53, 19 July 2016 (UTC)
  • Aaron: I am leaning towards 2 because it takes less work on the part of publishers as well as consumers. From the publisher POV, if they add the alt attribute, that should be all they need to do, it seems odd to make them do additional work to make that show up in the parsed result. From the consumer side, some implementations will not need changing since when looking for a string value, they already use either the string directly or look for the "value" of the property if it's an object. Making consumers handle a new h- object just to read alt text seems overkill.
    • Additionally, if the alt attribute is an empty string, this should be considered the same as if it were missing, so that the photo value will be the URL string rather than the object in this case as well
  • Kevin: 2 makes sense to me as well, as this is a very specific need. If we want an image object with more substructure as 1 implies, that should be a new object type that follows the process - there is a case for that based on usage of figure/figcaption etc. but caption is not alt, and using name for it implies that it is. Kevin Marks 01:50, 19 July 2016 (UTC)
  • Bear: The thoughts given above for option 2 make the most sense as a library writer and consumer, tying this change to a parser implementation's major version change will (should) give everyone notice and time to adjust

...

  • (unanimity copied to GitHub)

When it looks like thoughts are naturally converging, we should take that emergent convergence back to the github thread for proper back/forth discussion and figuring out of details.

https://github.com/microformats/microformats2-parsing/issues/2

Parse language information

Raised by VoxPelli 18:04, 23 July 2015 (UTC)

  • 2016-060: Update: and parse "id" attribute. Tantek 16:39, 29 February 2016 (UTC) (see Additionally below)
  • 2016-07-13: Update: created GitHub issue for this brainstorm VoxPelli 14:34, 13 July 2016 (UTC)

Currently there’s no way to tell the language of parsed microformats even if those microformats has been marked up with HTML "lang"-attributes.

There are examples in the wild of people marking up pages in such a way:

  • VoxPelli.com has a "lang"-attribute on the h-entry of his swedish articles to signify that the article is swedish even though the rest of the site is english.
  • Stephanie uses a WordPress plugin that adds summaries of other languages at the start of her content.
  • Seblog.nl has a lang="nl"-attribute on the <html> of each page, and uses a lang="en" on the p-name, p-summary and e-content of a h-entry if the CMS-field 'lang' is set to "en" (or any language other than "nl"). This to signify that the article is English, but the rest of the page Dutch (including the textual representation of the date). (example)

Proposal is to add a new "lang" keyword to h-* and e-* objects so that the following example:

<div class="h-entry" lang="sv">
  <h1 class="p-name">En svensk titel</h1>
  <div class="e-content" lang="en">With an <em>english</em> summary</div>
  <div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>

Would be parsed into something like:

{
  "type": ["h-entry"],
  "lang": "sv",
  "properties": {
    "name": ["En svensk titel"],
    "content": [
      {
        "lang": "en",
        "html": "With an <em>english</em> summary",
        "value": "With an english summary"
      },
      {
        "html": "Och <em>svensk</em> huvudtext",
        "value": "Och svensk huvudtext"
      }
    ]
  }
}

This was brainstormed on the IndieWebCamp IRC-channel where the mentioned example came up.

Additionally: consider the same for "id" attributes (use-case: rel=feed local discovery of a nested h-feed on the home page), specifically, parsing the first instance of any "id" attribute (ignoring latter duplicate id attribute values on any subsequent elements).

And alternatively: consider parsing as "html-id" and "html-lang" prefixed properties in the parsed result, e.g.

<div class="h-entry" lang="sv" id="postfrag123">
  <h1 class="p-name">En svensk titel</h1>
  <div class="e-content" lang="en">With an <em>english</em> summary</div>
  <div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>

Would be parsed into something like:

{
  "type": ["h-entry"],
  "html-id": "postfrag123",
  "html-lang": "sv",
  "properties": {
    "name": ["En svensk titel"],
    "content": [
      {
        "html-lang": "en",
        "html": "With an <em>english</em> summary",
        "value": "With an english summary"
      },
      {
        "html": "Och <em>svensk</em> huvudtext",
        "value": "Och svensk huvudtext"
      }
    ]
  }
}

Language inheritance

If the "lang" attribute is not specified for a particular element, it is inherited from the nearest parent (or from the HTTP Content-Language header)

HTML5: https://www.w3.org/TR/html5/dom.html#the-lang-and-xml:lang-attributes
HTML4: https://www.w3.org/TR/html4/struct/dirlang.html#h-8.1.2

Proposal: Determine and include the inherited "lang" value on *every* microformat object that directly specifies a lang or that has an ancestor that does, e.g. if <html lang="en">, then every object in the output will have "lang": "en".

Pronouns in different languages

Language is also useful context when defining pronouns, discussed a bit here[2].

<div class="h-card" lang="en">
  <span class="p-x-pronoun-nominative">he</span> /
  <span class="p-x-pronoun-possessive">him</span> /
  <span class="p-x-pronoun-oblique">his</span>
</div>

would parse as

{
  "type": ["h-card"],
  "lang": "en",
  "properties": {
    "x-pronoun-nominative": ["he"],
    "x-pronoun-possessive": ["him"],
    "x-pronoun-oblique": ["his"]
  }
}

It could also be useful to specify multiple languages within a single h-card (pardon me if I butcher Swedish pronouns)

<div class="h-card">
  <span lang="en" class="p-x-pronoun-nominative">he</span> /
  <span lang="en" class="p-x-pronoun-possessive">him</span> /
  <span lang="en" class="p-x-pronoun-oblique">his</span>
  <span lang="sv" class="p-x-pronoun-nominative">han</span> /
  <span lang="sv" class="p-x-pronoun-possessive">hans</span> /
  <span lang="sv" class="p-x-pronoun-oblique">honom</span>
</div>

which might parse as

{
  "type": ["h-card"],
  "properties": {
    "x-pronoun-nominative": [{"lang": "en", "value": "he"}, {"lang": "sv", "value": "han"}],
    "x-pronoun-possessive": [{"lang": "en", "value": "him"}, {"lang": "sv", "value": "hans"}],
    "x-pronoun-oblique": [{"lang": "en", "value": "his"}, {"lang": "sv", "value": "honom"}]
  }
}

or alternatively, we could introduce a new microformat h-x-pronoun to wrap a set of pronouns

<div class="h-card">
  <div class="p-x-pronoun h-x-pronoun" lang="en">
    <span class="p-nominative">he</span> /
    <span class="p-possessive">him</span> /
    <span class="p-oblique">his</span>
  </div>
  <div class="p-x-pronoun h-x-pronoun" lang="sv">
    <span class="p-nominative">han</span> /
    <span class="p-possessive">hans</span> /
    <span class="p-oblique">honom</span>
  </div>
</div>


parsed as

{
  "type": ["h-card"],
  "properties": {
    "x-pronoun": [{
      "type": ["h-x-pronoun"],
      "lang": "en",
      "properties": {
        "nominative": ["he"],
        "possessive": ["him"],
        "oblique": ["his"]
      }
    }, {
      "type": ["h-x-pronoun"],
      "lang": "sv",
      "properties": {
        "nominative": ["han"],
        "possessive": ["hans"],
        "oblique": ["honom"]
      }
    }]
  }
}


Discussion:

  • Kylewm Including the "lang" attribute in h- and e- properties makes a ton of sense to me.
  • Kylewm I like the idea of introducing an h-x-pronoun container that can define all the different pronoun forms for a particular language
  • Martijn Turns out that the neat summary of different p-x-pronoun-* per language from the second example is never going to happen. Objective case (here oblique) exists in English and then suddenly doesn’t exist at all in e.g. German.
  • Martijn The container is still a viable option because it gives a clear language split. Within the container, completely different case names would be used though. German would get properties for nominative, accusative, genitive, dative, and possessive cases. Every language will require its own documentation for properties, and some like Finnish would require up to 13 properties.
  • Martijn I propose an entirely different way of marking up pronouns. See h-card-brainstorming.
  • ...

Canonicalization of datetime output

Status: resolved, awaiting implementation attempt/experience.

It would be useful to choose a (more) uniform output format for datetimes to make it easier for users of the parser to consume datetimes. Microformats2 parsers already do sophisticated pattern matching to recognize date vs. time vs. datetimes, so converting this to any specific format should not add overhead.

Specifically:

  • Choose either 'T' or space as the date/time separator.
    • Prefer space as it is more human friendly/readable, which matters even for syntaxes/formats, as human still develop, debug them. Tantek 19:31, 6 January 2015 (UTC)
  • Choose either +XXYY or +XX:YY as the timezone specification (and convert 'Z' to +0000).
    • Would appreciate some study / input here as to which timezone offset syntax is more human friendly. I lean slightly toward +/-NNNN (without the colon) because in the context of seeing a time, leaving out the colon makes it less likely the offset will be confused for a time. E.g. "07:00-08:00" looks like 7-8am, even if it meant 07:00 in PST. Tantek 19:31, 6 January 2015 (UTC)
    • Space is fine - consensus 2015-01-20 meetup.
  • Parsers should not attempt make datetimes more exact than specified. They should not add time, seconds, or timezone if omitted in the original. Kylewm 04:02, 14 May 2014 (UTC)
    • Agreed. Tantek 19:31, 6 January 2015 (UTC)
    • or month, day per Tom Morris
    • consensus 2015-01-20 meetup
  • Counterpoint: PHP's builtin date parsing does not require strict formatting. And the equivalent functionality for Python is provided by the widely used python-dateutil library. Kylewm 19:02, 14 May 2014 (UTC)
    • However we cannot (must not) depend on either PHP or Python's "smart" "fixing" or Postelian "liberal handling", or any other language/framework's for that matter, as they all differ in how "intelligent" they are. Tantek 19:31, 6 January 2015 (UTC)

Perhaps just provide a guideline for these based on the above consensus.

Add meta http-equiv to microformats2 parsing model

Status: disagreement, awaiting implementation attempt/experience.

Similar to document level parsing of rel attributes, it makes sense simultaneously to parse <meta http-equiv> elements, perhaps treating "Status" in a special way (only using first number (sequence of digits) for its "value").

Use case: IndieWeb "deleted" indication inline in content for static file services that don't support HTTP return codes.

HTTP Header example:

  • Content-Type: text/html; charset=utf-8

HTML equivalent:

  • <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Related:

  • Interesting thought. Are you suggesting a top level "http-equivs:" collection similar to "rels:" in the parsed output? Should we consider "metas:" instead or in addition? Tantek 19:31, 6 January 2015 (UTC)
  • What's the use case for this? Also, http-equiv on its own is useless. http-equiv is only a supplement to the data stored in headers. And headers aren't always there: what happens in the context of someone debugging a page who pastes the source into the textarea of an mf2 parser? Without a compelling use case for including headers (and then over-riding some of them with http-equivs), I'm not sure why an implementor want to do this. —Tom Morris 00:25, 8 May 2015 (UTC)

E.g. from https://gist.github.com/aaronpk/10297489

<meta http-equiv="Status" content="410 GONE"/>
{
 "items": [],
 "rels": {},
 "http": {
 "status": 410
 }
}
  • Maybe make this an optional pass in the parser? - Tom Morris 2015-01-20
  • For now, don't bother with metas until someone provides a use-case. Tom Morris
  • Agreed on both counts. Tantek 06:56, 21 January 2015 (UTC)


MIME type

See microformats2-mime-type


Other Interpretation Parsing Notes

Note: most of these need to be written up as separate microformats2-parsing-issues

Author: Ben Ward

Microformats 2 proposes a new, all encompassing syntax modification of prefixes that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.

Collection of other unresolved parsing issues in a generic model:

This is good material for documenting as microformats-2-issues, microformats-2-faq, and perhaps some of the more technical details in microformats-2-parsing-faq.

  • The include pattern references other elements from elsewhere in a document. A generic parser needs to track IDs and fill them in after walking the DOM. (also, itemref if adopted.)
  • Will itemref always map to an item property name?
    • No, itemref maps to one or more elements by ids, and their children. Those referenced elements may have property class names themselves, or they may contain elements that do. Tantek
  • hAtom implies author from an hCard in a page that uses an address element. This requires format knowledge, but a generic parser does not currently track the element type of a property node. Should it?
    • It should not. element-specific handling (e.g. using "alt" from img, and "title" from abbr) is completely done at parse time. The JSON data model does not reflect which element type or attribute the value came from. Additionally, hAtom is an example where we created far too many vocabulary-specific rules, in practice they're not necessary, and only complicate the microformat for both publisher understanding and parser implementation. Tantek
  • hAtom defines that the highest level heading within an entry implies entry-title. This particular optimisation might be better off dead.
    • Agreed, this is gone in microformats 2. Tantek
  • hAtom defines that permalinks be parsed from rel attributes, not class
    • In practice this has been one of the more problematic/error prone aspects of hAtom implementations, and it's also inconsistent with other microformats (although hReview tried to use both rel permalinks and "url"). The dependence upon rel-bookmark for permalinks is dropped in h-atom in preference to re-using "u-url" and "u-uid". Tantek
  • XFN is entirely built on rel (although, has various other differences from structural microformats, as do vote-links, so perhaps are excluded from this discussion and will always be handled by dedicated parsers/queries regardless?)
    • The best (easiest and most reliable) use of 'rel' microformats in practice is when they are orthogonal to 'class' microformats. This is true both with XFN and some newer rel values like rel-author. In addition, it was very clear at the recent schema.org workshop's syntax session that RDFa's decision to apparently arbitrarily mix use of 'rel' and 'property' attributes for specifying different types of properties (it wasn't clear to people in the room when you use which for what) has caused a high degree of confusion among publishers and thus high error-rates. Thus if anything we should learn from both the mistakes of RDFa and our own experiences with even very deliberate/specific mixing of rel microformats in class microformats, and keep them defined as separate orthogonal building blocks that work together, but don't depend on each other. Tantek
      • Relatedly to this: rel-tag in hAtom. --BenWard 06:50, 5 October 2011 (UTC)
        • Yes, and two related things here. First, despite my (and others') objection and (past) interoperable post/entry-specific treatment by Technorati and Ice Rocket, Hixie has redefined rel-tag in HTML5 to mean applying to the whole page, not a single post. Second, I've explicitly added 'p-category' to the draft 'h-atom' vocabulary in microformats-2. Tantek 07:12, 5 October 2011 (UTC)
  • HTML's time element includes an optional pubdate attribute. Simply: We should parse this as dt-published. --BenWard 06:12, 10 October 2011 (UTC)
    • *If* there is even some reasonable data on actual use of the "pubdate" attribute (I don't think there is, frankly, especially with the removal of the algorithm to produce Atom from HTML5), then we could consider parsing "pubdate" as backwards compatible option for "dt-published". As a general rule, however, it is bad (demonstrably/experienced) design to depend on additional attributes (c.f. RDFa confusion over "property" vs. "rel"), especially for an instance where no additional attribute is necessary. I would leave this out for now until there is non-trivial (more than just test pages or folks who've written HTML5 books, ahem) use in the wild. When there is such use in the wild, it should be documented on a wiki page. We don't want to encourage more complex (additional attribute) publishing as a result of supporting it. Tantek 12:12, 10 October 2011 (UTC)
  • value-class-pattern: In microformats-2, since there are no sub-properties, there will presumably no-longer be a 'value' property in any parsed model. Properties such as 'tel > type' in hCard are, as I recall, deprecated due to underuse anyway, so 'tel > value' becomes redundant. (There's also potentially some clarification around 'price > value' in hListing, whereby value was used in a pattern. So, what does this mean for value class parsing, with regard to value-title patterns and date separation patterns. Are we looking for a 'p-value' and 'p-value-title' classname, but treating them specially (excluding them from regular property parsing.) Or, are we giving them a special prefix (v-text, v-title? That seems confusing, but could be a concept.) I'm fine with p- for both, and just having the parser ignore them since they're special, but need clarification and naming confirmation. --BenWard 09:35, 10 October 2011 (UTC)
    • A few things:
      • 1. Yes, no more subproperties. 'tel' becomes just 'p-tel'. If there is demand for a structured 'tel' value, then we can use that demand (and research into publishing in practice) to brainstorm and create an 'h-tel' structured telephone number (with perhaps fields like 'type', 'extension', some indication of it being local dialing (an extra 0 in some countries) or international dialing, etc.) Or, we address the different 'tel' types as their own flat properties (again as justified by research), e.g. perhaps 'p-tel-fax', or 'p-tel-mobile'. Something for hcard-2-brainstorming.
      • 2. For prices, e.g. hListing, either we're going to need to encode how to parse monetary amounts including monetary symbols, or consider creating an 'h-price' structured price. Not sure what the right answer is here, again, will need to be informed by analysis of documented actual price publication practices.
      • 3. We should avoid introducing a new prefix 'v-' just for value-class-pattern. As we've noted elsewhere, each new prefix adds complexity and should be avoided without substantial advantage.
      • 4. Using 'p-value-title' is strange, as it would be an exception to 'p-' parsing, since it would get the value from the 'title' attribute whereas 'p-' properties don't normally do that (exception: abbr).
      • 5. Using 'p-value' is also strange, as it wouldn't generate a 'value' property in the JSON data model.
      • 6. Class name 'value-title' is already sufficiently prefixed - we've found or even heard of no collisions in practice.
      • 7. Class name 'value' can, by its simpler naming nature, be expected to potentially collide with other web designer class name usage though we have no documentation/mention thereof. We could consider a renaming, or providing of alternative, such as 'value-string', or 'value-content', etc. However, let's keep that as a backup plan to use only if/when evidence is presented that we need to.
    • Conclusions: for now, in microformats-2, keep using 'value' and 'value-title' as defined in the value-class-pattern, and add the additional (obvious) interpretation that value class pattern: date and time parsing applies to all 'dt-' properties. - Tantek 12:12, 10 October 2011 (UTC)

incorporated 2015-05-28

The following brainstorms were incorporated 2015-05-28.

more information for alternates

Raised 2015-04-24 by Kevin Marks

The existing alternate parsing is omitting title - that should be added. The text would make sense to add here too.

Use-case: labels for presenting alternates

  • +1 Makes sense. Tantek 03:41, 25 April 2015 (UTC)

more information for rel-based formats

Raised 2015-04-18 by Kevin Marks

Related github test suite issue: https://github.com/microformats/tests/issues/16

Several rel-based formats have additional information that is useful beyond the link itself, which is all we capture at the moment. As I am trying to update the Universal feedparser to support mf2 based I will show examples from the testcases there.

The main change is to add a rel-urls entry for more information about the attributes and text of the urls pointed to by rel's in the document

A fork of mf2py that implements these changes is at https://github.com/kevinmarks/mf2py

rel-tag

<a rel="tag" href="http://del.icio.us/tag/tech">Technology</a> 

currently parses to:

{"rels": {"tag": ["http://del.icio.us/tag/tech"]}, "items": []} 

This loses the link text, which is useful as a label.

We add a rel-urls element to the parsed output with this extra data that can be looked up from the rels, which doesn't break backward compatibility and works better with xfn (see below)

{
    "rels": {
        "tag": [
            "http://del.icio.us/tag/tech"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://del.icio.us/tag/tech": {
            "rels": [
                "tag"
            ], 
            "text": "Technology"
        }
    }
}

xfn

<a rel="coworker" href="http://example.com/johndoe">John Doe</a>

currently parses to:

{"rels": {"coworker": ["http://example.com/johndoe"]}, "items": []}

This loses the link text, which is the person's name. Suggested output using the urls object:

{
{
    "rels": {
        "coworker": [
            "http://example.com/johndoe"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://example.com/johndoe": {
            "rels": [
                "coworker"
            ], 
            "text": "John Doe"
        }
    }
}

with multiple xfn values

<a rel="coworker friend" href="http://example.com/johndoe">John Doe</a> we get this:

{
    "rels": {
        "coworker": [
            "http://example.com/johndoe"
        ], 
        "friend": [
            "http://example.com/johndoe"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://example.com/johndoe": {
            "rels": [
                "coworker", 
                "friend"
            ], 
            "text": "John Doe"
        }
    }
}

rel-enclosure

<a rel="enclosure" href="http://example.com/movie.mp4" type="video/mpeg" title="real title">my movie</a>

currently parses to:

'{"rels": {"enclosure": ["http://example.com/movie.mp4"]}, "items": []}'

This loses the link text, which is the title and the attributes which give type. Suggested output:

{
    "rels": {
        "enclosure": [
            "http://example.com/movie.mp4"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://example.com/movie.mp4": {
            "rels": [
                "enclosure"
            ], 
            "text": "my movie", 
            "type": "video/mpeg", 
            "title": "real title"
        }
    }
}

This generalises to other rel's too, such as rel-feed and rel-alternate that have type, lang etc attributes.

(updated to include changes from feedback below) Kevin Marks 22:13, 26 April 2015 (UTC)

attributes parsed

Attributes currently parsed are:

  • hreflang for alternate and enclosure
  • media for alternate and enclosure
  • title for alternate and enclosure
  • type for alternate and enclosure

Attributes we may consider parsing if we have a use case are

  • sizes for icon - need use-case documentation
  • coords for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformats
  • shape for area - possibly for people tagging - no examples yet, and unnecessary as people-tagging requires using h-* microformats

In addition there is a special attribute name text which is the text contents of the link, which is useful in rel-tag rel-enclosure and xfn, and in alternate when used for feeds. It's also clarifying for rel-me links.

Tantek suggests we use textContent for this instead, and make it a single string, not a list as name is elsewhere in mf2 parsing

  • Update: "text" is good enough, and "textContent" is ugly camelCase. Tantek 04:39, 29 May 2015 (UTC)

feedback on more rel info

  1. "name" is bad because it misleadingly conflates with use of "name" elsewhere in microformats2.
    • Suggested alternative: textContent - since that's literally what is being returned there. Tantek 02:35, 25 April 2015 (UTC)
      • as all other mf2 keys are lowercase-with-hyphens, Tantek suggests 'text' as that isn't going to be an html Kevin Marks 07:28, 25 April 2015 (UTC)
  2. no need for array for "name"/textContent - since there is always only one at most
    • E.g. should be "textContent": "my movie" Tantek 02:35, 25 April 2015 (UTC)
    • Update: "text": "my movie" Tantek 04:39, 29 May 2015 (UTC)
  3. "urls" key is misleading - implies all URLs in the document, which is neither true, nor desired (takes much more parsing time and work and code)
    • Suggested alternative: "rel-urls". And open to better alternatives too. Tantek 02:35, 25 April 2015 (UTC)
      • If we are trying to extend the number of properties retuned from a rel without breaking the old structure why don't we call the new structure something like "rels-extended" Glenn Jones 12:29, 1 June 2015 (UTC)
        • Extension is not the point, but rather to use them complementary. One structure for look-up of any rel value, hence "rels", which returns you a list of URLs. Then you can lookup those URLs in the new mapping, by URL, hence it is called "rel-urls" - that's the point to use them in conjunction and that's why rel-urls is named what it is. Tantek 22:03, 1 June 2015 (UTC)
  1. Why is the structure of "rel-urls" different to the "alternates" structure. Should the "url" not just be added as a property and not as a key. Creating two data structures for one type of object seems inconsistent. It adds cognitive load to anyone trying to understand the JSON structure Glenn Jones 12:29, 1 June 2015 (UTC)
    • I was trying to avoid breaking the existing rels structure and use of it - I did implement a variant that put the structure inside rels, and it became cumbersome and repetitive where there were multiple rels on a url (xfn cases). Denormalising as properties of the URL made more sense. It also dedupes if there is repetitive linking to the same URL, eg a series of posts with rel-author on each. Kevin Marks 20:05, 1 June 2015 (UTC)
  2. If the rel is a "tag" then the main value we need to return should be the last path component of the URL, not the link text? Should we add another output property ie "tag" Glenn Jones 12:29, 1 June 2015 (UTC)
    • No need to return last path segment of the URL, because the URL is already there - and that's just a library/framework utility function to get the last path segment of a URL. Tantek 22:03, 1 June 2015 (UTC)
  3. As currently described, the URL from alternates is repeated in the rel-urls structure. If we are doing this, surely alternate should be in rels too? I assumed a mapping between them. Kevin Marks 20:05, 1 June 2015 (UTC)
    1. edit showing this variant: http://microformats.org/wiki/index.php?title=microformats2-parsing&oldid=65021#parse_a_hyperlink_element_for_rel_microformats
    • Yes it makes sense to drop "alternates" assuming the backcompat impact is low, put alternates in "rels" along with everything else, and direct people to use rels and rel-urls for alternates functionality. Evidence this is an acceptable even preferable approach.[3] Will add an issue accordingly. Tantek 22:03, 1 June 2015 (UTC)

Incorporated 2015-06-06

Nested h-* objects' "value" property

Status: resolved, resolution iterated, one real world implementation proven implementability, incorporated

Raised 2015-01-06 by User:Kylewm;

If a child element has a microformat (h-*) and is a property element (p-*, u-*, dt-*, e-*), the parser will add a "value" property to the resulting object. The value should attempt to be a useful representation of the object for consumers that do not have semantic knowledge of the particular h-* type. Ref: microformats2-parsing#parse_an_element_for_class_microformats.

To determine the "value", we parse the property element simply (as if it did not have a h-* class), which works well for simple h-* objects, e.g. <a class="u-like-of h-cite" href="...">...</a>

  • To handle more complex microformats, I propose that "value" for a p-* property element take on the first explicit "name" property of the nested microformat, and for a u-* property, the first explicit "url" property. Parsing will fall back on the current rules if an explicit property does not exist.
    • This makes sense to me, and fits with the use-cases and examples I've seen. Tantek 19:31, 6 January 2015 (UTC)
    • A similar (possibly simpler?) formulation would use the implied name and url rules to determine the "value" for p-* and u-* properties respectively
      • I don't think that's needed, as there are already implied rules on a property that should handle that. I'd start with just the "first explicit" scoping to be more conservative, and then see if we find any use-cases that (and existing implied rules) don't/doesn't catch. Tantek 19:31, 6 January 2015 (UTC)

For example:

<div class="h-entry">
  <div class="u-in-reply-to h-cite">
    <a class="p-author h-card" href="http://example.com">Example Author</a>
    <a class="p-name u-url" href="http://example.com/post">Example Post</a>
  </div>
</div>

The nested u-in-reply-to object would parse as

...
"in-reply-to": [{ 
  "type": ["h-cite"],
  "properties": {
    "name": ["Example Post"],
    "url": ["http://example.com/post"],
    "author": [{
      "type":["h-card"],
      "properties": {
        "url": ["http://example.com"], 
        "name": ["Example Author"]
      },
      "value": "Example Author"
    }],
  },
  "value": "http://example.com/post"
}]
...

where the outer "value" gets the in-reply-to h-cite's u-url property, and the inner "value" gets the author's p-name property.

  • Because there are no implied properties of the dt-* and e-* types, and no obvious defaults, the value rules for these types would not change.
    • A possibility for dt-* h-*: The dt-* could take either the first dt-* of the h-*, or (perhaps if no dt-* in the h-*,) the first <time> element inside. Tantek 19:31, 6 January 2015 (UTC)
    • First dt-* seems reasonable, predictable, and usable. Consensus at 2015-01-20 meetup.
    • Update 2015-05-29: no known use-cases for first dt-* or first e-*, and implementing that "would require some refactoring" (in mf2py at least per kylewm), thus until there's a use-case for first dt-*/e-* inside, let's treat "dt-* h-*" and "e-* h-*" as before. Tantek . In particular:
      • p-* h-* - value from first "name" as proposed above
      • u-* h-* - value from first "url" as proposed above
      • e-* h-* - value is already defined for e-* parsing, nothing special here
      • dt-* h-* - value from normal dt-* parsing - nothing special.
      • +1 totally agree, let's wait for use cases of e-* dt-* Kylewm 19:44, 29 May 2015 (UTC)

see also