microformats2 parsing brainstorming

(Difference between revisions)

Jump to: navigation, search
(for issues, use -issues)
(feedback on more rel info: link to algorithm)
Line 368: Line 368:
#* No need to return last path segment of the URL, because the URL is already there - and that's just a library/framework utility function to get the last path segment of a URL. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC)
#* No need to return last path segment of the URL, because the URL is already there - and that's just a library/framework utility function to get the last path segment of a URL. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC)
# As currently described, the URL from <code>alternates</code> is repeated in the <code>rel-urls</code> structure. If we are doing this, surely <code>alternate</code> should be in <code>rels</code> too? I assumed a mapping between them. [[User:Kevin Marks|Kevin Marks]] 20:05, 1 June 2015 (UTC)
# As currently described, the URL from <code>alternates</code> is repeated in the <code>rel-urls</code> structure. If we are doing this, surely <code>alternate</code> should be in <code>rels</code> too? I assumed a mapping between them. [[User:Kevin Marks|Kevin Marks]] 20:05, 1 June 2015 (UTC)
 +
## edit showing this variant: http://microformats.org/wiki/index.php?title=microformats2-parsing&oldid=65021#parse_a_hyperlink_element_for_rel_microformats
</div>
</div>
#* Yes it makes sense to drop "alternates" assuming the backcompat impact is low, put alternates in "rels" along with everything else, and direct people to use rels and rel-urls for alternates functionality. Evidence this is an acceptable even preferable approach.[http://indiewebcamp.com/irc/2015-06-01/line/1433195247005] Will add an issue accordingly. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC)
#* Yes it makes sense to drop "alternates" assuming the backcompat impact is low, put alternates in "rels" along with everything else, and direct people to use rels and rel-urls for alternates functionality. Evidence this is an acceptable even preferable approach.[http://indiewebcamp.com/irc/2015-06-01/line/1433195247005] Will add an issue accordingly. [[User:Tantek|Tantek]] 22:03, 1 June 2015 (UTC)

Revision as of 23:32, 1 June 2015


This page is for brainstorming, discussion, and other questions and explorations about microformats2 parsing.

For the microformats2 parsing algorithm, see:

For filing issues / problems with microformats2-parsing, see:

Contents

Nested h-* objects' "value" property

Status: resolved, awaiting implementation attempt/experience.

Raised 2015-01-06 by User:Kylewm;

If a child element has a microformat (h-*) and is a property element (p-*, u-*, dt-*, e-*), the parser will add a "value" property to the resulting object. The value should attempt to be a useful representation of the object for consumers that do not have semantic knowledge of the particular h-* type. Ref: microformats2-parsing#parse_an_element_for_class_microformats.

To determine the "value", we parse the property element simply (as if it did not have a h-* class), which works well for simple h-* objects, e.g. <a class="u-like-of h-cite" href="...">...</a>

  • To handle more complex microformats, I propose that "value" for a p-* property element take on the first explicit "name" property of the nested microformat, and for a u-* property, the first explicit "url" property. Parsing will fall back on the current rules if an explicit property does not exist.
    • This makes sense to me, and fits with the use-cases and examples I've seen. Tantek 19:31, 6 January 2015 (UTC)
    • A similar (possibly simpler?) formulation would use the implied name and url rules to determine the "value" for p-* and u-* properties respectively
      • I don't think that's needed, as there are already implied rules on a property that should handle that. I'd start with just the "first explicit" scoping to be more conservative, and then see if we find any use-cases that (and existing implied rules) don't/doesn't catch. Tantek 19:31, 6 January 2015 (UTC)

For example:

<div class="h-entry">
  <div class="u-in-reply-to h-cite">
    <a class="p-author h-card" href="http://example.com">Example Author</a>
    <a class="p-name u-url" href="http://example.com/post">Example Post</a>
  </div>
</div>

The nested u-in-reply-to object would parse as

...
"in-reply-to": [{ 
  "type": ["h-cite"],
  "properties": {
    "name": ["Example Post"],
    "url": ["http://example.com/post"],
    "author": [{
      "type":["h-card"],
      "properties": {
        "url": ["http://example.com"], 
        "name": ["Example Author"]
      },
      "value": "Example Author"
    }],
  },
  "value": "http://example.com/post"
}]
...

where the outer "value" gets the in-reply-to h-cite's u-url property, and the inner "value" gets the author's p-name property.

  • Because there are no implied properties of the dt-* and e-* types, and no obvious defaults, the value rules for these types would not change.
    • A possibility for dt-* h-*: The dt-* could take either the first dt-* of the h-*, or (perhaps if no dt-* in the h-*,) the first <time> element inside. Tantek 19:31, 6 January 2015 (UTC)
    • First dt-* seems reasonable, predictable, and usable. Consensus at 2015-01-20 meetup.
    • Update 2015-05-29: no known use-cases for first dt-* or first e-*, and implementing that "would require some refactoring" (in mf2py at least per kylewm), thus until there's a use-case for first dt-*/e-* inside, let's treat "dt-* h-*" and "e-* h-*" as before. Tantek . In particular:
      • p-* h-* - value from first "name" as proposed above
      • u-* h-* - value from first "url" as proposed above
      • e-* h-* - value is already defined for e-* parsing, nothing special here
      • dt-* h-* - value from normal dt-* parsing - nothing special.
      • +1 totally agree, let's wait for use cases of e-* dt-* Kylewm 19:44, 29 May 2015 (UTC)

Canonicalization of datetime output

Status: resolved, awaiting implementation attempt/experience.

It would be useful to choose a (more) uniform output format for datetimes to make it easier for users of the parser to consume datetimes. Microformats2 parsers already do sophisticated pattern matching to recognize date vs. time vs. datetimes, so converting this to any specific format should not add overhead.

Specifically:

  • Choose either 'T' or space as the date/time separator.
    • Prefer space as it is more human friendly/readable, which matters even for syntaxes/formats, as human still develop, debug them. Tantek 19:31, 6 January 2015 (UTC)
  • Choose either +XXYY or +XX:YY as the timezone specification (and convert 'Z' to +0000).
    • Would appreciate some study / input here as to which timezone offset syntax is more human friendly. I lean slightly toward +/-NNNN (without the colon) because in the context of seeing a time, leaving out the colon makes it less likely the offset will be confused for a time. E.g. "07:00-08:00" looks like 7-8am, even if it meant 07:00 in PST. Tantek 19:31, 6 January 2015 (UTC)
    • Space is fine - consensus 2015-01-20 meetup.
  • Parsers should not attempt make datetimes more exact than specified. They should not add time, seconds, or timezone if omitted in the original. Kylewm 04:02, 14 May 2014 (UTC)
    • Agreed. Tantek 19:31, 6 January 2015 (UTC)
    • or month, day per Tom Morris
    • consensus 2015-01-20 meetup
  • Counterpoint: PHP's builtin date parsing does not require strict formatting. And the equivalent functionality for Python is provided by the widely used python-dateutil library. Kylewm 19:02, 14 May 2014 (UTC)
    • However we cannot (must not) depend on either PHP or Python's "smart" "fixing" or Postelian "liberal handling", or any other language/framework's for that matter, as they all differ in how "intelligent" they are. Tantek 19:31, 6 January 2015 (UTC)

Perhaps just provide a guideline for these based on the above consensus.

Add meta http-equiv to microformats2 parsing model

Status: disagreement, awaiting implementation attempt/experience.

Similar to document level parsing of rel attributes, it makes sense simultaneously to parse <meta http-equiv> elements, perhaps treating "Status" in a special way (only using first number (sequence of digits) for its "value").

Use case: IndieWeb "deleted" indication inline in content for static file services that don't support HTTP return codes.

HTTP Header example:

HTML equivalent:

Related:

  • Interesting thought. Are you suggesting a top level "http-equivs:" collection similar to "rels:" in the parsed output? Should we consider "metas:" instead or in addition? Tantek 19:31, 6 January 2015 (UTC)
  • What's the use case for this? Also, http-equiv on its own is useless. http-equiv is only a supplement to the data stored in headers. And headers aren't always there: what happens in the context of someone debugging a page who pastes the source into the textarea of an mf2 parser? Without a compelling use case for including headers (and then over-riding some of them with http-equivs), I'm not sure why an implementor want to do this. —Tom Morris 00:25, 8 May 2015 (UTC)

E.g. from https://gist.github.com/aaronpk/10297489

<meta http-equiv="Status" content="410 GONE"/>
{
 "items": [],
 "rels": {},
 "http": {
 "status": 410
 }
}
  • Maybe make this an optional pass in the parser? - Tom Morris 2015-01-20
  • For now, don't bother with metas until someone provides a use-case. Tom Morris
  • Agreed on both counts. Tantek 06:56, 21 January 2015 (UTC)


MIME type

See microformats2-mime-type


Other Interpretation Parsing Notes

Note: most of these need to be written up as separate microformats2-parsing-issues

Author: Ben Ward

Microformats 2 proposes a new, all encompassing syntax modification of prefixes that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.

Collection of other unresolved parsing issues in a generic model:

This is good material for documenting as microformats-2-issues, microformats-2-faq, and perhaps some of the more technical details in microformats-2-parsing-faq.

incorporated 2015-05-28

The following brainstorms were incorporated 2015-05-28.

more information for alternates

Raised 2015-04-24 by Kevin Marks

The existing alternate parsing is omitting title - that should be added. The text would make sense to add here too.

Use-case: labels for presenting alternates

  • +1 Makes sense. Tantek 03:41, 25 April 2015 (UTC)

more information for rel-based formats

Raised 2015-04-18 by Kevin Marks

Related github test suite issue: https://github.com/microformats/tests/issues/16

Several rel-based formats have additional information that is useful beyond the link itself, which is all we capture at the moment. As I am trying to update the Universal feedparser to support mf2 based I will show examples from the testcases there.

The main change is to add a rel-urls entry for more information about the attributes and text of the urls pointed to by rel's in the document

A fork of mf2py that implements these changes is at https://github.com/kevinmarks/mf2py

rel-tag

<a rel="tag" href="http://del.icio.us/tag/tech">Technology</a> 

currently parses to:

{"rels": {"tag": ["http://del.icio.us/tag/tech"]}, "items": []} 

This loses the link text, which is useful as a label.

We add a rel-urls element to the parsed output with this extra data that can be looked up from the rels, which doesn't break backward compatibility and works better with xfn (see below)

{
    "rels": {
        "tag": [
            "http://del.icio.us/tag/tech"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://del.icio.us/tag/tech": {
            "rels": [
                "tag"
            ], 
            "text": "Technology"
        }
    }
}

xfn

<a rel="coworker" href="http://example.com/johndoe">John Doe</a>

currently parses to:

{"rels": {"coworker": ["http://example.com/johndoe"]}, "items": []}

This loses the link text, which is the person's name. Suggested output using the urls object:

{
{
    "rels": {
        "coworker": [
            "http://example.com/johndoe"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://example.com/johndoe": {
            "rels": [
                "coworker"
            ], 
            "text": "John Doe"
        }
    }
}

with multiple xfn values

<a rel="coworker friend" href="http://example.com/johndoe">John Doe</a> we get this:

{
    "rels": {
        "coworker": [
            "http://example.com/johndoe"
        ], 
        "friend": [
            "http://example.com/johndoe"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://example.com/johndoe": {
            "rels": [
                "coworker", 
                "friend"
            ], 
            "text": "John Doe"
        }
    }
}

rel-enclosure

<a rel="enclosure" href="http://example.com/movie.mp4" type="video/mpeg" title="real title">my movie</a>

currently parses to:

'{"rels": {"enclosure": ["http://example.com/movie.mp4"]}, "items": []}'

This loses the link text, which is the title and the attributes which give type. Suggested output:

{
    "rels": {
        "enclosure": [
            "http://example.com/movie.mp4"
        ]
    }, 
    "items": [], 
    "rel-urls": {
        "http://example.com/movie.mp4": {
            "rels": [
                "enclosure"
            ], 
            "text": "my movie", 
            "type": "video/mpeg", 
            "title": "real title"
        }
    }
}

This generalises to other rel's too, such as rel-feed and rel-alternate that have type, lang etc attributes.

(updated to include changes from feedback below) Kevin Marks 22:13, 26 April 2015 (UTC)

attributes parsed

Attributes currently parsed are:

Attributes we may consider parsing if we have a use case are

In addition there is a special attribute name text which is the text contents of the link, which is useful in rel-tag rel-enclosure and xfn, and in alternate when used for feeds. It's also clarifying for rel-me links.

Tantek suggests we use textContent for this instead, and make it a single string, not a list as name is elsewhere in mf2 parsing

feedback on more rel info

  1. "name" is bad because it misleadingly conflates with use of "name" elsewhere in microformats2.
    • Suggested alternative: textContent - since that's literally what is being returned there. Tantek 02:35, 25 April 2015 (UTC)
      • as all other mf2 keys are lowercase-with-hyphens, Tantek suggests 'text' as that isn't going to be an html Kevin Marks 07:28, 25 April 2015 (UTC)
  2. no need for array for "name"/textContent - since there is always only one at most
    • E.g. should be "textContent": "my movie" Tantek 02:35, 25 April 2015 (UTC)
    • Update: "text": "my movie" Tantek 04:39, 29 May 2015 (UTC)
  3. "urls" key is misleading - implies all URLs in the document, which is neither true, nor desired (takes much more parsing time and work and code)
    • Suggested alternative: "rel-urls". And open to better alternatives too. Tantek 02:35, 25 April 2015 (UTC)
      • If we are trying to extend the number of properties retuned from a rel without breaking the old structure why don't we call the new structure something like "rels-extended" Glenn Jones 12:29, 1 June 2015 (UTC)
        • Extension is not the point, but rather to use them complementary. One structure for look-up of any rel value, hence "rels", which returns you a list of URLs. Then you can lookup those URLs in the new mapping, by URL, hence it is called "rel-urls" - that's the point to use them in conjunction and that's why rel-urls is named what it is. Tantek 22:03, 1 June 2015 (UTC)
  1. Why is the structure of "rel-urls" different to the "alternates" structure. Should the "url" not just be added as a property and not as a key. Creating two data structures for one type of object seems inconsistent. It adds cognitive load to anyone trying to understand the JSON structure Glenn Jones 12:29, 1 June 2015 (UTC)
    • I was trying to avoid breaking the existing rels structure and use of it - I did implement a variant that put the structure inside rels, and it became cumbersome and repetitive where there were multiple rels on a url (xfn cases). Denormalising as properties of the URL made more sense. It also dedupes if there is repetitive linking to the same URL, eg a series of posts with rel-author on each. Kevin Marks 20:05, 1 June 2015 (UTC)
  2. If the rel is a "tag" then the main value we need to return should be the last path component of the URL, not the link text? Should we add another output property ie "tag" Glenn Jones 12:29, 1 June 2015 (UTC)
    • No need to return last path segment of the URL, because the URL is already there - and that's just a library/framework utility function to get the last path segment of a URL. Tantek 22:03, 1 June 2015 (UTC)
  3. As currently described, the URL from alternates is repeated in the rel-urls structure. If we are doing this, surely alternate should be in rels too? I assumed a mapping between them. Kevin Marks 20:05, 1 June 2015 (UTC)
    1. edit showing this variant: http://microformats.org/wiki/index.php?title=microformats2-parsing&oldid=65021#parse_a_hyperlink_element_for_rel_microformats
    • Yes it makes sense to drop "alternates" assuming the backcompat impact is low, put alternates in "rels" along with everything else, and direct people to use rels and rel-urls for alternates functionality. Evidence this is an acceptable even preferable approach.[1] Will add an issue accordingly. Tantek 22:03, 1 June 2015 (UTC)

see also

microformats2 parsing brainstorming was last modified: Wednesday, December 31st, 1969

Views