microformats2-parsing-brainstorming: Difference between revisions

Revision as of 01:09, 25 April 2015

<entry-title>microformats2 parsing brainstorming</entry-title>

This page is for brainstorming, discussion, and other questions and explorations about microformats2 parsing.

For the microformats2 parsing algorithm, see:

microformats2-parsing

more information for rel-based formats

Raised 2015-04-18 by Kevin Marks

Several rel-based formats have additional information that is useful beyond the link itself, which is all we capture at the moment. As I am trying to update the Universal feedparser to support mf2 based I will show examples from the testcases there.

rel-tag

<a rel="tag" href="http://del.icio.us/tag/tech">Technology</a>

currently parses to:

{"rels": {"tag": ["http://del.icio.us/tag/tech"]}, "items": []}

This loses the link text, which is useful as a label. Suggested output:

{
"rels": 
  {
  "tag": [{
    "url":"http://del.icio.us/tag/tech",
    "name":"Technology",
    "attrs":{}
    }]
  }, 
"items": []
}

Alternative is to add a urls element to the parsed output with this extra data that can be looked up from the rels, which doesn't break backward compatibility and works better with xfn (see blow)

{
    "rels": {
        "tag": [
            "http://del.icio.us/tag/tech"
        ]
    }, 
    "items": [], 
    "urls": {
        "http://del.icio.us/tag/tech": {
            "rels": [
                "tag"
            ], 
            "name": [
                "Technology"
            ]
        }
    }
}

xfn

<a rel="coworker" href="http://example.com/johndoe">John Doe</a>

currently parses to:

{"rels": {"coworker": ["http://example.com/johndoe"]}, "items": []}

This loses the link text, which is the person's name. Suggested output using the urls object:

{
    "rels": {
        "coworker": [
            "http://example.com/johndoe"
        ]
    }, 
    "items": [], 
    "urls": {
        "http://example.com/johndoe": {
            "rels": [
                "coworker"
            ], 
            "name": [
                "John Doe"
            ]
        }
    }
}

rel-enclosure

<a rel="enclosure" href="http://example.com/movie.mp4" type="video/mpeg" title="real title">my movie</a>

currently parses to:

'{"rels": {"enclosure": ["http://example.com/movie.mp4"]}, "items": []}'

This loses the link text, which is the title and the attributes which give type. Suggested output:

{
"rels":
  {
  "enclosure":[{
    "url":"http://example.com/movie.mp4",
    "name":"my movie",
    "attrs":{
      "type":"video/mpeg",
      "title":"real title"}
    }]
  },
  "items": []
}

I think this generalises to other rel's too, such as rel-feed and rel-alternate that have type, lang etc attributes.

Nested h-* objects' "value" property

Raised 2015-01-06 by User:Kylewm;

If a child element has a microformat (h-*) and is a property element (p-*, u-*, dt-*, e-*), the parser will add a "value" property to the resulting object. The value should attempt to be a useful representation of the object for consumers that do not have semantic knowledge of the particular h-* type. Ref: microformats2-parsing#parse_an_element_for_class_microformats.

To determine the "value", we parse the property element simply (as if it did not have a h-* class), which works well for simple h-* objects, e.g. <a class="u-like-of h-cite" href="...">...</a>

To handle more complex microformats, I propose that "value" for a p-* property element take on the first explicit "name" property of the nested microformat, and for a u-* property, the first explicit "url" property. Parsing will fall back on the current rules if an explicit property does not exist.
- This makes sense to me, and fits with the use-cases and examples I've seen. Tantek 19:31, 6 January 2015 (UTC)
- A similar (possibly simpler?) formulation would use the implied name and url rules to determine the "value" for p-* and u-* properties respectively
  - I don't think that's needed, as there are already implied rules on a property that should handle that. I'd start with just the "first explicit" scoping to be more conservative, and then see if we find any use-cases that (and existing implied rules) don't/doesn't catch. Tantek 19:31, 6 January 2015 (UTC)
    - Agreement at 2015-01-20 meetup.

For example:

<div class="h-entry">
  <div class="u-in-reply-to h-cite">
    <a class="p-author h-card" href="http://example.com">Example Author</a>
    <a class="p-name u-url" href="http://example.com/post">Example Post</a>
  </div>
</div>

The nested u-in-reply-to object would parse as

...
"in-reply-to": [{ 
  "type": ["h-cite"],
  "properties": {
    "name": ["Example Post"],
    "url": ["http://example.com/post"],
    "author": [{
      "type":["h-card"],
      "properties": {
        "url": ["http://example.com"], 
        "name": ["Example Author"],
        "value": "Example Author"
      }
    }],
  },
  "value": "http://example.com/post"
}]
...

where the outer "value" gets the in-reply-to h-cite's u-url property, and the inner "value" gets the author's p-name property.

Because there are no implied properties of the dt-* and e-* types, and no obvious defaults, the value rules for these types would not change.
- A possibility for dt-* h-*: The dt-* could take either the first dt-* of the h-*, or (perhaps if no dt-* in the h-*,) the first <time> element inside. Tantek 19:31, 6 January 2015 (UTC)
- First dt-* seems reasonable, predictable, and usable. Consensus at 2015-01-20 meetup.

Canonicalization of datetime output

It would be useful to choose a (more) uniform output format for datetimes to make it easier for users of the parser to consume datetimes. Microformats2 parsers already do sophisticated pattern matching to recognize date vs. time vs. datetimes, so converting this to any specific format should not add overhead.

Specifically:

Choose either 'T' or space as the date/time separator.
- Prefer space as it is more human friendly/readable, which matters even for syntaxes/formats, as human still develop, debug them. Tantek 19:31, 6 January 2015 (UTC)
Choose either +XXYY or +XX:YY as the timezone specification (and convert 'Z' to +0000).
- Would appreciate some study / input here as to which timezone offset syntax is more human friendly. I lean slightly toward +/-NNNN (without the colon) because in the context of seeing a time, leaving out the colon makes it less likely the offset will be confused for a time. E.g. "07:00-08:00" looks like 7-8am, even if it meant 07:00 in PST. Tantek 19:31, 6 January 2015 (UTC)
- Space is fine - consensus 2015-01-20 meetup.
Parsers should not attempt make datetimes more exact than specified. They should not add time, seconds, or timezone if omitted in the original. Kylewm 04:02, 14 May 2014 (UTC)
- Agreed. Tantek 19:31, 6 January 2015 (UTC)
- or month, day per Tom Morris
- consensus 2015-01-20 meetup

Counterpoint: PHP's builtin date parsing does not require strict formatting. And the equivalent functionality for Python is provided by the widely used python-dateutil library. Kylewm 19:02, 14 May 2014 (UTC)
- However we cannot (must not) depend on either PHP or Python's "smart" "fixing" or Postelian "liberal handling", or any other language/framework's for that matter, as they all differ in how "intelligent" they are. Tantek 19:31, 6 January 2015 (UTC)

Perhaps just provide a guideline for these based on the above consensus.

Add meta http-equiv to microformats2 parsing model

Similar to document level parsing of rel attributes, it makes sense simultaneously to parse <meta http-equiv> elements, perhaps treating "Status" in a special way (only using first number (sequence of digits) for its "value").

Use case: IndieWeb "deleted" indication inline in content for static file services that don't support HTTP return codes.

http://indiewebcamp.com/deleted#HTML_meta_http-equiv_for_status

HTTP Header example:

Content-Type: text/html; charset=utf-8

HTML equivalent:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

https://www.w3.org/International/O-HTTP-charset

Interesting thought. Are you suggesting a top level "http-equivs:" collection similar to "rels:" in the parsed output? Should we consider "metas:" instead or in addition? Tantek 19:31, 6 January 2015 (UTC)

E.g. from https://gist.github.com/aaronpk/10297489

<meta http-equiv="Status" content="410 GONE"/>

{
 "items": [],
 "rels": {},
 "http": {
 "status": 410
 }
}

Maybe make this an optional pass in the parser? - Tom Morris 2015-01-20
For now, don't bother with metas until someone provides a use-case. Tom Morris
Agreed on both counts. Tantek 06:56, 21 January 2015 (UTC)

Other Interpretation Parsing Notes

Note: most of these need to be written up as separate microformats2-parsing-issues

Author: Ben Ward

Microformats 2 proposes a new, all encompassing syntax modification of prefixes that will allow microformats to be parsed from pages by processors without prior knowledge of a vocabulary. The core components of this model are quite simple, are quite simple to implement, but there are a number of conflicts that emerge with the functionality of existing microformats parsers that need to be handled. This page documents a proposed model to separate these concerns clearly in a way that can be applied to the documentation of generic microformats parsing rules, and the documentation of individual vocabularies.

Collection of other unresolved parsing issues in a generic model:

This is good material for documenting as microformats-2-issues, microformats-2-faq, and perhaps some of the more technical details in microformats-2-parsing-faq.

The include pattern references other elements from elsewhere in a document. A generic parser needs to track IDs and fill them in after walking the DOM. (also, itemref if adopted.)
- The current thinking per microformats-2-brainstorming is to adopt itemref and drop the include-pattern. Tantek
Will itemref always map to an item property name?
- No, itemref maps to one or more elements by ids, and their children. Those referenced elements may have property class names themselves, or they may contain elements that do. Tantek
hAtom implies author from an hCard in a page that uses an address element. This requires format knowledge, but a generic parser does not currently track the element type of a property node. Should it?
- It should not. element-specific handling (e.g. using "alt" from img, and "title" from abbr) is completely done at parse time. The JSON data model does not reflect which element type or attribute the value came from. Additionally, hAtom is an example where we created far too many vocabulary-specific rules, in practice they're not necessary, and only complicate the microformat for both publisher understanding and parser implementation. Tantek
hAtom defines that the highest level heading within an entry implies entry-title. This particular optimisation might be better off dead.
- Agreed, this is gone in microformats 2. Tantek
hAtom defines that permalinks be parsed from rel attributes, not class
- In practice this has been one of the more problematic/error prone aspects of hAtom implementations, and it's also inconsistent with other microformats (although hReview tried to use both rel permalinks and "url"). The dependence upon rel-bookmark for permalinks is dropped in h-atom in preference to re-using "u-url" and "u-uid". Tantek
XFN is entirely built on rel (although, has various other differences from structural microformats, as do vote-links, so perhaps are excluded from this discussion and will always be handled by dedicated parsers/queries regardless?)
- The best (easiest and most reliable) use of 'rel' microformats in practice is when they are orthogonal to 'class' microformats. This is true both with XFN and some newer rel values like rel-author. In addition, it was very clear at the recent schema.org workshop's syntax session that RDFa's decision to apparently arbitrarily mix use of 'rel' and 'property' attributes for specifying different types of properties (it wasn't clear to people in the room when you use which for what) has caused a high degree of confusion among publishers and thus high error-rates. Thus if anything we should learn from both the mistakes of RDFa and our own experiences with even very deliberate/specific mixing of rel microformats in class microformats, and keep them defined as separate orthogonal building blocks that work together, but don't depend on each other. Tantek
  - Relatedly to this: rel-tag in hAtom. --BenWard 06:50, 5 October 2011 (UTC)
    - Yes, and two related things here. First, despite my (and others') objection and (past) interoperable post/entry-specific treatment by Technorati and Ice Rocket, Hixie has redefined rel-tag in HTML5 to mean applying to the whole page, not a single post. Second, I've explicitly added 'p-category' to the draft 'h-atom' vocabulary in microformats-2. Tantek 07:12, 5 October 2011 (UTC)
HTML's time element includes an optional pubdate attribute. Simply: We should parse this as dt-published. --BenWard 06:12, 10 October 2011 (UTC)
- *If* there is even some reasonable data on actual use of the "pubdate" attribute (I don't think there is, frankly, especially with the removal of the algorithm to produce Atom from HTML5), then we could consider parsing "pubdate" as backwards compatible option for "dt-published". As a general rule, however, it is bad (demonstrably/experienced) design to depend on additional attributes (c.f. RDFa confusion over "property" vs. "rel"), especially for an instance where no additional attribute is necessary. I would leave this out for now until there is non-trivial (more than just test pages or folks who've written HTML5 books, ahem) use in the wild. When there is such use in the wild, it should be documented on a wiki page. We don't want to encourage more complex (additional attribute) publishing as a result of supporting it. Tantek 12:12, 10 October 2011 (UTC)
value-class-pattern: In microformats-2, since there are no sub-properties, there will presumably no-longer be a 'value' property in any parsed model. Properties such as 'tel > type' in hCard are, as I recall, deprecated due to underuse anyway, so 'tel > value' becomes redundant. (There's also potentially some clarification around 'price > value' in hListing, whereby value was used in a pattern. So, what does this mean for value class parsing, with regard to value-title patterns and date separation patterns. Are we looking for a 'p-value' and 'p-value-title' classname, but treating them specially (excluding them from regular property parsing.) Or, are we giving them a special prefix (v-text, v-title? That seems confusing, but could be a concept.) I'm fine with p- for both, and just having the parser ignore them since they're special, but need clarification and naming confirmation. --BenWard 09:35, 10 October 2011 (UTC)
- A few things:
  - 1. Yes, no more subproperties. 'tel' becomes just 'p-tel'. If there is demand for a structured 'tel' value, then we can use that demand (and research into publishing in practice) to brainstorm and create an 'h-tel' structured telephone number (with perhaps fields like 'type', 'extension', some indication of it being local dialing (an extra 0 in some countries) or international dialing, etc.) Or, we address the different 'tel' types as their own flat properties (again as justified by research), e.g. perhaps 'p-tel-fax', or 'p-tel-mobile'. Something for hcard-2-brainstorming.
  - 2. For prices, e.g. hListing, either we're going to need to encode how to parse monetary amounts including monetary symbols, or consider creating an 'h-price' structured price. Not sure what the right answer is here, again, will need to be informed by analysis of documented actual price publication practices.
  - 3. We should avoid introducing a new prefix 'v-' just for value-class-pattern. As we've noted elsewhere, each new prefix adds complexity and should be avoided without substantial advantage.
  - 4. Using 'p-value-title' is strange, as it would be an exception to 'p-' parsing, since it would get the value from the 'title' attribute whereas 'p-' properties don't normally do that (exception: abbr).
  - 5. Using 'p-value' is also strange, as it wouldn't generate a 'value' property in the JSON data model.
  - 6. Class name 'value-title' is already sufficiently prefixed - we've found or even heard of no collisions in practice.
  - 7. Class name 'value' can, by its simpler naming nature, be expected to potentially collide with other web designer class name usage though we have no documentation/mention thereof. We could consider a renaming, or providing of alternative, such as 'value-string', or 'value-content', etc. However, let's keep that as a backup plan to use only if/when evidence is presented that we need to.
- Conclusions: for now, in microformats-2, keep using 'value' and 'value-title' as defined in the value-class-pattern, and add the additional (obvious) interpretation that value class pattern: date and time parsing applies to all 'dt-' properties. - Tantek 12:12, 10 October 2011 (UTC)

microformats2-parsing-brainstorming: Difference between revisions

Revision as of 01:09, 25 April 2015

Contents

more information for rel-based formats

rel-tag

xfn

rel-enclosure

Nested h-* objects' "value" property

Canonicalization of datetime output

Add meta http-equiv to microformats2 parsing model

Other Interpretation Parsing Notes

see also

Navigation menu

microformats2-parsing-brainstorming: Difference between revisions

Revision as of 01:09, 25 April 2015

more information for rel-based formats

rel-tag

xfn

rel-enclosure

Nested h-* objects' "value" property

Canonicalization of datetime output

Add meta http-equiv to microformats2 parsing model

Other Interpretation Parsing Notes

see also

Navigation menu

Search