textcontent-parsing: Difference between revisions

Revision as of 20:34, 27 August 2018

This is a draft specification for parsing textContent based on Martijn van der Ven's algorithm. [1]

Status

This is a draft specification.

Algorithm

Plain text of element

To get the plain text for an Element input:

Let output be the result of running Element to string on input
Remove any sequence of one or more consecutive U+0020 SPACE code points directly before and after an U+000A LF code point from output
Strip leading and trailing ASCII whitespace from output
Replace any sequence of one or more consecutive U+0020 SPACE code points in output with a single U+0020 SPACE code point
Return output

Element to string

To get the string value for an Element input:

Let output be an empty list
Let children be the children of input in tree order
For each child in children:
- If child is a Text node:
  1. Let value be the textContent of child
  2. Replace any U+0009 TAB, U+000A LF, and U+000D CR code points in value with a single U+0020 SPACE code point
  3. Append value to output
- If child is an Element, switch on its tagName:
  - SCRIPT
  - STYLE
    - Continue
  - IMG
    1. If child has an alt attribute, then:
      1. Let value be the contents of the alt attribute
      2. Strip leading and trailing ASCII whitespace from value
    2. Else if child has a src attribute, then:
      1. Let value be the contents of the src attribute
      2. Strip leading and trailing ASCII whitespace from value
      3. Set value to the absolute URL created by resolving value following the containing document’s language’s rules
    3. Else continue
    4. Append and prepend a single U+0020 SPACE code point to value
    5. Append value to output
  - BR
    - Append a string containing a single U+000A LF code point to output
  - P
    1. Let value be the result of running this algorithm on value
    2. Prepend a single U+000A LF code point to value
    3. Append value to output
  - Any other value
    1. Let value be the result of running this algorithm on child
    2. Append value to output
- Else continue
Return the concatenation of output

Implementations

List parsers that have implemented this algorithm. Note any differences as this specification evolves.

php-mf2 implements the initial version of this algorithm as of v0.4.4
...

Brainstorming

Discuss issues and improvements to the algorithm here.

whitespace in pre elements

This algorithm doesn't currently preserve whitespace in pre elements. There's some agreement in this issue that it should be preserved and mf2py currently does that.

add a new topic

References

Element: https://dom.spec.whatwg.org/#interface-element
Code point: https://infra.spec.whatwg.org/#code-point
Strip leading and trailing ASCII whitespace: https://infra.spec.whatwg.org/#strip-leading-and-trailing-ascii-whitespace
Children: https://dom.spec.whatwg.org/#concept-tree-child
Tree order: https://dom.spec.whatwg.org/#concept-tree-order
For each: https://infra.spec.whatwg.org/#list-iterate
Text node: https://dom.spec.whatwg.org/#interface-text
textContent: https://dom.spec.whatwg.org/#dom-node-textcontent
Append: https://infra.spec.whatwg.org/#list-append
tagName: https://dom.spec.whatwg.org/#dom-element-tagname
Continue: https://infra.spec.whatwg.org/#iteration-continue
Concatenation: https://infra.spec.whatwg.org/#string-concatenate

@@ Line 61: / Line 61: @@
 === whitespace in pre elements ===
-This algorithm doesn't currently preserve newlines in <code>pre</code> elements. There's some agreement in [https://github.com/microformats/microformats2-parsing/issues/15#issuecomment-407707386 this issue] that it ''should'' be preserved and mf2py currently does that.
+This algorithm doesn't currently preserve whitespace in <code>pre</code> elements. There's some agreement in [https://github.com/microformats/microformats2-parsing/issues/15#issuecomment-407707386 this issue] that it ''should'' be preserved and mf2py currently does that.
 === add a new topic ===

textcontent-parsing: Difference between revisions

Revision as of 20:34, 27 August 2018

Contents

Status

Algorithm

Plain text of element

Element to string

Implementations

Brainstorming

whitespace in pre elements

add a new topic

References

Navigation menu

textcontent-parsing: Difference between revisions

Revision as of 20:34, 27 August 2018

Status

Algorithm

Plain text of element

Element to string

Implementations

Brainstorming

whitespace in pre elements

add a new topic

References

Navigation menu

Search