textcontent-parsing
This is a draft specification for parsing textContent
based on Martijn van der Ven's algorithm. [1]
Status
This is a draft specification.
Algorithm
Plain text of element
To get the plain text for an Element input:
- Let output be the result of running Element to string on input
- Remove any sequence of one or more consecutive
U+0020 SPACE
code points directly before and after anU+000A LF
code point from output - Strip leading and trailing ASCII whitespace from output
- Replace any sequence of one or more consecutive
U+0020 SPACE
code points in output with a singleU+0020 SPACE
code point - Return output
Element to string
To get the string value for an Element input:
- Let output be an empty list
- Let children be the children of input in tree order
- For each child in children:
- If child is a Text node:
- Let value be the textContent of child
- Replace any
U+0009 TAB
,U+000A LF
, andU+000D CR
code points in value with a singleU+0020 SPACE
code point - Append value to output
- If child is an Element, switch on its tagName:
SCRIPT
STYLE
IMG
- If child has an alt attribute, then:
- Let value be the contents of the alt attribute
- Strip leading and trailing ASCII whitespace from value
- Else if child has a src attribute, then:
- Let value be the contents of the src attribute
- Strip leading and trailing ASCII whitespace from value
- Set value to the absolute URL created by resolving value following the containing document’s language’s rules
- Else continue
- Append and prepend a single
U+0020 SPACE
code point to value - Append value to output
- If child has an alt attribute, then:
BR
- Append a string containing a single
U+000A L
F code point to output
- Append a string containing a single
P
- Let value be the result of running this algorithm on value
- Prepend a single
U+000A LF
code point to value - Append value to output
- Any other value
- Let value be the result of running this algorithm on child
- Append value to output
- Else continue
- If child is a Text node:
- Return the concatenation of output
Implementations
List parsers that have implemented this algorithm. Note any differences as this specification evolves.
- php-mf2 implements the initial version of this algorithm as of v0.4.4
- ...
Brainstorming
Discuss issues and improvements to the algorithm here.
- ...
References
- Element: https://dom.spec.whatwg.org/#interface-element
- Code point: https://infra.spec.whatwg.org/#code-point
- Strip leading and trailing ASCII whitespace: https://infra.spec.whatwg.org/#strip-leading-and-trailing-ascii-whitespace
- Children: https://dom.spec.whatwg.org/#concept-tree-child
- Tree order: https://dom.spec.whatwg.org/#concept-tree-order
- For each: https://infra.spec.whatwg.org/#list-iterate
- Text node: https://dom.spec.whatwg.org/#interface-text
- textContent: https://dom.spec.whatwg.org/#dom-node-textcontent
- Append: https://infra.spec.whatwg.org/#list-append
- tagName: https://dom.spec.whatwg.org/#dom-element-tagname
- Continue: https://infra.spec.whatwg.org/#iteration-continue
- Concatenation: https://infra.spec.whatwg.org/#string-concatenate