textcontent-parsing
Jump to navigation
Jump to search
This is a draft specification for parsing textContent
based on Martijn van der Ven's algorithm. [1]
Status
This is a draft specification.
Algorithm
Plain text of element
To get the plain text for an Element input:
- Let output be the result of running Element to string on input
- Remove any sequence of one or more consecutive
U+0020 SPACE
code points directly before and after anU+000A LF
code point from output - Strip leading and trailing ASCII whitespace from output
- Replace any sequence of one or more consecutive
U+0020 SPACE
code points in output with a singleU+0020 SPACE
code point - Return output
Element to string
To get the string value for an Element input:
- Let output be an empty list
- Let children be the children of input in tree order
- For each child in children:
- If child is a Text node:
- Let value be the textContent of child
- Replace any
U+0009 TAB
,U+000A LF
, andU+000D CR
code points in value with a singleU+0020 SPACE
code point - Append value to output
- If child is an Element, switch on its tagName:
SCRIPT
STYLE
IMG
- If child has an alt attribute, then:
- Let value be the contents of the alt attribute
- Strip leading and trailing ASCII whitespace from value
- Else if child has a src attribute, then:
- Let value be the contents of the src attribute
- Strip leading and trailing ASCII whitespace from value
- Set value to the absolute URL created by resolving value following the containing document’s language’s rules
- Else continue
- Append and prepend a single
U+0020 SPACE
code point to value - Append value to output
- If child has an alt attribute, then:
BR
- Append a string containing a single
U+000A LF
code point to output
- Append a string containing a single
P
- Let value be the result of running this algorithm on child
- Prepend a single
U+000A LF
code point to value - Append value to output
- Any other value
- Let value be the result of running this algorithm on child
- Append value to output
- Else continue
- If child is a Text node:
- Return the concatenation of output
Implementations
List parsers that have implemented this algorithm. Note any differences as this specification evolves.
- php-mf2 implements the initial version of this algorithm as of v0.4.4
- ...
Brainstorming
Discuss issues and improvements to the algorithm here.
whitespace in pre elements
This algorithm doesn't currently preserve whitespace in pre
elements. There's some agreement in this issue that it should be preserved and mf2py currently does that.
add a new topic
References
- Element: https://dom.spec.whatwg.org/#interface-element
- Code point: https://infra.spec.whatwg.org/#code-point
- Strip leading and trailing ASCII whitespace: https://infra.spec.whatwg.org/#strip-leading-and-trailing-ascii-whitespace
- Children: https://dom.spec.whatwg.org/#concept-tree-child
- Tree order: https://dom.spec.whatwg.org/#concept-tree-order
- For each: https://infra.spec.whatwg.org/#list-iterate
- Text node: https://dom.spec.whatwg.org/#interface-text
- textContent: https://dom.spec.whatwg.org/#dom-node-textcontent
- Append: https://infra.spec.whatwg.org/#list-append
- tagName: https://dom.spec.whatwg.org/#dom-element-tagname
- Continue: https://infra.spec.whatwg.org/#iteration-continue
- Concatenation: https://infra.spec.whatwg.org/#string-concatenate