textcontent-parsing: Difference between revisions
Jump to navigation
Jump to search
GRegorLove (talk | contribs) (New page: This is a '''draft''' specification for parsing <code>textContent</code> based on Martijn van der Ven's algorithm. [https://wiki.zegnat.net/media/textparsing.html] == Stat...) |
GRegorLove (talk | contribs) (→Implementations: +php-mf2) |
||
Line 54: | Line 54: | ||
List parsers that have implemented this algorithm. Note any differences as this specification evolves. | List parsers that have implemented this algorithm. Note any differences as this specification evolves. | ||
* '''php-mf2''' implements the [http://microformats.org/wiki/index.php?title=textcontent-parsing&oldid=66933 initial version] of this algorithm as of [https://github.com/microformats/php-mf2/tree/v0.4.4 v0.4.4] | |||
* ... | * ... | ||
Revision as of 20:21, 27 August 2018
This is a draft specification for parsing textContent
based on Martijn van der Ven's algorithm. [1]
Status
This is a draft specification.
Algorithm
Plain text of element
To get the plain text for an Element input:
- Let output be the result of running Element to string on input
- Remove any sequence of one or more consecutive
U+0020 SPACE
code points directly before and after anU+000A LF
code point from output - Strip leading and trailing ASCII whitespace from output
- Replace any sequence of one or more consecutive
U+0020 SPACE
code points in output with a singleU+0020 SPACE
code point - Return output
Element to string
To get the string value for an Element input:
- Let output be an empty list
- Let children be the children of input in tree order
- For each child in children:
- If child is a Text node:
- Let value be the textContent of child
- Replace any
U+0009 TAB
,U+000A LF
, andU+000D CR
code points in value with a singleU+0020 SPACE
code point - Append value to output
- If child is an Element, switch on its tagName:
SCRIPT
STYLE
IMG
- If child has an alt attribute, then:
- Let value be the contents of the alt attribute
- Strip leading and trailing ASCII whitespace from value
- Else if child has a src attribute, then:
- Let value be the contents of the src attribute
- Strip leading and trailing ASCII whitespace from value
- Set value to the absolute URL created by resolving value following the containing document’s language’s rules
- Else continue
- Append and prepend a single
U+0020 SPACE
code point to value - Append value to output
- If child has an alt attribute, then:
BR
- Append a string containing a single
U+000A L
F code point to output
- Append a string containing a single
P
- Let value be the result of running this algorithm on value
- Prepend a single
U+000A LF
code point to value - Append value to output
- Any other value
- Let value be the result of running this algorithm on child
- Append value to output
- Else continue
- If child is a Text node:
- Return the concatenation of output
Implementations
List parsers that have implemented this algorithm. Note any differences as this specification evolves.
- php-mf2 implements the initial version of this algorithm as of v0.4.4
- ...
Brainstorming
Discuss issues and improvements to the algorithm here.
- ...
References
- Element: https://dom.spec.whatwg.org/#interface-element
- Code point: https://infra.spec.whatwg.org/#code-point
- Strip leading and trailing ASCII whitespace: https://infra.spec.whatwg.org/#strip-leading-and-trailing-ascii-whitespace
- Children: https://dom.spec.whatwg.org/#concept-tree-child
- Tree order: https://dom.spec.whatwg.org/#concept-tree-order
- For each: https://infra.spec.whatwg.org/#list-iterate
- Text node: https://dom.spec.whatwg.org/#interface-text
- textContent: https://dom.spec.whatwg.org/#dom-node-textcontent
- Append: https://infra.spec.whatwg.org/#list-append
- tagName: https://dom.spec.whatwg.org/#dom-element-tagname
- Continue: https://infra.spec.whatwg.org/#iteration-continue
- Concatenation: https://infra.spec.whatwg.org/#string-concatenate