[uf-discuss] a question about concatenation and hAtom entry content

Sat Jun 2 06:23:59 PDT 2007

On 6/2/07, Ben Wiley Sittler <bsittler at gmail.com> wrote:
> On 6/1/07, David Janes <davidjanes at blogmatrix.com> wrote:
> > On 6/1/07, Ryan King <ryan at technorati.com> wrote:
> > > On May 31, 2007, at 11:29 AM, David Janes wrote:
> > >
> > > > On 5/31/07, Ryan King <ryan at technorati.com> wrote:
> > > >
> > > >> Another option is that entry content is:
> > > >>
> > > >> <p class="entry-content">Content</p>
> > > >> <p class="entry-content">More Content</p>
> > > >>
> > > >>
> > > >> Is there a reason why hAtom as currently spec'ed only does text, not
> > > >> markup?
> > > >
> > > > I thought it did markup! I totally see what you are saying here
> > > > though; the question here is whether we include the DOM nodes that
> > > > specify entry-content. This isn't in the spec, and you wouldn't want
> > > > to do it everywhere (entry-title, for example) but it would make sense
> > > > if it did.
> > >
> > > You're right, I'm suggesting that only for entry-content (and maybe
> > > entry-summary) that we take the nodes that have the class name on
> > > them. The reason? I've seen this several times:
> > >
> > > <... class="hentry">
> > >   ...
> > >   <p class="entry-content">...</p>
> > >
> > >   <p class="entry-content">...</p>
> > >
> > > </>
> > >
> > > It makes sense, to me, to put the paragraph nodes, intact, in the
> > > content.
> >
> > I concur. Time to start ramping up for hAtom 0.2, if I can get some
> > blocks of free time.
> >
> > Regards, etc...
>
> why not do this for the entry title, too? accroding to the atom spec,
> this can contain markup too (and in my experience, often does.)
>
> and yes, having some well-defined rules for xhtml → text flattening
> would be good (not just for microformats, but for xhtml apps
> generally.) here are the ones i use:
>
> 1. ignore content of the following elements: script, style, textarea, title
>
> 2. use the alt text as the text for img elements
>
> 4. normalize all runs of one or more whitespace to a single space in
> all elements that do not have an encestral pre, xmp, plaintext, or
> listing element

> 3. insert breaks before and after the following elements: br, p, div,
> hr, h1, h2, h3, h4, h5, blockquote, address, table, tr, td, form, pre,
> xmp, listing, ol, ul, menu, dir, li, dl, dt and dd

forgot this: 3.1. remove all non-linebreak space adjacent to linebreaks

also forgot this: 3.2. consolidate runs of two or more line breaks
into a single line break

> still to do:
>
> 4. table layout algorithm
>
> 5. conversion of content inside sup or sub to corresponding unicode
> characters where possible, but only when the entire non-whitespace sub
> or sup content can be converted. this would include e.g. <sup>TM</sup>
> → ™ and <sup>2</sup> → ²
>
> -ben
>