[uf-new] New proposal: Elemental microformat for content boundaries

Indus Khaitan khaitan at gmail.com
Thu Apr 23 07:25:48 PDT 2009


ufrs,

I'm interested in finding a way to identify the content boundaries for
content aggregated/published on a composite page. This is from the
perspective of a search engine (and/or a data aggregator) which looks
at the page as a single unit of content while in reality it may be a
container page of aggregated data (I have explained this visually here
http://www.khaitan.org/blog/2009/03/the-micro-content-problem-in-search-result-pollution/).
Examples of such pages are comments, multiple posts on a single page,
twitter's public_timeline, message boards/groups with
discussions/threads, flickr photo sets, Question & Answer pages,
composite FAQ pages, a monthly calendar, a task list, activity streams
and so on.

Expanding on a simple example: twitter's public timeline page,
consists of 20 individual content (can I say micro-content?) units (or
twitter statuses). On an aggregate page, these status messages are
uniquely identifiable units of content but there is no determinate way
of discerning boundaries of the individual statuses and successfully
parsing them without knowing the visual arrangement of markup. The
same problem statement can be attached to other example situations.

The current mechanisms do not provide any semantic cues to a search
bot. Nor, there is any easy way to detect duplicate content across
multiple sites when it can be done using a simple annotation in an
extended use-case of the proposed solution. I was hoping to see
something around better content identification and grouping in the
upcoming efforts, but the HTML5 spec for grouping content (See Sec.
4.5 http://www.whatwg.org/specs/web-apps/current-work/#grouping-content)
only proposes markup for visually grouping the content.

Possible Solution:
I'm thinking of using something like "rel=cboundary" (not able to come
up with a better name) with a link tag. Similar to "rel=nofollow"
which provides meta information for un-endorsed links, "rel=cboundary"
can provide the meta information for content demarcation/chunking. By
adding this the page would indicate that the markup (or content)
following the link is semantically demarcated from the markup
preceding the link. This solution can be extended and can work with
"rel=bookmark" to identify duplicate content when same micro content
is present elsewhere.

Another benefit I see is that this can become an elemental microformat
and can be used in hReview, hCalendar and several other microformats
like activity-streams, comments, etc. which are in active discussions.

Would love to hear some comments before proceeding further.

Indus

-- 
http://khaitan.org
+1 408 689 9587


More information about the microformats-new mailing list