[uf-dev] Preventing false positives

Fri May 2 01:25:10 PDT 2008

Zachary Carter wrote:

> So I have two questions: 1) is
> there a way to ignore an entire element and its descendants from being
> parsed?

Not that I know of. I suppose that putting the content into an IFRAME  
instead of on the main page ought to do it, but it's an ugly  
solution; and because it's not an officially sanctioned method for  
hiding content from parsers, you have no guarantee that future  
parsers will not start parsing within IFRAMEs.

> 2) Is there a way to have the parser ignore all class names on
> an element? (as if the class names were removed from the element prior
> to parsing)

The MFO effort <http://microformats.org/wiki/mfo> is an attempt to do  
something like this. The list of parsers that actually support MFO is  
pretty short though.

Cognition <http://buzzword.org.uk/cognition/> does support MFO. I  
mention this because the technique it uses is close to what you  
describe. When it parses a microformat, it takes a *clone* of the  
element and its children (so as not to damage the original DOM tree),  
then tries to parse embedded microformats -- e.g. "adr", "geo" and  
"agent vcard" within a "vcard".

I'll break off the parsing procedure here for a little terminology: I  
make a distinction between "embedded microformats" which are those  
that imply a special meaning by being nested within each other; and  
"nested microformats" which are those that are nested within each  
other by mere co-incidence, or perhaps to convey some kind of  
undefined relationship between the objects (e.g. an hCard could be  
nested within a geo -- perhaps the author meant to convey that the  
person represented by the hCard lives at that location, but this type  
of nesting is not defined in the specs)

Anyway, after parsing *embedded* microformats, Cognition searches for  
*nested* microformats. It uses a list of all known root element  
classes (e.g. "hatom", "hresume", "hlisting", "vcalendar") --  
including the class names for microformats which Cognition does not  
yet support. It also includes the class name "mfo".

Now, if it finds any of these nested microformats, it reaches within  
them and tampers with every descendent element, setting the "rel",  
"rev" and "class" attributes to the empty string. Remember, that this  
is on a clone of the DOM. Thus these elements will be excluded from  
supplying any unintentional semantics to the outer microformat.

Let's look at an example:

	<div class="vcard">
	  <h1 class="fn n">
	    <span class="honorific-prefix">Dr.</span>
	    <span class="given-name">Marvin</span>
	    <span class="family-name">Candle</span>
	  </h1>
	  <p class="note">
	    <span class="mfo">
	      Worked for a company called
	      <b class="vcard">
	        <span class="fn org">The Hanzo Foundation</span>
	      </b>.
	    </span>
	  </p>
	</div>

Now, when we come to parse the outer hCard, the clone is reduced to  
the following using MFO:

	<div class="vcard">
	  <h1 class="fn n">
	    <span class="honorific-prefix">Dr.</span>
	    <span class="given-name">Marvin</span>
	    <span class="family-name">Candle</span>
	  </h1>
	  <p class="note">
	    <span class="mfo">
	      Worked for a company called
	      <b>
	        <span>The Hanzo Foundation</span>
	      </b>.
	    </span>
	  </p>
	</div>

And the following vCard may be produced:

BEGIN:VCARD
FN:Dr. Marvin Candle
N:Candle;Marvin;;Dr.
NOTE:Worked for a company called The Hanzo Foundation.
END:VCARD

Note that the full text of the note is included, but there is no  
"ORG" property in the vCard.

As it happens, because "vcard" is included in that big list of known  
microformats (remember? "hatom", "hresume", "hlisting",  
"vcalendar"...), the same effect would have happened even if we  
hadn't included <span class="mfo"> -- but the MFO class is still  
useful because new microformats could arise at some point in the  
future which are not on that list.

It is also worth noting that while this MFO step masks the properties  
of the inner hCard from the outer hCard, the inner hCard will still  
be parsed as a later step, resulting in a second vCard:

BEGIN:VCARD
FN:The Hanzo Foundation
ORG:The Hanzo Foundation
END:VCARD

-- 
Toby A Inkster
<mailto:mail at tobyinkster.co.uk>
<http://tobyinkster.co.uk>

-- 
Toby A Inkster
<mailto:mail at tobyinkster.co.uk>
<http://tobyinkster.co.uk>