[uf-dev] Preventing false positives
Toby A Inkster
mail at tobyinkster.co.uk
Fri May 2 01:25:10 PDT 2008
Zachary Carter wrote:
> So I have two questions: 1) is
> there a way to ignore an entire element and its descendants from being
> parsed?
Not that I know of. I suppose that putting the content into an IFRAME
instead of on the main page ought to do it, but it's an ugly
solution; and because it's not an officially sanctioned method for
hiding content from parsers, you have no guarantee that future
parsers will not start parsing within IFRAMEs.
> 2) Is there a way to have the parser ignore all class names on
> an element? (as if the class names were removed from the element prior
> to parsing)
The MFO effort <http://microformats.org/wiki/mfo> is an attempt to do
something like this. The list of parsers that actually support MFO is
pretty short though.
Cognition <http://buzzword.org.uk/cognition/> does support MFO. I
mention this because the technique it uses is close to what you
describe. When it parses a microformat, it takes a *clone* of the
element and its children (so as not to damage the original DOM tree),
then tries to parse embedded microformats -- e.g. "adr", "geo" and
"agent vcard" within a "vcard".
I'll break off the parsing procedure here for a little terminology: I
make a distinction between "embedded microformats" which are those
that imply a special meaning by being nested within each other; and
"nested microformats" which are those that are nested within each
other by mere co-incidence, or perhaps to convey some kind of
undefined relationship between the objects (e.g. an hCard could be
nested within a geo -- perhaps the author meant to convey that the
person represented by the hCard lives at that location, but this type
of nesting is not defined in the specs)
Anyway, after parsing *embedded* microformats, Cognition searches for
*nested* microformats. It uses a list of all known root element
classes (e.g. "hatom", "hresume", "hlisting", "vcalendar") --
including the class names for microformats which Cognition does not
yet support. It also includes the class name "mfo".
Now, if it finds any of these nested microformats, it reaches within
them and tampers with every descendent element, setting the "rel",
"rev" and "class" attributes to the empty string. Remember, that this
is on a clone of the DOM. Thus these elements will be excluded from
supplying any unintentional semantics to the outer microformat.
Let's look at an example:
<div class="vcard">
<h1 class="fn n">
<span class="honorific-prefix">Dr.</span>
<span class="given-name">Marvin</span>
<span class="family-name">Candle</span>
</h1>
<p class="note">
<span class="mfo">
Worked for a company called
<b class="vcard">
<span class="fn org">The Hanzo Foundation</span>
</b>.
</span>
</p>
</div>
Now, when we come to parse the outer hCard, the clone is reduced to
the following using MFO:
<div class="vcard">
<h1 class="fn n">
<span class="honorific-prefix">Dr.</span>
<span class="given-name">Marvin</span>
<span class="family-name">Candle</span>
</h1>
<p class="note">
<span class="mfo">
Worked for a company called
<b>
<span>The Hanzo Foundation</span>
</b>.
</span>
</p>
</div>
And the following vCard may be produced:
BEGIN:VCARD
FN:Dr. Marvin Candle
N:Candle;Marvin;;Dr.
NOTE:Worked for a company called The Hanzo Foundation.
END:VCARD
Note that the full text of the note is included, but there is no
"ORG" property in the vCard.
As it happens, because "vcard" is included in that big list of known
microformats (remember? "hatom", "hresume", "hlisting",
"vcalendar"...), the same effect would have happened even if we
hadn't included <span class="mfo"> -- but the MFO class is still
useful because new microformats could arise at some point in the
future which are not on that list.
It is also worth noting that while this MFO step masks the properties
of the inner hCard from the outer hCard, the inner hCard will still
be parsed as a later step, resulting in a second vCard:
BEGIN:VCARD
FN:The Hanzo Foundation
ORG:The Hanzo Foundation
END:VCARD
--
Toby A Inkster
<mailto:mail at tobyinkster.co.uk>
<http://tobyinkster.co.uk>
--
Toby A Inkster
<mailto:mail at tobyinkster.co.uk>
<http://tobyinkster.co.uk>
More information about the microformats-dev
mailing list