From paul.kinlan at gmail.com  Fri Sep  7 13:58:27 2007
From: paul.kinlan at gmail.com (Paul Kinlan)
Date: Fri Sep  7 13:58:30 2007
Subject: [uf-dev] Microformat Parser for .Net
Message-ID: <1f8270600709071358r70c12ad9i521c641b7fb02bcf@mail.gmail.com>

Hi all,

I am new to the development list, but I have been on the uf-discuss list for
a while now.  I thought that this list was the best place to announce that I
have created a usable [although beta] release of a generic microformat
parser for .Net.

The project can be found on codeplex at http://www.codeplex.com/microformat.
The current release is Iteration 3.

The parser is stream based and uses an application configuration (see below
for an example) to define the how the parser should parse the html/xml
stream.  This flexible configuration means that if a spec changes for a
microformat or a new one is introduced then no code needs to be changed in
the framework to let users of the framework see the changed data.

 <configSections>
    <section name="MicroformatsSection" type="
Microformats.ConfigurationSections.MicroformatConfigSection, Microformat.net
"/>
  </configSections>
  <MicroformatsSection>
    <Microformats>
      <Microformat type="rel-tag" rootType="rel" root="tag" dataType="
System.Uri" />
      <Microformat type="hCard" rootType="class" root="vcard" dataType="
System.String">
        <Fields>
          <Field name="fn" dataType="System.String" plurality="Singular"/>
          <Field name="url" dataType="System.Uri" plurality="Singular"/>
          <Field name="email" dataType="System.Uri" plurality="Singular"/>
          <Field name="adr" dataType="Microformat" plurality="Singular"/>
        </Fields>
      </Microformat>
      <Microformat type="adr" rootType="class" root="adr" dataType="
System.String">
        <Fields>
          <Field name="post-office-box" dataType="System.String"
plurality="Singular"/>
          <Field name="extended-address" dataType="System.String"
plurality="Singular"/>
          <Field name="street-address" dataType="System.String"
plurality="Singular"/>
          <Field name="locality" dataType="System.String"
plurality="Singular"/>
          <Field name="region" dataType="System.String"
plurality="Singular"/>
          <Field name="postal-code" dataType="System.String"
plurality="Singular"/>
          <Field name="country-name" dataType="System.String"
plurality="Singular"/>
        </Fields>
      </Microformat>
    </Microformats>
  </MicroformatsSection>

The above configuration says that the following microformats are to be
searched for: rel-tag, hCard and adr.  Each microformat configuration can
also be nested (see the hCard spec that allows an adr to be nested inside
itself).  This saves on duplicating configuration information.
(Unfortunately a circular reference in the configuration can be defined and
plurality of elements is not implemented.  This will be fixed soon).
Currently in this configuration not all of the hCard spec is defined (this
was done for simplicity of me showing you how the config works), obviously
this means that any parts of a microformat that you are not interested in
you won't see in the output of the framework.

I still have a lot of work to do, however it appears (to me at least) to be
quite flexible.  I would greatly appreciate any comments and feedback and if
you use the framework I would love to hear about it.  If anyone is
interested in joining the project let me know.

Kind Regards,
Paul Kinlan

Nb.  The code is released under the Microsoft permissive licence, this
licence fits best with the sgml reader code that is included in the project
by Chris Lovett.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://microformats.org/discuss/mail/microformats-dev/attachments/20070907/d4cfd69f/attachment.html
From tantek at cs.stanford.edu  Sat Sep  8 03:29:07 2007
From: tantek at cs.stanford.edu (Tantek =?ISO-8859-1?B?xw==?=elik)
Date: Sat Sep  8 03:28:28 2007
Subject: [uf-dev] Re: [uf-discuss] Microformats and title attribute parsing
In-Reply-To: <e06e0e0b0709071936s3b82d978u43ecb16bf64806b3@mail.gmail.com>
Message-ID: <C307C710.94554%tantek@cs.stanford.edu>

Re-routing parsing issue to microformats-dev.

On 9/7/07 7:36 PM, "Mike Kaply" <microformats@kaply.com> wrote:

> See:
> 
> http://kidachi.kazuhi.to/blog/archives/002343.html
> 
>> I was disappointed with his comment - he means that Operator won't catch
>> title attribute of
>> span element in hCalendar as far as the community doesn't get one concrete
>> conclusion.
>> (BTW Tails and Tails Export can find my hCalendar as I expected.)
> 
> Is  this true? Tails and Tails export find title on non abbr elements?

This is a longstanding bug in Tails.

The title attribute has *never* semantically meant anything on anything but
the <abbr> element in microformats parsing (e.g. see hCard parsing)[1].

Last time I checked, Tails incorrectly looks at the title attribute on all
elements instead of just on <abbr> - I think at this point it is due to lack
of maintenance than anything else.  An innocent mistake that was just never
corrected.

> If this is the case, it would mean an old version of X2V does this
> since that's what they use....

X2V has properly supported <abbr title> (and not title attribute in general)
for quite some time.  Tails must use its own implementation or perhaps it
uses a *really* old version of X2V?

Tantek

[1] http://microformats.org/wiki/hcard-parsing

From matt at daisyinteractive.com  Mon Sep 10 12:59:29 2007
From: matt at daisyinteractive.com (Matt Warnock)
Date: Mon Sep 10 13:06:04 2007
Subject: [uf-dev] Multiple hCards on one page
Message-ID: <64db63234ef480ac8881f576514439c6@daisyinteractive.com>

Hello - This is my first time posting to a list like this so please 
excuse the errors if there are any.

I have just gotten the microformats book and I am having a real problem 
getting this to work across several platforms.  I am trying to get the 
"add to address book" link working for users to add the particular 
address to their address book .app, outlook or entourage.

I am using the:

http://feeds.technorati.com/contacts/(my absolute path here)

link because hopefully we will be able to get this working in 
production and might get a couple links a day so I don't want to 
overload the Suda server.

I's it possible to have 2 hCards on one page?  I searched and found a 
posting where someone used a target of #santa-monica vs. #boston and 
they said that worked when you escaped the # with %23, but I haven't 
been able to get it going.

If someone has an answer it would get 3 days of being off track on the 
project back up and moving forward again.

My implementation is here: http://exhale.daisyinteractive.com/locations/
.......................................................
Matt Warnock
Daisy Interactive Inc.
830 S. Hill St. #850
Los Angeles, CA 90014
Tel.: 213-627-4990
Fax: 213-627-4080
matt@daisyinteractive.com
www.daisyinteractive.com

From ryan at technorati.com  Tue Sep 11 10:57:10 2007
From: ryan at technorati.com (Ryan King)
Date: Tue Sep 11 10:57:17 2007
Subject: [uf-dev] Multiple hCards on one page
In-Reply-To: <64db63234ef480ac8881f576514439c6@daisyinteractive.com>
References: <64db63234ef480ac8881f576514439c6@daisyinteractive.com>
Message-ID: <ABF5A987-375A-4D6E-83B8-15FE7DC8EEB4@technorati.com>

On Sep 10, 2007, at 12:59 PM, Matt Warnock wrote:

> Hello - This is my first time posting to a list like this so please  
> excuse the errors if there are any.
>
> I have just gotten the microformats book and I am having a real  
> problem getting this to work across several platforms.  I am trying  
> to get the "add to address book" link working for users to add the  
> particular address to their address book .app, outlook or entourage.
>
> I am using the:
>
> http://feeds.technorati.com/contacts/(my absolute path here)
>
> link because hopefully we will be able to get this working in  
> production and might get a couple links a day so I don't want to  
> overload the Suda server.
>
> I's it possible to have 2 hCards on one page?  I searched and found  
> a posting where someone used a target of #santa-monica vs. #boston  
> and they said that worked when you escaped the # with %23, but I  
> haven't been able to get it going.
>
> If someone has an answer it would get 3 days of being off track on  
> the project back up and moving forward again.
>
> My implementation is here: http://exhale.daisyinteractive.com/ 
> locations/

You're using this:

  <a id="new-york" name="new-york"></a>

Which means that when the service goes to extract http:// 
exhale.daisyinteractive.com/locations/#new-york, it will only look at  
that node of the document. To reference a specific hCard, put the id  
on the root element of that hCard.

-ryan
From brian.suda at gmail.com  Thu Sep 13 06:54:15 2007
From: brian.suda at gmail.com (Brian Suda)
Date: Thu Sep 13 06:54:18 2007
Subject: [uf-dev] Parsing By Element
Message-ID: <21e770780709130654g1d34dab1k36cc6ff984a2d009@mail.gmail.com>

Tantek and I were talking about how we should construct a list of
elements and explain how they are being parsed. For example,

<a href="http://example.org" class="fn url">John Doe</a>

We know that FN becomes "John Doe" and URL becomes http://example.org
but there is very little documentation (atleast in one place) about
how and when these rules are invoked.

I created a parsing page on the wiki (feel free to move it as needed)
http://microformats.org/wiki/parsing

It takes the W3C list of HTML elements and begins to map how and where
to extract the values for various microformats properties.

This is an early braindump, so feel free to re-work major portions of
it. Because several of the elements only exist in the HTML head
element, i have noted them as (valid?) I?m not sure how and when or
why you would add a microformats property to a STYLE element, or even
if some of these elements can take a @class attribute - maybe they
should be kept in the list, but noted in some fashion that they are
not valid for MF usage.

-brian

-- 
brian suda
http://suda.co.uk

From tantek at cs.stanford.edu  Thu Sep 13 11:57:58 2007
From: tantek at cs.stanford.edu (Tantek =?ISO-8859-1?B?xw==?=elik)
Date: Thu Sep 13 11:57:21 2007
Subject: [uf-dev] Parsing By Element
In-Reply-To: <21e770780709130654g1d34dab1k36cc6ff984a2d009@mail.gmail.com>
Message-ID: <C30ED63E.94877%tantek@cs.stanford.edu>

On 9/13/07 6:54 AM, "Brian Suda" <brian.suda@gmail.com> wrote:

> Tantek and I were talking about how we should construct a list of
> elements and explain how they are being parsed. For example,
> 
> <a href="http://example.org" class="fn url">John Doe</a>
> 
> We know that FN becomes "John Doe" and URL becomes http://example.org
> but there is very little documentation (atleast in one place) about
> how and when these rules are invoked.

Actually, there is quite a bit of documentation about this, and it is in
*only* one place currently:

 http://microformats.org/wiki/hcard-parsing

> I created a parsing page on the wiki (feel free to move it as needed)
> http://microformats.org/wiki/parsing
>
> It takes the W3C list of HTML elements and begins to map how and where
> to extract the values for various microformats properties.

I saw that, I'm not sure that a raw element list is the right way to start
that.

I've been trying to complete the *semantic* element and attribute lists as
well as group them into logical sets for mnemonic purposes here:

 http://microformats.org/wiki/semantic-xhtml

>From that, the next step is an audit of hcard-parsing to see if I'm missing
any special element handling (like <input> for example) and derive parsing
rules per element semantics accordingly and finish writing them up here:

<http://microformats.org/wiki/hcard-brainstorming#Additional_Semantic_HTML_h
andling>

Then let's take a look at the open source implementations (X2V, hKit,
Operator) and determine if it is fairly straightforward to add any
additional element-specific semantic handling - I expect that implementation
updates should be fairly trivial for a few special cases.

Simultaneously, test cases which exercise the new parsing cases will help as
well.

Once we've gotten all that working for hCard, then I'll draft up
/wiki/hcalendar-parsing accordingly as well.

With the practical experience of hCard and hCalendar parsing, I'll
extract/abstract common bits and draft /wiki/compound-parsing as general
rules for parsing compound microformats.

How does that sound?

Tantek

From connolly at w3.org  Thu Sep 13 13:17:35 2007
From: connolly at w3.org (Dan Connolly)
Date: Thu Sep 13 13:17:57 2007
Subject: [uf-dev] Parsing By Element
In-Reply-To: <C30ED63E.94877%tantek@cs.stanford.edu>
References: <C30ED63E.94877%tantek@cs.stanford.edu>
Message-ID: <1189714655.1829.628.camel@pav>

On Thu, 2007-09-13 at 11:57 -0700, Tantek =?ISO-8859-1?B?xw==?=elik
wrote:
[...]
> Actually, there is quite a bit of documentation about this, and it is
> in
> *only* one place currently:
> 
>  http://microformats.org/wiki/hcard-parsing

Yeah, that's where I thought this stuff lived.

[...]
> How does that sound?

Good, at a glance.

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/


From ryan at technorati.com  Thu Sep 13 15:15:18 2007
From: ryan at technorati.com (Ryan King)
Date: Thu Sep 13 15:15:25 2007
Subject: [uf-dev] Parsing By Element
In-Reply-To: <C30ED63E.94877%tantek@cs.stanford.edu>
References: <C30ED63E.94877%tantek@cs.stanford.edu>
Message-ID: <9BD1C3AC-D241-46DF-A7E9-9F01A1822DCB@technorati.com>


On Sep 13, 2007, at 11:57 AM, Tantek ?elik wrote:

> On 9/13/07 6:54 AM, "Brian Suda" <brian.suda@gmail.com> wrote:
>
>> Tantek and I were talking about how we should construct a list of
>> elements and explain how they are being parsed. For example,
>>
>> <a href="http://example.org" class="fn url">John Doe</a>
>>
>> We know that FN becomes "John Doe" and URL becomes http://example.org
>> but there is very little documentation (atleast in one place) about
>> how and when these rules are invoked.
>
> Actually, there is quite a bit of documentation about this, and it  
> is in
> *only* one place currently:
>
>  http://microformats.org/wiki/hcard-parsing

Note that I have a good deal of stuff in a presenation I gave earlier  
this year:

http://theryanking.com/presentations/2007/www2007-microformats-parsing/

I should write it up more formally.

-ryan
From brian.suda at gmail.com  Fri Sep 14 03:29:34 2007
From: brian.suda at gmail.com (Brian Suda)
Date: Fri Sep 14 03:29:38 2007
Subject: [uf-dev] Parsing By Element
In-Reply-To: <C30ED63E.94877%tantek@cs.stanford.edu>
References: <21e770780709130654g1d34dab1k36cc6ff984a2d009@mail.gmail.com>
	<C30ED63E.94877%tantek@cs.stanford.edu>
Message-ID: <21e770780709140329n7b716790n376ef7b7e980e31f@mail.gmail.com>

On 9/13/07, Tantek ?elik <tantek@cs.stanford.edu> wrote:
> > It takes the W3C list of HTML elements and begins to map how and where
> > to extract the values for various microformats properties.
>
> I saw that, I'm not sure that a raw element list is the right way to start
> that.
--- my understanding, was that we wanted a full list of elements and
their parsing rules. If we want to re-order or group them my semantics
that is fine. It is a wiki, so i'll let someone else take the existing
data and re-order/remove/tweak it as needed.

> I've been trying to complete the *semantic* element and attribute lists as
> well as group them into logical sets for mnemonic purposes here:
>
>  http://microformats.org/wiki/semantic-xhtml

--- last time i looked at that page it was a list of a handful of
elements. It seems much better and grouped now. How would you suggest
adding parsing information to that list (or do you?)

> >From that, the next step is an audit of hcard-parsing to see if I'm missing
> any special element handling (like <input> for example) and derive parsing
> rules per element semantics accordingly and finish writing them up here:

--- when i created the list of element, there are several that do not
seem to be covered. FRAME, SCRIPT, APPLET, CITE, INS/DEL and Q (they
both have a cite attribute), (some of those are easy answers) we only
have a brief description of TABLE semantics, (Tables also have a
SUMMARY attribute and the whole AXIS/HEADER/ID stuff). There was also
rules discussed about what it means if class="category" was on an
OL/UL would that be one category per LI or is that a single string of
the combined LI values?

<ol class="category">
  <li>foo</li>
  <li>bar</li>
</ol>

is that:
CATEGORIES:foobar
or
CATEGORIES:foo,bar

would the same apply to other properties such as TEL? or only plural
properties we sigularized? or none. I couldn't find a reference but i
thought we did decide on something, so we should document our decision

hCard page also mentions this:
http://microformats.org/wiki/hcard#Tags_as_Categories
using rel-tag with categories, then the parsing is different, this
isn?t mentioned on the hcard-parsing page


> With the practical experience of hCard and hCalendar parsing, I'll
> extract/abstract common bits and draft /wiki/compound-parsing as general
> rules for parsing compound microformats.
>
> How does that sound?

--- I agree it would be better to migrate this information to a
general "parsing" page, than continue to have a *-parsing for each
format. I?m not sure a hCalendar-parsing page is needed. There is
plenty of common overlap, so i would prefer this approach of a generic
page, then any specific rules be added to the *-parsing pages on a
format-by-format needed basis.

-brian

-- 
brian suda
http://suda.co.uk

From scott at randomchaos.com  Mon Sep 17 15:11:19 2007
From: scott at randomchaos.com (Scott Reynen)
Date: Mon Sep 17 15:11:53 2007
Subject: [uf-dev] Re: [uf-discuss] Storing Microformats
In-Reply-To: <1f8270600709171244t59efe0bbk3e92c341ae8429a@mail.gmail.com>
References: <1f8270600709170837k266af6a8x9d8fde7629af4386@mail.gmail.com>
	<1f8270600709171244t59efe0bbk3e92c341ae8429a@mail.gmail.com>
Message-ID: <4669E3BD-BC58-498F-9DAB-291A486ABFBF@randomchaos.com>

On Sep 17, 2007, at 1:44 PM, Paul Kinlan wrote:

>  My question is: How are people storing
> the data present in microformats so that they can be searched and
> maintained and consumed in applications etc?

(Moved from the -discuss list.)

Back when I did a spider, my database schema looked like this:

CREATE TABLE IF NOT EXISTS `url` (
   `id` int(11) NOT NULL auto_increment,
   `url` text NOT NULL,
   `last_checked` datetime NOT NULL,
   PRIMARY KEY  (`id`)
)

CREATE TABLE IF NOT EXISTS `node` (
   `id` int(11) NOT NULL auto_increment,
   `parent_id` int(11) NOT NULL,
   `url_id` int(11) NOT NULL,
   `html` text NOT NULL,
   PRIMARY KEY  (`id`),
   KEY `parent_id` (`parent_id`),
   KEY `url_id` (`url_id`)
);

CREATE TABLE IF NOT EXISTS `node_property` (
   `id` int(11) NOT NULL auto_increment,
   `node_id` int(11) NOT NULL,
   `name` varchar(255) NOT NULL,
   `value` text NOT NULL,
   PRIMARY KEY  (`id`),
   KEY `node_id` (`node_id`)
)

So the "node" table was basically a DOM tree I used for parsing and  
the "node_property" table was where I put the parsed data for quick  
searching.  I don't know that this was necessarily the best way to do  
it, but I didn't run into any problems during the brief period it was  
running.

Peace,
Scott