From paul.kinlan at gmail.com Fri Sep 7 13:58:27 2007 From: paul.kinlan at gmail.com (Paul Kinlan) Date: Fri Sep 7 13:58:30 2007 Subject: [uf-dev] Microformat Parser for .Net Message-ID: <1f8270600709071358r70c12ad9i521c641b7fb02bcf@mail.gmail.com> Hi all, I am new to the development list, but I have been on the uf-discuss list for a while now. I thought that this list was the best place to announce that I have created a usable [although beta] release of a generic microformat parser for .Net. The project can be found on codeplex at http://www.codeplex.com/microformat. The current release is Iteration 3. The parser is stream based and uses an application configuration (see below for an example) to define the how the parser should parse the html/xml stream. This flexible configuration means that if a spec changes for a microformat or a new one is introduced then no code needs to be changed in the framework to let users of the framework see the changed data.
The above configuration says that the following microformats are to be searched for: rel-tag, hCard and adr. Each microformat configuration can also be nested (see the hCard spec that allows an adr to be nested inside itself). This saves on duplicating configuration information. (Unfortunately a circular reference in the configuration can be defined and plurality of elements is not implemented. This will be fixed soon). Currently in this configuration not all of the hCard spec is defined (this was done for simplicity of me showing you how the config works), obviously this means that any parts of a microformat that you are not interested in you won't see in the output of the framework. I still have a lot of work to do, however it appears (to me at least) to be quite flexible. I would greatly appreciate any comments and feedback and if you use the framework I would love to hear about it. If anyone is interested in joining the project let me know. Kind Regards, Paul Kinlan Nb. The code is released under the Microsoft permissive licence, this licence fits best with the sgml reader code that is included in the project by Chris Lovett. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://microformats.org/discuss/mail/microformats-dev/attachments/20070907/d4cfd69f/attachment.html From tantek at cs.stanford.edu Sat Sep 8 03:29:07 2007 From: tantek at cs.stanford.edu (Tantek =?ISO-8859-1?B?xw==?=elik) Date: Sat Sep 8 03:28:28 2007 Subject: [uf-dev] Re: [uf-discuss] Microformats and title attribute parsing In-Reply-To: Message-ID: Re-routing parsing issue to microformats-dev. On 9/7/07 7:36 PM, "Mike Kaply" wrote: > See: > > http://kidachi.kazuhi.to/blog/archives/002343.html > >> I was disappointed with his comment - he means that Operator won't catch >> title attribute of >> span element in hCalendar as far as the community doesn't get one concrete >> conclusion. >> (BTW Tails and Tails Export can find my hCalendar as I expected.) > > Is this true? Tails and Tails export find title on non abbr elements? This is a longstanding bug in Tails. The title attribute has *never* semantically meant anything on anything but the element in microformats parsing (e.g. see hCard parsing)[1]. Last time I checked, Tails incorrectly looks at the title attribute on all elements instead of just on - I think at this point it is due to lack of maintenance than anything else. An innocent mistake that was just never corrected. > If this is the case, it would mean an old version of X2V does this > since that's what they use.... X2V has properly supported (and not title attribute in general) for quite some time. Tails must use its own implementation or perhaps it uses a *really* old version of X2V? Tantek [1] http://microformats.org/wiki/hcard-parsing From matt at daisyinteractive.com Mon Sep 10 12:59:29 2007 From: matt at daisyinteractive.com (Matt Warnock) Date: Mon Sep 10 13:06:04 2007 Subject: [uf-dev] Multiple hCards on one page Message-ID: <64db63234ef480ac8881f576514439c6@daisyinteractive.com> Hello - This is my first time posting to a list like this so please excuse the errors if there are any. I have just gotten the microformats book and I am having a real problem getting this to work across several platforms. I am trying to get the "add to address book" link working for users to add the particular address to their address book .app, outlook or entourage. I am using the: http://feeds.technorati.com/contacts/(my absolute path here) link because hopefully we will be able to get this working in production and might get a couple links a day so I don't want to overload the Suda server. I's it possible to have 2 hCards on one page? I searched and found a posting where someone used a target of #santa-monica vs. #boston and they said that worked when you escaped the # with %23, but I haven't been able to get it going. If someone has an answer it would get 3 days of being off track on the project back up and moving forward again. My implementation is here: http://exhale.daisyinteractive.com/locations/ ....................................................... Matt Warnock Daisy Interactive Inc. 830 S. Hill St. #850 Los Angeles, CA 90014 Tel.: 213-627-4990 Fax: 213-627-4080 matt@daisyinteractive.com www.daisyinteractive.com From ryan at technorati.com Tue Sep 11 10:57:10 2007 From: ryan at technorati.com (Ryan King) Date: Tue Sep 11 10:57:17 2007 Subject: [uf-dev] Multiple hCards on one page In-Reply-To: <64db63234ef480ac8881f576514439c6@daisyinteractive.com> References: <64db63234ef480ac8881f576514439c6@daisyinteractive.com> Message-ID: On Sep 10, 2007, at 12:59 PM, Matt Warnock wrote: > Hello - This is my first time posting to a list like this so please > excuse the errors if there are any. > > I have just gotten the microformats book and I am having a real > problem getting this to work across several platforms. I am trying > to get the "add to address book" link working for users to add the > particular address to their address book .app, outlook or entourage. > > I am using the: > > http://feeds.technorati.com/contacts/(my absolute path here) > > link because hopefully we will be able to get this working in > production and might get a couple links a day so I don't want to > overload the Suda server. > > I's it possible to have 2 hCards on one page? I searched and found > a posting where someone used a target of #santa-monica vs. #boston > and they said that worked when you escaped the # with %23, but I > haven't been able to get it going. > > If someone has an answer it would get 3 days of being off track on > the project back up and moving forward again. > > My implementation is here: http://exhale.daisyinteractive.com/ > locations/ You're using this: Which means that when the service goes to extract http:// exhale.daisyinteractive.com/locations/#new-york, it will only look at that node of the document. To reference a specific hCard, put the id on the root element of that hCard. -ryan From brian.suda at gmail.com Thu Sep 13 06:54:15 2007 From: brian.suda at gmail.com (Brian Suda) Date: Thu Sep 13 06:54:18 2007 Subject: [uf-dev] Parsing By Element Message-ID: <21e770780709130654g1d34dab1k36cc6ff984a2d009@mail.gmail.com> Tantek and I were talking about how we should construct a list of elements and explain how they are being parsed. For example, John Doe We know that FN becomes "John Doe" and URL becomes http://example.org but there is very little documentation (atleast in one place) about how and when these rules are invoked. I created a parsing page on the wiki (feel free to move it as needed) http://microformats.org/wiki/parsing It takes the W3C list of HTML elements and begins to map how and where to extract the values for various microformats properties. This is an early braindump, so feel free to re-work major portions of it. Because several of the elements only exist in the HTML head element, i have noted them as (valid?) I?m not sure how and when or why you would add a microformats property to a STYLE element, or even if some of these elements can take a @class attribute - maybe they should be kept in the list, but noted in some fashion that they are not valid for MF usage. -brian -- brian suda http://suda.co.uk From tantek at cs.stanford.edu Thu Sep 13 11:57:58 2007 From: tantek at cs.stanford.edu (Tantek =?ISO-8859-1?B?xw==?=elik) Date: Thu Sep 13 11:57:21 2007 Subject: [uf-dev] Parsing By Element In-Reply-To: <21e770780709130654g1d34dab1k36cc6ff984a2d009@mail.gmail.com> Message-ID: On 9/13/07 6:54 AM, "Brian Suda" wrote: > Tantek and I were talking about how we should construct a list of > elements and explain how they are being parsed. For example, > > John Doe > > We know that FN becomes "John Doe" and URL becomes http://example.org > but there is very little documentation (atleast in one place) about > how and when these rules are invoked. Actually, there is quite a bit of documentation about this, and it is in *only* one place currently: http://microformats.org/wiki/hcard-parsing > I created a parsing page on the wiki (feel free to move it as needed) > http://microformats.org/wiki/parsing > > It takes the W3C list of HTML elements and begins to map how and where > to extract the values for various microformats properties. I saw that, I'm not sure that a raw element list is the right way to start that. I've been trying to complete the *semantic* element and attribute lists as well as group them into logical sets for mnemonic purposes here: http://microformats.org/wiki/semantic-xhtml >From that, the next step is an audit of hcard-parsing to see if I'm missing any special element handling (like for example) and derive parsing rules per element semantics accordingly and finish writing them up here: Then let's take a look at the open source implementations (X2V, hKit, Operator) and determine if it is fairly straightforward to add any additional element-specific semantic handling - I expect that implementation updates should be fairly trivial for a few special cases. Simultaneously, test cases which exercise the new parsing cases will help as well. Once we've gotten all that working for hCard, then I'll draft up /wiki/hcalendar-parsing accordingly as well. With the practical experience of hCard and hCalendar parsing, I'll extract/abstract common bits and draft /wiki/compound-parsing as general rules for parsing compound microformats. How does that sound? Tantek From connolly at w3.org Thu Sep 13 13:17:35 2007 From: connolly at w3.org (Dan Connolly) Date: Thu Sep 13 13:17:57 2007 Subject: [uf-dev] Parsing By Element In-Reply-To: References: Message-ID: <1189714655.1829.628.camel@pav> On Thu, 2007-09-13 at 11:57 -0700, Tantek =?ISO-8859-1?B?xw==?=elik wrote: [...] > Actually, there is quite a bit of documentation about this, and it is > in > *only* one place currently: > > http://microformats.org/wiki/hcard-parsing Yeah, that's where I thought this stuff lived. [...] > How does that sound? Good, at a glance. -- Dan Connolly, W3C http://www.w3.org/People/Connolly/ From ryan at technorati.com Thu Sep 13 15:15:18 2007 From: ryan at technorati.com (Ryan King) Date: Thu Sep 13 15:15:25 2007 Subject: [uf-dev] Parsing By Element In-Reply-To: References: Message-ID: <9BD1C3AC-D241-46DF-A7E9-9F01A1822DCB@technorati.com> On Sep 13, 2007, at 11:57 AM, Tantek ?elik wrote: > On 9/13/07 6:54 AM, "Brian Suda" wrote: > >> Tantek and I were talking about how we should construct a list of >> elements and explain how they are being parsed. For example, >> >> John Doe >> >> We know that FN becomes "John Doe" and URL becomes http://example.org >> but there is very little documentation (atleast in one place) about >> how and when these rules are invoked. > > Actually, there is quite a bit of documentation about this, and it > is in > *only* one place currently: > > http://microformats.org/wiki/hcard-parsing Note that I have a good deal of stuff in a presenation I gave earlier this year: http://theryanking.com/presentations/2007/www2007-microformats-parsing/ I should write it up more formally. -ryan From brian.suda at gmail.com Fri Sep 14 03:29:34 2007 From: brian.suda at gmail.com (Brian Suda) Date: Fri Sep 14 03:29:38 2007 Subject: [uf-dev] Parsing By Element In-Reply-To: References: <21e770780709130654g1d34dab1k36cc6ff984a2d009@mail.gmail.com> Message-ID: <21e770780709140329n7b716790n376ef7b7e980e31f@mail.gmail.com> On 9/13/07, Tantek ?elik wrote: > > It takes the W3C list of HTML elements and begins to map how and where > > to extract the values for various microformats properties. > > I saw that, I'm not sure that a raw element list is the right way to start > that. --- my understanding, was that we wanted a full list of elements and their parsing rules. If we want to re-order or group them my semantics that is fine. It is a wiki, so i'll let someone else take the existing data and re-order/remove/tweak it as needed. > I've been trying to complete the *semantic* element and attribute lists as > well as group them into logical sets for mnemonic purposes here: > > http://microformats.org/wiki/semantic-xhtml --- last time i looked at that page it was a list of a handful of elements. It seems much better and grouped now. How would you suggest adding parsing information to that list (or do you?) > >From that, the next step is an audit of hcard-parsing to see if I'm missing > any special element handling (like for example) and derive parsing > rules per element semantics accordingly and finish writing them up here: --- when i created the list of element, there are several that do not seem to be covered. FRAME, SCRIPT, APPLET, CITE, INS/DEL and Q (they both have a cite attribute), (some of those are easy answers) we only have a brief description of TABLE semantics, (Tables also have a SUMMARY attribute and the whole AXIS/HEADER/ID stuff). There was also rules discussed about what it means if class="category" was on an OL/UL would that be one category per LI or is that a single string of the combined LI values?
  1. foo
  2. bar
is that: CATEGORIES:foobar or CATEGORIES:foo,bar would the same apply to other properties such as TEL? or only plural properties we sigularized? or none. I couldn't find a reference but i thought we did decide on something, so we should document our decision hCard page also mentions this: http://microformats.org/wiki/hcard#Tags_as_Categories using rel-tag with categories, then the parsing is different, this isn?t mentioned on the hcard-parsing page > With the practical experience of hCard and hCalendar parsing, I'll > extract/abstract common bits and draft /wiki/compound-parsing as general > rules for parsing compound microformats. > > How does that sound? --- I agree it would be better to migrate this information to a general "parsing" page, than continue to have a *-parsing for each format. I?m not sure a hCalendar-parsing page is needed. There is plenty of common overlap, so i would prefer this approach of a generic page, then any specific rules be added to the *-parsing pages on a format-by-format needed basis. -brian -- brian suda http://suda.co.uk From scott at randomchaos.com Mon Sep 17 15:11:19 2007 From: scott at randomchaos.com (Scott Reynen) Date: Mon Sep 17 15:11:53 2007 Subject: [uf-dev] Re: [uf-discuss] Storing Microformats In-Reply-To: <1f8270600709171244t59efe0bbk3e92c341ae8429a@mail.gmail.com> References: <1f8270600709170837k266af6a8x9d8fde7629af4386@mail.gmail.com> <1f8270600709171244t59efe0bbk3e92c341ae8429a@mail.gmail.com> Message-ID: <4669E3BD-BC58-498F-9DAB-291A486ABFBF@randomchaos.com> On Sep 17, 2007, at 1:44 PM, Paul Kinlan wrote: > My question is: How are people storing > the data present in microformats so that they can be searched and > maintained and consumed in applications etc? (Moved from the -discuss list.) Back when I did a spider, my database schema looked like this: CREATE TABLE IF NOT EXISTS `url` ( `id` int(11) NOT NULL auto_increment, `url` text NOT NULL, `last_checked` datetime NOT NULL, PRIMARY KEY (`id`) ) CREATE TABLE IF NOT EXISTS `node` ( `id` int(11) NOT NULL auto_increment, `parent_id` int(11) NOT NULL, `url_id` int(11) NOT NULL, `html` text NOT NULL, PRIMARY KEY (`id`), KEY `parent_id` (`parent_id`), KEY `url_id` (`url_id`) ); CREATE TABLE IF NOT EXISTS `node_property` ( `id` int(11) NOT NULL auto_increment, `node_id` int(11) NOT NULL, `name` varchar(255) NOT NULL, `value` text NOT NULL, PRIMARY KEY (`id`), KEY `node_id` (`node_id`) ) So the "node" table was basically a DOM tree I used for parsing and the "node_property" table was where I put the parsed data for quick searching. I don't know that this was necessarily the best way to do it, but I didn't run into any problems during the brief period it was running. Peace, Scott