Microformalyze (was: Playlists and Albums (was: Re: [uf-new] item property))

Martin McEvoy martin at weborganics.co.uk
Fri Oct 19 14:15:30 PDT 2007


On Fri, 2007-10-19 at 10:57 -0400, Manu Sporny wrote:
> Martin McEvoy wrote:
> >>> I Really dont think that we can have a clear Idea of what hAudio is
> >>> Until our our examples are re-studied without the use of a program.
> > 
> > Because it is my opinion that the data output of your application is not
> > to be relied upon
> 
> I don't want this to become a nasty discussion, 

?? now you are confusing me, this is a nasty discussion because I ask
questions? 

> Martin. I realize that
> you have questions about Microformalyze and I am attempting to answer them.
> 
> I believe the tone of this discussion is a bit off... right now, it
> sounds like you're alluding to the notion that there has been some sort
> of "nefarious behavior" when gathering data for hAudio, 

I am not saying that there is some sort of sinister behavior going on at
all I am pointing out that the data that Microformalize outputs (in the
terminal) is not to be trusted.

> or that the data
> we have is not dependable. I realize that my responses could have been
> less inflammatory and more explanatory.
> 
> I am going to attempt to explain how Microformalyze works in a more
> explanatory manner.
> 
> >> Why do you think this approach is going to help us?
> > 
> > Why do you think that the Microformalyze approach is going to help us?
> > do you not think the Hand and Eye are a better approach? 
> 
> Microformalyze is a "Hand and Eye" approach... there is no automation to
> the "analyzing a web page" part of the tool.
> 

...

> It saves us the time from having to tally statistics by hand. It is also
> far more accurate to have a machine tally the results and statistics.
> 

...

> Before we were using Microformalyze there were several errors when
> calculating the statistics that I made. It is difficult to go through 48
> examples and over 1,000 properties by hand, calculate statistics, and
> not expect some human error.
> 
> Here's how we used to gather examples for hAudio:
> 
> 1. Open up the hAudio Wiki.
> 2. Copy/Paste one example URL into a different tab in the web browser.
> 3. Copy/Paste the hAudio example template that had all of the properties
>    into the correct part of the wiki page.
> 4. Flip between the hAudio Wiki tab and the example URL page, adding or
>    deleting properties from hAudio.
> 5. Repeat this process 84 times (each page took around 20 minutes to
>    analyze).
> 
> Here's how it works with Microformalyze:
> 
> 1. Open up Microformalyze
> 2. Click "Add URL" to add URLs that need to be analyzed.
> 3. Click "Add property" to add properties that you expect to see (this
>    can also be done while you're analyzing the pages)
> 4. Once all of the URLs that need to be analyzed have been added, you
>    click the "Next URL" button.
> 5. Microformalyze displays the URL in a web browser and you click
>    checkboxes to specify what properties exist on the example URL page.

So I tell the application what properties exist on a given page, and It
confirms if this is true or not?

>    This small change to the process cuts down the time to analyze a
>    page greatly... mainly because you're not editing wiki text, you're
>    just clicking a checkbox.
> 6. Repeat this process 84 times (each page took around 5 minutes to
>    analyze).
> 
> The old way of doing things took around 20 minutes per website. The
> Microformalyze way of doing things takes around 5 minutes per website.
> 
> Now let's examine how we calculated statistics before:
> 
> Here's how we did it via the Wiki:
> 
> Every time a new property was created, I would have to go through and
> tally the results by hand. This was error prone and on more than one
> occasion, I had to wipe everything and start over. It also required me
> to triple-check my work to make sure I was reporting the correct
> statistics to the list. I spent hours doing this - just calculating
> statistics. There is a reason not many people help out with gathering
> examples and calculating statistics - it is tedious and excruciatingly
> time consuming.
> 
> Here's how it was done using Microformalyze:
> 
> You click a button and the statistics are automatically calculated for
> you. You click another button and it dumps the wiki formatted text for
> displaying the properties. It is no longer time consuming or error prone
> to do this!
> 
> However, the most important aspect of Microformalyze is that ANYBODY can
> go back and validate our findings easily. The data files are there,
> there is a common namespace across all properties/websites, in other
> words: there is a verifiable paper trail.
> 
> It is important to point out that this does not exist for any other
> Microformat that I know about. Verifiability of analysis results is very
> important! Reducing human error in statistics calculations is very
> important! Microformalyze builds this into the examples gathering and
> statistics calculation process.
> 
> >> that helps the user track the properties on each page. It can
> >> automatically calculate statistics and helped the process of analysis
> >> immensely.
> > 
> > This is my concern *HOW* does Microformalize do this? 
> > 
> > Microformalize has all the power of a high profile search engine that
> > can output the relevance of a given keyword in order and frequency of
> > occurrence correct?
> 
> No, absolutely not. This is the core of your misunderstanding of what
> Microformalyze does. There is no "search engine" or "keyword matching"
> technology in Microformalyze. That would be a horrible way to go about
> gathering examples.
> 
> All Microformalyze does is automate the tedious and error-prone parts of
> the examples and statistics gathering portion of the Microformats
> process. It also adds verifiability - which is really it's most
> important contribution to the process.

Sorry my friend I don't think I was being very clear

*HOW* does Microformalize do this? 

What Is a property? 
how is a property determined?, 
does Microformalize Analyze the raw html to determine the existence of
these properties? does it look for actual output on a web page?

How does it gather statistics?
how are they compared?, are they compared against other url's loaded
into Microformalize, or does it calculate the occurence of a "property"
on a page, or some other way?

> 
> If you would like to see a detailed tutorial on how it works, the
> tutorial is available here:
> 
> http://wiki.digitalbazaar.com/en/Microformalyze#Tutorial

Thanks for the tutorial but How do I use Microformalize was not the
question.

> 
> I'd be happy to answer any other questions or concerns that you have
> about Microformalyze. Like I said before, all of the data files, source
> code (which I placed under the GPL), and documentation is available via
> the website listed above. You don't have to take my word for it... you
> could read the code, look at the data and see for yourself.

I have had a look at the code but Python is not my strong point, Perhaps
you might like to explain?

I did a test, the "properties" I was Looking for were Baba and Flumps
(because there is a good chance that these properties will NOT exist in
any of the pages I'm likely to test)

here is the test file (copy and paste if you like)

property	Baba	The Elephant
property	Flumps	A sweetie
url	Bazaar	http://blog.digitalbazaar.com/
properties	Baba	Flumps

sorry to use your url but it was the first thing that sprung to mind :)

I ticked both boxes in the GUI Baba and Flumps and outputted the data in
the terminal

Baba                               : 100.00%
Flumps                             : 100.00%


I looked at your page thinking "Huh" how can that be correct?

In the web page text there is no mention of the words Baba or Flumps 

I looked at the source code No no Mention there either?

Does Microformalize determine the existence of these properties in
another way?

I added another url to examine

property	Baba	The Elephant
property	Flumps	A sweetie
url	Bazaar	http://blog.digitalbazaar.com/
properties	Baba	Flumps
url	no foo in this	http://weborganics.co.uk/
properties	Baba


the outputed data from the second url

Baba                               : 100.00%
Flumps                             : 50.00%

I KNOW these two properties do not exist in any way at WebOrganics

Can you see now WHY I am concerned and moderately confused 
Microformalize does not seem to be calculating the existence of these
properties on a page it seems to be Just calculating if I have ticked a
box or not.


Am I missing something?


Thanks

Martin
 
> 
> -- manu
> _______________________________________________
> microformats-new mailing list
> microformats-new at microformats.org
> http://microformats.org/mailman/listinfo/microformats-new



More information about the microformats-new mailing list