[uf-dev] Parsing XFN in PHP

Mark Ng mark at markng.me.uk
Fri Apr 11 04:36:03 PDT 2008


$html = tidy_repair_string($html,array('output-xhtml' => true,
'numeric-entities' => 'true', )); was what I was using - does it work
for you ?

Mark

On 11/04/2008, Julian Bond <julian_bond at voidstar.com> wrote:
> Continuing a thread that started on the Discuss list.
>
>  My experiments have led me to 2 approaches depending on PHP release.
>  First php5. With error handling left as an exercise for the reader
>
>
>  $url = 'http://ciaranmcnulty.com/';
>  if($html = @file_get_contents($url)){
>   $dom = new DomDocument();
>   if(@$dom->loadHtml($html)){
>
>     if ($nodes = $dom->getElementsByTagName('a')) {
>       foreach($nodes as $node){
>         if ($node->getAttribute('rel')=='me') {
>           echo $node->getAttribute('href');
>         }
>       }
>     }
>   }
>  }
>
>  Pretty easy, huh? Clearly this same approach could be used for other
>  values of rel= It's probably not too hard to extend this approach to
>  find hCard and other uFs.
>
>  loadHtml() doesn't exist in php4 dom-xml. In theory it should be
>  possible to use HTML-Tidy tidy_repair_string to clean the html first and
>  then feed it to domxml_open_mem. In practice, I'm having real trouble
>  getting the right collection of tidy_repair_string configuration
>  parameters to generate clean enough XML for dom to accept it. If that
>  can be done, then this should work.
>
>
>  $url = 'http://ciaranmcnulty.com/';
>  if($html = @file_get_contents($url)){
>
>   $html = @tidy_repair_string($html);
>   if ($dom = @domxml_open_mem($html)) ) {
>     if ($nodes = $dom->get_elements_by_tagname('a')) {
>       foreach($nodes as $node){
>         if ($node->get_attribute('rel')=='me') {
>           echo $node->get_attribute('href');
>         }
>       }
>     }
>   }
>  }
>
>  Typical errors are things like:-
>  - Space required after the Public Identifier
>  - SystemLiteral " or ' expected
>  - xmlParseExternalID: PUBLIC, no URI in
>  - invalid entity nbsp
>  Maybe, it's possible to get Tidy's output to avoid all these but I
>  haven't managed it yet. I had a look at hkit but that makes no attempt
>  to configure the Tidy module so I'd expect lots of problems when trying
>  to parse arbitrary web pages.
>
>
>  --
>  Julian Bond  E&MSN: julian_bond at voidstar.com  M: +44 (0)77 5907 2173
>  Webmaster:          http://www.ecademy.com/      T: +44 (0)192 0412 433
>  Personal WebLog:    http://www.voidstar.com/     skype:julian.bond?chat
>                            Tastes Like Milk
>  _______________________________________________
>
> microformats-dev mailing list
>  microformats-dev at microformats.org
>  http://microformats.org/mailman/listinfo/microformats-dev
>


More information about the microformats-dev mailing list