Ver Mensaje Individual
  #2 (permalink)  
Antiguo 01/12/2014, 06:01
Avatar de lauser
lauser
Moderator Unix/Linux
 
Fecha de Ingreso: julio-2013
Ubicación: Odessa (Ukrania)
Mensajes: 3.278
Antigüedad: 11 años, 4 meses
Puntos: 401
Respuesta: Crear "robot"

Debes saber que las arañas no están bien vistas por los webmasters, ya que su misión es rastrear, capturar, robar ya sea documentos, textos, imágenes, etc... Pero si tantas inquietudes tienes te pongo un simple ejemplo en el cual captaríamos la documentación de la pagina php.net.

Código PHP:
Ver original
  1. <?php
  2.  
  3. // It may take a whils to crawl a site ...
  4.  
  5. // Inculde the phpcrawl-mainclass
  6. include("libs/PHPCrawler.class.php");
  7.  
  8. // Extend the class and override the handleDocumentInfo()-method  
  9. class MyCrawler extends PHPCrawler  
  10. {
  11.   function handleDocumentInfo($DocInfo)  
  12.   {
  13.     // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
  14.     if (PHP_SAPI == "cli") $lb = "\n";
  15.     else $lb = "<br />";
  16.  
  17.     // Print the URL and the HTTP-status-Code
  18.     echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
  19.      
  20.     // Print the refering URL
  21.     echo "Referer-page: ".$DocInfo->referer_url.$lb;
  22.      
  23.     // Print if the content of the document was be recieved or not
  24.     if ($DocInfo->received == true)
  25.       echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
  26.     else
  27.       echo "Content not received".$lb;  
  28.      
  29.     // Now you should do something with the content of the actual
  30.     // received page or file ($DocInfo->source), we skip it in this example  
  31.      
  32.     echo $lb;
  33.      
  34.     flush();
  35.   }  
  36. }
  37.  
  38. // Now, create a instance of your class, define the behaviour
  39. // of the crawler (see class-reference for more options and details)
  40. // and start the crawling-process.  
  41.  
  42. $crawler = new MyCrawler();
  43.  
  44. // URL to crawl
  45. $crawler->setURL("www.php.net");
  46.  
  47. // Only receive content of files with content-type "text/html"
  48. $crawler->addContentTypeReceiveRule("#text/html#");
  49.  
  50. // Ignore links to pictures, dont even request pictures
  51. $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
  52.  
  53. // Store and send cookie-data like a browser does
  54. $crawler->enableCookieHandling(true);
  55.  
  56. // Set the traffic-limit to 1 MB (in bytes,
  57. // for testing we dont want to "suck" the whole site)
  58. $crawler->setTrafficLimit(1000 * 1024);
  59.  
  60. // Thats enough, now here we go
  61. $crawler->go();
  62.  
  63. // At the end, after the process is finished, we print a short
  64. // report (see method getProcessReport() for more information)
  65. $report = $crawler->getProcessReport();
  66.  
  67. if (PHP_SAPI == "cli") $lb = "\n";
  68. else $lb = "<br />";
  69.      
  70. echo "Summary:".$lb;
  71. echo "Links followed: ".$report->links_followed.$lb;
  72. echo "Documents received: ".$report->files_received.$lb;
  73. echo "Bytes received: ".$report->bytes_received." bytes".$lb;
  74. echo "Process runtime: ".$report->process_runtime." sec".$lb;  
  75. ?>
__________________
Los usuarios que te responden, lo hacen altruistamente y sin ánimo de lucro con el único fin de ayudarte. Se paciente y agradecido.
-SOLOLINUX-