Código PHP:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Smashing HTML5!</title>
<link rel="stylesheet" href="assets/main.css" type="text/css" />
</head>
<body>
<div id="container">
<header></header>
<section id="body">
<?php
require('class.bot.php');
$var = new Bot();
echo '<div class="master">';
echo $var->Scrape('Steve_Jobs');
?>
</section>
<footer></footer>
</body>
</html>
Código PHP:
<?php
require('phpQuery.php'); //He actually does all the Magic!
require('class.cleanTags.php'); //He actually does the cleaning :P
class Bot {
public $input;
public $output;
public function Scrape($string){
$this->input = $string;
$url = 'http://en.wikipedia.org/wiki/'.$this->input;
phpQuery::newDocumentFileHTML($url);
$myHTML = pq('div#bodyContent');
$this->output = $myHTML->html();
$newObj = new CleanTags();
$this->output = $newObj->cleanInsideTags($this->output, '<p>,<b>,<a>');
$this->output = strip_tags($this->output);
$this->output = trim($this->output);
$this->output = str_replace(';', "", $this->output);
$this->output = nl2br($this->output);
$this->output = str_replace("<br />\n<br />\n<br />", "</div>\n<br />", $this->output);
$this->output = strip_tags($this->output, '<div>');
$this->output = explode("</div>", $this->output);
return $this->output[0];
}
}
?>
Código PHP:
<?php
class CleanTags {
public $text;
public $tags;
public function cleanInsideTags($text, $tags = '', $invert = FALSE){
$this->text= $text;
$this->tags= $tags;
preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($this->tags), $this->tags);
$this->tags = array_unique($this->tags[1]);
if(is_array($this->tags) AND count($this->tags) > 0) {
if($invert == FALSE) {
return preg_replace('@<(?!(?:'. implode('|', $this->tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $this->text);
} else {
return preg_replace('@<('. implode('|', $this->tags) .')\b.*?>.*?</\1>@si', '', $this->text);
}
}elseif($invert == FALSE) {
return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $this->text);
}
// Actually, this may output lots of € % $ characters, lets clean 'em up.
return $this->text;
}
}
?>
Mi duda es, si pongo valores como "Steve_Jobs" o "Bill_gates" mi código me devuelve el primer párrafo (como debe ser) pero si pongo "Food" o "Energy" o palabras individuales me devuelve una página en blanco, me encantaría poder resolver este problema.
¿Ideas? Muchas gracias por leer.
PD: Si alguien esta interesado en este código, esperen que lo logre reparar y cuando este al 100% va derechito a GitHub ;)