Error al generar keywords de una web

xarmagedonx · #1 (**permalink**) 01/12/2012, 16:20

Hola a todos!!, encontré un código en la web, que "servía" para generar keywords de un sitio web, le hice distintas correciones para que funcionara, pero sigue sin funcionar completamente.

Me extrae palabras clave erróneas, este es un ejemplo:

Cita:

testfalse, lang, {mw, loader, window, function, true, vector, user, gadget, mediawiki, legacy, options, usebetatoolbar, implement, resourceloader, default

Del artículo: http://es.wikipedia.org/wiki/Caf%C3%A9

Utilizo para llamar al extractor:

Código PHP:

  include("KeyPer.php");

[...]

if (empty($keywords)){

$ekeys = new KeyPer;

$keywords = $ekeys->Keys($url);

}

Y el código de "KeyPer" es:

Código PHP:

  <?php

class KeyPer {

function Keys($url) { 

$html = file_get_contents($url);

$html = $this->clean($html); 

$blacklist='a, ante, bajo, con, contra, de, desde, mediante, durante, hasta, hacia, para, por, que, qué, cuán, cuan, los, las, una, unos, unas, donde, dónde, como, cómo, cuando, porque, por, para, según, sin, tras, con, mas, más, pero, del'; 

$sticklist='test'; 

$minlength = 3; 

$count = 17; 

 
$html = preg_replace('/[\.;:|\'|\"|\`|\,|\(|\)|\-]/', ' ', $html); 

$html = preg_replace('/¡/', '', $html); 

$html = preg_replace('/¿/', '', $html);

 

$keysArray = explode(" ", $html); 

$keysArray = array_count_values(array_map('strtolower', $keysArray)); 

$blackArray = explode(",", $blacklist); 

 
foreach($blackArray as $blackWord){ 

if(isset($keysArray[trim($blackWord)])) 

unset($keysArray[trim($blackWord)]); 

} 

arsort($keysArray); 

$i = 1; 

$keywords = ""; 

foreach($keysArray as $word => $instances){ 

if($i > $count) break; 

if(strlen(trim($word)) >= $minlength && is_string($word)) { 

$keywords .= $word . ", "; 

$i++; 

} 

} 

 
$keywords = rtrim($keywords, ", "); 

 
return $keywords=$sticklist.''.$keywords; 

} 

 
function clean($html) { 

 
$regex = '/(([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*@([A-Za-z0-9-]+)(\\.[A-Za-z0-9-]+)*)/iex'; 

$desc = preg_replace($regex, '', $html); 

$html = preg_replace( "''si", '', $html ); 

$html = preg_replace( '/]*>([^<]+)<\/a>/is', '\2 (\1)', $html ); 

$html = preg_replace( '//', '', $html ); 

$html = preg_replace( '/{.+?}/', '', $html ); 

$html = preg_replace( '/ /', ' ', $html ); 

$html = preg_replace( '/&/', ' ', $html ); 

$html = preg_replace( '/"/', ' ', $html ); 

$html = strip_tags( $html ); 

$html = htmlspecialchars($html); 

$html = str_replace(array("\r\n", "\r", "\n", "\t"), " ", $html); 

 
while (strchr($html," ")) { 

$html = str_replace(" ", "",$html); 

} 

 
for ($cnt = 1; 

$cnt < strlen($html)-1; $cnt++) {

if (($html{$cnt} == '.') || ($html{$cnt} == ',')) { 

if ($html{$cnt+1} != ' ') { 

$html = substr_replace($html, ' ', $cnt + 1, 0); 

} 

} 

} 

return $html; 

} 

}

?>

¿Cómo puedo evitar que extraiga esa clase de palabras clave y lograr que extraiga las del texto?. Las que está extrayendo parece como palabras del html.

Me recomendaron en otros lugares utilizar tf-idf pero es muy complejo, con ecuaciones logarítmicas, así que intentare mejorarlo de esa manera, pero más adelante.

Espero que puedan ayudarme. Saludos!!!!