Foros del Web - Ver Mensaje Individual

razpeitia · #2 (**permalink**) 02/11/2010, 13:39

No uses expresiones regulares para parsear html o xml en vez de eso usa un parser.
Como lxml o BeautifulSoup

Al parecer no es tan fácil de hacer un spider. Lo probe con python 2.6 y me marco esto.

Código HTML:

Ver original<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>Vimeo / 403 Forbidden</title>
        
    <style type="text/css">
        body {
            background:red;
            font-family:arial,san-serif;
            font-size:18px;
            font-weight:normal;
            color:white;
            margin:75px;
        }
    </style>
</head>
<body>
    <p><h1>You are blocked from Vimeo</h1></p>
    <p>The connection you are using has been blocked from communicating with Vimeo's servers. This ban will never be lifted.</p>
    <div style="display:none">1288726562</div>
    <p>If you are human and think this is an error, please <a href="mailto:[email protected]?body=I have been banned. My IP is x.x.x.x and my browser is Python-urllib/1.17">click here</a>.</p>
    <br />
    <p><em>"It's too bad she won't live. But then again, who does?"</em></p>
</body>
</html>

Al parecer también tienes que añadir algunos headers al request que estas haciendo.