No uses expresiones regulares para parsear html o xml en vez de eso usa un parser.
Como lxml o BeautifulSoup
Al parecer no es tan fácil de hacer un spider. Lo probe con python 2.6 y me marco esto.
Código HTML:
Ver original<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
body {
background:red;
font-family:arial,san-serif;
font-size:18px;
font-weight:normal;
color:white;
margin:75px;
}
<p><h1>You are blocked from Vimeo
</h1></p> <p>The connection you are using has been blocked from communicating with Vimeo's servers. This ban will never be lifted.
</p> <div style="display:none">1288726562
</div> <p>If you are human and think this is an error, please
<a href="mailto:[email protected]?body=I have been banned. My IP is x.x.x.x and my browser is Python-urllib/1.17">click here
</a>.
</p> <p><em>"It's too bad she won't live. But then again, who does?"
</em></p>
Al parecer también tienes que añadir algunos headers al request que estas haciendo.