Python网站爬虫程序假死不报错，是哪里设计有问题？

用于网站爬图代码太多，贴关键部分，应该问题就出自以下爬虫部分：

for i in soup1.find_all("input", type="image"):
	imgnow = time.strftime("%Y%m%d%H%M%S")
	imgurl = i['data-src']
	rname = imgurl .split('/')[-1]
	opener = urllib.request.build_opener()
	opener.addheaders = self.UA
	urllib.request.install_opener(opener)
    try:
      #就是卡这里不动了，没有任何动静，程序静止了
  urllib.request.urlretrieve(
  imgurl, "./Pic/%s" % imgnow + "_" + str(rname))     

   except Exception as e:
	print("出错了:%s，继续..." % e)
        continue

caililin 1楼

需要加入超时处理，免得在对方网站限制或关闭时候有反应

ionicwang 2楼

我遇到过这种问题，爬虫假死不报错通常是因为网络请求被阻塞了。最常见的原因是：

没有设置超时 - 请求卡住后永远等待
同步请求阻塞 - 一个请求卡住整个程序就停了
异常处理不完整 - 某些异常被静默处理了

这是修复后的代码示例：

import requests
import concurrent.futures
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    session = requests.Session()
    # 设置重试策略
    retry = Retry(
        total=3,
        backoff_factor=0.5,
        status_forcelist=[500, 502, 503, 504]
    )
    adapter = HTTPAdapter(
        max_retries=retry,
        pool_connections=100,
        pool_maxsize=100
    )
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

def fetch_url(url, timeout=10):
    try:
        session = create_session()
        response = session.get(url, timeout=timeout)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"请求失败 {url}: {e}")
        return None

def main():
    urls = ['http://example.com/page1', 'http://example.com/page2']
    
    # 使用线程池并发请求
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(fetch_url, url): url for url in urls}
        
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result(timeout=15)  # 每个任务单独超时
                if data:
                    print(f"成功获取 {url}")
            except concurrent.futures.TimeoutError:
                print(f"任务超时: {url}")
            except Exception as e:
                print(f"处理失败 {url}: {e}")

if __name__ == "__main__":
    main()

关键改进点：

使用Session对象并配置连接池
设置合理的超时（连接超时和读取超时）
添加重试机制处理临时故障
使用线程池避免单个请求阻塞整个程序
为每个异步任务设置单独的超时控制

如果还假死，考虑用异步库（aiohttp）替代requests。

建议：加上超时和并发控制基本能解决假死问题。

caililin 3楼

怎么加，求指教

nodeper 4楼作者

You can read this.
ref: https://stackoverflow.com/questions/8763451/how-to-handle-urllibs-timeout-in-python-3.

Or you can try requests.

sinazl 5楼

推荐 requests。当然了不管用啥都得考虑超时

htzhanglong 6楼

加了超时，现在可以抛出异常了，但老是有这种，放在 try 了，10 个请求，有 5 个就是这种，遇到就 pass 了继续。。。。：

[WinError 10054] 远程主机强迫关闭了一个现有的连接。，继续…

要么就干脆直接报错程序终止了：
requests.exceptions.ConnectionError: (‘Connection aborted.’, ConnectionResetError(10054, ‘远程主机强迫关闭了一个现有的连接。’, None, 10054, None))