为了做课题，通过 uid 读取微博的用户基本信息。
从 chrome 模拟 h5 访问。
某些用户如 uid1 可以正常访问。
某些用户如 uid2 则不行，chrome 中可正常访问。
排除网络不稳定因素。

问题

uid2 访问失败的原因是什么？
有没有方法区分 {访问失败} 和 {用户被封禁销号}。

import requests
def get_user_info(uid):
url = ‘https://m.weibo.cn/api/container/getIndex’
headers = {
‘Accept’: ‘application/json, text/plain, /’,
‘DNT’: ‘1’,
‘MWeibo-Pwa’: ‘1’,
‘Referer’: ‘https://m.weibo.cn/u/1707254184’,
‘User-Agent’: ‘Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1’,
‘X-Requested-With’: ‘XMLHttpRequest’,
}
headers['Referer'] = f'https://m.weibo.cn/u/{uid}'
p = {'type': 'uid',
     'value': uid,
     'containerid': '100505%s' % uid}
body = request(url, headers=headers, params=p)
print(body)
uid1 = 1692016845
uid2 = 1996669711
get_user_info(uid1)
get_user_info(uid2)

Python爬虫问题请教：浏览器能访问到的页面，爬虫时某些id能访问到，有些不能

htzhanglong 1楼

fiddler 下用安卓模拟器测试了下，浏览器访问 uid2，提示证书有问题（已安装 fiddler 证书），会不会是微博方的 https 的问题？
但我不是很懂，有人能费心解释一下嘛？

songsunli 2楼

这个问题我遇到过，典型的反爬虫机制。浏览器能访问但爬虫不行，通常是因为目标网站检测到了爬虫请求。

最常见的原因是请求头（headers）不完整。很多网站会检查User-Agent、Referer、Cookie等字段。有些id能访问到是因为它们对应的页面可能对爬虫限制较少，或者你刚好带了正确的cookie。

给你个完整的解决方案：

import requests
from bs4 import BeautifulSoup
import time

def get_page_with_proper_headers(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'max-age=0',
        'Referer': 'https://www.google.com/'  # 根据实际情况修改
    }
    
    # 如果需要cookie
    cookies = {
        'session_id': 'your_session_id_here'  # 从浏览器开发者工具复制
    }
    
    try:
        response = requests.get(url, headers=headers, cookies=cookies, timeout=10)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"请求失败: {e}")
        return None

# 使用示例
url = "https://example.com/page/123"
html_content = get_page_with_proper_headers(url)

if html_content:
    soup = BeautifulSoup(html_content, 'html.parser')
    # 解析页面内容
    print("页面获取成功")
else:
    print("页面获取失败")

关键点：

完整的headers模拟真实浏览器
必要的cookies（从浏览器开发者工具Network标签复制）
适当的延迟避免被封IP

如果还是不行，可能是动态加载的内容，需要用Selenium或Playwright。先用浏览器开发者工具检查网络请求，看看哪些请求是真正获取数据的。

总结：先完善请求头，不行再考虑动态渲染。

gougou168 3楼

>>>get_user_info(uid2)
{‘ok’: 0, ‘msg’: ‘这里还没有内容’, ‘data’: {‘cards’: []}}

does uid2 really exist?

vueper 4楼

经过测试，查看这个博主好像要登录之后才能看，不知道是什么机制
楼主可以用隐身模式打开这两个链接试试
http://weibo.com/u/1996669711
https://m.weibo.cn/u/1996669711

yuanlaile 5楼

后来访问失败的都加 cookie 重取了。不过不知道微博网页版的这个 cookie 管用多久，客户端 api 倒是能用很久，不过那个访问多了又有 403 限制，之后再看吧。