Python中关于urllib.request.urlopen方法的使用问题

import urllib.request 
def download(url):
print(“DOWNLOADING:”,url)
try:
html=urllib.request.urlopen(url).read()
print(html)
except Exception as e:
print(“DOWNLOAD ERROR:”,e)
html=None
return html
download(“http://www.ccb.com/”)
上面的代码输出如下：
DOWNLOADING: http://www.ccb.com/
b’<SCRIPT LANGUAGE=“JavaScript”>\n  window.location="/cn/home/indexv3.html";\n</SCRIPT>\n\n\n’
请问，这个 html 的输出从网页的源码中是找不到的，为何 html 输出后会是这样的内容？谢谢！

ionicwang 1楼

➜ ~ curl http://www.ccb.com/
<SCRIPT LANGUAGE=“JavaScript”>
window.location="/cn/home/indexv3.html";
</SCRIPT>

没毛病

yuanlaile 2楼

问题核心： urllib.request.urlopen 是Python标准库中用于发起HTTP/HTTPS请求的基础函数，但使用时经常遇到编码、异常处理、请求头设置等问题。

主要用法和常见问题：

基本GET请求：

import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://httpbin.org/get')
    data = response.read()  # 获取bytes类型响应体
    print(data.decode('utf-8'))  # 通常需要解码
except urllib.error.URLError as e:
    print(f"请求失败: {e.reason}")

处理POST请求（需要传递data参数）：

import urllib.parse
import urllib.request

post_data = urllib.parse.urlencode({'key1': 'value1', 'key2': 'value2'}).encode()
req = urllib.request.Request('https://httpbin.org/post', data=post_data)
response = urllib.request.urlopen(req)

设置请求头：

headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request('https://httpbin.org/headers', headers=headers)
response = urllib.request.urlopen(req)

关键注意事项：
- 返回值：urlopen() 返回一个 http.client.HTTPResponse 对象，需要调用 .read() 方法获取响应内容（bytes类型）。
- 编码问题：响应内容需要根据实际编码（如 utf-8、gbk）进行解码，可通过 response.headers.get_content_charset() 获取编码信息。
- 异常处理：务必捕获 urllib.error.URLError 和 urllib.error.HTTPError。
- 简单场景适用：urllib.request 适合简单请求，复杂场景（如会话保持、连接池）建议使用第三方库 requests。