Python3中如何解决网页字符串编码问题：处理\xb6\xd4\xb6\xcc\xd0\xc5\格式的字节串

如图，我爬虫获取到的是一个 dict。现在我要把字典的中文解析出来。我用 py2 是这么干的：主要是用了字符串的 decode。但是 py3 没有这个方法啊。而且这个 dict 是用 requests 获取出来的 response.headers 获取出来就是字符串 dict，不能是 bytes。我自己试了用 bytes(s,"gbk"),str(s,"gbk")都没用。自己也想了几天，没啥结果。别跟我说在字符串前面加个 b 变为 bytes，问题是变量不能加 b 啊。

gougou168 1楼

哈哈哈，我前段时间也遇到了，放弃了后来……

bupafengyu 2楼

这个问题是典型的Python3中处理网页抓取时遇到的编码问题。你看到的\xb6\xd4\xb6\xcc\xd0\xc5\这种格式，其实是字节串（bytes）在控制台或日志中的显示形式，它表示的是使用某种编码（比如GBK）编码后的中文文本的原始字节。

核心原因在于：Python3中str和bytes严格区分。网页返回的原始数据是bytes，你需要用正确的编码将其解码（decode）为str。

解决方案：

首先，确定正确的编码。 对于中文网页，常见编码有 'utf-8'、'gbk'、'gb2312'。你可以通过以下方式获取或猜测：
- 查看HTTP响应头中的 Content-Type，例如 charset=utf-8。
- 查看网页HTML源码中的 <meta charset="..."> 标签。
- 使用 chardet 库自动检测（非标准库，需安装 pip install chardet）。
然后，进行解码。 使用 bytes 对象的 .decode() 方法。

代码示例：

假设你通过 requests 库获取了一个网页，其内容字节串为 content_bytes。

import requests
# 假设你已获得字节串数据，例如：
# content_bytes = b'\xb6\xd4\xb6\xcc\xd0\xc5'  # 这实际是“对信”的GBK编码

# 方法1：如果明确知道编码（例如GBK）
decoded_str = content_bytes.decode('gbk')
print(decoded_str)  # 输出：对信

# 方法2：使用requests库，它会自动根据HTTP头部尝试解码
response = requests.get('你的网址')
# response.encoding 是requests推测的编码，response.text 是解码后的字符串
print(response.text)

# 方法3：使用chardet检测编码（当编码未知时）
try:
    import chardet
    encoding_detected = chardet.detect(content_bytes)['encoding']
    if encoding_detected:
        decoded_str = content_bytes.decode(encoding_detected)
        print(decoded_str)
    else:
        print("无法检测编码")
except ImportError:
    print("请先安装chardet库: pip install chardet")

关键点： 你遇到的字符串本质是 b'\xb6\xd4\xb6\xcc\xd0\xc5'（一个bytes对象），直接打印它就会显示成 \xb6\xd4\xb6\xcc\xd0\xc5。你需要做的就是找到对的编码（比如 'gbk'），然后调用 .decode('gbk') 得到正确的中文字符串。

总结： 用正确的编码对字节串进行 .decode() 即可。

itying888 3楼

eval(“ans = b’{}’”.format(str))

yuanlaile 4楼

data.encode(“raw_unicode_escape”)

itying888 5楼

先 encode 再 decode 即可：

In [3]: s.encode(‘latin’)
Out[3]: b’\xb6\xd4\xb6\xcc\xd0\xc5’

In [4]: s.encode(‘latin’).decode(‘gbk’)
Out[4]: ‘对短信’

sinazl 6楼

“\xd3\xf1\xc1\xfa”.encode(“raw_unicode_escape”).decode(“gbk”)
Out[5]: ‘玉龙’

itying888 7楼

![]( )

先检查 text 是什么类型如果 type(text) is bytes，那么 text.decode(‘unicode_escape’)

如果 type(text) is str，那么 text.encode(‘latin-1’).decode(‘unicode_escape’)

链接： https://www.zhihu.com/question/26921730/answer/49625649

itying888 8楼

关于 latin-1

https://baike.baidu.com/item/latin1

bupafengyu 9楼

呃，我不是很懂 python
更像是 lz 那个是 dict 不知道怎么转 str 的问题

搜一下还是能找到解决方案。没测试过中文文件名 lz 自己测试一下
r = requests.get(‘http://httpbin.org/image/jpeg’)
headers = str.encode(json.dumps(dict(r.headers)))
print(headers.decode(‘gbk’))

nodeper 10楼

感谢各位大佬为我解惑。

ionicwang 11楼

既然是爬虫,
response = requests.get(url)
response.encoding="utf-8"
就直接是能看懂的汉字了,
那用这么麻烦…

caililin 12楼

print(str(b’\xb6\xd4\xb6\xcc\xd0\xc5’, encoding=“GBK”))

sinazl 13楼

自己实现 http.client.parse_headers 方法

https://www.dust8.com/2017/12/04/header-parsing-error/

eggper 14楼

网络编程中常见的编码问题