Python爬虫抓取网站时编码格式问题如何解决？求助帖

通过查看网页源码，网站编码格式为 utf-8, 可是通过 utf-8 解码提示 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte, 通过设置 ignore 参数，成功得到了一堆乱码。网站链接在此: http://www.bw30.com/tszt/huodong/09/wpsj/index.htm, 坐等各位大佬协助~
Python爬虫抓取网站时编码格式问题如何解决？求助帖

gougou168 1楼

bom 了解一下

nodeper 2楼

遇到编码问题确实挺烦的，我一般这么处理：

import requests
from bs4 import BeautifulSoup
import chardet

# 方法1：用chardet自动检测编码
def get_html_with_encoding(url):
    response = requests.get(url)
    # 先检测原始字节的编码
    encoding = chardet.detect(response.content)['encoding']
    # 有些网站检测不准，可以设置备选
    if encoding is None:
        encoding = 'utf-8'
    # 用检测到的编码解码
    html = response.content.decode(encoding, errors='ignore')
    return html

# 方法2：直接从response headers里找编码
def get_html_from_headers(url):
    response = requests.get(url)
    # 先看headers里有没有指定
    encoding = response.encoding
    if not encoding:
        # 没有的话从content-type里解析
        content_type = response.headers.get('content-type', '').lower()
        if 'charset=' in content_type:
            encoding = content_type.split('charset=')[-1].strip()
        else:
            encoding = 'utf-8'  # 默认用utf-8
    html = response.content.decode(encoding, errors='ignore')
    return html

# 方法3：用BeautifulSoup的自动检测（最简单）
def get_html_simple(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser', from_encoding='utf-8')
    # 如果网页有<meta charset>标签，BeautifulSoup会自动处理
    html = str(soup)
    return html

# 实际使用示例
if __name__ == '__main__':
    url = 'http://example.com'
    
    # 方法1：最可靠
    html1 = get_html_with_encoding(url)
    
    # 方法2：适合有明确编码声明的网站
    html2 = get_html_from_headers(url)
    
    # 方法3：最省事，但可能不准确
    html3 = get_html_simple(url)
    
    print(f"方法1获取字符数: {len(html1)}")
    print(f"方法2获取字符数: {len(html2)}")
    print(f"方法3获取字符数: {len(html3)}")

关键点：