Python 爬虫之 Beautiful Soup 使用指南

写了一个 Python 爬虫之 Beautiful Soup 使用指南，请各位轻拍：

https://mp.weixin.qq.com/s?__biz=MjM5NjMyMjUzNg==&mid=2448130592&idx=1&sn=18ac8a5db08a0367378f02150af7cbea&chksm=b2f42fa78583a6b109bd27b14073eb5340cef4da90c82e85d64198655fdd869833a6ab727903&mpshare=1&scene=1&srcid=0621r50K15iba84frrRSfMcz#rd

h691938207 1楼

这就是你摘抄的理由？
我就不太明白，技术文章全靠官方文档是什么思路，这也是我写不出来博客的原因，不能产出有价值的内容，抄又懒得吵，只有翻译才好意思放到 onenote 里

caililin 2楼

帖子回复：

嘿，爬虫这事儿用BeautifulSoup就对了，它解析HTML/XML确实方便。核心就两步：拿到文档，然后找你要的东西。

1. 安装和基础解析

import requests
from bs4 import BeautifulSoup

# 抓页面
url = 'https://example.com'
resp = requests.get(url)
html_content = resp.text

# 解析成Soup对象，'html.parser'是Python内置解析器，也可以用'lxml'更快
soup = BeautifulSoup(html_content, 'html.parser')

2. 核心查找方法

# 按标签名找第一个
title_tag = soup.find('title')
print(title_tag.text)

# 找所有<a>标签
all_links = soup.find_all('a')
for link in all_links:
    print(link.get('href'))

# 按CSS类名找
special_divs = soup.find_all('div', class_='special-class')

# 按属性精确查找
target = soup.find('meta', attrs={'name': 'description'})

3. 提取数据的常用姿势

tag = soup.find('p')

# 拿文本
text_content = tag.get_text(strip=True)  # strip=True去掉空白字符

# 拿属性值
href_value = tag['href'] if tag.has_attr('href') else None

# 在找到的标签里继续找
container = soup.find('div', id='main')
inner_items = container.find_all('li')

4. 实际爬个简单例子

# 假设爬取一个图书列表页
import requests
from bs4 import BeautifulSoup

url = 'http://books.toscrape.com'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')

books = []
for article in soup.find_all('article', class_='product_pod'):
    title = article.h3.a['title']
    price = article.find('p', class_='price_color').text
    books.append({'title': title, 'price': price})

print(f"抓到 {len(books)} 本书")
for book in books[:3]:  # 打印前3本
    print(f"书名：{book['title']}, 价格：{book['price']}")

关键点：

用find()找单个，find_all()找全部
get_text()比直接.text更好控制格式
记得检查标签是否存在再取属性，避免AttributeError

总结：先find定位，再get_text或['attr']取内容。

eggper 3楼

排版不错，用的什么工具

yuanlaile 4楼

解析 dom 强推 pyquery ～

gougou168 5楼

为何不用 js 写爬虫

zlyuanteng 6楼

因为在学 Python

sinazl 7楼作者

好的，多谢推荐