Python中如何使用BeautifulSoup包抓取链接?

下面的代码,用于搜索维基百科上凯文·贝肯词条里所有指向其他词条的链接。
搜索结果中那些指向其它词条页面的链接应具有如下特点:
• 它们都在 id 是 bodyContent 的 div 标签里
• URL 链接不包含分号
• URL 链接都以 /wiki/ 开头

代码如下:

from urllib.request import urlopen from bs4 import BeautifulSoup import datetime import random import re random.seed(datetime.datetime.now()) def getLinks(articleUrl): html = urlopen(" https://en.wikipedia.org “+articleUrl) bsObj = BeautifulSoup(html) return bsObj.find(“div”, {“id”:“bodyContent”}).findAll(“a”,href=re.compile(”^(/wiki/)((?!:).)*$")) links = getLinks("/wiki/Kevin_Bacon") print(links)

代码运行后报错如下,请问这是为什么呢?感谢指点!

Traceback (most recent call last): File “c:\Users\A\AppData\Roaming\Code\User\test\2.py”, line 11, in <module> links = getLinks("/wiki/Kevin_Bacon") File “c:\Users\A\AppData\Roaming\Code\User\test\2.py”, line 8, in getLinks html = urlopen(" https://en.wikipedia.org "+articleUrl) File “D:\Python\Python3\lib\urllib\request.py”, line 223, in urlopen return opener.open(url, data, timeout) File “D:\Python\Python3\lib\urllib\request.py”, line 526, in open response = self._open(req, data) File “D:\Python\Python3\lib\urllib\request.py”, line 544, in _open ‘_open’, req) File “D:\Python\Python3\lib\urllib\request.py”, line 504, in _call_chain result = func(*args) File “D:\Python\Python3\lib\urllib\request.py”, line 1361, in https_open context=self._context, check_hostname=self._check_hostname) File “D:\Python\Python3\lib\urllib\request.py”, line 1320, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed>


Python中如何使用BeautifulSoup包抓取链接?

7 回复

html = urlopen(“https://en.wikipedia.org”+articleUrl)
字符串里别加空格


import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# 1. 获取网页内容
url = 'https://example.com'
response = requests.get(url)
html_content = response.text

# 2. 解析HTML
soup = BeautifulSoup(html_content, 'html.parser')

# 3. 抓取所有链接
# 方法一:获取所有<a>标签的href属性
links = []
for a_tag in soup.find_all('a', href=True):
    href = a_tag['href']
    # 转换为绝对URL
    absolute_url = urljoin(url, href)
    links.append(absolute_url)

# 方法二:使用列表推导式
links = [urljoin(url, a['href']) 
         for a in soup.find_all('a', href=True)]

# 4. 打印结果
print(f"找到 {len(links)} 个链接:")
for link in links:
    print(link)

# 5. 过滤特定类型的链接(可选)
# 只保留http/https链接
http_links = [link for link in links 
              if link.startswith(('http://', 'https://'))]

# 查找包含特定关键词的链接
keyword_links = [link for link in links 
                 if 'blog' in link.lower()]

核心要点:

  1. requests 获取网页,BeautifulSoup 解析HTML
  2. soup.find_all('a', href=True) 找到所有带href的标签
  3. urljoin() 处理相对URL,确保得到完整链接
  4. 用列表推导式让代码更简洁

简单建议: 记得处理异常和设置请求头。

错误都给你报出来了 getLinks 这个函数里面的
html = urlopen(" https://en.wikipedia.org "+articleUrl)
出错咯。

DNS 解析错误,需要挂代理

程序是在 VSCODE 下执行的,去掉 URL 中的空格后程序报错如下。

如果把“ BeautifulSoup(html)”这句改为“ BeautifulSoup(html,‘lxml’)”则不会报错,但是从出错信息来看好像是没有明确指定一个 parser 导致的。

请问 parser 是什么?以及在 VSCODE 下如何设置才能做到明确指定 parser,从而让“ BeautifulSoup(html)”这句代码运行不出错呢?感谢指点!


D:\Python\Python3\lib\site-packages\bs4<a target="_blank" href=“http://init.py:181” rel=“nofollow noopener”>init.py:181: UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“lxml”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 91 of the file C:\Users\A.vscode\extensions\donjayamanne.python-0.7.0\pythonFiles\PythonTools<a target="_blank" href=“http://visualstudio_py_launcher.py” rel=“nofollow noopener”>visualstudio_py_launcher.py. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, “lxml”)
markup_type=markup_type))

在读《 Python 网络数据采集》?
lmgtfy

是的,刚开始学,您也看过这本书吧,有遇到过这个问题么?书上的 BeautifulSoup(html)语句要修改成 BeautifulSoup(html,‘lxml’)才能正常运行,不知这是怎么回事。

回到顶部