Python中如何使用BeautifulSoup提取a标签里的网址？

from urllib.request import urlopen  
from bs4 import BeautifulSoup   
import re  
def getLinks(articleUrl):

html = urlopen(" http://www.ccb.com/cn/home “+articleUrl)

bsObj = BeautifulSoup(html,‘lxml’)

print(‘bsObj.find=’,bsObj.find(“div”, {“class”:“Language_select”}).findAll(“a”))

return bsObj.find(“div”, {“class”:“Language_select”}).findAll(“a”,re.compile(”^(<a href=")(.*(?!">繁体))$"))
links = getLinks("/indexv3.html")

print(‘links=’,links)
输出如下：
bsObj.find= [<a href=“http://fjt.ccb.com”>繁体</a>, <a href=“http://en.ccb.com/en/home/indexv3.html”>ENGLISH</a>]
links= []
上面的代码用 BeautifulSoup 爬了" http://www.ccb.com/cn/home/indexv3.html "，输出的 a 标签内容里，“繁体”这两个字左侧的网址是我想要提取的网址， 即我希望输出的第二行应该是 links= [‘http://fjt.ccb.com’]。 现在看来 return 语句中的 findAll 没写对，导致输出为空，恳请大家指点应该怎么写才对呢？
感谢！

sinazl 1楼

用 re.search 或者 re.findall 吧
<a.?href="(.?)".*?>繁体</a>

htzhanglong 2楼作者

from bs4 import BeautifulSoup
import requests

# 示例HTML内容
html_content = """
<html>
<body>
    <a href="https://www.example.com">示例链接1</a>
    <a href="/relative/path">相对链接</a>
    <a href="#section">锚点链接</a>
    <a>没有href的标签</a>
</body>
</html>
"""

# 解析HTML
soup = BeautifulSoup(html_content, 'html.parser')

# 方法1：提取所有a标签的href
all_links = []
for a_tag in soup.find_all('a'):
    href = a_tag.get('href')
    if href:  # 过滤掉没有href属性的标签
        all_links.append(href)

print("所有链接:", all_links)

# 方法2：使用列表推导式
links = [a.get('href') for a in soup.find_all('a') if a.get('href')]
print("简洁写法:", links)

# 实际网页抓取示例
def extract_links_from_url(url):
    try:
        response = requests.get(url, timeout=5)
        soup = BeautifulSoup(response.content, 'html.parser')
        return [a.get('href') for a in soup.find_all('a') if a.get('href')]
    except Exception as e:
        print(f"抓取失败: {e}")
        return []

# 使用示例
# urls = extract_links_from_url("https://www.python.org")

关键点：

用soup.find_all('a')找到所有a标签
用.get('href')获取href属性值
记得检查href是否存在，避免None值

建议直接用列表推导式最简洁。

caililin 3楼

a.get(‘href’)

h691938207 4楼

语句应该怎么写呢，我把 return 语句修改为：return re.findall(’<a.?href="(.?)".*?>繁体</a>’,bsObj.find(“div”, {“class”:“Language_select”}).findAll(“a”))

但是运行报错如下

File “D:\Python\Python3\lib<a target=”_blank" href=“http://re.py” rel=“nofollow noopener”>re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

vueper 5楼

re.compile 为何不适用这种需求？

ionicwang 6楼

return re.findall(r’<a.?href="(.?)".*?>繁体</a>’,html)[0]
这样吧,[0]是满足正则的括号内的内容，具体的去查下怎么用吧，表达能力有限…

htzhanglong 7楼作者

return bsObj.find(“div”, {“ class ”：” Language_select ”}.find(“a”, href=True)[‘href’]

sinazl 8楼

我感觉用正则有点复杂，这种情况我一般都是直接用 css 选择器

python import requests from bs4 import BeautifulSoup import re def getLinks(articleUrl): html = requests.get("<a target="_blank" href="http://www.ccb.com/cn/home" rel="nofollow noopener">http://www.ccb.com/cn/home</a>"+articleUrl).text bsObj = BeautifulSoup(html,'lxml') fanti_link = bsObj.select(".Language_select a")[0]['href'] return fanti_link links = getLinks("/indexv3.html") print('links=',links) 

gougou168 9楼

https://gist.github.com/JerryLi-X/d7dc24ffbeef82a0ffc88ac564f4de07

gougou168 10楼

谢谢，请问 href=True 表示什么含义呢？

htzhanglong 11楼作者

href=True 表示什么含义，以及为何用[‘href’]键就能取到第一个链接？顺便问一下，如果要取第二个链接，应该用什么下标呢？谢谢

eggper 12楼

for a in bsObj.find(“div”, {“class”:“Language_select”}).findAll(“a”)):
print a.attr(“href”)

h691938207 13楼

find(name, attrs, recursive, string, **kwargs) href=True 指的是找 bsobj 下有 href 这个属性的 object ；取到第一个链接不是因为用[href’]，而是因为 find （）返回的就是第一个 object，findall （）返回的所有包含 href 的 object ；取第二个链接可以用 find_next()，或者 findall(limit=2)，可以看看官方文档的介绍 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

wuwangju 14楼

感谢！