Python中是否有现成的轮子，对页面的文字+链接直接提取输出？

要爬的页面三天两头变样子，原来是 lxml+etree+xpath 定位元素，提取 herf,herf 的 text() 现在变化的情况，也许这种方式不好维护，
请问各位高手，有没有现成的轮子或者思路，对整个页面的超链接进行提取成[('urltext_1',url1),('urltext_2',url2)]，类似这样的？
也许对 urltext 做个匹配合适的目标就好了，忽然觉得做这事情，没必要把事情复杂化。。。

yuanlaile 1楼

你都说了三天两头变。。。不是一小时一变就应该知足了

h691938207 2楼

有，requests + BeautifulSoup 就是干这个的黄金组合。requests 负责把网页抓下来，BeautifulSoup 负责解析 HTML，提取你要的文字和链接。

下面这个例子直接就能跑，它会把页面上所有带文字的链接（<a> 标签）的文本和 href 属性都挖出来：

import requests
from bs4 import BeautifulSoup

def extract_links_from_url(url):
    """
    从给定URL的页面中提取所有链接的文本和URL。
    """
    try:
        # 1. 获取页面内容
        response = requests.get(url)
        response.raise_for_status() # 检查请求是否成功
        response.encoding = response.apparent_encoding # 自动识别编码

        # 2. 用BeautifulSoup解析HTML
        soup = BeautifulSoup(response.text, 'html.parser')

        # 3. 提取所有<a>标签
        links = []
        for a_tag in soup.find_all('a', href=True): # 确保有href属性
            link_text = a_tag.get_text(strip=True) # 获取链接文本并去除空白
            link_url = a_tag['href']

            # 处理相对URL，将其转换为绝对URL
            if link_url.startswith('/'):
                from urllib.parse import urljoin
                link_url = urljoin(url, link_url)

            # 只收集有文本的链接（避免空链接或纯图片链接）
            if link_text:
                links.append({
                    'text': link_text,
                    'url': link_url
                })

        return links

    except requests.exceptions.RequestException as e:
        print(f"请求出错: {e}")
        return []
    except Exception as e:
        print(f"解析出错: {e}")
        return []

# 使用示例
if __name__ == "__main__":
    target_url = "https://httpbin.org/html" # 一个测试页面
    extracted_links = extract_links_from_url(target_url)

    print(f"从 {target_url} 提取到 {len(extracted_links)} 个链接：\n")
    for idx, link in enumerate(extracted_links, 1):
        print(f"{idx}. 文本: {link['text']}")
        print(f"   链接: {link['url']}\n")

简单解释一下：

requests.get(url) 把网页源代码拿回来。
BeautifulSoup(html, 'html.parser') 把杂乱的 HTML 变成结构化的树，方便查找。
soup.find_all('a', href=True) 找到所有有效的链接标签。
循环里用 get_text() 拿链接显示的文本，用 ['href'] 拿实际的链接地址。urljoin 是为了把像 /about 这样的相对路径补全成完整网址。
最后返回一个字典列表，每个字典包含 text 和 url。

运行前记得装库：

pip install requests beautifulsoup4

一句话总结： 用 requests 加 BeautifulSoup 自己写个提取器，灵活又简单。

yibo5220 3楼

现成的轮子一大堆,你去 github 上搜一下就行了, 但是现成的轮子做的都比较复杂,你光学会用的时间都能用 python 的 scrapy 库写一个满足你这种简单需求的了

caililin 4楼

Jsou 解析 html 很方便，可以使用选择器，一次提取出所有的 a 标签。

htzhanglong 5楼

光提取超链接和文字的话，用 re 也可以啊，但是你不好判断哪些是你需要的，哪些是不需要的

bupafengyu 6楼

爬虫是持久战，看是他们前端先倒下还是爬虫工程师先倒下，在这之前，斗争都将持续

zlyuanteng 7楼

xpth 直接获取 a 标签？

songsunli 8楼

还不如有没有现成的工具能直接帮你完成功能，然后数据落库

h691938207 9楼

python 也有这样库？

yuanlaile 10楼

htzhanglong 11楼

diffbot
了解一下，不便宜

vueper 12楼

用 PHP 写了一个

links.php
 <?php require __DIR__ . '/vendor/autoload.php'; global $base_uri, $wait_replace_imgs; $base_uri = '<a target="_blank" href="https://www.v2ex.com" rel="nofollow noopener">https://www.v2ex.com</a>'; $t1 = microtime(true); try { set_time_limit(1800); ini_set("max_execution_time", 1800); ini_set('memory_limit', '512M'); $html_node = file_get_contents($base_uri); $crawler = new \Symfony\Component\DomCrawler\Crawler($html_node, $base_uri); $links = $crawler->filter('a')->links(); foreach ($links as $link) { $temp_links[] = ['url' => $link->getUri(), 'text' => $link->getNode()->textContent]; } file_put_contents('links.txt', json_encode($temp_links, JSON_UNESCAPED_UNICODE)); echo 'success '; $t2 = microtime(true); echo 'time consuming ' . round($t2 - $t1, 3) . ' s' . PHP_EOL; } catch (Exception $exception) { echo $exception->getCode() . ', message:' . $exception->getMessage(); } 
部分效果
 [ { "url": "https:\/\/<a target="_blank" href="http://www.v2ex.com" rel="nofollow noopener">www.v2ex.com</a>\/t\/575511", "text": "写代码的时候没有思路不知道如何写起，请教如何培养训练编程思路谢谢！" }, { "url": "https:\/\/<a target="_blank" href="http://www.v2ex.com" rel="nofollow noopener">www.v2ex.com</a>\/member\/fanmouji", "text": "" }, { "url": "https:\/\/<a target="_blank" href="http://www.v2ex.com" rel="nofollow noopener">www.v2ex.com</a>\/t\/575397", "text": "JD 的 618 是不是走走过场？" } ] 
项目地址
https://github.com/MasterCloner/Cornerstone

sinazl 13楼作者

Distill Web Monitor

nodeper 14楼

支持这种功能的选择器应该是最低要求。

zlyuanteng 15楼

推荐一下 https://github.com/MontFerret/ferret
自写代码部分是 DSL，很简单的几行就行，改起来自然更简单更快

框架本身是 go 的，运行效率没得说

wuwangju 16楼

度一下，keywords: 爬虫采集工具。

h691938207 17楼

可以试试 Crawlab 的自动提取字段功能，成功率大概在 50-70%

https://github.com/tikazyq/crawlab

文章: https://juejin.im/post/5cf4a7fa5188254c5879facd

yibo5220 18楼

xpath 没写好吧，难道每天换主题？否则大改 css 怎么解决？