Python中如何抓取京东、天猫、亚马逊的商品信息？

大家有好的开源代码吗？好学习借鉴

我现在用 python selenium 抓取列表页

yuanlaile 1楼

购物党是怎么抓的？

wuwangju 2楼

用requests和BeautifulSoup抓电商数据的基本思路

直接上代码，这是抓京东商品页的例子：

import requests
from bs4 import BeautifulSoup
import re

def fetch_jd_product(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 京东商品标题通常在<title>标签或特定class中
        title_tag = soup.find('title')
        title = title_tag.text.strip() if title_tag else '未找到标题'
        
        # 价格可能在脚本数据中，需要正则提取
        price_pattern = r'"price":"(\d+\.?\d*)"'
        price_match = re.search(price_pattern, response.text)
        price = price_match.group(1) if price_match else '未找到价格'
        
        return {
            'title': title.replace('【行情 报价 价格 评测】-京东', ''),
            'price': price
        }
        
    except requests.RequestException as e:
        return {'error': f'请求失败: {str(e)}'}

# 使用示例
if __name__ == '__main__':
    jd_url = 'https://item.jd.com/100000000001.html'  # 示例商品ID
    result = fetch_jd_product(jd_url)
    print(f"商品标题: {result.get('title')}")
    print(f"价格: {result.get('price')}")

几个关键点：

反爬处理：必须设置User-Agent，否则会被直接拒绝。京东、天猫都有严格的反爬机制。
数据位置：
- 京东：真实价格经常藏在页面脚本的JSON数据里，直接解析HTML可能找不到
- 天猫：大量数据通过Ajax加载，需要分析网络请求
- 亚马逊：对爬虫最严格，需要模拟浏览器行为

更实用的方案：

# 对于动态加载的网站，考虑用selenium
from selenium import webdriver

driver = webdriver.Chrome()
driver.get(url)
# 等待动态内容加载
# 提取数据...
driver.quit()

法律风险：大量频繁抓取可能违反网站条款，个人学习少量抓取一般没问题。

简单建议：先分析页面结构，小规模测试，再考虑反爬策略。

yibo5220 3楼

张大妈比价网上有这里有抓取代码 https://github.com/hizdm/dynamic_ip

h691938207 4楼

这三家都能搞定的，肯定是生产代码，我认为不可能开源。

zlyuanteng 5楼

用啥语言无所谓，我在腾讯课堂学的.net ，有一节课就是用京东做案例，爬京东，我试了下，10m 带宽，4 核 8G 机器 3 个多小时爬完商品条目和价格

eggper 6楼

不会出现验证码？

gougou168 7楼

原来动态换 ip 就是重启路由器啊, 够暴力的.