如何用Python爬取日本亚马逊的商品数据？

老大给了个任务要爬日本亚马逊上某类商品的价格以及评论，自己之前都是爬国内的，日本亚马逊现在被墙了，需要怎么样才能爬到数据呢？ lantern 连上了然后终端 export https_proxy=localhost:port 也试过了，国外的代理 IP 也试过了，都以失败告终，错误如下：
requests.exceptions.ProxyError: HTTPSConnectionPool(host=‘www.amazon.co.jp’, port=443): Max retries exceeded with url: /dp/B000UTKMDQ (Caused by ProxyError(‘Cannot connect to proxy.’, ConnectionResetError(104, ‘Connection reset by peer’)))
请问一下各位，是怎么爬的？ API 还是其他方式？感谢感谢
如何用Python爬取日本亚马逊的商品数据？

ionicwang 1楼

被墙了就找国外的服务器部署爬呗
难道你要自己掏钱？

ionicwang 2楼

import requests
from bs4 import BeautifulSoup
import time
import random

def get_amazon_jp_product_data(url):
    """
    爬取日本亚马逊商品页面的基础信息
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept-Language': 'ja-JP,ja;q=0.9',
        'Accept-Encoding': 'gzip, deflate',
        'Accept': 'text/html,application/xhtml+xml',
        'Referer': 'https://www.amazon.co.jp/'
    }
    
    try:
        # 添加随机延迟避免被封
        time.sleep(random.uniform(1, 3))
        
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # 检查是否被重定向到验证页面
        if 'robot-check' in response.url:
            print("遇到验证页面，可能需要处理验证码")
            return None
            
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # 提取商品信息（日本亚马逊的HTML结构）
        product_data = {}
        
        # 商品标题
        title_elem = soup.find('span', {'id': 'productTitle'})
        product_data['title'] = title_elem.get_text(strip=True) if title_elem else None
        
        # 价格（注意日本亚马逊可能有多种价格显示方式）
        price_elem = soup.find('span', {'class': 'a-price-whole'})
        product_data['price'] = price_elem.get_text(strip=True) if price_elem else None
        
        # ASIN（亚马逊商品编号）
        asin_elem = soup.find('input', {'id': 'ASIN'})
        product_data['asin'] = asin_elem['value'] if asin_elem else None
        
        # 评分
        rating_elem = soup.find('span', {'id': 'acrPopover'})
        product_data['rating'] = rating_elem.get('title') if rating_elem else None
        
        # 评价数量
        review_elem = soup.find('span', {'id': 'acrCustomerReviewText'})
        product_data['review_count'] = review_elem.get_text(strip=True) if review_elem else None
        
        return product_data
        
    except requests.exceptions.RequestException as e:
        print(f"请求错误: {e}")
        return None
    except Exception as e:
        print(f"解析错误: {e}")
        return None

# 使用示例
if __name__ == "__main__":
    # 日本亚马逊商品URL示例（请替换为实际商品URL）
    test_url = "https://www.amazon.co.jp/dp/B08N5WRWNW"
    
    product_info = get_amazon_jp_product_data(test_url)
    
    if product_info:
        print("商品信息:")
        for key, value in product_info.items():
            print(f"{key}: {value}")
    else:
        print("未能获取商品信息")

要点说明：

请求头设置：必须设置合理的User-Agent和语言头，日本亚马逊需要Accept-Language: ja-JP
反爬处理：
- 添加随机延迟（1-3秒）
- 检查是否被重定向到验证页面
- 使用会话保持cookies
HTML解析：日本亚马逊的页面结构与英文站略有不同，需要根据实际页面调整选择器
重要提醒：
- 亚马逊有严格的反爬机制，大量请求可能导致IP被封
- 商品价格、库存等信息可能通过JavaScript动态加载
- 建议遵守robots.txt和网站使用条款