Python中如何分析并爬取cn.investing.com网页信息

https://cn.investing.com/stock-screener/?sp=country::37|sector::a|industry::a|equityType::a%3Ceq_market_cap;1

如上，应该是用了很多 js，现在直接用 get 的方式，取到的股票记录数是 0
用 post 的方式，{‘sp’,‘country::37|sector::a|industry::a|equityType::a%3Ceq_market_cap;1’}作为 data 输入也不行，得到的结果也是记录数是 0

请问应该怎么解析这个网页？思路是什么？

谢谢！
Python中如何分析并爬取cn.investing.com网页信息

vueper 1楼

直接渲染出来页面啊

zlyuanteng 2楼

要爬取cn.investing.com这类动态加载的网站，直接用requests拿不到完整数据，得用Selenium或Playwright模拟浏览器。下面是一个用Selenium的完整例子，爬取外汇页面的主要货币对数据。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time

def scrape_investing_forex():
    # 设置Chrome选项
    chrome_options = Options()
    chrome_options.add_argument('--headless')  # 无头模式，不显示浏览器窗口
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--no-sandbox')
    
    # 启动浏览器
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        # 访问外汇页面
        url = "https://cn.investing.com/currencies/"
        driver.get(url)
        
        # 等待表格加载
        wait = WebDriverWait(driver, 10)
        table = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "genTbl")))
        
        # 提取数据
        data = []
        rows = table.find_elements(By.TAG_NAME, "tr")[1:]  # 跳过表头
        
        for row in rows:
            cols = row.find_elements(By.TAG_NAME, "td")
            if len(cols) >= 9:
                row_data = {
                    '名称': cols[1].text.strip(),
                    '最新价': cols[2].text.strip(),
                    '涨跌幅': cols[3].text.strip(),
                    '最高价': cols[4].text.strip(),
                    '最低价': cols[5].text.strip(),
                    '涨跌': cols[6].text.strip(),
                    '时间': cols[8].text.strip()
                }
                data.append(row_data)
        
        # 转换为DataFrame
        df = pd.DataFrame(data)
        return df
        
    finally:
        driver.quit()

# 执行爬取
if __name__ == "__main__":
    df = scrape_investing_forex()
    print(df.head())
    # 保存到CSV
    df.to_csv('investing_forex_data.csv', index=False, encoding='utf-8-sig')

代码解释：

用Selenium的Chrome驱动，设置无头模式不弹窗。
访问目标页面后，等genTbl这个表格类加载出来。
遍历表格行，提取第2到第9列的数据（名称、价格、涨跌等）。
把数据转成pandas DataFrame，方便存CSV或进一步处理。

几个关键点：

这个网站反爬挺严，可能需要加user-agent或者处理验证码。
数据是动态加载的，等元素出现再用WebDriverWait，比硬等靠谱。
爬的时候悠着点，加个time.sleep()别把人家服务器搞挂了。

总结：爬动态网页得用浏览器自动化工具，注意反爬和请求频率。

ionicwang 3楼

不就是这个吗？
https://cn.investing.com/stock-screener/Service/SearchStocks
![]( )

itying888 4楼

渲染页面是指啥？

yibo5220 5楼

恕我刚入门，没有看懂关键点在哪儿。。下一步应该做什么？

h691938207 6楼

用 py 模拟 post 请求啊

zlyuanteng 7楼

在 console 打$(’#resultsTable tr’)看看是不是你要的结果

sinazl 8楼

1 楼和 2 楼都已经告诉你答案了，而且都是对的。1 楼的意思是直接使用 selenium 等自动化包驱动浏览器访问目标链接，浏览器运行 js 后渲染得到目标数据，具体实现搜索 selenium 相关知识点。

2 楼的意思是分析 http 请求数据，发现目标数据实际是通过 XHR，带参数 POST 访问 https://cn.investing.com/stock-screener/Service/SearchStocks，直接得到数据。具体分析，可 F12 看 network 或者代理抓包。

nodeper 9楼作者

你要的数据并不是在你写的那个网页里而是载入网页后通过 api 调用获取的 api 的地址就是我上面写的这个构造个一样的请求就能获得数据了
建议找本爬虫的书看看别跟视频教程学

ionicwang 10楼

data={‘sp’:‘country::37|sector::a|industry::a|equityType::a<eq_market_cap;1’}
header=func.randHeader()
s = requests.post(‘https://cn.investing.com/stock-screener/Service/SearchStocks’,params=data,headers=header)
我这么写的，有什么问题吗？还是没有数据…

songsunli 11楼

我现在都是照猫画虎，还没有系统学习过。想着边用边学来着

bupafengyu 12楼

所以建议弄本书快速的翻下其实更省时间

h691938207 13楼

然后 data 不是你写的这些 f12 打开看看

request 是这些内容

POST /stock-screener/Service/SearchStocks HTTP/1.1
Host: cn.investing.com
Connection: keep-alive
Content-Length: 909
Accept: application/json, text/javascript, /; q=0.01
Origin: https://cn.investing.com
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: https://cn.investing.com/stock-screener/?sp=country::37|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7,ja;q=0.6
Cookie: PHPSESSID=tt0b1qp47ancp40619ftigb2t1; geoC=CN; adBlockerNewUserDomains=1511230139; StickySession=id.70178265937.000.cn.investing.com; adbBLk=6; billboardCounter_6=2; nyxDorf=Y2AxYmYuP2JkNWtgZCkxMjZnYj0%2BJzAzMDRlZw%3D%3D
DNT: 1

data 是这些

country[]:37
sector:2,11,7,10,1,4,9,5,8,3,6,12
industry:63,85,82,21,10,86,7,78,36,25,4,28,67,5,71,27,61,90,23,68,34,89,43,50,81,41,56,59,69,9,83,29,52,100,58,95,102,94,60,53,38,87,31,6,16,48,55,74,66,35,65,40,99,42,92,98,39,70,32,45,77,20,54,33,24,72,51,30,64,2,96,8,14,22,26,80,15,37,93,13,46,1,79,44,75,91,49,62,88,12,47,84,57,76,17,97,18,19,3,11,101,73
equityType:ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN
exchange[]:54
exchange[]:103
pn:1
order[col]:eq_market_cap
order[dir]:d

yibo5220 14楼

header={
‘Accept’:‘application/json, text/javascript, /; q=0.01’,
‘Accept-Encoding’:‘gzip, deflate, br’,
‘Accept-Language’:‘zh-CN,zh;q=0.9’,
‘Connection’:‘keep-alive’,
‘Content-Length’:‘909’,
‘Content-Type’:‘application/x-www-form-urlencoded’,
‘Host’:‘cn.investing.com’,
‘Origin’:‘https://cn.investing.com’,
‘Referer’:‘https://cn.investing.com/stock-screener/?sp=country::37|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1’,
‘User-Agent’:‘Opera/8.0 (Macintosh; PPC Mac OS X; U; en)’,
‘X-Requested-With’:‘XMLHttpRequest’
}
data={
‘country[]’:‘37’,
‘sector’:‘2,11,7,10,1,4,9,5,8,3,6,12’,
‘industry’:‘63,85,82,21,10,86,7,78,36,25,4,28,67,5,71,27,61,90,23,68,34,89,43,50,81,41,56,59,69,9,83,29,52,100,58,95,102,94,60,53,38,87,31,6,16,48,55,74,66,35,65,40,99,42,92,98,39,70,32,45,77,20,54,33,24,72,51,30,64,2,96,8,14,22,26,80,15,37,93,13,46,1,79,44,75,91,49,62,88,12,47,84,57,76,17,97,18,19,3,11,101,73’,
‘equityType’:‘ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN’,
‘exchange[]’:‘54’,
‘exchange[]’:‘103’,
‘pn’:‘1’,
‘order[col]’:‘eq_market_cap’,
‘order[dir]’:‘d’
}
session = requests.Session()
s = session.post(‘https://cn.investing.com/stock-screener/Service/SearchStocks’,params=data,headers=header)
html = etree.HTML(s.text)

我理解应该是这么写的？但还是得不到想要的结果…

ionicwang 15楼

s = session.post(‘https://cn.investing.com/stock-screener/Service/SearchStocks’,params=data,headers=header)
改成
s = session.post(‘https://cn.investing.com/stock-screener/Service/SearchStocks’,data,headers=header)

vueper 16楼

非常感谢各位大神，终于可以了。
不过还有个小问题，就是 post 参数里的 pn:1，貌似只能得到第一页的 50 条记录，我试了一下好像只能一页一页的获取，要获取 70 多次。

这种有什么简便的办法吗？

ionicwang 17楼

获取 70 多次对程序不是什么大的影响,不过就是不推荐用 selenium,因为它的访问相当于打开一个网页,会等网页全部渲染完才会执行下一步,效率低下.
更推荐模拟 js 请求获取 json 数据的方式