Python 爬虫新手常见问题与解决方案

首先，我是在这个 http://www.zbj.com/appdingzhikaifa/sq10054601.html 网址下爬取每一个服务的公司名，然后把这些公司名拿去在 www.qichacha.com 上爬取搜索到的信息。但是现在有个问题是，企查查这个网站你爬多了就会让你验证，即使你登录了之后爬的多了也会让你验证，我想知道有什么方法去解决，比如代理 ip 怎么实现设置多个代理 ip 去爬，因为我现在这么爬效率实在太慢了。下面是我的代码，很繁琐，还想让大神帮我精简下：import os import selenium.webdriver as webdriver driver=webdriver.Chrome() import xlrd data = xlrd.open_workbook(“C://Python27//2.xlsx”) table = data.sheets()[0] nrows = table.nrows ncols = table.ncols rowValues=[] for i in xrange(0,nrows): rowValues.append(table.row_values(i))

#import sys,urllib a=[] for r in rowValues: s = (’’).join® base_url = ‘http://www.qichacha.com/search?key=’ + s a.append(base_url)

res=[] for r in a: driver.get® results=driver.find_elements_by_xpath("//tr[1]//td[2]/p[1][@class=‘m-t-xs’]/a") for result in results: res.append(result.text)

from xlutils.copy import copy from xlrd import open_workbook from xlwt import easyxf excel=r’C://Python27//2.xlsx’ rb=xlrd.open_workbook(excel) wb=copy(rb) sheet=wb.get_sheet(0) x=0 y=5 for tag in res: sheet.write(x,y,tag) x+=1

wb.save(excel)

rex=[] for r in a: driver.get® results=driver.find_elements_by_xpath("//tr[1]//td[2]/p[1][@class=‘m-t-xs’]/span[1]") for result in results: rex.append(result.text)

excel=r’C://Python27//2.xlsx’ rb=xlrd.open_workbook(excel) wb=copy(rb) sheet=wb.get_sheet(0) x=0 y=10 for tag in rex: sheet.write(x,y,tag) x+=1

wb.save(excel)

rey=[] for r in a: driver.get® results=driver.find_elements_by_xpath("//tr[1]//td[2]/p[1][@class=‘m-t-xs’]/span[2]") for result in results: rey.append(result.text)

excel=r’C://Python27//2.xlsx’ rb=xlrd.open_workbook(excel) wb=copy(rb) sheet=wb.get_sheet(0) x=0 y=15 for tag in rey: sheet.write(x,y,tag) x+=1

wb.save(excel)

rez=[] for r in a: driver.get® results=driver.find_elements_by_xpath("//tr[1]//td[2]/p[2][@class=‘m-t-xs’]") for result in results: rez.append(result.text)

excel=r’C://Python27//2.xlsx’ rb=xlrd.open_workbook(excel) wb=copy(rb) sheet=wb.get_sheet(0) x=0 y=20 for tag in rez: sheet.write(x,y,tag) x+=1

wb.save(excel)

reo=[] for r in a: driver.get® results=driver.find_elements_by_xpath("//tr[1]//td[2]/p[3][@class=‘m-t-xs’]") for result in results: reo.append(result.text)

excel=r’C://Python27//2.xlsx’ rb=xlrd.open_workbook(excel) wb=copy(rb) sheet=wb.get_sheet(0) x=0 y=25 for tag in reo: sheet.write(x,y,tag) x+=1

wb.save(excel) 文件夹下是不同的公司名。最好可以实现双击.py 程序就可以自动爬取的，我现在这样也可以自动爬取，但是太慢了，一次只能爬几个公司就需要验证

我是个新人，可能问问题方式有点奇葩，希望多多包涵，求帮忙
Python 爬虫新手常见问题与解决方案

vueper 1楼

nobody ？

songsunli 2楼

帖子标题：Python 爬虫新手常见问题与解决方案

爬虫新手常见问题就那几个，我直接说重点：

被封IP：最简单就是加time.sleep(random.uniform(1, 3))，别爬太快。真要防封就用代理IP池，requests加代理就proxies={'http': 'http://ip:port'}。
解析不了数据：别死磕正则，用BeautifulSoup或lxml。安装：pip install beautifulsoup4 lxml。示例：

from bs4 import BeautifulSoup
import requests

html = requests.get('http://example.com').text
soup = BeautifulSoup(html, 'lxml')
title = soup.find('h1').text  # 找第一个h1标签

动态页面抓不到：页面是JS加载的就用selenium。示例：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://example.com')
content = driver.find_element_by_class_name('content').text
driver.quit()

保存数据：小数据用csv，import csv写文件。要存数据库就用sqlite3，Python自带。
请求失败：加try-except和重试，用requests.Session()保持会话。
编码问题：拿到响应先response.encoding = 'utf-8'，或者用response.content.decode('utf-8')。

总结：爬虫就是发请求、拿数据、解析、存储，遇到问题查文档看状态码。

nodeper 3楼

代码贴 gist 缩进清楚点

zlyuanteng 4楼

import os
import selenium.webdriver as webdriver
driver=webdriver.Chrome()
import xlrd
data = xlrd.open_workbook(“C://Python27//2.xlsx”)
table = data.sheets()[0]
nrows = table.nrows
ncols = table.ncols
rowValues=[]
for i in xrange(0,nrows):
rowValues.append(table.row_values(i))
a=[]
for r in rowValues:
s = (’’).join®
base_url = ‘http://www.qichacha.com/search?key=’ + s
a.append(base_url)

res=[]
for r in a:
driver.get®
results=driver.find_elements_by_xpath("//tr[1]//td[2]/p[1][@class=‘m-t-xs’]/a")
for result in results:
res.append(result.text)

from xlutils.copy import copy
from xlrd import open_workbook
from xlwt import easyxf
excel=r’C://Python27//2.xlsx’
rb=xlrd.open_workbook(excel)
wb=copy(rb)
sheet=wb.get_sheet(0)
x=0
y=5
for tag in res:
sheet.write(x,y,tag)
x+=1

wb.save(excel)

大概是这样的过程，怎么去使用代理 ip。避免网页验证