Python中如何实现Tumblr爬虫

好几个月前写的了，写的比较挫。
并没有写成爬取一个博客的所有内容，本来是用来网站的，如果要爬所有内容，会让用户等待太久。

# -*- coding=utf-8 -*-
from threading import Thread
import Queue
import requests
import re
import os
import sys
import time
api_url=‘http://%s.tumblr.com/api/read?&num=50&start=’
UQueue=Queue.Queue()
def getpost(uid,queue):
url=‘http://%s.tumblr.com/api/read?&num=50’%uid
page=requests.get(url).content
total=re.findall(’<posts start=“0” total="(.?)">’,page)[0]
total=int(total)
a=[i50 for i in range(1000) if i*50-total<0]
ul=api_url%uid
for i in a:
queue.put(ul+str(i))
extractpicre = re.compile(r’(?<=<photo-url max-width=“1280”>).+?(?=</photo-url>)’,flags=re.S)   #search for url of maxium size of a picture, which starts with ‘<photo-url max-width=“1280”>’ and ends with ‘</photo-url>’
extractvideore=re.compile(’/tumblr_(.*?)" type=“video/mp4”’)
video_links = []
pic_links = []
vhead = ‘https://vt.tumblr.com/tumblr_%s.mp4’
class Consumer(Thread):
def __init__(self, l_queue):
    super(Consumer,self).__init__()
    self.queue = l_queue

def run(self):
    session = requests.Session()
    while 1:
        link = self.queue.get()
        print 'start parse post: ' + link
        try:
            content = session.get(link).content
            videos = extractvideore.findall(content)
            video_links.extend([vhead % v for v in videos])
            pic_links.extend(extractpicre.findall(content))
        except:
            print 'url: %s parse failed\n' % link
        if self.queue.empty():
            break
def main():
task=[]
for i in range(min(10,UQueue.qsize())):
t=Consumer(UQueue)
task.append(t)
for t in task:
t.start()
for t in task:
t.join
while 1:
for t in task:
if t.is_alive():
continue
else:
task.remove(t)
if len(task)==0:
break
def write():
videos=[i.replace(’/480’,’’) for i in video_links]
pictures=pic_links
with open(‘pictures.txt’,‘w’) as f:
for i in pictures:
f.write(’%s\n’%i)
with open(‘videos.txt’,‘w’) as f:
for i in videos:
f.write(’%s\n’%i)
if name==‘main’:
#name=sys.argv[1]
#name=name.strip()
name=‘mzcyx2011’
getpost(name,UQueue)
main()
write()

Python中如何实现Tumblr爬虫

ionicwang 1楼作者

Mark

yuanlaile 2楼

要写一个Tumblr爬虫，最直接的方法是使用他们的官方API。不过，如果你想直接爬取公开页面，用requests和BeautifulSoup也行，但得注意反爬和页面结构变化。

这里给你一个基于API的示例，这是最稳定和推荐的方式。首先，你需要去Tumblr申请一个OAuth应用，拿到API Key（也叫Consumer Key）。

import requests
import json

# 替换成你自己的API Key
API_KEY = 'YOUR_API_KEY_HERE'
BLOG_NAME = 'staff'  # 要爬取的博客名，例如官方博客 'staff'

# 构建API请求URL来获取基础信息
url = f'https://api.tumblr.com/v2/blog/{BLOG_NAME}.tumblr.com/posts?api_key={API_KEY}'

try:
    response = requests.get(url)
    response.raise_for_status()  # 检查请求是否成功
    data = response.json()
    
    # 解析返回的JSON数据
    posts = data.get('response', {}).get('posts', [])
    
    for post in posts:
        # 这里根据帖子类型处理不同内容
        post_id = post.get('id')
        post_type = post.get('type')
        summary = post.get('summary', 'No summary')
        
        print(f"Post ID: {post_id}")
        print(f"Type: {post_type}")
        print(f"Summary: {summary[:100]}...")  # 打印前100个字符
        print("-" * 40)
        
except requests.exceptions.RequestException as e:
    print(f"请求出错: {e}")
except json.JSONDecodeError:
    print("解析JSON响应失败")

这个脚本会获取指定博客的最新帖子列表。Tumblr API返回的是JSON格式，结构清晰。post['type']字段能告诉你帖子是文本（text）、图片（photo）、引用（quote）还是其他类型，你需要根据这个写不同的处理逻辑。

如果你想爬取更多帖子，API响应里有个_links字段包含下一页的链接，可以用来做分页请求。记得在请求间加点延时，别把人家服务器搞崩了。

总结：用官方API最省心。

itying888 3楼

忘了去重了！在 write 函数里面
videos=list(set(videos))
pictures=list(set(pictures))

zlyuanteng 4楼

mark ，明天起来再看

zlyuanteng 5楼

mark

wuwangju 6楼

然而不会用

phonegap100 7楼

加个下载功能
https://gist.github.com/zhiyue/f7121aefc00640cb13bb0eded10c5312.js

gougou168 8楼

Python 下载没多少意义，下载起来慢。所以我是写出文件，可以用迅雷下载

sinazl 9楼

刚需啊，出售营养快线！

ionicwang 10楼作者

改个 name 就够了，然后直接运行

nodeper 11楼

个人网站上目前有 5000 多个解析过的博客😝

zlyuanteng 12楼

正解，解析出地址，让下载工具下载，最高效率了。

eggper 13楼

哪呢

yibo5220 14楼

感谢楼主

songsunli 15楼

最下面

bupafengyu 16楼

这样的在线解析不要太多(⊙o⊙)哦！😂

caililin 17楼

olddrivertaketakeme

itying888 18楼

不是被墙了么， vps 上下吗

wuwangju 19楼

开了 8 进程下载并不觉得慢啊。是什么理由导致慢呢？

ionicwang 20楼作者

这东西是好，但是我觉得爬出提供资源的 tumblr 名字更重要

eggper 21楼

我的网站放在过外 vps 上，也是在线解析

bupafengyu 22楼

名字没办法

bupafengyu 23楼

Mark

yuanlaile 24楼

然后就可以 wget 了？

sinazl 25楼

能不能简述下爬虫效果。。。

yuanlaile 26楼

收藏了

wuwangju 27楼

name 改成什么好，能否给个名单: )

bupafengyu 28楼

求 name_list

wuwangju 29楼

mark

ionicwang 30楼作者

下载到一半会这样

Traceback (most recent call last):
File “turmla.py”, line 150, in <module>
for square in tqdm(pool.imap_unordered(download_base_dir, urls), total=len(urls)):
File “/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/site-packages/tqdm/_tqdm.py”, line 713, in iter
for obj in iterable:
File “/home/leetom/.pyenv/versions/2.7.10/lib/python2.7/multiprocessing/pool.py”, line 668, in next
raise value
Exception: Unexpected response.

itying888 31楼

Mark ，哎，老司机一言不合就发车啊。

yuanlaile 32楼

mark

yibo5220 33楼

没人知道 www.tumblrget.com 吗

gougou168 34楼

无效啊

nodeper 35楼

上梯子=。=

itying888 36楼

战略 Mark

sinazl 37楼

不是有现成的 API 吗

wuwangju 38楼

这不就是用 api 吗

eggper 39楼

我也用 golang 爬过。。。后来被墙就没搞了

caililin 40楼

默默点个赞 :)

ionicwang 41楼作者

一天到晚搞事情

yuanlaile 42楼

搞事搞事

vueper 43楼

你们别搞事啊

yibo5220 44楼

楼主，我要访问你的网站，我要做的你粉丝😄

h691938207 45楼

少儿不宜哈哈哈

nodeper 46楼

下载的那个脚本
Traceback (most recent call last):
File “./1.py”, line 138, in <module>
getpost(name, UQueue)
File “./1.py”, line 27, in getpost
total = re.findall(’<posts start=“0” total="(.*?)">’, page)[0]
IndexError: list index out of range

sinazl 47楼

with open(‘pictures.txt’,‘r’) as fobj:
for eachline in fobj:
pngurl=eachline.strip()
filename=’.//getpic//test-{0}.jpg’.format(i)
print ‘[-]parsing:{0}’.format(filename)
urllib.urlretrieve(pngurl,filename)
i+=1

sinazl 48楼

for i in range(0, total, 50): queue.put(ul+str(i))

phonegap100 49楼

看完表示自己 python 白学了。。。
人家的爬虫都是多线程，队列，类
我的爬虫都是。。。 while if for …

nodeper 50楼

多线程是为了提高速度， 3 个小时的事情， 1 个小时就做完了，多爽啊！

sinazl 51楼

求 LZ 网站

itying888 52楼

#50 不可告人

sinazl 53楼

请问老哥您的博客是什么？想深入学习一下爬虫。