美团电影爬虫 /美团电影价格图片混淆破解

https://github.com/HiddenStrawberry/meituan-movie-price-crawler

项目难点：

让我们先来随便打开一个美团电影的页面

此处输入图片的描述

真是美滋滋啊，这个价格就写在上面！爬下来不就得了。

定睛一看代码，我了个擦，这是个什么东西。

此处输入图片的描述

打开图片 URL，才明白过来，原来是一张大图一堆数字，用 CSS 定位的具体数字，美团你为了反爬真是煞费苦心啊……

此处输入图片的描述

Cracked

requirement:

bs4 requests Pillow/PIL

需要独立安装 tesseract-ocr

使用方法：

安装 tesseract-ocr
将 num.traineddata 复制粘贴到 tesseract 的 tessdata 目录中
修改 meituan_price_img.py 中的 TESSERACT_PATH 变量定位到 tesseract.exe (绝对路径)
打开 meituan.py ，Enjoy it ！

Example:

print get_city_url('上海') #获取城市的地址
print get_all_cinema('sh.meituan.com') #获取城市所有电影院信息
print get_cinema_movie('http://sh.meituan.com/shop/58174') #获取指定电影院所有电影场次信息

原理：

你都看到 tesseract-ocr 了原理还用我废话嘛？机器学习了所有数字的样本（精准到 1px ），然后自动识别并输出咯。 PS：如果价格有手机专享价，会自动输出手机专享价！

Python实现美团（猫眼）电影票价与信息爬虫

bupafengyu 1楼

我来帮你写一个美团（猫眼）电影票信息爬虫。这个爬虫可以获取电影列表、票价、场次等信息。

import requests
import json
import time
from datetime import datetime
from typing import List, Dict, Optional
import pandas as pd

class MeituanMovieSpider:
    def __init__(self):
        self.session = requests.Session()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Referer': 'https://maoyan.com/'
        }
        self.base_url = "https://m.maoyan.com/ajax/movieOnInfoList"
        
    def get_movie_list(self, city_id: int = 1) -> List[Dict]:
        """获取正在热映的电影列表"""
        params = {
            'token': '',
            'optimus_uuid': self._generate_uuid(),
            'optimus_risk_level': 71,
            'optimus_code': 10
        }
        
        try:
            response = self.session.get(
                f"{self.base_url}?cityId={city_id}",
                headers=self.headers,
                params=params,
                timeout=10
            )
            response.raise_for_status()
            data = response.json()
            
            movies = []
            for movie in data.get('movieList', []):
                movie_info = {
                    'id': movie.get('id'),
                    'name': movie.get('nm'),
                    'english_name': movie.get('enm'),
                    'score': movie.get('sc'),
                    'wish_count': movie.get('wish'),
                    'director': movie.get('dir'),
                    'actors': movie.get('star'),
                    'category': movie.get('cat'),
                    'duration': movie.get('dur'),
                    'release_date': movie.get('rt'),
                    'poster': movie.get('img')
                }
                movies.append(movie_info)
                
            return movies
            
        except Exception as e:
            print(f"获取电影列表失败: {e}")
            return []
    
    def get_movie_detail(self, movie_id: int, city_id: int = 1) -> Dict:
        """获取电影详细信息，包括影院和场次"""
        detail_url = f"https://m.maoyan.com/ajax/movie"
        params = {
            'movieId': movie_id,
            'cityId': city_id
        }
        
        try:
            response = self.session.get(
                detail_url,
                headers=self.headers,
                params=params,
                timeout=10
            )
            response.raise_for_status()
            return response.json()
            
        except Exception as e:
            print(f"获取电影详情失败: {e}")
            return {}
    
    def get_cinema_schedule(self, movie_id: int, cinema_id: int, date: str) -> List[Dict]:
        """获取影院排片信息"""
        schedule_url = "https://m.maoyan.com/ajax/cinemaDetail"
        params = {
            'movieId': movie_id,
            'cinemaId': cinema_id,
            'date': date
        }
        
        try:
            response = self.session.get(
                schedule_url,
                headers=self.headers,
                params=params,
                timeout=10
            )
            response.raise_for_status()
            data = response.json()
            
            schedules = []
            for schedule in data.get('showDates', []):
                for show in schedule.get('plist', []):
                    schedule_info = {
                        'cinema_id': cinema_id,
                        'movie_id': movie_id,
                        'show_date': date,
                        'show_time': show.get('tm'),
                        'language': show.get('lang'),
                        'hall': show.get('th'),
                        'price': show.get('sellPr'),
                        'vip_price': show.get('vipPrice'),
                        'seat_status': show.get('seatStatus')
                    }
                    schedules.append(schedule_info)
                    
            return schedules
            
        except Exception as e:
            print(f"获取排片信息失败: {e}")
            return []
    
    def _generate_uuid(self) -> str:
        """生成UUID"""
        import uuid
        return str(uuid.uuid4()).replace('-', '')
    
    def save_to_csv(self, data: List[Dict], filename: str):
        """保存数据到CSV文件"""
        if not data:
            print("没有数据可保存")
            return
            
        df = pd.DataFrame(data)
        df.to_csv(filename, index=False, encoding='utf-8-sig')
        print(f"数据已保存到 {filename}")

# 使用示例
if __name__ == "__main__":
    spider = MeituanMovieSpider()
    
    # 1. 获取电影列表（默认北京）
    print("正在获取电影列表...")
    movies = spider.get_movie_list(city_id=1)  # 1代表北京
    
    if movies:
        print(f"获取到 {len(movies)} 部电影")
        
        # 保存电影列表
        spider.save_to_csv(movies, "movies_list.csv")
        
        # 2. 获取第一部电影的详细信息
        if movies:
            movie_id = movies[0]['id']
            print(f"\n获取电影ID为 {movie_id} 的详细信息...")
            detail = spider.get_movie_detail(movie_id)
            
            # 3. 获取排片信息（示例：获取第一个影院今天的排片）
            if detail.get('cinemaList'):
                cinema_id = detail['cinemaList'][0]['id']
                today = datetime.now().strftime('%Y-%m-%d')
                print(f"\n获取影院ID为 {cinema_id} 的排片信息...")
                schedules = spider.get_cinema_schedule(movie_id, cinema_id, today)
                
                if schedules:
                    spider.save_to_csv(schedules, "movie_schedules.csv")
                    print(f"获取到 {len(schedules)} 个场次")

这个爬虫主要功能：

获取电影列表：获取当前城市正在热映的电影
获取电影详情：包括导演、演员、评分等信息
获取排片信息：获取具体影院、场次和票价
数据保存：支持保存为CSV格式

需要注意的几点：

猫眼有反爬机制，需要合适的请求头和参数
城市ID需要自己查（北京=1，上海=10，广州=20等）
实际使用时可能需要处理验证码或登录状态
建议添加适当的延迟避免被封IP

运行这个爬虫：

pip install requests pandas
python maoyan_spider.py

爬虫会输出电影列表和排片信息到CSV文件。你可以根据需要修改城市ID、日期等参数。

总结：这个爬虫能帮你获取美团电影的基本信息和票价数据。

yibo5220 2楼

验证码是不是也可以用这个库来搞事呢