关于使用Nodejs抓取百度图片的可行性

最近公司的一个项目即将上线，现在需要在每个城市预先插入一些图片数据，我的想法是通过百度图片解析页面上的image标签然后将图片存到本地然后再上传到我们自己的服务器上，请问有这方面经验的同学这种可行性，以及请推荐我相关的node组件（请求，解析html）。

sinazl 1楼作者

关于使用Node.js抓取百度图片的可行性

最近公司的一个项目即将上线，现在需要在每个城市预先插入一些图片数据。我的想法是通过百度图片解析页面上的<img>标签，然后将图片存到本地，再上传到我们自己的服务器上。请问有这方面经验的同学这种方案是否可行？另外，请推荐相关的Node.js组件（例如用于发起HTTP请求和解析HTML的组件）。

可行性分析

从技术角度来看，使用Node.js抓取百度图片并存储到本地是完全可行的。不过需要注意的是，百度图片搜索页面的数据获取可能涉及反爬虫机制，如验证码、IP封禁等。因此，在实际操作中需要处理好这些情况。

示例代码

以下是一个简单的示例代码，展示了如何使用上述组件来抓取百度图片并保存到本地。

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');
const path = require('path');

// 目标URL
const url = 'https://image.baidu.com/';

// 发起HTTP请求获取网页内容
axios.get(url)
  .then(response => {
    const html = response.data;
    const $ = cheerio.load(html);

    // 解析HTML中的<img>标签
    $('img').each((index, element) => {
      const src = $(element).attr('src');
      if (src) {
        console.log(`Fetching image: ${src}`);
        
        // 发起请求获取图片二进制数据
        axios.get(src, { responseType: 'arraybuffer' })
          .then(imgResponse => {
            const imgData = imgResponse.data;
            
            // 构造保存路径
            const filePath = path.join(__dirname, `./images/image-${index}.jpg`);
            
            // 将图片写入本地文件
            fs.writeFileSync(filePath, imgData);
            console.log(`Image saved at: ${filePath}`);
          })
          .catch(err => {
            console.error(`Failed to fetch image: ${src}`, err);
          });
      }
    });
  })
  .catch(err => {
    console.error('Failed to fetch HTML content', err);
  });

注意事项

合法性: 在抓取任何网站的数据之前，请确保你遵守了相关网站的服务条款和法律法规。
性能: 抓取大量数据时要注意不要给目标服务器带来过大压力。
错误处理: 在实际应用中需要完善错误处理逻辑，以确保程序的健壮性。

通过上述步骤，你可以使用Node.js实现从百度图片中抓取图片并保存到本地的功能。

nodeper 2楼

请求可以用 spidex。

然后解析html可以用 cheerio。

可以看下这里http://xcoder.in/2013/12/28/xplan-spider-doc/#Cheerio%E6%A8%A1%E5%9D%97*

sinazl 3楼作者

https://github.com/XadillaX/spidex

htzhanglong 4楼

这个不知道对你有木有什么帮助 http://www.9958.pw/post/js_html_img_url

phonegap100 5楼

谢谢你的回复，那spidex可以实现上传文件吗？

wuwangju 6楼

图片下载的功能在你这个项目里面实现了吗？

h691938207 7楼

我爬过，直接http下载，没有任何问题。spidex是一个爬虫，不只是爬图片用的，对你的需求多余了 var fs = require(“fs”); var http = require(“http”)

function downloadImag(url){ var date = new Date(); var file_name = (date.getMonth()+1).toString() + ‘-’ + date.getDate().toString()+"-"+date.getTime(); file_name += ‘.jpg’; var file = fs.createWriteStream("./tmp/"+file_name); http.get(url,function(res) { console.log(res.headers)

    res.on('data',function(data) {
        file.write(data);
    }).on('end',function() {
            file.end();
            console.log('download success');
        });
});

}

vueper 8楼

再说一下，解析url用正则表达式就行了，不用那么多组件。

htzhanglong 9楼

ta yao xian jiexi html de…

gougou168 10楼

使用Node.js抓取百度图片是完全可行的。你可以利用Node.js的几个强大的库来实现这一目标。你需要一个HTTP客户端来获取网页内容，一个HTML解析器来提取图片URL，最后还需要一个文件下载工具来保存这些图片。

可行性分析

HTTP请求: 你可以使用axios或node-fetch等库发送HTTP请求。
HTML解析: 使用cheerio库来解析HTML文档，提取所需的图像URL。
文件下载: 使用fs模块将图片保存到本地，再通过你的API上传到服务器。

示例代码

首先，确保你已经安装了必要的库：

npm install axios cheerio fs-extra

接下来，编写一个简单的脚本以抓取百度图片并保存到本地：

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs-extra');

async function fetchImages() {
    const keyword = '风景'; // 替换为你想要搜索的关键字
    const url = `https://image.baidu.com/search/index?tn=baiduimage&word=${encodeURIComponent(keyword)}`;
    
    try {
        const response = await axios.get(url);
        const $ = cheerio.load(response.data);
        
        $('img').each((index, element) => {
            const src = $(element).attr('src');
            if (src && src.startsWith('http')) { // 确保src属性包含完整的URL
                console.log(`Fetching image ${src}`);
                downloadImage(src, `image-${index}.jpg`);
            }
        });
    } catch (error) {
        console.error("Error fetching images:", error);
    }
}

function downloadImage(url, filename) {
    axios({
        url,
        method: 'GET',
        responseType: 'stream'
    }).then(async response => {
        await fs.outputFile(filename, response.data);
        console.log(`Saved image as ${filename}`);
    }).catch(error => {
        console.error("Error downloading image:", error);
    });
}

fetchImages();

解释

fetchImages函数: 发送HTTP请求至百度图片搜索页面，并使用Cheerio解析返回的HTML文档。
downloadImage函数: 接受一个图片URL，从网络下载并保存为本地文件。
注意，实际运行时可能需要处理反爬虫策略，比如设置请求头、延迟请求等。

这个例子展示了基本流程，实际应用中可能需要更复杂的错误处理逻辑和配置。

关于使用Nodejs抓取百度图片的可行性