寻找爬取 taobao npm 的那位 Nodejs 同学

看到每天下午都有一大波流量，看了一下日志是在狂拉数据，而且有多台服务器，不知道这位同学是谁呢？

211.151.229.245 - 10.159.63.236 [30/Oct/2014:18:56:42 +0800] "GET /uupaa.hmac.js HTTP/1.0" 200 43697 "-" "Ruby"
211.151.229.247 - 10.159.63.122 [30/Oct/2014:18:56:42 +0800] "GET /chakra/download/chakra-0.0.3.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.227 - 10.159.63.59 [30/Oct/2014:18:56:42 +0800] "GET /chakra/download/chakra-0.0.2.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.117 - 10.159.63.173 [30/Oct/2014:18:56:42 +0800] "GET /chainy-plugin-swap/download/chainy-plugin-swap-0.1.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.120 - 10.159.63.117 [30/Oct/2014:18:56:42 +0800] "GET /uupaa.eventlistener.js HTTP/1.0" 200 37223 "-" "Ruby"
211.151.229.111 - 10.159.63.102 [30/Oct/2014:18:56:42 +0800] "GET /chale/download/chale-0.1.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.118 - 10.159.63.35 [30/Oct/2014:18:56:42 +0800] "GET /chairs/download/chairs-0.0.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.245 - 10.159.63.100 [30/Oct/2014:18:56:42 +0800] "GET /uupaa.help.js HTTP/1.0" 200 70786 "-" "Ruby"
211.151.229.110 - 10.159.63.134 [30/Oct/2014:18:56:42 +0800] "GET /chakela/download/chakela-0.0.3.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.227 - 10.159.63.50 [30/Oct/2014:18:56:42 +0800] "GET /chalk-log/download/chalk-log-1.0.5.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.120 - 10.159.63.3 [30/Oct/2014:18:56:42 +0800] "GET /chalk-log/download/chalk-log-1.0.4.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.248 - 10.159.63.162 [30/Oct/2014:18:56:42 +0800] "GET /chalk-log/download/chalk-log-1.0.3.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.245 - 10.159.63.152 [30/Oct/2014:18:56:42 +0800] "GET /chalk-log/download/chalk-log-1.0.2.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.247 - 10.159.63.124 [30/Oct/2014:18:56:42 +0800] "GET /chalk-log/download/chalk-log-0.0.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.111 - 10.159.63.107 [30/Oct/2014:18:56:42 +0800] "GET /chalk-twig-filters/download/chalk-twig-filters-0.0.3.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.118 - 10.159.63.39 [30/Oct/2014:18:56:42 +0800] "GET /chalk-twig-filters/download/chalk-twig-filters-0.0.2.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.117 - 10.159.63.172 [30/Oct/2014:18:56:42 +0800] "GET /chamandir/download/chamandir-0.0.11.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.227 - 10.159.63.53 [30/Oct/2014:18:56:42 +0800] "GET /chamandir/download/chamandir-0.0.6.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.110 - 10.159.63.130 [30/Oct/2014:18:56:42 +0800] "GET /chamandir/download/chamandir-0.0.3.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.229 - 10.159.63.186 [30/Oct/2014:18:56:42 +0800] "GET /chamber/download/chamber-0.0.2.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.225 - 10.159.63.234 [30/Oct/2014:18:56:42 +0800] "GET /chamber/download/chamber-0.0.1.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.247 - 10.159.63.120 [30/Oct/2014:18:56:42 +0800] "GET /chameleon/download/chameleon-0.1.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.245 - 10.159.63.62 [30/Oct/2014:18:56:43 +0800] "GET /chan/download/chan-0.2.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.245 - 10.159.63.144 [30/Oct/2014:18:56:43 +0800] "GET /chan/download/chan-0.1.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.111 - 10.159.63.108 [30/Oct/2014:18:56:43 +0800] "GET /chan/download/chan-0.0.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.118 - 10.159.63.42 [30/Oct/2014:18:56:43 +0800] "GET /change/download/change-0.0.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.110 - 10.159.63.10 [30/Oct/2014:18:56:43 +0800] "GET /chalk256/download/chalk256-2.0.0.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.110 - 10.159.63.33 [30/Oct/2014:18:56:43 +0800] "GET /chance.js/download/chance.js-0.0.1.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.117 - 10.159.63.169 [30/Oct/2014:18:56:43 +0800] "GET /chancejs/download/chancejs-0.0.7.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.247 - 10.159.63.120 [30/Oct/2014:18:56:43 +0800] "GET /chancejs/download/chancejs-0.0.8.tgz HTTP/1.0" 302 5 "-" "Ruby"
211.151.229.225 - 10.159.63.229 [30/Oct/2014:18:56:43 +0800] "GET /chancejs/download/chancejs-0.0.6.tgz HTTP/1.0" 302 5 "-" "Ruby"
- - 10.159.34.0 [30/Oct/2014:18:56:43 +0800] "-" 400 0 "-" "-"

htzhanglong 1楼作者

寻找爬取 taobao npm 的那位 Nodejs 同学

我们发现每天下午都有大量的流量涌入，经过分析日志文件，这些流量主要来自一些特定的 IP 地址，并且这些请求都在拉取 npm 包。这看起来像是某个 Node.js 脚本或程序正在从我们的服务器上下载大量的 npm 包。

示例代码

假设我们要编写一个简单的 Node.js 脚本来模拟这种行为，我们可以使用 axios 或 request 库来发送 HTTP 请求。这里以 axios 为例：

const axios = require('axios');
const fs = require('fs');

// 需要下载的 npm 包列表
const packages = [
    'uupaa.hmac.js',
    'chakra-0.0.3.tgz',
    'chakra-0.0.2.tgz',
    'chainy-plugin-swap-0.1.0.tgz',
    // 添加更多包
];

packages.forEach(async (package) => {
    try {
        const response = await axios({
            method: 'GET',
            url: `https://npm.taobao.org/mirrors/${package}`,
            responseType: 'stream'
        });

        const writer = fs.createWriteStream(package);
        response.data.pipe(writer);

        return new Promise((resolve, reject) => {
            writer.on('finish', resolve);
            writer.on('error', reject);
        });
    } catch (error) {
        console.error(`Failed to download ${package}:`, error.message);
    }
});

解释

导入库：我们使用 axios 来发送 HTTP 请求，fs 来处理文件写入。
定义包列表：packages 数组中包含了需要下载的 npm 包的名称。
循环下载：通过 forEach 循环遍历每个包，使用 axios 发送 GET 请求，并将响应数据流保存到本地文件中。
错误处理：捕获并打印任何下载失败的信息。

注意事项

确保你有权访问这些 npm 包。
在实际生产环境中，应考虑并发限制和错误处理机制，避免对服务器造成过大压力。
如果你需要从多个不同的服务器下载包，可以考虑使用多线程或多进程来提高效率。

如果你是执行上述行为的同学，请联系我们，以便我们更好地理解你的需求并提供帮助。

sinazl 2楼

ruby 社区来防火了么？

ionicwang 3楼

哈哈，查水表

bupafengyu 4楼

https://github.com/cnpm/cnpmjs.org/issues/486 115.231.100.67 这个ip不知道谁那位呢？同步配置错误了

zlyuanteng 5楼

哈哈。反正不是我: )

htzhanglong 6楼作者

北京电信，难道有人自建npm源？

phonegap100 7楼

八成也就是有人搭了个 npm 私有库，然后不懂翻墙，就把源指向你们了。。应该不是恶意的

caililin 8楼

根据日志信息，可以看出有大量的请求来自于同一个IP地址段（211.151.229.*），并且这些请求都是通过HTTP GET方法获取npm包。从这些请求中可以推断出，这可能是某位开发者或自动化工具在定期爬取npm包。

如果你想要找到这个爬虫，可以通过以下几个步骤来尝试定位：

检查IP白名单和黑名单：查看你服务器的防火墙规则或者Nginx/Apache配置文件中的访问限制规则，看看是否可以识别并阻止该IP段的请求。
分析请求模式：这些请求似乎都是针对npm包的下载，可以通过分析请求路径和User-Agent来进一步确认。
使用反爬虫技术：你可以实现一些简单的反爬虫策略，例如添加请求频率限制、验证User-Agent、加入验证码等。

示例代码

这里提供一个简单的Node.js脚本，用于检测特定的请求模式，并记录它们：

const http = require('http');
const url = require('url');

const server = http.createServer((req, res) => {
    const parsedUrl = url.parse(req.url);
    
    // 检查请求路径是否为npm包
    if (parsedUrl.pathname.includes('/download')) {
        console.log(`Detected download request: ${req.url}`);
        
        // 记录请求信息到日志
        const logMessage = `${new Date().toISOString()} - ${req.method} ${req.url}`;
        fs.appendFile('requests.log', logMessage + '\n', err => {
            if (err) {
                console.error('Failed to write to log file.');
            }
        });
    }
    
    res.writeHead(200, {'Content-Type': 'text/plain'});
    res.end('Hello World\n');
});

server.listen(3000, () => {
    console.log('Server running at http://localhost:3000/');
});

解释

创建HTTP服务器：使用http.createServer创建一个监听在3000端口的HTTP服务器。
解析URL：通过url.parse解析请求的URL。
检查请求路径：如果请求路径包含/download，则认为是下载请求，并打印相关信息。
记录请求信息：将请求信息追加到日志文件中。

希望这个示例能帮助你追踪爬虫行为。