Nodejs request抓取google的首页正常，抓取google搜索结果页就乱码，为什么呢？

代码如下：


var request = require('request');
var cheerio = require('cheerio');
var http = require('http');


http.createServer(function (req, res) {
	request("http://www.google.com",function(error,response,body){
		if(!error && response.statusCode == 200){
			res.writeHead(200, {'Content-Type': 'text/html'}); 
			res.end(body);
		}
	})


}).listen(8080, "127.0.0.1");

yuanlaile 1楼

要解决Node.js request 抓取Google搜索结果页时出现乱码的问题，主要是因为Google的搜索结果页面可能使用了非UTF-8字符编码。默认情况下，request 模块会尝试以UTF-8解码响应体，但如果页面使用了其他编码（如GBK或ISO-8859-1），就会导致乱码。

示例代码及解释

首先，我们需要确保正确处理响应的字符编码。可以通过检查响应头中的 Content-Type 字段来确定正确的编码方式，并使用该编码解码响应体。

var request = require('request');
var cheerio = require('cheerio');
var http = require('http');

http.createServer(function (req, res) {
    // 发送请求获取Google搜索结果页
    request("https://www.google.com/search?q=nodejs", function(error, response, body){
        if (!error && response.statusCode == 200) {
            // 获取Content-Type头部信息
            var contentType = response.headers['content-type'] || '';
            var charset = '';

            // 解析Content-Type头部中的字符集
            if (contentType.includes('charset=')) {
                charset = contentType.split('charset=')[1].split(';')[0];
            } else {
                charset = 'utf-8'; // 默认为UTF-8
            }

            // 使用正确的字符编码解码响应体
            var decodedBody = Buffer.from(body, 'binary').toString(charset);

            // 将解码后的数据发送给客户端
            res.writeHead(200, {'Content-Type': 'text/html'});
            res.end(decodedBody);
        }
    });
}).listen(8080, "127.0.0.1");

关键点解释

获取Content-Type头部：我们从响应对象中提取 Content-Type 头部信息，该信息通常包含字符编码。
解析字符编码：如果 Content-Type 包含 charset= 参数，则从中提取字符编码；否则，默认使用 utf-8。
解码响应体：使用提取到的字符编码将二进制数据转换为字符串。
返回解码后的数据：将解码后的HTML内容通过HTTP响应返回给客户端。

通过这种方式，我们可以确保即使Google搜索结果页使用了不同于UTF-8的字符编码，也能正确显示页面内容。

nodeper 2楼作者

这个问题根本不是问题，你的浏览器使用的编码与源编码一样吗？

yibo5220 3楼

留意一下gzip，不知道request有么有处理这个环节

h691938207 4楼

浏览器是用utf8，抓取返回的html也是charset=utf8。但是就是搜索列表页面中文乱码。首页的中文不乱码。

eggper 5楼

已经在options里增加了encoding : null

sinazl 6楼

抓取 Google 搜索结果页出现乱码通常是由于编码问题导致的。Google 的搜索结果页面通常会使用 UTF-8 编码，而 request 库默认可能不会正确处理这种编码。你可以通过设置 encoding: null 来获取二进制数据，然后手动设置正确的编码。

以下是一个修改后的示例代码：

var request = require('request');
var cheerio = require('cheerio');

http.createServer(function (req, res) {
    request({
        url: "https://www.google.com/search?q=example",
        encoding: null // 获取二进制数据
    }, function(error, response, body) {
        if (!error &amp;&amp; response.statusCode == 200) {
            var contentType = response.headers['content-type'] || '';
            var charset = contentType.split('charset=')[1];
            
            // 设置正确的字符编码
            body = body.toString(charset || 'utf-8');
            
            res.writeHead(200, {'Content-Type': 'text/html'});
            res.end(body);
        }
    });
}).listen(8080, "127.0.0.1");

解释：

encoding: null：这将使 request 返回二进制数据而不是字符串。
response.headers[‘content-type’]：从响应头中提取 Content-Type，找到 charset。
toString(charset)：根据提取到的编码将二进制数据转换为字符串。

通过这种方式，可以确保获取的数据是按照正确的编码进行解析的，从而避免乱码问题。