Nodejs 爬虫的问题：抓取到相对路径以及文件如何生成完整的URL链接

处理流程：主程序—>（输入参数） —> 爬行网站 --> 访问网站(获得返回的网页源码) —>拿出所有Href 属性值 (现在就在这一步拿到了) —> 生成了网站的URL链接然后就全部丢入 URL 数组(去重复) --> 然后又遍历数组把数组中的所有Href 全丢给爬行模块又去爬行 (这样就有个问题有的是相对路径 /test.php 这样一个Href 可能存在与 www.a.com/a/b/c 下我想表达的意思就是即使我当前请求了 www.a.com/a/b/c/index.php 拿到了 /test.php 这样的一个href 也会给我生成个网站的 http://www.a.com/a/b/c/test.php 的网站链接然后我就又可以丢入 URL 数组去重复一直循环下去了

现在获得到的href是这样的：

/favicon.ico http://www.xxx.org/feeds/x http://www.xxx.org/feeds/s http://www.xxx.org/feeds/b /css/style.css javascript:void(0) /user.php?action=login /user.php?action=register /index.php /corps/ /whitehats/ /teams/ /bugs/ /bug/submit /corp_actions /job/ /notice/ /index.php

/notice.php?action=view&id=29 /notice.php?action=view&id=28 /notice.php?action=view&id=27 /notice.php?action=view&id=26

sinazl 1楼

Node.js 爬虫的问题：抓取到相对路径以及文件如何生成完整的URL链接

处理流程：

主程序：输入参数（如起始URL）。
爬行网站：访问网站并获取网页源码。
提取链接：从网页源码中提取所有的<a>标签中的href属性。
生成完整URL：将提取到的相对路径转换为完整的URL。
去重和递归：将生成的URL放入数组，并去重。然后遍历数组，将每个URL重新丢给爬行模块进行递归爬取。

示例代码

以下是一个简单的Node.js爬虫示例，展示了如何处理相对路径并生成完整的URL链接：

const axios = require('axios');
const cheerio = require('cheerio');
const urlLib = require('url');

async function fetchPage(url) {
    try {
        const response = await axios.get(url);
        return response.data;
    } catch (error) {
        console.error(`Error fetching ${url}:`, error.message);
        return null;
    }
}

function extractLinks(html, baseUrl) {
    const $ = cheerio.load(html);
    const links = [];

    $('a').each((index, element) => {
        const href = $(element).attr('href');
        if (href) {
            const fullUrl = urlLib.resolve(baseUrl, href);
            links.push(fullUrl);
        }
    });

    return links;
}

async function crawl(startUrl) {
    let visitedUrls = new Set();
    let urlsToVisit = [startUrl];

    while (urlsToVisit.length > 0) {
        const currentUrl = urlsToVisit.shift();
        if (visitedUrls.has(currentUrl)) continue;

        console.log(`Fetching: ${currentUrl}`);
        const html = await fetchPage(currentUrl);
        if (!html) continue;

        const links = extractLinks(html, currentUrl);
        visitedUrls.add(currentUrl);

        // Add new links to the queue and remove duplicates
        links.forEach(link => {
            if (!visitedUrls.has(link)) {
                urlsToVisit.push(link);
            }
        });
    }
}

// Start crawling from a given URL
crawl('http://example.com');

解释

fetchPage: 使用axios库来发送HTTP请求并获取HTML内容。
extractLinks: 使用cheerio库解析HTML，提取所有的<a>标签中的href属性，并使用urlLib.resolve方法将相对路径转换为完整的URL。
crawl: 主函数，用于管理URL的队列和去重。它不断从队列中取出URL，爬取其内容，并提取新的链接。

通过这种方式，我们可以有效地处理相对路径，并确保每次爬取都能生成正确的完整URL。

bupafengyu 2楼

要解决这个问题，我们需要将相对路径转换为绝对路径。这可以通过使用url模块中的resolve方法来实现。以下是一个简单的示例代码，展示如何处理这种情况：

const url = require('url');
const baseUrl = 'http://www.example.com'; // 这里填写你的基础URL

function resolveRelativeUrls(relativeUrl, base) {
    return new url.URL(relativeUrl, base).toString();
}

// 示例用法
let hrefs = [
    '/favicon.ico',
    'http://www.xxx.org/feeds/x',
    '/css/style.css',
    'javascript:void(0)',
    '/index.php'
];

let resolvedHrefs = hrefs.map(href => {
    if (href.startsWith('http')) {
        return href;
    } else if (href !== 'javascript:void(0)') {
        return resolveRelativeUrls(href, baseUrl);
    }
    return null;
}).filter(Boolean);

console.log(resolvedHrefs);

解释：

url模块：用于解析和格式化URL。
baseUrl：基础URL，即你要爬取的网站的根URL。
resolveRelativeUrls函数：将相对路径转换为绝对路径。
hrefs数组：包含所有从网页中提取的href属性值。
resolvedHrefs数组：通过映射hrefs数组并使用resolveRelativeUrls函数处理每个元素，最终得到包含完整URL的数组。

这段代码会处理各种情况，包括已经是绝对路径的URL、相对路径、以及不需要处理的javascript:void(0)。