Crawleeを使ってリンク切れテストを行う

CrawleeはWeb向けのスクレイピング・クローリングライブラリ。さくっとクローラーを作れてしまう。たくさんの似たようなページがあるサイト（楽天の商品ページとか）だと、全ページを開いてE2Eテストするのは大変なので、Crawleeをリンク切れチェッカーとか使えないものかと試してみた。

インストール

See: https://crawlee.dev/docs/quick-start

以下のコマンドでプロジェクトを作成する。

npx crawlee create my-crawler

上記コマンドを叩くと、CheerioCrawler、PlaywrightCrawler、PuppeteerCrawler。さらにJavascriptかTypescriptかを選べる。今回は今風にPlaywright ＋ Typescript （PlaywrightCrawler template project [TypeScript]）を選んだ。Playwrightを使っておけば、画面キャプチャなどE2Eでやりたいことが簡単にできる。

自動生成されたコードでポイントになるのが、src/main.ts と src/routes.ts 。

// main.ts
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

import { router } from './routes.js';

const startUrls = ['https://crawlee.dev'];

const crawler = new PlaywrightCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: router,
    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 20,
});

await crawler.run(startUrls);

// routes.ts
import { createPlaywrightRouter } from 'crawlee';

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://crawlee.dev/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log, pushData }) => {
    const title = await page.title();
    log.info(`${title}`, { url: request.loadedUrl });

    await pushData({
        url: request.loadedUrl,
        title,
    });
});

startUrlsとglobsをクローリングしたいURLに変更して、以下のコマンドを実行すればクローリングがはじまる。

cd my-crawler && npm start

> my-crawler@0.0.1 start
> npm run start:dev


> my-crawler@0.0.1 start:dev
> tsx src/main.ts

(node:54173) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
INFO  PlaywrightCrawler: Starting the crawler.
INFO  PlaywrightCrawler: enqueueing new URLs
INFO  PlaywrightCrawler: Crawlee {"url":"https://crawlee.dev/docs/examples"}
INFO  PlaywrightCrawler: Crawlee {"url":"https://crawlee.dev/api/core"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/quick-start"}
INFO  PlaywrightCrawler: Changelog | API | Crawlee {"url":"https://crawlee.dev/api/core/changelog"}
INFO  PlaywrightCrawler: Crawlee Blog - learn how to build better scrapers | Crawlee {"url":"https://crawlee.dev/blog"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/next/quick-start"}
INFO  PlaywrightCrawler: Crawlee {"url":"https://crawlee.dev/docs/3.9/quick-start"}
INFO  PlaywrightCrawler: Crawlee {"url":"https://crawlee.dev/docs/3.8/quick-start"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/3.7/quick-start"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/3.5/quick-start"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/3.6/quick-start"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/3.3/quick-start"}
INFO  PlaywrightCrawler: Crawlee {"url":"https://crawlee.dev/docs/3.4/quick-start"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/3.2/quick-start"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/3.1/quick-start"}
INFO  PlaywrightCrawler: Introduction | Crawlee {"url":"https://crawlee.dev/docs/introduction"}
INFO  PlaywrightCrawler: Crawlee {"url":"https://crawlee.dev/docs/guides/javascript-rendering"}
INFO  PlaywrightCrawler: Quick Start | Crawlee {"url":"https://crawlee.dev/docs/3.0/quick-start"}
INFO  PlaywrightCrawler: Crawlee {"url":"https://crawlee.dev/docs/guides/avoid-blocking"}
INFO  PlaywrightCrawler: Crawler reached the maxRequestsPerCrawl limit of 20 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO  PlaywrightCrawler: TypeScript Projects | Crawlee {"url":"https://crawlee.dev/docs/guides/typescript-project"}
INFO  PlaywrightCrawler: CheerioCrawler guide | Crawlee {"url":"https://crawlee.dev/docs/guides/cheerio-crawler-guide"}
INFO  PlaywrightCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 20 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 22 requests and will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":22,"requestsFailed":0,"retryHistogram":[22],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1018,"requestsFinishedPerMinute":103,"requestsFailedPerMinute":0,"requestTotalDurationMillis":22403,"requestsTotal":22,"crawlerRuntimeMillis":12834}
INFO  PlaywrightCrawler: Finished! Total 22 requests: 22 succeeded, 0 failed. {"terminal":true}

スクレイピングしたデータはdatasetsディレクトリに作成される。データは自分で加工も可能。

設定がたくさんある。気になったのをいくつか。

// main.ts
// headless modeで動かす場合
const crawler = new PlaywrightCrawler({
    ...
    launchContext: {
        // Here you can set options that are passed to the playwright .launch() function.
        launchOptions: {
            headless: true,
        },
    },
    ...
});

// routes.ts
// リンク先のドメインやサブドメイン判定はstrategyを使う。
// See: https://crawlee.dev/docs/examples/crawl-relative-links
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    ...
    await enqueueLinks({
        // domain settings
        strategy: EnqueueStrategy.All,
        ...
    });
});

指定したページから1階層だけリンクを辿らせられないか調べたけど無理そう。

Option to perform a depth first crawl #1595

ただ、サンプルを動かしてみると、指定したURLから1階層だけリンクをたどってくれている気がするので再帰的に無限でクローリングにはならない感じ。Crawl all links on a websiteを見る限り、 enqueueLinks()を指定すれば自動でそうなる模様。

その他のサンプルも Exampleにたくさんある。ドキュメントも充実している。

See: https://crawlee.dev/docs/examples

あとはリンク切れをチェックしたいけど、リンク先のステータスコードを取ってくる方法が見つからず。リンク切れだとページタイトルが「404 Not Found」とかになっているので、そこから抜き出すしかないのかしら。