Skip to main content

Web Crawler

·347 words·2 mins

Technologies

Rust Tokio reqwest clap

A small, dependency-minimal Rust web crawler that fetches a seed URL, extracts same-host links from the homepage, and saves HTML responses to disk.

Built for learning and small local crawl tasks — not a production spider.

TL;DR
#

What this is: A CLI and library that crawls a seed URL, follows same-host links found on the homepage, and writes each page to disk with structured stdout logging.

What this isn’t: A recursive or politeness-aware crawler. No robots.txt, rate limiting, or depth control by default.

Run: cargo run --release -- "https://example.com" --out-dir crawl_out

What it does
#

  • Accepts a seed URL (or hostname) as CLI input
  • Fetches the homepage once
  • Extracts <a href="..."> links on the first page
  • Normalizes each link to an absolute URL
  • Follows only same-host links
  • Fetches each same-host page once
  • Saves each response body in out_dir using a deterministic URL hash filename
  • Logs crawl events to stdout with status and byte counts

Project structure
#

ModuleRole
src/main.rsCLI entrypoint using clap + tokio
src/lib.rsReusable crawler API
src/engine.rsCrawl orchestration
src/fetch.rsHTTP fetch wrapper with reqwest
src/links.rsHTML link extraction
src/storage.rsFile path generation, save HTML
src/url_util.rsURL normalization and same-host checks
src/log.rsLogging abstraction (stdout + pluggable)

Usage
#

Build and run from the project root:

cargo run --release -- "https://example.com" --out-dir crawl_out

Short form:

cargo run --release -- example.com -o crawl_out

Defaults:

  • out_dir: crawl_out

Output
#

  • crawl_out/<url_hash>.htmlurl_hash is derived from the normalized final URL
  • Stdout log events include: seed, response, fetch, save, skip_links, link_skip, fetch_err, save_err

Configuration
#

No configuration file. CLI args only.

Tests
#

No test files are currently included. The library is unit-test-friendly via Crawler::with_logger and CrawlConfig.

Dependencies
#

  • reqwest — HTTP client
  • tokio — async runtime
  • anyhow — error handling
  • clap — CLI
  • url — URL parsing

Extending
#

  • Add depth control (breadth-first / recursive crawl)
  • Add robots.txt + rate limiting
  • Add concurrency queue and dedupe URL set
  • Add filter rules (patterns, content types)
  • Instrument with structured logging / metrics

Notes
#

The crawler is intentionally simple. It does not enforce politeness controls by default.