Web Crawler

TL;DR
#

What this is: A CLI and library that crawls a seed URL, follows same-host links found on the homepage, and writes each page to disk with structured stdout logging.

What this isn’t: A recursive or politeness-aware crawler. No robots.txt, rate limiting, or depth control by default.

Run: cargo run --release -- "https://example.com" --out-dir crawl_out

What it does
#

Accepts a seed URL (or hostname) as CLI input
Fetches the homepage once
Extracts <a href="..."> links on the first page
Normalizes each link to an absolute URL
Follows only same-host links
Fetches each same-host page once
Saves each response body in out_dir using a deterministic URL hash filename
Logs crawl events to stdout with status and byte counts

Project structure
#

Module	Role
`src/main.rs`	CLI entrypoint using `clap` + `tokio`
`src/lib.rs`	Reusable crawler API
`src/engine.rs`	Crawl orchestration
`src/fetch.rs`	HTTP fetch wrapper with `reqwest`
`src/links.rs`	HTML link extraction
`src/storage.rs`	File path generation, save HTML
`src/url_util.rs`	URL normalization and same-host checks
`src/log.rs`	Logging abstraction (stdout + pluggable)

Usage
#

Build and run from the project root:

cargo run --release -- "https://example.com" --out-dir crawl_out

Short form:

cargo run --release -- example.com -o crawl_out

Defaults:

out_dir: crawl_out

Output
#

crawl_out/<url_hash>.html — url_hash is derived from the normalized final URL
Stdout log events include: seed, response, fetch, save, skip_links, link_skip, fetch_err, save_err

Configuration
#

No configuration file. CLI args only.

Tests
#

No test files are currently included. The library is unit-test-friendly via Crawler::with_logger and CrawlConfig.

Dependencies
#

reqwest — HTTP client
tokio — async runtime
anyhow — error handling
clap — CLI
url — URL parsing

Extending
#

Add depth control (breadth-first / recursive crawl)
Add robots.txt + rate limiting
Add concurrency queue and dedupe URL set
Add filter rules (patterns, content types)
Instrument with structured logging / metrics

Notes
#

The crawler is intentionally simple. It does not enforce politeness controls by default.

TL;DR#

What it does#

Project structure#

Usage#

Output#

Configuration#

Tests#

Dependencies#

Extending#

Notes#