twars-url2md is a fast and robust command-line tool and Rust library that fetches web pages, intelligently cleans up their HTML content, and converts them into clean, readable Markdown files. It's designed for high-performance batch processing, making it ideal for archiving, research, content conversion, or any task requiring structured text from web sources.
twars-url2md takes one or more URLs (or local HTML files) as input. For each URL, it:
This tool is valuable for:
twars-url2md stands out due to its combination of speed, reliability, and the quality of its output:
/path/to/file.html or file:///path/to/file.html)..md file.You can install twars-url2md using pre-compiled binaries (recommended for most users) or by building it from source using Cargo.
Linux and macOS:
curl -fsSL https://raw.githubusercontent.com/twardoch/twars-url2md/main/install.sh | bash
Or with custom install directory:
curl -fsSL https://raw.githubusercontent.com/twardoch/twars-url2md/main/install.sh | bash -s -- --install-dir ~/.local/bin
Download the latest release for your platform from the GitHub Releases page.
macOS:
# Intel x86_64
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-macos-x86_64.tar.gz | tar xz
# Apple Silicon (M1/M2)
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-macos-aarch64.tar.gz | tar xz
# Move to a directory in your PATH
sudo mv twars-url2md /usr/local/bin/
Linux:
# x86_64
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-linux-x86_64.tar.gz | tar xz
# ARM64 (aarch64)
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-linux-aarch64.tar.gz | tar xz
# Static binary (musl)
curl -L https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-linux-x86_64-musl.tar.gz | tar xz
# Move to a directory in your PATH
sudo mv twars-url2md /usr/local/bin/
Windows:
# Download Invoke-WebRequest -Uri https://github.com/twardoch/twars-url2md/releases/latest/download/twars-url2md-windows-x86_64.zip -OutFile twars-url2md.zip # Extract Expand-Archive twars-url2md.zip -DestinationPath . # Move twars-url2md.exe to a directory in your system's PATH # For example, move it to C:\Windows\System32 or add its current directory to PATH
Note: For Windows, ensure the directory where you place twars-url2md.exe is included in your PATH environment variable to run it from any command prompt.
If you have Rust installed (version 1.70.0 or later), you can install twars-url2md directly from Crates.io:
cargo install twars-url2md
To build twars-url2md from the source code:
git clone https://github.com/twardoch/twars-url2md.git
cd twars-url2md
The executable will be located atcargo build --release
target/release/twars-url2md.~/.cargo/bin/) by running:
cargo install --path .
After installation, verify it by running:
twars-url2md --version
twars-url2md can be used as a command-line tool or as a library in your Rust projects.
The basic syntax for the CLI is:
twars-url2md [OPTIONS] [URLS...]
If URLs are provided directly as arguments, they will be processed. Otherwise, use --input or --stdin.
CLI Options:
You can view all options by running twars-url2md --help. Here are the main ones:
-i, --input <FILE>: Input file containing URLs (one per line, or text with extractable URLs).-o, --output <PATH>: Output directory for Markdown files. If <PATH> ends with .md, all content will be saved into this single file instead of a directory structure (unless --pack is also used).--stdin: Read URLs from standard input.--base-url <URL>: Base URL for resolving relative links found in the input content (e.g., if parsing URLs from an HTML page).-p, --pack <FILE.md>: Pack all converted Markdown content into a single specified .md file. Each URL's content will be headed by its original URL.-v, --verbose: Enable verbose output with detailed logging (INFO and DEBUG levels).-h, --help: Print help information.-V, --version: Print version information.Input Formats:
The tool can extract URLs from various input sources when using -i or --stdin:
https://example.com https://another-site.com/page
<a href="https://example.com">Example</a>).[Example](https://example.com)).Note: For local files, the content is read and converted to Markdown directly./path/to/your/file.html file:///absolute/path/to/another/file.html
CLI Examples:
Process a single URL, output to console (default if no -o or --pack):
twars-url2md https://www.rust-lang.org
(Note: This will print Markdown to stdout. For saving, use -o or --pack)
Process multiple URLs, save to default directory structure (./output/<domain>/...):
twars-url2md https://www.rust-lang.org https://crates.io -o ./output
Process URLs from a file, save to a custom directory:
# urls.txt contains one URL per line
twars-url2md -i urls.txt -o ./markdown_files
Process URLs from stdin, with verbose logging:
echo "https://example.com" | twars-url2md --stdin -o ./output -v
Extract URLs from a webpage and process them (using curl as an example source):
curl -s https://news.ycombinator.com | \ twars-url2md --stdin --base-url https://news.ycombinator.com -o ./hn_articles
Process local HTML files (using find to supply file paths):
find . -name "*.html" | twars-url2md --stdin -o ./local_markdown
Create a single combined Markdown file from multiple URLs (--pack):
twars-url2md -i urls.txt --pack combined_report.md
Each URL's content in combined_report.md will be preceded by a header like # https://example.com/some/page.
Output to a single .md file (alternative to directory structure):
# This is useful if you have one primary URL or want a simpler output than --pack
twars-url2md https://example.com/main_article -o article.md
If multiple URLs are processed and output is a single file (not using --pack), their content will be concatenated.
Use both individual file output and packed output:
twars-url2md -i urls.txt -o ./individual_files --pack all_content.md
twars-url2md can also be used as a Rust library to integrate its functionality into your own projects.
Add it to your Cargo.toml:
[dependencies]
twars-url2md = "0.3.0" # Replace with the latest version from crates.io
tokio = { version = "1", features = ["full"] }
anyhow = "1"
(Check Crates.io for the most current version number.)
Example:
use twars_url2md::{process_urls, Config, url::extract_urls_from_text};
use std::path::PathBuf;
use anyhow::Result;
#[tokio::main]
async fn main() -> Result<()> {
// Example text from which to extract URLs
let text_with_urls = "Check out https://www.rust-lang.org and also see https://crates.io for packages.";
// Extract URLs. A base_url (Option<&str>) can be provided if needed.
let urls_to_process = extract_urls_from_text(text_with_urls, None);
if urls_to_process.is_empty() {
println!("No URLs found to process.");
return Ok(());
}
// Configure the processing task
let config = Config {
verbose: true, // Enable detailed internal logging (uses `tracing` crate)
max_retries: 3, // Max *additional* retries after the initial attempt (so, 3 means up to 4 total attempts)
output_base: PathBuf::from("./my_markdown_output"), // Base path for output
single_file: false, // False: create directory structure; True: if output_base is a file.md, save all to it
has_output: true, // True if output_base is a path to save to, false if just processing (e.g. for pack_file only)
pack_file: Some(PathBuf::from("./packed_documentation.md")), // Optional: combine all into one .md file
};
// Ensure the output directory exists if not using pack_file primarily or single_file mode
if config.has_output && !config.single_file && config.output_base.extension().is_none() {
if !config.output_base.exists() {
tokio::fs::create_dir_all(&config.output_base).await?;
}
} else if config.has_output && config.single_file { // Output is a single file
if let Some(parent) = config.output_base.parent() { // Ensure parent directory for the single file exists
if !parent.exists() {
tokio::fs::create_dir_all(parent).await?;
}
}
}
// Process the URLs
// `process_urls` returns a Result containing a list of (String, anyhow::Error) for failed URLs.
match process_urls(urls_to_process, config).await {
Ok(errors) => {
if errors.is_empty() {
println!("All URLs processed successfully!");
} else {
eprintln!("Some URLs failed to process:");
for (url, error) in errors {
eprintln!("- {}: {}", url, error);
}
}
}
Err(e) => {
eprintln!("A critical error occurred during processing: {}", e);
}
}
Ok(())
}
For more detailed information on the library API, please refer to the official documentation on docs.rs.
This section provides a deeper dive into the architecture and inner workings of twars-url2md.
twars-url2md is built with a modular design in Rust, emphasizing performance, concurrency, and resilience.
Core Components:
┌──────────────────┐ ┌───────────────────┐ ┌──────────────────┐ │ CLI / Library │────▶│ URL Extractor │────▶│ HTTP Client │ │ (Input Handler) │ │ (src/url.rs) │ │ (src/html.rs) │ └──────────────────┘ └───────────────────┘ └─────────┬────────┘ │ ▼ ┌──────────────────┐ ┌───────────────────┐ ┌─────────┴────────┐ │ Output Writer │◀────│ Markdown Converter│◀────│ HTML Cleaner │ │ (File System) │ │ (src/markdown.rs) │ │ (Monolith Lib) │ └──────────────────┘ └───────────────────┘ └──────────────────┘
CLI / Library Interface (src/cli.rs, src/lib.rs):
clap) or library function calls.URL Extractor & Validator (src/url.rs):
linkify.HTTP Client (src/html.rs):
curl-based HTTP client (curl_rust crate).HTML Cleaner (Monolith Integration):
twars-url2md attempts a fallback to basic HTML processing or skips the URL, logging an error and allowing the batch job to continue.Markdown Converter (src/markdown.rs):
Output Writer (src/lib.rs):
Concurrent Processing:
min(CPU_COUNT * 2, 16)), optimizing throughput without overloading the system.futures::stream::StreamExt::buffer_unordered for managing concurrent tasks.Robust Error Handling:
anyhow crate for flexible and context-rich error reporting.twars-url2md automatically retries. The Config.max_retries field (default 2 for CLI, configurable for library use) specifies the number of additional attempts after the first one. So, max_retries: 2 means up to 3 total attempts. Retries use exponential backoff.The built-in HTTP client is carefully configured to maximize compatibility with various web servers and Content Delivery Networks (CDNs):
curl via the curl_rust crate, known for its robustness and wide protocol support.Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36) to appear like a standard browser. This is defined in src/lib.rs as USER_AGENT_STRING.Accept, Accept-Language, Sec-Ch-Ua, Sec-Fetch-Site, etc.) to further mimic legitimate browser traffic, reducing the likelihood of being blocked by bot detection systems.curl_http_client) to prevent indefinite hangs.-o <directory>): Creates a hierarchy like output_dir/example.com/path/to/page.md.-o <filename>.md): Concatenates all Markdown content into the specified file. If multiple URLs are processed, their content is simply joined. This mode is active if the path given to -o ends with .md and is not a directory.--pack <filename>.md): Combines Markdown from all URLs into a single file. Each URL's content is clearly demarcated by a heading like # <URL>. This mode also preserves the original input order of URLs in the output file.tracing crate for structured and configurable logging.ERROR, WARN, INFO, DEBUG, TRACE.-v, --verbose flag (enables INFO and DEBUG for twars_url2md modules by setting RUST_LOG=info,twars_url2md=debug).RUST_LOG environment variable (e.g., RUST_LOG=twars_url2md=trace for maximum detail from this application, or RUST_LOG=info for general info level). See src/main.rs for initialization logic.build.rs) embeds build time, target architecture, and profile (debug/release) into the binary. This information is accessible via the twars-url2md --version command (see src/lib.rs version() function).This section provides guidance for setting up a development environment, building the project, and running tests.
issues/issuetest.py).git clone https://github.com/twardoch/twars-url2md.git
cd twars-url2md
Debug build:
cargo build
The executable will be in target/debug/twars-url2md.
Release build (optimized):
cargo build --release
The executable will be in target/release/twars-url2md.
Run directly with arguments (debug mode):
cargo run -- -i urls.txt -o ./output
Consistent code style is maintained using rustfmt, and clippy is used for linting.
Format code:
cargo fmt
Run linter (Clippy):
cargo clippy --all-targets --all-features -- -D warnings
(The -D warnings flag promotes warnings to errors, ensuring high code quality.)
The project has a suite of unit and integration tests.
Run all tests:
cargo test --all-features
Run a specific test function:
# Example: cargo test test_url_extraction --all-features
cargo test <TEST_FUNCTION_NAME> --all-features
(Replace <TEST_FUNCTION_NAME> with the actual test function name.)
Run tests with output (e.g., for debugging print statements):
cargo test --all-features -- --nocapture
Run only integration tests:
Integration tests are typically located in the tests/ directory (e.g., tests/integration/e2e_tests.rs).
# Example: cargo test --test e2e_tests --all-features
cargo test --test <INTEGRATION_TEST_FILENAME_WITHOUT_RS> --all-features
Issue Verification Suite: The project includes an issue verification script to test various CLI functionalities and confirm fixes for reported issues.
python3 issues/issuetest.py
(Ensure you have Python 3 installed and any dependencies listed in or for that script.)
cargo doc --no-deps --open
(You may need to installcargo audit
cargo-audit first: cargo install cargo-audit)cargo package
cargo publish
Contributions are welcome and greatly appreciated! Whether it's bug reports, feature suggestions, documentation improvements, or code contributions, your help makes twars-url2md better.
Please see the CONTRIBUTING.md file for detailed guidelines on how to contribute to the project, including information on reporting issues, submitting pull requests, and the code of conduct.
If you're looking for ways to contribute, here are some areas where help would be valuable:
Before starting significant work, it's a good idea to open an issue to discuss your proposed changes.
If you encounter issues while using twars-url2md, this section may help.
SSL/TLS Certificate Errors:
twars-url2md uses curl which typically relies on the system's certificate store. Ensure your system's CA certificates are up-to-date. For specific problematic sites, this can be complex. The tool aims for secure defaults.-v) for more details on the TLS handshake. If it's a corporate environment with a custom CA, ensure that CA is trusted by your system.CDN-Protected Sites (e.g., Cloudflare, Akamai, Adobe):
twars-url2md have significantly improved CDN compatibility by:
curl as the underlying HTTP client, which has a network stack more aligned with browsers.twars-url2md. If you still encounter issues, verbose logging (-v or RUST_LOG) can provide clues.Timeouts on Large or Slow Pages:
Monolith Panics or Poor Conversion on Specific Pages:
htmd library (for Markdown conversion) might struggle with extremely complex, malformed, or unusual HTML structures.twars-url2md includes panic recovery for Monolith. If Monolith panics, the tool logs an error and attempts to fall back to a more basic HTML processing step or skips the URL. This prevents the entire batch from failing.For more detailed insight into what the tool is doing, especially when troubleshooting, use verbose logging.
Using the -v flag:
The simplest way to get more logs is to add the -v or --verbose flag to your command. This typically sets the log level to show INFO messages from all crates and DEBUG messages from twars_url2md itself (RUST_LOG=info,twars_url2md=debug).
twars-url2md -i urls.txt -o output -v
Using the RUST_LOG Environment Variable:
For more fine-grained control, you can use the RUST_LOG environment variable. twars-url2md uses the tracing library.
Syntax: RUST_LOG="target[span{field=value}]=level" (simplified: RUST_LOG="crate_name=level,another_crate=level")
Examples:
DEBUG level for twars_url2md and INFO for everything else (similar to -v):
RUST_LOG=info,twars_url2md=debug twars-url2md -i urls.txt -o output
TRACE level for twars_url2md (very detailed, for deep debugging):
RUST_LOG=twars_url2md=trace twars-url2md -i urls.txt -o output
DEBUG for the HTML processing module specifically:
RUST_LOG=twars_url2md::html=debug twars-url2md -i urls.txt -o output
DEBUG level messages from all crates:
RUST_LOG=debug twars-url2md -i urls.txt -o output
Logging Levels (most to least verbose):
TRACE: Extremely detailed information, typically for fine-grained debugging.DEBUG: Detailed information useful for debugging.INFO: Informational messages about the progress of the application.WARN: Warnings about potential issues that don't stop execution.ERROR: Errors that prevent a specific operation (e.g., processing one URL) but don't crash the application.If you consistently encounter an issue not covered here, consider opening an issue on the GitHub repository with detailed information, including the command you ran, the output, and logs if possible.
This project is licensed under the MIT License. See the LICENSE file for details.
Adam Twardoch (@twardoch)
twars-url2md builds upon the excellent work of others:
For bug reports, feature requests, or questions, please open an issue on the GitHub repository.