My agentic slop goes here. Not intended for anyone else!
Toru: OCaml Data Repository Manager#
Toru is an OCaml library for managing data file downloads and caching, fully compatible with Python Pooch registry files. Built on the Eio effects system for efficient concurrent operations.
🚀 Features#
✅ Complete Implementation#
- 🔒 Hash Verification: SHA256, SHA1, MD5 with automatic format detection
- 📋 Registry Management: Full Pooch compatibility for registry files
- 💾 Smart Caching: XDG-compliant cache with versioning and management APIs
- 🔗 Multiple Download Methods: Modular system (wget/curl/future cohttp-eio)
- 🔐 Authentication: Per-downloader HTTP Basic Auth support
- ⚡ Concurrent Downloads: Efficient parallel downloads using Eio
- 🛠️ CLI Tools: Cache management and inspection utilities
- 🔄 Cross-Validation: Python/Pooch compatibility verified
🎯 Production Ready#
- Type Safe: Leverages OCaml's type system for robust error handling
- Well Tested: Comprehensive test suite with cross-validation against Python Pooch
- Performance: Concurrent operations and efficient file handling
- Modular: Clean interfaces with easy extensibility
📦 Installation#
# Clone the repository
git clone https://github.com/yourusername/toru
cd toru
# Build with dune
dune build
# Run tests
dune exec test/test_hash.exe
dune exec test/test_registry.exe
dune exec test/test_cache.exe
dune exec test/test_python_cross_validation.exe
Dependencies#
eio(>= 1.0): Effects-based I/Odigestif(>= 1.0): Cryptographic hashesyojson: JSON parsingcmdliner: CLI argument parsingptime: Time handlingfmt: Formatted output with colors and styling
🔧 Usage#
Basic Library Usage#
open Eio.Std
let main ~env ~sw =
(* Create a Toru instance *)
let toru = Toru.create ~sw ~env
~base_url:"https://github.com/myorg/data/raw/main/"
~cache_path:"~/.myapp/data"
~version:"v1.0"
~registry_file:"registry.txt"
() in
(* Fetch a single file *)
match Toru.fetch toru ~filename:"data.csv" () with
| Ok path ->
traceln "File available at: %s" (Eio.Path.native_exn path)
| Error msg ->
traceln "Failed to fetch: %s" msg
(* Download all files concurrently *)
match Toru.fetch_all toru ~concurrency:4 () with
| Ok () -> traceln "Downloaded all files successfully"
| Error msg -> traceln "Download failed: %s" msg
Hash Module#
open Toru.Hash
(* Parse hash with automatic format detection *)
let hash1 = of_string "sha1:abc123def456789"
let hash2 = of_string "d1f947c87017eebc8b98d6c3944eaea813dd..." (* SHA256 by length *)
(* Compute and verify file hashes *)
let file_hash = compute SHA256 file_path in
let is_valid = verify file_path expected_hash in
Registry Management#
open Toru.Registry
(* Load Pooch-compatible registry *)
let registry = load registry_path in
(* Query registry *)
match find "data.csv" registry with
| Some entry ->
let hash = hash entry in
let filename = filename entry in
printf "Found %s with hash %s\n" filename (Hash.to_string hash)
| None -> printf "File not found\n"
(* Create and save registry *)
let entry = create_entry ~filename:"data.csv" ~hash:computed_hash () in
let updated_registry = add entry registry in
save output_path updated_registry
Cache Management#
open Toru.Cache
(* Create cache with XDG compliance *)
let cache = create ~sw ~env ~version:"v1.0" "/path/to/cache" in
(* Check file existence and get paths *)
let file_path = file_path cache "data.csv" in
let exists = exists cache "data.csv" in
(* Management operations *)
let stats = usage_stats cache in
printf "Cache size: %Ld bytes, %d files\n" stats.total_size stats.file_count;
(* Clean up cache *)
trim_to_size cache (1024L * 1024L * 1024L); (* 1GB limit *)
vacuum cache; (* Remove empty directories *)
🖥️ CLI Tools#
Cache Management#
# Show cache information
toru-cache info
# List cached files
toru-cache list --sort=size --limit=10
# Show size breakdown
toru-cache size --breakdown --human-readable
# Clean cache (dry run)
toru-cache clean --max-size=1GB --dry-run
# Remove files older than 30 days
toru-cache clean --max-age=30
# Clean up empty directories
toru-cache vacuum
🔬 Python Compatibility#
Toru is fully compatible with Python Pooch registries. We provide comprehensive cross-validation tests:
# Generate Python test data (requires uv)
cd test/python && uv run generate_pooch_registry.py
# Run cross-validation tests
dune exec test/test_python_cross_validation.exe
Registry Format Support#
Standard Pooch Format:
# Comments supported
data/file1.csv d1f947c87017eebc8b98d6c3944eaea813ddcfb6ceafa96db0bb70675abd4f28
data/file2.txt sha1:0a0a9f2a6772942557ab5355d76af442f8f65e01
archive.zip md5:65a8e27d8879283831b664bd8b7f0ad4
Mixed Format Support:
- SHA256:
filename hashorfilename sha256:hash - SHA1:
filename sha1:hash - MD5:
filename md5:hash - Automatic detection by hash length for unprefixed formats
🏗️ Architecture#
Modular Design#
- Hash Module: Multi-algorithm support with verification
- Registry Module: Pooch-compatible parsing and management
- Cache Module: XDG-compliant storage with management APIs
- Downloader Modules: Pluggable download implementations
- Main Toru Module: High-level interface combining all components
Download Strategy#
- Phase 1 (Current): External tools (wget/curl) for immediate functionality
- Phase 2 (Future): Pure OCaml implementation (cohttp-eio)
- Benefits: Battle-tested tools now, migration path to pure OCaml later
Authentication Support#
Per-downloader authentication configuration:
- Environment variables:
TORU_WGET_USERNAME,TORU_CURL_USERNAME, etc. - CLI arguments:
--wget-username,--curl-password, etc. - Programmatic API: Auth configuration per downloader type
🧪 Testing#
Comprehensive Test Suite#
# Core module tests
dune exec test/test_hash.exe
dune exec test/test_registry.exe
dune exec test/test_cache.exe
dune exec test/test_downloader.exe
# Integration tests
dune exec test/test_python_cross_validation.exe
dune exec test/test_cache_xdg.exe
# All tests
dune runtest
Cross-Validation#
- Python Generator: Creates test data using Python Pooch
- OCaml Validation: Verifies compatibility with generated data
- Format Testing: All hash formats and registry variations
- Round-trip Testing: Parse → serialize → parse consistency
📈 Performance#
- Concurrent Downloads: Configurable parallelism using Eio
- Efficient Hashing: Streaming for large files, optimized algorithms
- Smart Caching: Only downloads when needed, hash verification
- Memory Efficient: Streaming I/O, minimal memory footprint
🛣️ Roadmap#
Completed ✅#
- Core hash verification (SHA256/SHA1/MD5)
- Pooch-compatible registry parsing
- XDG-compliant caching with management APIs
- External tool download system (wget/curl)
- Per-downloader authentication
- Comprehensive CLI tools
- Python cross-validation testing
In Progress 🚧#
- Main Toru interface implementation
- make_registry utility for directory scanning
- Retry mechanisms with exponential backoff
Planned 📋#
- Pure OCaml HTTP client (cohttp-eio)
- SFTP protocol support
- DOI resolution (Zenodo/Figshare)
- Advanced archive processing
- Progress reporting enhancements
🤝 Contributing#
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
dune runtest - Submit a pull request
📄 License#
MIT License - see LICENSE file for details.
🙏 Acknowledgments#
- Python Pooch - Inspiration and compatibility target
- Eio - Modern effects-based I/O
- digestif - Cryptographic hashing
- OCaml community for excellent libraries and tools
Toru: Your OCaml companion for data repository management! 🦀⚡