My agentic slop goes here. Not intended for anyone else!
Toru Implementation TODO#
This document outlines the implementation plan for Toru, an OCaml data repository manager compatible with Python Pooch registry files.
Phase 1: Core Modules#
1.1 Hash Module ✨#
- Define abstract
Hash.ttype with algorithm variants (SHA256, SHA1, MD5) - Implement
create,of_string,to_stringfunctions - Add algorithm parsing with prefix support ("sha1:", "md5:", plain)
- Implement file verification using digestif library
- Add hash computation for files
- Create comprehensive test suite with known hash values
1.2 Registry Module 📋#
- Design abstract
Registry.tandRegistry.entrytypes - Implement Pooch-compatible file parser (comments, blank lines)
- Add entry creation with filename, hash, optional custom URL
- Implement registry operations (find, exists, add, remove)
- Support loading from files and URLs
- Add registry serialization (to_string, save)
1.3 Cache Module 💾#
- Create abstract
Cache.ttype with base path management - Implement XDG Base Directory specification
- Add version subdirectory support
- Implement cache operations (exists, clear, size, list)
- Add lazy directory creation
- Support environment variable overrides (TORU_CACHE_DIR)
Phase 2: External Tool Integration#
2.1 Modular Downloader Interface 🔌#
- Define DOWNLOADER module signature
- Create abstract Downloader.t type with module wrapping
- Implement tool detection and availability checking
- Add downloader selection (wget, curl, auto-detect)
2.2 Wget Downloader Implementation 📥#
- Implement Wget_downloader module with DOWNLOADER interface
- Add resume support with
--continueflag - Handle timeout, retry, and quiet options
- Implement hash verification after download
- Add comprehensive error handling with exit codes
2.3 Curl Downloader Implementation 📦#
- Implement Curl_downloader module with DOWNLOADER interface
- Add resume support with
--continue-at -flag - Configure timeout, retry, and progress options
- Implement hash verification after download
- Handle various curl error conditions
Phase 3: Main Interface#
3.1 Toru Module Core 🎯#
- Design abstract Toru.t type with accessor functions
- Implement constructor with registry loading
- Add base_url, cache, and registry accessors
- Create single file fetch functionality
- Implement processor pipeline for post-download transformations
3.2 Concurrent Operations ⚡#
- Implement fetch_all with configurable concurrency
- Add Eio fiber-based parallel downloads
- Implement progress reporting integration
- Add error aggregation for batch operations
3.3 Static Utilities 🛠️#
- Implement standalone retrieve function
- Add registry manipulation functions
- Support base URL updates
- Create convenience functions for common use cases
Phase 4: Testing & Validation#
4.1 Tessera-Manifests Integration 🧪#
- Set up test fixtures with tessera-manifests URLs
- Test embeddings registry parsing (2024 data)
- Validate landmasks registry parsing
- Test geographic coordinate extraction
- Performance test with large manifests (>100 entries)
4.2 Hash Algorithm Tests 🔐#
- Test SHA256, SHA1, MD5 verification with known files
- Validate prefix parsing ("sha1:abc123", "md5:def456")
- Test hash computation accuracy
- Error handling for invalid hash formats
4.3 Download Integration Tests 📡#
- Test wget downloader with real tessera data
- Test curl downloader with resume functionality
- Validate hash verification after downloads
- Test error handling for network failures
Phase 5: CLI & User Experience#
5.1 Command Line Interface 💻#
- Integrate cmdliner for argument parsing
- Add downloader selection (--downloader wget|curl|auto)
- Implement cache path configuration
- Add verbose/quiet mode options
5.2 Progress Reporting 📊#
- Integrate OCaml progress library
- Show download speed and ETA
- Support multiple concurrent progress bars
- Add file name and size information
5.3 Archive Processing 📁#
- Implement untar_gz processor using system tar
- Add unzip processor using system unzip
- Support untar_xz with tar -xJf
- Create custom processor interface
Phase 6: Future Extensions#
6.1 Pure OCaml Implementation 🐪#
- Implement Cohttp_downloader module
- Add streaming download support
- Implement HTTP Range requests for resume
- TLS support with tls-eio
- Migrate from external tools gradually
6.2 DOI Resolution (Toru-DOI) 📚#
- Create separate toru-doi library
- Implement Zenodo API integration
- Add Figshare API support
- DOI to registry conversion
- Metadata caching and rate limiting
6.3 Advanced Features 🚀#
- FTP protocol support
- Authentication mechanisms (API keys, tokens)
- Checksum verification during download
- Partial download recovery
- Registry merging and diff operations
Dependencies#
Core Dependencies#
eio(>= 1.0) - Effects-based I/Odigestif(>= 1.0) - Cryptographic hashesuri- URL parsingcmdliner- CLI parsing
System Dependencies#
wgetorcurl- Download tools (one required)
Optional Dependencies#
progress- Progress barsyojson- JSON configurationtar,unzip- Archive processing
Success Criteria#
Phase 1 Success ✅#
- All core modules pass unit tests
- Hash verification works with digestif
- Registry parsing handles tessera-manifests correctly
- Cache follows XDG directory specification
Phase 2 Success ✅#
- Both wget and curl downloaders work
- Resume functionality tested with interrupted downloads
- Automatic tool detection and fallback
- Hash verification after external tool downloads
Phase 3 Success ✅#
- Full tessera-manifests integration test passes
- Concurrent downloads work without conflicts
- Single-file fetch and batch fetch both functional
- Processor pipeline handles archives correctly
Final Success ✅#
- Complete tessera geospatial data download workflow
- CLI tool usable for real data management
- Documentation and examples complete
- Performance acceptable for large datasets (GB scale)
XDG Integration Notes#
The current TODO includes XDG Base Directory specification support through the xdg-eio library. This provides:
- Automatic XDG cache directory detection
- Cross-platform path handling (Unix, macOS, Windows)
- Environment variable overrides (XDG_CACHE_HOME, etc.)
- Pretty-printing for user-friendly directory display
This TODO represents approximately 6-8 weeks of development work, focusing on robust external tool integration before migrating to pure OCaml implementation.
Additional Features for Pooch Compatibility#
1. Authentication Support#
- HTTP Authentication: Basic auth via username/password
- FTP Authentication: Username/password credentials (moved to "Won't Implement")
- SFTP Authentication: SSH-based secure file transfer
- Environment Variable Patterns: Standardized env var support for credentials
2. Additional Download Protocols#
- SFTP: Secure file transfer (requires SSH libraries)
- DataVerse DOI Support: Beyond Zenodo/Figshare (added in Pooch v1.7.0)
3. Registry Management Utilities#
-
make_registryequivalent: Auto-generate registry files from directories - DOI-based Registry Loading:
load_registry_from_doi()functionality - Recursive Directory Scanning: For registry generation
4. Advanced Download Features#
- Retry Mechanisms: Exponential backoff (1s → 10s max)
- Temporary File Handling: Atomic file replacement during downloads
- Hash Algorithm Flexibility: MD5/SHA1 support (implemented with automatic detection)
5. Processing/Archive Support#
- Built-in Processors:
-
Unzipprocessor for ZIP files -
Untarprocessor for TAR archives -
Decompressprocessor for compressed files
-
- Processor Chaining: Sequential processing pipeline
- Archive-Specific Handling: Beyond basic shell tool integration
6. Progress Reporting Integration#
- Multiple Progress Libraries:
tqdmintegration vs OCamlprogress - Progress Bar Customization: Custom progress objects
- Stderr Output Control: Configurable progress display
7. Utilities and Helper Functions#
- Version Compatibility Checking:
check_version()utility - Logging Integration: Built-in logging support with levels
- Test Runner:
pooch.test()functionality
8. Environment Variable Standards#
- XDG Compliance: Full XDG Base Directory specification
- Platform-Specific Defaults: Windows
%LOCALAPPDATA%patterns - Standardized Override Patterns: Consistent env var naming
9. Registry Format Features#
- Comment Support: Lines starting with
#(already planned) - Multiple Hash Formats: SHA256, SHA1, MD5 with automatic detection
- Registry Validation: Built-in format checking
10. API Design Differences#
- Static Methods:
pooch.retrieve()for one-off downloads (already planned) - Factory Functions:
pooch.create()vs constructor patterns - Callable Downloaders: Function-based custom downloaders vs module system
Features Where Toru Has Advantages#
1. Concurrency#
- Parallel Downloads:
fetch_allwith configurable concurrency - Eio-based Async: Modern effects-based concurrency
2. Type Safety#
- OCaml Type System: Compile-time error prevention
- Result Types: Explicit error handling vs exceptions
3. Modular Architecture#
- Downloader Modules: Clean module interface vs callable objects
- External Tool Integration: wget/curl with migration path to pure OCaml
4. Performance Path#
- Migration Strategy: External tools → pure OCaml implementation
- Resume Support: Via wget/curl initially, then native implementation
Implementation Priorities#
Priority 1 (Core Compatibility)#
- Add Authentication Support: HTTP Basic, environment variables (FTP removed)
- Implement
make_registry: Directory scanning utility - Add More Hash Algorithms: SHA1, MD5 support (completed)
- Enhance Progress Reporting: Better integration with OCaml ecosystem
- Unified Configuration: Cmdliner + environment variable integration
Priority 2 (Advanced Features)#
- SFTP Protocol: Using OCaml SSH libraries
- Retry Mechanisms: Exponential backoff implementation
- Built-in Processors: Native OCaml archive handling
- DataVerse DOI Support: Extend DOI resolver
Priority 3 (Ecosystem Integration)#
- Command Line Interface: Unlike Pooch, add CLI support
- Comprehensive Logging: Structured logging with levels
- Test Framework Integration: Native OCaml test support
Implementation Notes#
Authentication Implementation#
- Per-downloader auth configs: Each downloader gets its own auth settings
- External tools:
wget/curlhandle auth via command-line args (--user,--password) - Pure OCaml:
cohttp-eiouses Basic Auth headers - SSH libraries: For SFTP (consider
sshorlibsshbindings) - Configuration: Cmdliner + environment variables per downloader type
Registry Utilities#
make_registry: UseEio.Pathfor directory traversal- Implement recursive hash computation
- Output in Pooch-compatible format
Retry Mechanisms#
- Implement exponential backoff with jitter
- Configurable retry counts and timeouts
- Log retry attempts
Archive Processing#
- Native OCaml implementations preferred over shell tools
- Consider
camlzip,tar,decompresslibraries - Chain processors for complex formats
Progress Reporting#
- Enhance OCaml
progresslibrary integration - Support custom progress callbacks
- Configurable output streams
Current Implementation Status#
Completed (as per CLAUDE.md)#
- Hash module design (SHA256, SHA1, MD5)
- Hash module implementation with all algorithms
- Registry module design with Pooch compatibility
- Cache module with XDG support
- Modular downloader interface
- Concurrent download design (
fetch_all) - External tool integration (wget/curl)
- DOI resolution library design
In Progress#
- Hash module implementation (completed with tests)
- Registry module implementation
- Cache module implementation
- External tool downloader implementations
- Test suite with tessera-manifests
Not Started#
- Authentication systems
- SFTP protocol support
- Advanced registry utilities
- Built-in archive processors
- CLI interface
- Pure OCaml HTTP downloader (cohttp-eio)
Features We Won't Implement#
Deliberately Excluded Features#
- FTP Protocol Support:
- Rare in modern usage, HTTPS/SFTP preferred
- Adds complexity without significant benefit
- Can still use via external tools if needed
- Windows-Specific Path Handling:
- Focus on Unix/macOS primarily
- Basic Windows support via Eio, not optimized
- Legacy Hash Algorithms (initially):
- MD4, SHA-0 - cryptographically broken
- Focus on SHA256/SHA1/MD5 for compatibility
- Complex Authentication Flows:
- OAuth2, JWT tokens, API keys with refresh
- Keep to Basic Auth for simplicity
- GUI Progress Bars:
- Terminal/CLI focused library
- Text-based progress reporting only
Unified Configuration Design#
Cmdliner + Environment Variable Integration#
module Config = struct
(* Per-downloader authentication *)
type auth = {
username: string option;
password: string option;
}
(* Global application configuration *)
type t = {
base_url: string;
cache_dir: string;
downloader: [`Wget | `Curl | `Cohttp | `Auto];
retry_count: int;
timeout: float;
(* Auth configurations per downloader type *)
wget_auth: auth;
curl_auth: auth;
cohttp_auth: auth;
}
(* Cmdliner terms with environment fallbacks *)
let base_url_term =
let doc = "Base URL for downloads" in
let env = Cmdliner.Arg.env_var "TORU_BASE_URL" in
Cmdliner.Arg.(required & opt (some string) None &
info ["base-url"; "u"] ~env ~doc)
let cache_dir_term =
let doc = "Cache directory path" in
let env = Cmdliner.Arg.env_var "TORU_CACHE_DIR" in
let default = Cache.default_path ~app_name:"toru" () in
Cmdliner.Arg.(value & opt string default &
info ["cache-dir"; "c"] ~env ~doc)
(* Per-downloader auth terms *)
let wget_auth_terms =
let username_term =
let doc = "Wget authentication username" in
let env = Cmdliner.Arg.env_var "TORU_WGET_USERNAME" in
Cmdliner.Arg.(value & opt (some string) None &
info ["wget-username"] ~env ~doc) in
let password_term =
let doc = "Wget authentication password" in
let env = Cmdliner.Arg.env_var "TORU_WGET_PASSWORD" in
Cmdliner.Arg.(value & opt (some string) None &
info ["wget-password"] ~env ~doc) in
(username_term, password_term)
let curl_auth_terms =
let username_term =
let doc = "Curl authentication username" in
let env = Cmdliner.Arg.env_var "TORU_CURL_USERNAME" in
Cmdliner.Arg.(value & opt (some string) None &
info ["curl-username"] ~env ~doc) in
let password_term =
let doc = "Curl authentication password" in
let env = Cmdliner.Arg.env_var "TORU_CURL_PASSWORD" in
Cmdliner.Arg.(value & opt (some string) None &
info ["curl-password"] ~env ~doc) in
(username_term, password_term)
let cohttp_auth_terms =
let username_term =
let doc = "Cohttp authentication username" in
let env = Cmdliner.Arg.env_var "TORU_COHTTP_USERNAME" in
Cmdliner.Arg.(value & opt (some string) None &
info ["cohttp-username"] ~env ~doc) in
let password_term =
let doc = "Cohttp authentication password" in
let env = Cmdliner.Arg.env_var "TORU_COHTTP_PASSWORD" in
Cmdliner.Arg.(value & opt (some string) None &
info ["cohttp-password"] ~env ~doc) in
(username_term, password_term)
let downloader_term =
let doc = "Download tool: wget, curl, cohttp, auto" in
let env = Cmdliner.Arg.env_var "TORU_DOWNLOADER" in
Cmdliner.Arg.(value & opt (enum [
("wget", `Wget); ("curl", `Curl);
("cohttp", `Cohttp); ("auto", `Auto)
]) `Auto & info ["downloader"; "d"] ~env ~doc)
let retry_count_term =
let doc = "Number of download retries" in
let env = Cmdliner.Arg.env_var "TORU_RETRY_COUNT" in
Cmdliner.Arg.(value & opt int 3 &
info ["retries"; "r"] ~env ~doc)
let timeout_term =
let doc = "Download timeout in seconds" in
let env = Cmdliner.Arg.env_var "TORU_TIMEOUT" in
Cmdliner.Arg.(value & opt float 300.0 &
info ["timeout"; "t"] ~env ~doc)
let config_term =
let combine base_url cache_dir downloader retries timeout
wget_user wget_pass curl_user curl_pass cohttp_user cohttp_pass =
let wget_auth = { username = wget_user; password = wget_pass } in
let curl_auth = { username = curl_user; password = curl_pass } in
let cohttp_auth = { username = cohttp_user; password = cohttp_pass } in
{ base_url; cache_dir; downloader; retry_count = retries; timeout;
wget_auth; curl_auth; cohttp_auth }
in
let (wget_user_term, wget_pass_term) = wget_auth_terms in
let (curl_user_term, curl_pass_term) = curl_auth_terms in
let (cohttp_user_term, cohttp_pass_term) = cohttp_auth_terms in
Cmdliner.Term.(const combine $ base_url_term $ cache_dir_term $
downloader_term $ retry_count_term $ timeout_term $
wget_user_term $ wget_pass_term $
curl_user_term $ curl_pass_term $
cohttp_user_term $ cohttp_pass_term)
(* Helper to get auth for specific downloader *)
let get_auth config = function
| `Wget -> config.wget_auth
| `Curl -> config.curl_auth
| `Cohttp -> config.cohttp_auth
| `Auto -> { username = None; password = None } (* No auth for auto-detect *)
end
Updated Downloader Interface#
module type DOWNLOADER = sig
type t
val create : sw:Eio.Switch.t -> env:Eio_unix.Stdenv.base ->
?auth:Config.auth -> unit -> t
val download : t ->
url:string ->
dest:Eio.Fs.dir_ty Eio.Path.t ->
?hash:Hash.t ->
?progress:Progress_reporter.t ->
?resume:bool ->
unit -> (unit, string) result
val supports_resume : t -> bool
val name : t -> string
end
module Wget_downloader : DOWNLOADER = struct
type t = {
sw : Eio.Switch.t;
env : Eio_unix.Stdenv.base;
auth : Config.auth option;
timeout : float;
}
let create ~sw ~env ?auth () =
{ sw; env; auth; timeout = 300.0 }
let download t ~url ~dest ?hash ?progress ?(resume=true) () =
let auth_args = match t.auth with
| Some { username = Some u; password = Some p } ->
["--user=" ^ u; "--password=" ^ p]
| Some { username = Some u; password = None } ->
["--user=" ^ u]
| _ -> [] in
let args = ["--quiet"; "--show-progress"; "--timeout=300"] @
auth_args @ ["--output-document=" ^ (Eio.Path.native_exn dest)] in
(* ... rest of implementation *)
end
Environment Variable Standards#
| Variable | Purpose | Example |
|---|---|---|
TORU_BASE_URL |
Default download base URL | https://data.example.com/ |
TORU_CACHE_DIR |
Override cache location | /custom/cache/path |
TORU_WGET_USERNAME |
Wget HTTP Basic Auth username | myuser |
TORU_WGET_PASSWORD |
Wget HTTP Basic Auth password | secret123 |
TORU_CURL_USERNAME |
Curl HTTP Basic Auth username | myuser |
TORU_CURL_PASSWORD |
Curl HTTP Basic Auth password | secret123 |
TORU_COHTTP_USERNAME |
Cohttp HTTP Basic Auth username | myuser |
TORU_COHTTP_PASSWORD |
Cohttp HTTP Basic Auth password | secret123 |
TORU_DOWNLOADER |
Preferred download tool | wget, curl, cohttp, auto |
TORU_RETRY_COUNT |
Download retry attempts | 5 |
TORU_TIMEOUT |
Download timeout (seconds) | 600.0 |
TORU_REGISTRY_URL |
Registry file URL override | https://example.com/registry.txt |
Benefits of Per-Downloader Auth#
- Flexibility: Different auth for different downloaders
- Tool-Specific: Wget vs Curl may need different credentials
- Migration Path: Smooth transition from external tools to pure OCaml
- Security: Auth only passed to specific downloader implementations
- Testing: Easy to test each downloader with different auth configurations
Usage Examples#
# Different auth per downloader
export TORU_WGET_USERNAME=wget_user
export TORU_WGET_PASSWORD=wget_pass
export TORU_CURL_USERNAME=curl_user
export TORU_CURL_PASSWORD=curl_pass
toru fetch --downloader wget data.csv # uses wget auth
toru fetch --downloader curl data.csv # uses curl auth
# CLI override for specific downloader
toru fetch --downloader cohttp --cohttp-username myuser data.csv
# Auto-detection with fallback auth
export TORU_WGET_USERNAME=backup_user
toru fetch --downloader auto data.csv # will use wget if available, with auth