My agentic slop goes here. Not intended for anyone else!

Toru Implementation TODO#

This document outlines the implementation plan for Toru, an OCaml data repository manager compatible with Python Pooch registry files.

Phase 1: Core Modules#

1.1 Hash Module ✨#

  • Define abstract Hash.t type with algorithm variants (SHA256, SHA1, MD5)
  • Implement create, of_string, to_string functions
  • Add algorithm parsing with prefix support ("sha1:", "md5:", plain)
  • Implement file verification using digestif library
  • Add hash computation for files
  • Create comprehensive test suite with known hash values

1.2 Registry Module 📋#

  • Design abstract Registry.t and Registry.entry types
  • Implement Pooch-compatible file parser (comments, blank lines)
  • Add entry creation with filename, hash, optional custom URL
  • Implement registry operations (find, exists, add, remove)
  • Support loading from files and URLs
  • Add registry serialization (to_string, save)

1.3 Cache Module 💾#

  • Create abstract Cache.t type with base path management
  • Implement XDG Base Directory specification
  • Add version subdirectory support
  • Implement cache operations (exists, clear, size, list)
  • Add lazy directory creation
  • Support environment variable overrides (TORU_CACHE_DIR)

Phase 2: External Tool Integration#

2.1 Modular Downloader Interface 🔌#

  • Define DOWNLOADER module signature
  • Create abstract Downloader.t type with module wrapping
  • Implement tool detection and availability checking
  • Add downloader selection (wget, curl, auto-detect)

2.2 Wget Downloader Implementation 📥#

  • Implement Wget_downloader module with DOWNLOADER interface
  • Add resume support with --continue flag
  • Handle timeout, retry, and quiet options
  • Implement hash verification after download
  • Add comprehensive error handling with exit codes

2.3 Curl Downloader Implementation 📦#

  • Implement Curl_downloader module with DOWNLOADER interface
  • Add resume support with --continue-at - flag
  • Configure timeout, retry, and progress options
  • Implement hash verification after download
  • Handle various curl error conditions

Phase 3: Main Interface#

3.1 Toru Module Core 🎯#

  • Design abstract Toru.t type with accessor functions
  • Implement constructor with registry loading
  • Add base_url, cache, and registry accessors
  • Create single file fetch functionality
  • Implement processor pipeline for post-download transformations

3.2 Concurrent Operations ⚡#

  • Implement fetch_all with configurable concurrency
  • Add Eio fiber-based parallel downloads
  • Implement progress reporting integration
  • Add error aggregation for batch operations

3.3 Static Utilities 🛠️#

  • Implement standalone retrieve function
  • Add registry manipulation functions
  • Support base URL updates
  • Create convenience functions for common use cases

Phase 4: Testing & Validation#

4.1 Tessera-Manifests Integration 🧪#

  • Set up test fixtures with tessera-manifests URLs
  • Test embeddings registry parsing (2024 data)
  • Validate landmasks registry parsing
  • Test geographic coordinate extraction
  • Performance test with large manifests (>100 entries)

4.2 Hash Algorithm Tests 🔐#

  • Test SHA256, SHA1, MD5 verification with known files
  • Validate prefix parsing ("sha1:abc123", "md5:def456")
  • Test hash computation accuracy
  • Error handling for invalid hash formats

4.3 Download Integration Tests 📡#

  • Test wget downloader with real tessera data
  • Test curl downloader with resume functionality
  • Validate hash verification after downloads
  • Test error handling for network failures

Phase 5: CLI & User Experience#

5.1 Command Line Interface 💻#

  • Integrate cmdliner for argument parsing
  • Add downloader selection (--downloader wget|curl|auto)
  • Implement cache path configuration
  • Add verbose/quiet mode options

5.2 Progress Reporting 📊#

  • Integrate OCaml progress library
  • Show download speed and ETA
  • Support multiple concurrent progress bars
  • Add file name and size information

5.3 Archive Processing 📁#

  • Implement untar_gz processor using system tar
  • Add unzip processor using system unzip
  • Support untar_xz with tar -xJf
  • Create custom processor interface

Phase 6: Future Extensions#

6.1 Pure OCaml Implementation 🐪#

  • Implement Cohttp_downloader module
  • Add streaming download support
  • Implement HTTP Range requests for resume
  • TLS support with tls-eio
  • Migrate from external tools gradually

6.2 DOI Resolution (Toru-DOI) 📚#

  • Create separate toru-doi library
  • Implement Zenodo API integration
  • Add Figshare API support
  • DOI to registry conversion
  • Metadata caching and rate limiting

6.3 Advanced Features 🚀#

  • FTP protocol support
  • Authentication mechanisms (API keys, tokens)
  • Checksum verification during download
  • Partial download recovery
  • Registry merging and diff operations

Dependencies#

Core Dependencies#

  • eio (>= 1.0) - Effects-based I/O
  • digestif (>= 1.0) - Cryptographic hashes
  • uri - URL parsing
  • cmdliner - CLI parsing

System Dependencies#

  • wget or curl - Download tools (one required)

Optional Dependencies#

  • progress - Progress bars
  • yojson - JSON configuration
  • tar, unzip - Archive processing

Success Criteria#

Phase 1 Success ✅#

  • All core modules pass unit tests
  • Hash verification works with digestif
  • Registry parsing handles tessera-manifests correctly
  • Cache follows XDG directory specification

Phase 2 Success ✅#

  • Both wget and curl downloaders work
  • Resume functionality tested with interrupted downloads
  • Automatic tool detection and fallback
  • Hash verification after external tool downloads

Phase 3 Success ✅#

  • Full tessera-manifests integration test passes
  • Concurrent downloads work without conflicts
  • Single-file fetch and batch fetch both functional
  • Processor pipeline handles archives correctly

Final Success ✅#

  • Complete tessera geospatial data download workflow
  • CLI tool usable for real data management
  • Documentation and examples complete
  • Performance acceptable for large datasets (GB scale)

XDG Integration Notes#

The current TODO includes XDG Base Directory specification support through the xdg-eio library. This provides:

  • Automatic XDG cache directory detection
  • Cross-platform path handling (Unix, macOS, Windows)
  • Environment variable overrides (XDG_CACHE_HOME, etc.)
  • Pretty-printing for user-friendly directory display

This TODO represents approximately 6-8 weeks of development work, focusing on robust external tool integration before migrating to pure OCaml implementation.

Additional Features for Pooch Compatibility#

1. Authentication Support#

  • HTTP Authentication: Basic auth via username/password
  • FTP Authentication: Username/password credentials (moved to "Won't Implement")
  • SFTP Authentication: SSH-based secure file transfer
  • Environment Variable Patterns: Standardized env var support for credentials

2. Additional Download Protocols#

  • SFTP: Secure file transfer (requires SSH libraries)
  • DataVerse DOI Support: Beyond Zenodo/Figshare (added in Pooch v1.7.0)

3. Registry Management Utilities#

  • make_registry equivalent: Auto-generate registry files from directories
  • DOI-based Registry Loading: load_registry_from_doi() functionality
  • Recursive Directory Scanning: For registry generation

4. Advanced Download Features#

  • Retry Mechanisms: Exponential backoff (1s → 10s max)
  • Temporary File Handling: Atomic file replacement during downloads
  • Hash Algorithm Flexibility: MD5/SHA1 support (implemented with automatic detection)

5. Processing/Archive Support#

  • Built-in Processors:
    • Unzip processor for ZIP files
    • Untar processor for TAR archives
    • Decompress processor for compressed files
  • Processor Chaining: Sequential processing pipeline
  • Archive-Specific Handling: Beyond basic shell tool integration

6. Progress Reporting Integration#

  • Multiple Progress Libraries: tqdm integration vs OCaml progress
  • Progress Bar Customization: Custom progress objects
  • Stderr Output Control: Configurable progress display

7. Utilities and Helper Functions#

  • Version Compatibility Checking: check_version() utility
  • Logging Integration: Built-in logging support with levels
  • Test Runner: pooch.test() functionality

8. Environment Variable Standards#

  • XDG Compliance: Full XDG Base Directory specification
  • Platform-Specific Defaults: Windows %LOCALAPPDATA% patterns
  • Standardized Override Patterns: Consistent env var naming

9. Registry Format Features#

  • Comment Support: Lines starting with # (already planned)
  • Multiple Hash Formats: SHA256, SHA1, MD5 with automatic detection
  • Registry Validation: Built-in format checking

10. API Design Differences#

  • Static Methods: pooch.retrieve() for one-off downloads (already planned)
  • Factory Functions: pooch.create() vs constructor patterns
  • Callable Downloaders: Function-based custom downloaders vs module system

Features Where Toru Has Advantages#

1. Concurrency#

  • Parallel Downloads: fetch_all with configurable concurrency
  • Eio-based Async: Modern effects-based concurrency

2. Type Safety#

  • OCaml Type System: Compile-time error prevention
  • Result Types: Explicit error handling vs exceptions

3. Modular Architecture#

  • Downloader Modules: Clean module interface vs callable objects
  • External Tool Integration: wget/curl with migration path to pure OCaml

4. Performance Path#

  • Migration Strategy: External tools → pure OCaml implementation
  • Resume Support: Via wget/curl initially, then native implementation

Implementation Priorities#

Priority 1 (Core Compatibility)#

  1. Add Authentication Support: HTTP Basic, environment variables (FTP removed)
  2. Implement make_registry: Directory scanning utility
  3. Add More Hash Algorithms: SHA1, MD5 support (completed)
  4. Enhance Progress Reporting: Better integration with OCaml ecosystem
  5. Unified Configuration: Cmdliner + environment variable integration

Priority 2 (Advanced Features)#

  1. SFTP Protocol: Using OCaml SSH libraries
  2. Retry Mechanisms: Exponential backoff implementation
  3. Built-in Processors: Native OCaml archive handling
  4. DataVerse DOI Support: Extend DOI resolver

Priority 3 (Ecosystem Integration)#

  1. Command Line Interface: Unlike Pooch, add CLI support
  2. Comprehensive Logging: Structured logging with levels
  3. Test Framework Integration: Native OCaml test support

Implementation Notes#

Authentication Implementation#

  • Per-downloader auth configs: Each downloader gets its own auth settings
  • External tools: wget/curl handle auth via command-line args (--user, --password)
  • Pure OCaml: cohttp-eio uses Basic Auth headers
  • SSH libraries: For SFTP (consider ssh or libssh bindings)
  • Configuration: Cmdliner + environment variables per downloader type

Registry Utilities#

  • make_registry: Use Eio.Path for directory traversal
  • Implement recursive hash computation
  • Output in Pooch-compatible format

Retry Mechanisms#

  • Implement exponential backoff with jitter
  • Configurable retry counts and timeouts
  • Log retry attempts

Archive Processing#

  • Native OCaml implementations preferred over shell tools
  • Consider camlzip, tar, decompress libraries
  • Chain processors for complex formats

Progress Reporting#

  • Enhance OCaml progress library integration
  • Support custom progress callbacks
  • Configurable output streams

Current Implementation Status#

Completed (as per CLAUDE.md)#

  • Hash module design (SHA256, SHA1, MD5)
  • Hash module implementation with all algorithms
  • Registry module design with Pooch compatibility
  • Cache module with XDG support
  • Modular downloader interface
  • Concurrent download design (fetch_all)
  • External tool integration (wget/curl)
  • DOI resolution library design

In Progress#

  • Hash module implementation (completed with tests)
  • Registry module implementation
  • Cache module implementation
  • External tool downloader implementations
  • Test suite with tessera-manifests

Not Started#

  • Authentication systems
  • SFTP protocol support
  • Advanced registry utilities
  • Built-in archive processors
  • CLI interface
  • Pure OCaml HTTP downloader (cohttp-eio)

Features We Won't Implement#

Deliberately Excluded Features#

  • FTP Protocol Support:
    • Rare in modern usage, HTTPS/SFTP preferred
    • Adds complexity without significant benefit
    • Can still use via external tools if needed
  • Windows-Specific Path Handling:
    • Focus on Unix/macOS primarily
    • Basic Windows support via Eio, not optimized
  • Legacy Hash Algorithms (initially):
    • MD4, SHA-0 - cryptographically broken
    • Focus on SHA256/SHA1/MD5 for compatibility
  • Complex Authentication Flows:
    • OAuth2, JWT tokens, API keys with refresh
    • Keep to Basic Auth for simplicity
  • GUI Progress Bars:
    • Terminal/CLI focused library
    • Text-based progress reporting only

Unified Configuration Design#

Cmdliner + Environment Variable Integration#

module Config = struct
  (* Per-downloader authentication *)
  type auth = {
    username: string option;
    password: string option;
  }
  
  (* Global application configuration *)
  type t = {
    base_url: string;
    cache_dir: string;
    downloader: [`Wget | `Curl | `Cohttp | `Auto];
    retry_count: int;
    timeout: float;
    (* Auth configurations per downloader type *)
    wget_auth: auth;
    curl_auth: auth;
    cohttp_auth: auth;
  }
  
  (* Cmdliner terms with environment fallbacks *)
  let base_url_term =
    let doc = "Base URL for downloads" in
    let env = Cmdliner.Arg.env_var "TORU_BASE_URL" in
    Cmdliner.Arg.(required & opt (some string) None & 
                  info ["base-url"; "u"] ~env ~doc)
  
  let cache_dir_term = 
    let doc = "Cache directory path" in
    let env = Cmdliner.Arg.env_var "TORU_CACHE_DIR" in
    let default = Cache.default_path ~app_name:"toru" () in
    Cmdliner.Arg.(value & opt string default &
                  info ["cache-dir"; "c"] ~env ~doc)
  
  (* Per-downloader auth terms *)
  let wget_auth_terms =
    let username_term =
      let doc = "Wget authentication username" in
      let env = Cmdliner.Arg.env_var "TORU_WGET_USERNAME" in
      Cmdliner.Arg.(value & opt (some string) None &
                    info ["wget-username"] ~env ~doc) in
    let password_term =
      let doc = "Wget authentication password" in  
      let env = Cmdliner.Arg.env_var "TORU_WGET_PASSWORD" in
      Cmdliner.Arg.(value & opt (some string) None &
                    info ["wget-password"] ~env ~doc) in
    (username_term, password_term)
                    
  let curl_auth_terms =
    let username_term =
      let doc = "Curl authentication username" in
      let env = Cmdliner.Arg.env_var "TORU_CURL_USERNAME" in
      Cmdliner.Arg.(value & opt (some string) None &
                    info ["curl-username"] ~env ~doc) in
    let password_term =
      let doc = "Curl authentication password" in
      let env = Cmdliner.Arg.env_var "TORU_CURL_PASSWORD" in
      Cmdliner.Arg.(value & opt (some string) None &
                    info ["curl-password"] ~env ~doc) in
    (username_term, password_term)
                    
  let cohttp_auth_terms =
    let username_term =
      let doc = "Cohttp authentication username" in
      let env = Cmdliner.Arg.env_var "TORU_COHTTP_USERNAME" in
      Cmdliner.Arg.(value & opt (some string) None &
                    info ["cohttp-username"] ~env ~doc) in
    let password_term =
      let doc = "Cohttp authentication password" in
      let env = Cmdliner.Arg.env_var "TORU_COHTTP_PASSWORD" in
      Cmdliner.Arg.(value & opt (some string) None &
                    info ["cohttp-password"] ~env ~doc) in
    (username_term, password_term)
  
  let downloader_term =
    let doc = "Download tool: wget, curl, cohttp, auto" in
    let env = Cmdliner.Arg.env_var "TORU_DOWNLOADER" in
    Cmdliner.Arg.(value & opt (enum [
      ("wget", `Wget); ("curl", `Curl); 
      ("cohttp", `Cohttp); ("auto", `Auto)
    ]) `Auto & info ["downloader"; "d"] ~env ~doc)
  
  let retry_count_term =
    let doc = "Number of download retries" in
    let env = Cmdliner.Arg.env_var "TORU_RETRY_COUNT" in
    Cmdliner.Arg.(value & opt int 3 &
                  info ["retries"; "r"] ~env ~doc)
  
  let timeout_term =
    let doc = "Download timeout in seconds" in
    let env = Cmdliner.Arg.env_var "TORU_TIMEOUT" in
    Cmdliner.Arg.(value & opt float 300.0 &
                  info ["timeout"; "t"] ~env ~doc)
  
  let config_term =
    let combine base_url cache_dir downloader retries timeout 
                wget_user wget_pass curl_user curl_pass cohttp_user cohttp_pass =
      let wget_auth = { username = wget_user; password = wget_pass } in
      let curl_auth = { username = curl_user; password = curl_pass } in
      let cohttp_auth = { username = cohttp_user; password = cohttp_pass } in
      { base_url; cache_dir; downloader; retry_count = retries; timeout;
        wget_auth; curl_auth; cohttp_auth }
    in
    let (wget_user_term, wget_pass_term) = wget_auth_terms in
    let (curl_user_term, curl_pass_term) = curl_auth_terms in
    let (cohttp_user_term, cohttp_pass_term) = cohttp_auth_terms in
    Cmdliner.Term.(const combine $ base_url_term $ cache_dir_term $ 
                   downloader_term $ retry_count_term $ timeout_term $
                   wget_user_term $ wget_pass_term $
                   curl_user_term $ curl_pass_term $
                   cohttp_user_term $ cohttp_pass_term)
                   
  (* Helper to get auth for specific downloader *)
  let get_auth config = function
    | `Wget -> config.wget_auth
    | `Curl -> config.curl_auth
    | `Cohttp -> config.cohttp_auth
    | `Auto -> { username = None; password = None } (* No auth for auto-detect *)
end

Updated Downloader Interface#

module type DOWNLOADER = sig
  type t
  
  val create : sw:Eio.Switch.t -> env:Eio_unix.Stdenv.base -> 
               ?auth:Config.auth -> unit -> t
  
  val download : t ->
    url:string ->
    dest:Eio.Fs.dir_ty Eio.Path.t ->
    ?hash:Hash.t ->
    ?progress:Progress_reporter.t ->
    ?resume:bool ->
    unit -> (unit, string) result
    
  val supports_resume : t -> bool
  val name : t -> string
end

module Wget_downloader : DOWNLOADER = struct
  type t = {
    sw : Eio.Switch.t;
    env : Eio_unix.Stdenv.base;
    auth : Config.auth option;
    timeout : float;
  }
  
  let create ~sw ~env ?auth () = 
    { sw; env; auth; timeout = 300.0 }
  
  let download t ~url ~dest ?hash ?progress ?(resume=true) () =
    let auth_args = match t.auth with
      | Some { username = Some u; password = Some p } -> 
          ["--user=" ^ u; "--password=" ^ p]
      | Some { username = Some u; password = None } ->
          ["--user=" ^ u]
      | _ -> [] in
    let args = ["--quiet"; "--show-progress"; "--timeout=300"] @ 
               auth_args @ ["--output-document=" ^ (Eio.Path.native_exn dest)] in
    (* ... rest of implementation *)
end

Environment Variable Standards#

Variable Purpose Example
TORU_BASE_URL Default download base URL https://data.example.com/
TORU_CACHE_DIR Override cache location /custom/cache/path
TORU_WGET_USERNAME Wget HTTP Basic Auth username myuser
TORU_WGET_PASSWORD Wget HTTP Basic Auth password secret123
TORU_CURL_USERNAME Curl HTTP Basic Auth username myuser
TORU_CURL_PASSWORD Curl HTTP Basic Auth password secret123
TORU_COHTTP_USERNAME Cohttp HTTP Basic Auth username myuser
TORU_COHTTP_PASSWORD Cohttp HTTP Basic Auth password secret123
TORU_DOWNLOADER Preferred download tool wget, curl, cohttp, auto
TORU_RETRY_COUNT Download retry attempts 5
TORU_TIMEOUT Download timeout (seconds) 600.0
TORU_REGISTRY_URL Registry file URL override https://example.com/registry.txt

Benefits of Per-Downloader Auth#

  1. Flexibility: Different auth for different downloaders
  2. Tool-Specific: Wget vs Curl may need different credentials
  3. Migration Path: Smooth transition from external tools to pure OCaml
  4. Security: Auth only passed to specific downloader implementations
  5. Testing: Easy to test each downloader with different auth configurations

Usage Examples#

# Different auth per downloader
export TORU_WGET_USERNAME=wget_user
export TORU_WGET_PASSWORD=wget_pass
export TORU_CURL_USERNAME=curl_user  
export TORU_CURL_PASSWORD=curl_pass
toru fetch --downloader wget data.csv    # uses wget auth
toru fetch --downloader curl data.csv    # uses curl auth

# CLI override for specific downloader
toru fetch --downloader cohttp --cohttp-username myuser data.csv

# Auto-detection with fallback auth
export TORU_WGET_USERNAME=backup_user
toru fetch --downloader auto data.csv    # will use wget if available, with auth