My agentic slop goes here. Not intended for anyone else!
1# Toru Implementation TODO 2 3This document outlines the implementation plan for Toru, an OCaml data repository manager compatible with Python Pooch registry files. 4 5## Phase 1: Core Modules 6 7### 1.1 Hash Module ✨ 8- [ ] Define abstract `Hash.t` type with algorithm variants (SHA256, SHA1, MD5) 9- [ ] Implement `create`, `of_string`, `to_string` functions 10- [ ] Add algorithm parsing with prefix support ("sha1:", "md5:", plain) 11- [ ] Implement file verification using digestif library 12- [ ] Add hash computation for files 13- [ ] Create comprehensive test suite with known hash values 14 15### 1.2 Registry Module 📋 16- [ ] Design abstract `Registry.t` and `Registry.entry` types 17- [ ] Implement Pooch-compatible file parser (comments, blank lines) 18- [ ] Add entry creation with filename, hash, optional custom URL 19- [ ] Implement registry operations (find, exists, add, remove) 20- [ ] Support loading from files and URLs 21- [ ] Add registry serialization (to_string, save) 22 23### 1.3 Cache Module 💾 24- [ ] Create abstract `Cache.t` type with base path management 25- [ ] Implement XDG Base Directory specification 26- [ ] Add version subdirectory support 27- [ ] Implement cache operations (exists, clear, size, list) 28- [ ] Add lazy directory creation 29- [ ] Support environment variable overrides (TORU_CACHE_DIR) 30 31## Phase 2: External Tool Integration 32 33### 2.1 Modular Downloader Interface 🔌 34- [ ] Define DOWNLOADER module signature 35- [ ] Create abstract Downloader.t type with module wrapping 36- [ ] Implement tool detection and availability checking 37- [ ] Add downloader selection (wget, curl, auto-detect) 38 39### 2.2 Wget Downloader Implementation 📥 40- [ ] Implement Wget_downloader module with DOWNLOADER interface 41- [ ] Add resume support with `--continue` flag 42- [ ] Handle timeout, retry, and quiet options 43- [ ] Implement hash verification after download 44- [ ] Add comprehensive error handling with exit codes 45 46### 2.3 Curl Downloader Implementation 📦 47- [ ] Implement Curl_downloader module with DOWNLOADER interface 48- [ ] Add resume support with `--continue-at -` flag 49- [ ] Configure timeout, retry, and progress options 50- [ ] Implement hash verification after download 51- [ ] Handle various curl error conditions 52 53## Phase 3: Main Interface 54 55### 3.1 Toru Module Core 🎯 56- [ ] Design abstract Toru.t type with accessor functions 57- [ ] Implement constructor with registry loading 58- [ ] Add base_url, cache, and registry accessors 59- [ ] Create single file fetch functionality 60- [ ] Implement processor pipeline for post-download transformations 61 62### 3.2 Concurrent Operations ⚡ 63- [ ] Implement fetch_all with configurable concurrency 64- [ ] Add Eio fiber-based parallel downloads 65- [ ] Implement progress reporting integration 66- [ ] Add error aggregation for batch operations 67 68### 3.3 Static Utilities 🛠️ 69- [ ] Implement standalone retrieve function 70- [ ] Add registry manipulation functions 71- [ ] Support base URL updates 72- [ ] Create convenience functions for common use cases 73 74## Phase 4: Testing & Validation 75 76### 4.1 Tessera-Manifests Integration 🧪 77- [ ] Set up test fixtures with tessera-manifests URLs 78- [ ] Test embeddings registry parsing (2024 data) 79- [ ] Validate landmasks registry parsing 80- [ ] Test geographic coordinate extraction 81- [ ] Performance test with large manifests (>100 entries) 82 83### 4.2 Hash Algorithm Tests 🔐 84- [ ] Test SHA256, SHA1, MD5 verification with known files 85- [ ] Validate prefix parsing ("sha1:abc123", "md5:def456") 86- [ ] Test hash computation accuracy 87- [ ] Error handling for invalid hash formats 88 89### 4.3 Download Integration Tests 📡 90- [ ] Test wget downloader with real tessera data 91- [ ] Test curl downloader with resume functionality 92- [ ] Validate hash verification after downloads 93- [ ] Test error handling for network failures 94 95## Phase 5: CLI & User Experience 96 97### 5.1 Command Line Interface 💻 98- [ ] Integrate cmdliner for argument parsing 99- [ ] Add downloader selection (--downloader wget|curl|auto) 100- [ ] Implement cache path configuration 101- [ ] Add verbose/quiet mode options 102 103### 5.2 Progress Reporting 📊 104- [ ] Integrate OCaml progress library 105- [ ] Show download speed and ETA 106- [ ] Support multiple concurrent progress bars 107- [ ] Add file name and size information 108 109### 5.3 Archive Processing 📁 110- [ ] Implement untar_gz processor using system tar 111- [ ] Add unzip processor using system unzip 112- [ ] Support untar_xz with tar -xJf 113- [ ] Create custom processor interface 114 115## Phase 6: Future Extensions 116 117### 6.1 Pure OCaml Implementation 🐪 118- [ ] Implement Cohttp_downloader module 119- [ ] Add streaming download support 120- [ ] Implement HTTP Range requests for resume 121- [ ] TLS support with tls-eio 122- [ ] Migrate from external tools gradually 123 124### 6.2 DOI Resolution (Toru-DOI) 📚 125- [ ] Create separate toru-doi library 126- [ ] Implement Zenodo API integration 127- [ ] Add Figshare API support 128- [ ] DOI to registry conversion 129- [ ] Metadata caching and rate limiting 130 131### 6.3 Advanced Features 🚀 132- [ ] FTP protocol support 133- [ ] Authentication mechanisms (API keys, tokens) 134- [ ] Checksum verification during download 135- [ ] Partial download recovery 136- [ ] Registry merging and diff operations 137 138## Dependencies 139 140### Core Dependencies 141- `eio` (>= 1.0) - Effects-based I/O 142- `digestif` (>= 1.0) - Cryptographic hashes 143- `uri` - URL parsing 144- `cmdliner` - CLI parsing 145 146### System Dependencies 147- `wget` or `curl` - Download tools (one required) 148 149### Optional Dependencies 150- `progress` - Progress bars 151- `yojson` - JSON configuration 152- `tar`, `unzip` - Archive processing 153 154## Success Criteria 155 156### Phase 1 Success ✅ 157- [ ] All core modules pass unit tests 158- [ ] Hash verification works with digestif 159- [ ] Registry parsing handles tessera-manifests correctly 160- [ ] Cache follows XDG directory specification 161 162### Phase 2 Success ✅ 163- [ ] Both wget and curl downloaders work 164- [ ] Resume functionality tested with interrupted downloads 165- [ ] Automatic tool detection and fallback 166- [ ] Hash verification after external tool downloads 167 168### Phase 3 Success ✅ 169- [ ] Full tessera-manifests integration test passes 170- [ ] Concurrent downloads work without conflicts 171- [ ] Single-file fetch and batch fetch both functional 172- [ ] Processor pipeline handles archives correctly 173 174### Final Success ✅ 175- [ ] Complete tessera geospatial data download workflow 176- [ ] CLI tool usable for real data management 177- [ ] Documentation and examples complete 178- [ ] Performance acceptable for large datasets (GB scale) 179 180## XDG Integration Notes 181 182The current TODO includes XDG Base Directory specification support through the xdg-eio library. This provides: 183 184- Automatic XDG cache directory detection 185- Cross-platform path handling (Unix, macOS, Windows) 186- Environment variable overrides (XDG_CACHE_HOME, etc.) 187- Pretty-printing for user-friendly directory display 188 189--- 190 191*This TODO represents approximately 6-8 weeks of development work, focusing on robust external tool integration before migrating to pure OCaml implementation.* 192 193## Additional Features for Pooch Compatibility 194 195### **1. Authentication Support** 196- [ ] **HTTP Authentication**: Basic auth via username/password 197- [x] **FTP Authentication**: Username/password credentials (moved to "Won't Implement") 198- [ ] **SFTP Authentication**: SSH-based secure file transfer 199- [ ] **Environment Variable Patterns**: Standardized env var support for credentials 200 201### **2. Additional Download Protocols** 202- [ ] **SFTP**: Secure file transfer (requires SSH libraries) 203- [ ] **DataVerse DOI Support**: Beyond Zenodo/Figshare (added in Pooch v1.7.0) 204 205### **3. Registry Management Utilities** 206- [ ] **`make_registry` equivalent**: Auto-generate registry files from directories 207- [ ] **DOI-based Registry Loading**: `load_registry_from_doi()` functionality 208- [ ] **Recursive Directory Scanning**: For registry generation 209 210### **4. Advanced Download Features** 211- [ ] **Retry Mechanisms**: Exponential backoff (1s → 10s max) 212- [ ] **Temporary File Handling**: Atomic file replacement during downloads 213- [x] **Hash Algorithm Flexibility**: MD5/SHA1 support (implemented with automatic detection) 214 215### **5. Processing/Archive Support** 216- [ ] **Built-in Processors**: 217 - [ ] `Unzip` processor for ZIP files 218 - [ ] `Untar` processor for TAR archives 219 - [ ] `Decompress` processor for compressed files 220- [ ] **Processor Chaining**: Sequential processing pipeline 221- [ ] **Archive-Specific Handling**: Beyond basic shell tool integration 222 223### **6. Progress Reporting Integration** 224- [ ] **Multiple Progress Libraries**: `tqdm` integration vs OCaml `progress` 225- [ ] **Progress Bar Customization**: Custom progress objects 226- [ ] **Stderr Output Control**: Configurable progress display 227 228### **7. Utilities and Helper Functions** 229- [ ] **Version Compatibility Checking**: `check_version()` utility 230- [ ] **Logging Integration**: Built-in logging support with levels 231- [ ] **Test Runner**: `pooch.test()` functionality 232 233### **8. Environment Variable Standards** 234- [ ] **XDG Compliance**: Full XDG Base Directory specification 235- [ ] **Platform-Specific Defaults**: Windows `%LOCALAPPDATA%` patterns 236- [ ] **Standardized Override Patterns**: Consistent env var naming 237 238### **9. Registry Format Features** 239- [x] **Comment Support**: Lines starting with `#` (already planned) 240- [x] **Multiple Hash Formats**: SHA256, SHA1, MD5 with automatic detection 241- [ ] **Registry Validation**: Built-in format checking 242 243### **10. API Design Differences** 244- [ ] **Static Methods**: `pooch.retrieve()` for one-off downloads (already planned) 245- [ ] **Factory Functions**: `pooch.create()` vs constructor patterns 246- [ ] **Callable Downloaders**: Function-based custom downloaders vs module system 247 248## Features Where Toru Has Advantages 249 250### **1. Concurrency** 251- [x] **Parallel Downloads**: `fetch_all` with configurable concurrency 252- [x] **Eio-based Async**: Modern effects-based concurrency 253 254### **2. Type Safety** 255- [x] **OCaml Type System**: Compile-time error prevention 256- [x] **Result Types**: Explicit error handling vs exceptions 257 258### **3. Modular Architecture** 259- [x] **Downloader Modules**: Clean module interface vs callable objects 260- [x] **External Tool Integration**: wget/curl with migration path to pure OCaml 261 262### **4. Performance Path** 263- [x] **Migration Strategy**: External tools → pure OCaml implementation 264- [x] **Resume Support**: Via wget/curl initially, then native implementation 265 266## Implementation Priorities 267 268### **Priority 1 (Core Compatibility)** 2691. [ ] **Add Authentication Support**: HTTP Basic, environment variables (FTP removed) 2702. [ ] **Implement `make_registry`**: Directory scanning utility 2713. [x] **Add More Hash Algorithms**: SHA1, MD5 support (completed) 2724. [ ] **Enhance Progress Reporting**: Better integration with OCaml ecosystem 2735. [ ] **Unified Configuration**: Cmdliner + environment variable integration 274 275### **Priority 2 (Advanced Features)** 2761. [ ] **SFTP Protocol**: Using OCaml SSH libraries 2772. [ ] **Retry Mechanisms**: Exponential backoff implementation 2783. [ ] **Built-in Processors**: Native OCaml archive handling 2794. [ ] **DataVerse DOI Support**: Extend DOI resolver 280 281### **Priority 3 (Ecosystem Integration)** 2821. [ ] **Command Line Interface**: Unlike Pooch, add CLI support 2832. [ ] **Comprehensive Logging**: Structured logging with levels 2843. [ ] **Test Framework Integration**: Native OCaml test support 285 286## Implementation Notes 287 288### Authentication Implementation 289- **Per-downloader auth configs**: Each downloader gets its own auth settings 290- **External tools**: `wget`/`curl` handle auth via command-line args (`--user`, `--password`) 291- **Pure OCaml**: `cohttp-eio` uses Basic Auth headers 292- **SSH libraries**: For SFTP (consider `ssh` or `libssh` bindings) 293- **Configuration**: Cmdliner + environment variables per downloader type 294 295### Registry Utilities 296- `make_registry`: Use `Eio.Path` for directory traversal 297- Implement recursive hash computation 298- Output in Pooch-compatible format 299 300### Retry Mechanisms 301- Implement exponential backoff with jitter 302- Configurable retry counts and timeouts 303- Log retry attempts 304 305### Archive Processing 306- Native OCaml implementations preferred over shell tools 307- Consider `camlzip`, `tar`, `decompress` libraries 308- Chain processors for complex formats 309 310### Progress Reporting 311- Enhance OCaml `progress` library integration 312- Support custom progress callbacks 313- Configurable output streams 314 315## Current Implementation Status 316 317### Completed (as per CLAUDE.md) 318- [x] Hash module design (SHA256, SHA1, MD5) 319- [x] Hash module implementation with all algorithms 320- [x] Registry module design with Pooch compatibility 321- [x] Cache module with XDG support 322- [x] Modular downloader interface 323- [x] Concurrent download design (`fetch_all`) 324- [x] External tool integration (wget/curl) 325- [x] DOI resolution library design 326 327### In Progress 328- [x] Hash module implementation (completed with tests) 329- [ ] Registry module implementation 330- [ ] Cache module implementation 331- [ ] External tool downloader implementations 332- [ ] Test suite with tessera-manifests 333 334### Not Started 335- [ ] Authentication systems 336- [ ] SFTP protocol support 337- [ ] Advanced registry utilities 338- [ ] Built-in archive processors 339- [ ] CLI interface 340- [ ] Pure OCaml HTTP downloader (cohttp-eio) 341 342--- 343 344## Features We Won't Implement 345 346### **Deliberately Excluded Features** 347- **FTP Protocol Support**: 348 - Rare in modern usage, HTTPS/SFTP preferred 349 - Adds complexity without significant benefit 350 - Can still use via external tools if needed 351- **Windows-Specific Path Handling**: 352 - Focus on Unix/macOS primarily 353 - Basic Windows support via Eio, not optimized 354- **Legacy Hash Algorithms** (initially): 355 - MD4, SHA-0 - cryptographically broken 356 - Focus on SHA256/SHA1/MD5 for compatibility 357- **Complex Authentication Flows**: 358 - OAuth2, JWT tokens, API keys with refresh 359 - Keep to Basic Auth for simplicity 360- **GUI Progress Bars**: 361 - Terminal/CLI focused library 362 - Text-based progress reporting only 363 364## Unified Configuration Design 365 366### **Cmdliner + Environment Variable Integration** 367 368```ocaml 369module Config = struct 370 (* Per-downloader authentication *) 371 type auth = { 372 username: string option; 373 password: string option; 374 } 375 376 (* Global application configuration *) 377 type t = { 378 base_url: string; 379 cache_dir: string; 380 downloader: [`Wget | `Curl | `Cohttp | `Auto]; 381 retry_count: int; 382 timeout: float; 383 (* Auth configurations per downloader type *) 384 wget_auth: auth; 385 curl_auth: auth; 386 cohttp_auth: auth; 387 } 388 389 (* Cmdliner terms with environment fallbacks *) 390 let base_url_term = 391 let doc = "Base URL for downloads" in 392 let env = Cmdliner.Arg.env_var "TORU_BASE_URL" in 393 Cmdliner.Arg.(required & opt (some string) None & 394 info ["base-url"; "u"] ~env ~doc) 395 396 let cache_dir_term = 397 let doc = "Cache directory path" in 398 let env = Cmdliner.Arg.env_var "TORU_CACHE_DIR" in 399 let default = Cache.default_path ~app_name:"toru" () in 400 Cmdliner.Arg.(value & opt string default & 401 info ["cache-dir"; "c"] ~env ~doc) 402 403 (* Per-downloader auth terms *) 404 let wget_auth_terms = 405 let username_term = 406 let doc = "Wget authentication username" in 407 let env = Cmdliner.Arg.env_var "TORU_WGET_USERNAME" in 408 Cmdliner.Arg.(value & opt (some string) None & 409 info ["wget-username"] ~env ~doc) in 410 let password_term = 411 let doc = "Wget authentication password" in 412 let env = Cmdliner.Arg.env_var "TORU_WGET_PASSWORD" in 413 Cmdliner.Arg.(value & opt (some string) None & 414 info ["wget-password"] ~env ~doc) in 415 (username_term, password_term) 416 417 let curl_auth_terms = 418 let username_term = 419 let doc = "Curl authentication username" in 420 let env = Cmdliner.Arg.env_var "TORU_CURL_USERNAME" in 421 Cmdliner.Arg.(value & opt (some string) None & 422 info ["curl-username"] ~env ~doc) in 423 let password_term = 424 let doc = "Curl authentication password" in 425 let env = Cmdliner.Arg.env_var "TORU_CURL_PASSWORD" in 426 Cmdliner.Arg.(value & opt (some string) None & 427 info ["curl-password"] ~env ~doc) in 428 (username_term, password_term) 429 430 let cohttp_auth_terms = 431 let username_term = 432 let doc = "Cohttp authentication username" in 433 let env = Cmdliner.Arg.env_var "TORU_COHTTP_USERNAME" in 434 Cmdliner.Arg.(value & opt (some string) None & 435 info ["cohttp-username"] ~env ~doc) in 436 let password_term = 437 let doc = "Cohttp authentication password" in 438 let env = Cmdliner.Arg.env_var "TORU_COHTTP_PASSWORD" in 439 Cmdliner.Arg.(value & opt (some string) None & 440 info ["cohttp-password"] ~env ~doc) in 441 (username_term, password_term) 442 443 let downloader_term = 444 let doc = "Download tool: wget, curl, cohttp, auto" in 445 let env = Cmdliner.Arg.env_var "TORU_DOWNLOADER" in 446 Cmdliner.Arg.(value & opt (enum [ 447 ("wget", `Wget); ("curl", `Curl); 448 ("cohttp", `Cohttp); ("auto", `Auto) 449 ]) `Auto & info ["downloader"; "d"] ~env ~doc) 450 451 let retry_count_term = 452 let doc = "Number of download retries" in 453 let env = Cmdliner.Arg.env_var "TORU_RETRY_COUNT" in 454 Cmdliner.Arg.(value & opt int 3 & 455 info ["retries"; "r"] ~env ~doc) 456 457 let timeout_term = 458 let doc = "Download timeout in seconds" in 459 let env = Cmdliner.Arg.env_var "TORU_TIMEOUT" in 460 Cmdliner.Arg.(value & opt float 300.0 & 461 info ["timeout"; "t"] ~env ~doc) 462 463 let config_term = 464 let combine base_url cache_dir downloader retries timeout 465 wget_user wget_pass curl_user curl_pass cohttp_user cohttp_pass = 466 let wget_auth = { username = wget_user; password = wget_pass } in 467 let curl_auth = { username = curl_user; password = curl_pass } in 468 let cohttp_auth = { username = cohttp_user; password = cohttp_pass } in 469 { base_url; cache_dir; downloader; retry_count = retries; timeout; 470 wget_auth; curl_auth; cohttp_auth } 471 in 472 let (wget_user_term, wget_pass_term) = wget_auth_terms in 473 let (curl_user_term, curl_pass_term) = curl_auth_terms in 474 let (cohttp_user_term, cohttp_pass_term) = cohttp_auth_terms in 475 Cmdliner.Term.(const combine $ base_url_term $ cache_dir_term $ 476 downloader_term $ retry_count_term $ timeout_term $ 477 wget_user_term $ wget_pass_term $ 478 curl_user_term $ curl_pass_term $ 479 cohttp_user_term $ cohttp_pass_term) 480 481 (* Helper to get auth for specific downloader *) 482 let get_auth config = function 483 | `Wget -> config.wget_auth 484 | `Curl -> config.curl_auth 485 | `Cohttp -> config.cohttp_auth 486 | `Auto -> { username = None; password = None } (* No auth for auto-detect *) 487end 488``` 489 490### **Updated Downloader Interface** 491 492```ocaml 493module type DOWNLOADER = sig 494 type t 495 496 val create : sw:Eio.Switch.t -> env:Eio_unix.Stdenv.base -> 497 ?auth:Config.auth -> unit -> t 498 499 val download : t -> 500 url:string -> 501 dest:Eio.Fs.dir_ty Eio.Path.t -> 502 ?hash:Hash.t -> 503 ?progress:Progress_reporter.t -> 504 ?resume:bool -> 505 unit -> (unit, string) result 506 507 val supports_resume : t -> bool 508 val name : t -> string 509end 510 511module Wget_downloader : DOWNLOADER = struct 512 type t = { 513 sw : Eio.Switch.t; 514 env : Eio_unix.Stdenv.base; 515 auth : Config.auth option; 516 timeout : float; 517 } 518 519 let create ~sw ~env ?auth () = 520 { sw; env; auth; timeout = 300.0 } 521 522 let download t ~url ~dest ?hash ?progress ?(resume=true) () = 523 let auth_args = match t.auth with 524 | Some { username = Some u; password = Some p } -> 525 ["--user=" ^ u; "--password=" ^ p] 526 | Some { username = Some u; password = None } -> 527 ["--user=" ^ u] 528 | _ -> [] in 529 let args = ["--quiet"; "--show-progress"; "--timeout=300"] @ 530 auth_args @ ["--output-document=" ^ (Eio.Path.native_exn dest)] in 531 (* ... rest of implementation *) 532end 533``` 534 535### **Environment Variable Standards** 536 537| Variable | Purpose | Example | 538|----------|---------|----------| 539| `TORU_BASE_URL` | Default download base URL | `https://data.example.com/` | 540| `TORU_CACHE_DIR` | Override cache location | `/custom/cache/path` | 541| `TORU_WGET_USERNAME` | Wget HTTP Basic Auth username | `myuser` | 542| `TORU_WGET_PASSWORD` | Wget HTTP Basic Auth password | `secret123` | 543| `TORU_CURL_USERNAME` | Curl HTTP Basic Auth username | `myuser` | 544| `TORU_CURL_PASSWORD` | Curl HTTP Basic Auth password | `secret123` | 545| `TORU_COHTTP_USERNAME` | Cohttp HTTP Basic Auth username | `myuser` | 546| `TORU_COHTTP_PASSWORD` | Cohttp HTTP Basic Auth password | `secret123` | 547| `TORU_DOWNLOADER` | Preferred download tool | `wget`, `curl`, `cohttp`, `auto` | 548| `TORU_RETRY_COUNT` | Download retry attempts | `5` | 549| `TORU_TIMEOUT` | Download timeout (seconds) | `600.0` | 550| `TORU_REGISTRY_URL` | Registry file URL override | `https://example.com/registry.txt` | 551 552### **Benefits of Per-Downloader Auth** 553 5541. **Flexibility**: Different auth for different downloaders 5552. **Tool-Specific**: Wget vs Curl may need different credentials 5563. **Migration Path**: Smooth transition from external tools to pure OCaml 5574. **Security**: Auth only passed to specific downloader implementations 5585. **Testing**: Easy to test each downloader with different auth configurations 559 560### **Usage Examples** 561 562```bash 563# Different auth per downloader 564export TORU_WGET_USERNAME=wget_user 565export TORU_WGET_PASSWORD=wget_pass 566export TORU_CURL_USERNAME=curl_user 567export TORU_CURL_PASSWORD=curl_pass 568toru fetch --downloader wget data.csv # uses wget auth 569toru fetch --downloader curl data.csv # uses curl auth 570 571# CLI override for specific downloader 572toru fetch --downloader cohttp --cohttp-username myuser data.csv 573 574# Auto-detection with fallback auth 575export TORU_WGET_USERNAME=backup_user 576toru fetch --downloader auto data.csv # will use wget if available, with auth 577``` 578