My agentic slop goes here. Not intended for anyone else!
1# Toru Implementation TODO
2
3This document outlines the implementation plan for Toru, an OCaml data repository manager compatible with Python Pooch registry files.
4
5## Phase 1: Core Modules
6
7### 1.1 Hash Module ✨
8- [ ] Define abstract `Hash.t` type with algorithm variants (SHA256, SHA1, MD5)
9- [ ] Implement `create`, `of_string`, `to_string` functions
10- [ ] Add algorithm parsing with prefix support ("sha1:", "md5:", plain)
11- [ ] Implement file verification using digestif library
12- [ ] Add hash computation for files
13- [ ] Create comprehensive test suite with known hash values
14
15### 1.2 Registry Module 📋
16- [ ] Design abstract `Registry.t` and `Registry.entry` types
17- [ ] Implement Pooch-compatible file parser (comments, blank lines)
18- [ ] Add entry creation with filename, hash, optional custom URL
19- [ ] Implement registry operations (find, exists, add, remove)
20- [ ] Support loading from files and URLs
21- [ ] Add registry serialization (to_string, save)
22
23### 1.3 Cache Module 💾
24- [ ] Create abstract `Cache.t` type with base path management
25- [ ] Implement XDG Base Directory specification
26- [ ] Add version subdirectory support
27- [ ] Implement cache operations (exists, clear, size, list)
28- [ ] Add lazy directory creation
29- [ ] Support environment variable overrides (TORU_CACHE_DIR)
30
31## Phase 2: External Tool Integration
32
33### 2.1 Modular Downloader Interface 🔌
34- [ ] Define DOWNLOADER module signature
35- [ ] Create abstract Downloader.t type with module wrapping
36- [ ] Implement tool detection and availability checking
37- [ ] Add downloader selection (wget, curl, auto-detect)
38
39### 2.2 Wget Downloader Implementation 📥
40- [ ] Implement Wget_downloader module with DOWNLOADER interface
41- [ ] Add resume support with `--continue` flag
42- [ ] Handle timeout, retry, and quiet options
43- [ ] Implement hash verification after download
44- [ ] Add comprehensive error handling with exit codes
45
46### 2.3 Curl Downloader Implementation 📦
47- [ ] Implement Curl_downloader module with DOWNLOADER interface
48- [ ] Add resume support with `--continue-at -` flag
49- [ ] Configure timeout, retry, and progress options
50- [ ] Implement hash verification after download
51- [ ] Handle various curl error conditions
52
53## Phase 3: Main Interface
54
55### 3.1 Toru Module Core 🎯
56- [ ] Design abstract Toru.t type with accessor functions
57- [ ] Implement constructor with registry loading
58- [ ] Add base_url, cache, and registry accessors
59- [ ] Create single file fetch functionality
60- [ ] Implement processor pipeline for post-download transformations
61
62### 3.2 Concurrent Operations ⚡
63- [ ] Implement fetch_all with configurable concurrency
64- [ ] Add Eio fiber-based parallel downloads
65- [ ] Implement progress reporting integration
66- [ ] Add error aggregation for batch operations
67
68### 3.3 Static Utilities 🛠️
69- [ ] Implement standalone retrieve function
70- [ ] Add registry manipulation functions
71- [ ] Support base URL updates
72- [ ] Create convenience functions for common use cases
73
74## Phase 4: Testing & Validation
75
76### 4.1 Tessera-Manifests Integration 🧪
77- [ ] Set up test fixtures with tessera-manifests URLs
78- [ ] Test embeddings registry parsing (2024 data)
79- [ ] Validate landmasks registry parsing
80- [ ] Test geographic coordinate extraction
81- [ ] Performance test with large manifests (>100 entries)
82
83### 4.2 Hash Algorithm Tests 🔐
84- [ ] Test SHA256, SHA1, MD5 verification with known files
85- [ ] Validate prefix parsing ("sha1:abc123", "md5:def456")
86- [ ] Test hash computation accuracy
87- [ ] Error handling for invalid hash formats
88
89### 4.3 Download Integration Tests 📡
90- [ ] Test wget downloader with real tessera data
91- [ ] Test curl downloader with resume functionality
92- [ ] Validate hash verification after downloads
93- [ ] Test error handling for network failures
94
95## Phase 5: CLI & User Experience
96
97### 5.1 Command Line Interface 💻
98- [ ] Integrate cmdliner for argument parsing
99- [ ] Add downloader selection (--downloader wget|curl|auto)
100- [ ] Implement cache path configuration
101- [ ] Add verbose/quiet mode options
102
103### 5.2 Progress Reporting 📊
104- [ ] Integrate OCaml progress library
105- [ ] Show download speed and ETA
106- [ ] Support multiple concurrent progress bars
107- [ ] Add file name and size information
108
109### 5.3 Archive Processing 📁
110- [ ] Implement untar_gz processor using system tar
111- [ ] Add unzip processor using system unzip
112- [ ] Support untar_xz with tar -xJf
113- [ ] Create custom processor interface
114
115## Phase 6: Future Extensions
116
117### 6.1 Pure OCaml Implementation 🐪
118- [ ] Implement Cohttp_downloader module
119- [ ] Add streaming download support
120- [ ] Implement HTTP Range requests for resume
121- [ ] TLS support with tls-eio
122- [ ] Migrate from external tools gradually
123
124### 6.2 DOI Resolution (Toru-DOI) 📚
125- [ ] Create separate toru-doi library
126- [ ] Implement Zenodo API integration
127- [ ] Add Figshare API support
128- [ ] DOI to registry conversion
129- [ ] Metadata caching and rate limiting
130
131### 6.3 Advanced Features 🚀
132- [ ] FTP protocol support
133- [ ] Authentication mechanisms (API keys, tokens)
134- [ ] Checksum verification during download
135- [ ] Partial download recovery
136- [ ] Registry merging and diff operations
137
138## Dependencies
139
140### Core Dependencies
141- `eio` (>= 1.0) - Effects-based I/O
142- `digestif` (>= 1.0) - Cryptographic hashes
143- `uri` - URL parsing
144- `cmdliner` - CLI parsing
145
146### System Dependencies
147- `wget` or `curl` - Download tools (one required)
148
149### Optional Dependencies
150- `progress` - Progress bars
151- `yojson` - JSON configuration
152- `tar`, `unzip` - Archive processing
153
154## Success Criteria
155
156### Phase 1 Success ✅
157- [ ] All core modules pass unit tests
158- [ ] Hash verification works with digestif
159- [ ] Registry parsing handles tessera-manifests correctly
160- [ ] Cache follows XDG directory specification
161
162### Phase 2 Success ✅
163- [ ] Both wget and curl downloaders work
164- [ ] Resume functionality tested with interrupted downloads
165- [ ] Automatic tool detection and fallback
166- [ ] Hash verification after external tool downloads
167
168### Phase 3 Success ✅
169- [ ] Full tessera-manifests integration test passes
170- [ ] Concurrent downloads work without conflicts
171- [ ] Single-file fetch and batch fetch both functional
172- [ ] Processor pipeline handles archives correctly
173
174### Final Success ✅
175- [ ] Complete tessera geospatial data download workflow
176- [ ] CLI tool usable for real data management
177- [ ] Documentation and examples complete
178- [ ] Performance acceptable for large datasets (GB scale)
179
180## XDG Integration Notes
181
182The current TODO includes XDG Base Directory specification support through the xdg-eio library. This provides:
183
184- Automatic XDG cache directory detection
185- Cross-platform path handling (Unix, macOS, Windows)
186- Environment variable overrides (XDG_CACHE_HOME, etc.)
187- Pretty-printing for user-friendly directory display
188
189---
190
191*This TODO represents approximately 6-8 weeks of development work, focusing on robust external tool integration before migrating to pure OCaml implementation.*
192
193## Additional Features for Pooch Compatibility
194
195### **1. Authentication Support**
196- [ ] **HTTP Authentication**: Basic auth via username/password
197- [x] **FTP Authentication**: Username/password credentials (moved to "Won't Implement")
198- [ ] **SFTP Authentication**: SSH-based secure file transfer
199- [ ] **Environment Variable Patterns**: Standardized env var support for credentials
200
201### **2. Additional Download Protocols**
202- [ ] **SFTP**: Secure file transfer (requires SSH libraries)
203- [ ] **DataVerse DOI Support**: Beyond Zenodo/Figshare (added in Pooch v1.7.0)
204
205### **3. Registry Management Utilities**
206- [ ] **`make_registry` equivalent**: Auto-generate registry files from directories
207- [ ] **DOI-based Registry Loading**: `load_registry_from_doi()` functionality
208- [ ] **Recursive Directory Scanning**: For registry generation
209
210### **4. Advanced Download Features**
211- [ ] **Retry Mechanisms**: Exponential backoff (1s → 10s max)
212- [ ] **Temporary File Handling**: Atomic file replacement during downloads
213- [x] **Hash Algorithm Flexibility**: MD5/SHA1 support (implemented with automatic detection)
214
215### **5. Processing/Archive Support**
216- [ ] **Built-in Processors**:
217 - [ ] `Unzip` processor for ZIP files
218 - [ ] `Untar` processor for TAR archives
219 - [ ] `Decompress` processor for compressed files
220- [ ] **Processor Chaining**: Sequential processing pipeline
221- [ ] **Archive-Specific Handling**: Beyond basic shell tool integration
222
223### **6. Progress Reporting Integration**
224- [ ] **Multiple Progress Libraries**: `tqdm` integration vs OCaml `progress`
225- [ ] **Progress Bar Customization**: Custom progress objects
226- [ ] **Stderr Output Control**: Configurable progress display
227
228### **7. Utilities and Helper Functions**
229- [ ] **Version Compatibility Checking**: `check_version()` utility
230- [ ] **Logging Integration**: Built-in logging support with levels
231- [ ] **Test Runner**: `pooch.test()` functionality
232
233### **8. Environment Variable Standards**
234- [ ] **XDG Compliance**: Full XDG Base Directory specification
235- [ ] **Platform-Specific Defaults**: Windows `%LOCALAPPDATA%` patterns
236- [ ] **Standardized Override Patterns**: Consistent env var naming
237
238### **9. Registry Format Features**
239- [x] **Comment Support**: Lines starting with `#` (already planned)
240- [x] **Multiple Hash Formats**: SHA256, SHA1, MD5 with automatic detection
241- [ ] **Registry Validation**: Built-in format checking
242
243### **10. API Design Differences**
244- [ ] **Static Methods**: `pooch.retrieve()` for one-off downloads (already planned)
245- [ ] **Factory Functions**: `pooch.create()` vs constructor patterns
246- [ ] **Callable Downloaders**: Function-based custom downloaders vs module system
247
248## Features Where Toru Has Advantages
249
250### **1. Concurrency**
251- [x] **Parallel Downloads**: `fetch_all` with configurable concurrency
252- [x] **Eio-based Async**: Modern effects-based concurrency
253
254### **2. Type Safety**
255- [x] **OCaml Type System**: Compile-time error prevention
256- [x] **Result Types**: Explicit error handling vs exceptions
257
258### **3. Modular Architecture**
259- [x] **Downloader Modules**: Clean module interface vs callable objects
260- [x] **External Tool Integration**: wget/curl with migration path to pure OCaml
261
262### **4. Performance Path**
263- [x] **Migration Strategy**: External tools → pure OCaml implementation
264- [x] **Resume Support**: Via wget/curl initially, then native implementation
265
266## Implementation Priorities
267
268### **Priority 1 (Core Compatibility)**
2691. [ ] **Add Authentication Support**: HTTP Basic, environment variables (FTP removed)
2702. [ ] **Implement `make_registry`**: Directory scanning utility
2713. [x] **Add More Hash Algorithms**: SHA1, MD5 support (completed)
2724. [ ] **Enhance Progress Reporting**: Better integration with OCaml ecosystem
2735. [ ] **Unified Configuration**: Cmdliner + environment variable integration
274
275### **Priority 2 (Advanced Features)**
2761. [ ] **SFTP Protocol**: Using OCaml SSH libraries
2772. [ ] **Retry Mechanisms**: Exponential backoff implementation
2783. [ ] **Built-in Processors**: Native OCaml archive handling
2794. [ ] **DataVerse DOI Support**: Extend DOI resolver
280
281### **Priority 3 (Ecosystem Integration)**
2821. [ ] **Command Line Interface**: Unlike Pooch, add CLI support
2832. [ ] **Comprehensive Logging**: Structured logging with levels
2843. [ ] **Test Framework Integration**: Native OCaml test support
285
286## Implementation Notes
287
288### Authentication Implementation
289- **Per-downloader auth configs**: Each downloader gets its own auth settings
290- **External tools**: `wget`/`curl` handle auth via command-line args (`--user`, `--password`)
291- **Pure OCaml**: `cohttp-eio` uses Basic Auth headers
292- **SSH libraries**: For SFTP (consider `ssh` or `libssh` bindings)
293- **Configuration**: Cmdliner + environment variables per downloader type
294
295### Registry Utilities
296- `make_registry`: Use `Eio.Path` for directory traversal
297- Implement recursive hash computation
298- Output in Pooch-compatible format
299
300### Retry Mechanisms
301- Implement exponential backoff with jitter
302- Configurable retry counts and timeouts
303- Log retry attempts
304
305### Archive Processing
306- Native OCaml implementations preferred over shell tools
307- Consider `camlzip`, `tar`, `decompress` libraries
308- Chain processors for complex formats
309
310### Progress Reporting
311- Enhance OCaml `progress` library integration
312- Support custom progress callbacks
313- Configurable output streams
314
315## Current Implementation Status
316
317### Completed (as per CLAUDE.md)
318- [x] Hash module design (SHA256, SHA1, MD5)
319- [x] Hash module implementation with all algorithms
320- [x] Registry module design with Pooch compatibility
321- [x] Cache module with XDG support
322- [x] Modular downloader interface
323- [x] Concurrent download design (`fetch_all`)
324- [x] External tool integration (wget/curl)
325- [x] DOI resolution library design
326
327### In Progress
328- [x] Hash module implementation (completed with tests)
329- [ ] Registry module implementation
330- [ ] Cache module implementation
331- [ ] External tool downloader implementations
332- [ ] Test suite with tessera-manifests
333
334### Not Started
335- [ ] Authentication systems
336- [ ] SFTP protocol support
337- [ ] Advanced registry utilities
338- [ ] Built-in archive processors
339- [ ] CLI interface
340- [ ] Pure OCaml HTTP downloader (cohttp-eio)
341
342---
343
344## Features We Won't Implement
345
346### **Deliberately Excluded Features**
347- **FTP Protocol Support**:
348 - Rare in modern usage, HTTPS/SFTP preferred
349 - Adds complexity without significant benefit
350 - Can still use via external tools if needed
351- **Windows-Specific Path Handling**:
352 - Focus on Unix/macOS primarily
353 - Basic Windows support via Eio, not optimized
354- **Legacy Hash Algorithms** (initially):
355 - MD4, SHA-0 - cryptographically broken
356 - Focus on SHA256/SHA1/MD5 for compatibility
357- **Complex Authentication Flows**:
358 - OAuth2, JWT tokens, API keys with refresh
359 - Keep to Basic Auth for simplicity
360- **GUI Progress Bars**:
361 - Terminal/CLI focused library
362 - Text-based progress reporting only
363
364## Unified Configuration Design
365
366### **Cmdliner + Environment Variable Integration**
367
368```ocaml
369module Config = struct
370 (* Per-downloader authentication *)
371 type auth = {
372 username: string option;
373 password: string option;
374 }
375
376 (* Global application configuration *)
377 type t = {
378 base_url: string;
379 cache_dir: string;
380 downloader: [`Wget | `Curl | `Cohttp | `Auto];
381 retry_count: int;
382 timeout: float;
383 (* Auth configurations per downloader type *)
384 wget_auth: auth;
385 curl_auth: auth;
386 cohttp_auth: auth;
387 }
388
389 (* Cmdliner terms with environment fallbacks *)
390 let base_url_term =
391 let doc = "Base URL for downloads" in
392 let env = Cmdliner.Arg.env_var "TORU_BASE_URL" in
393 Cmdliner.Arg.(required & opt (some string) None &
394 info ["base-url"; "u"] ~env ~doc)
395
396 let cache_dir_term =
397 let doc = "Cache directory path" in
398 let env = Cmdliner.Arg.env_var "TORU_CACHE_DIR" in
399 let default = Cache.default_path ~app_name:"toru" () in
400 Cmdliner.Arg.(value & opt string default &
401 info ["cache-dir"; "c"] ~env ~doc)
402
403 (* Per-downloader auth terms *)
404 let wget_auth_terms =
405 let username_term =
406 let doc = "Wget authentication username" in
407 let env = Cmdliner.Arg.env_var "TORU_WGET_USERNAME" in
408 Cmdliner.Arg.(value & opt (some string) None &
409 info ["wget-username"] ~env ~doc) in
410 let password_term =
411 let doc = "Wget authentication password" in
412 let env = Cmdliner.Arg.env_var "TORU_WGET_PASSWORD" in
413 Cmdliner.Arg.(value & opt (some string) None &
414 info ["wget-password"] ~env ~doc) in
415 (username_term, password_term)
416
417 let curl_auth_terms =
418 let username_term =
419 let doc = "Curl authentication username" in
420 let env = Cmdliner.Arg.env_var "TORU_CURL_USERNAME" in
421 Cmdliner.Arg.(value & opt (some string) None &
422 info ["curl-username"] ~env ~doc) in
423 let password_term =
424 let doc = "Curl authentication password" in
425 let env = Cmdliner.Arg.env_var "TORU_CURL_PASSWORD" in
426 Cmdliner.Arg.(value & opt (some string) None &
427 info ["curl-password"] ~env ~doc) in
428 (username_term, password_term)
429
430 let cohttp_auth_terms =
431 let username_term =
432 let doc = "Cohttp authentication username" in
433 let env = Cmdliner.Arg.env_var "TORU_COHTTP_USERNAME" in
434 Cmdliner.Arg.(value & opt (some string) None &
435 info ["cohttp-username"] ~env ~doc) in
436 let password_term =
437 let doc = "Cohttp authentication password" in
438 let env = Cmdliner.Arg.env_var "TORU_COHTTP_PASSWORD" in
439 Cmdliner.Arg.(value & opt (some string) None &
440 info ["cohttp-password"] ~env ~doc) in
441 (username_term, password_term)
442
443 let downloader_term =
444 let doc = "Download tool: wget, curl, cohttp, auto" in
445 let env = Cmdliner.Arg.env_var "TORU_DOWNLOADER" in
446 Cmdliner.Arg.(value & opt (enum [
447 ("wget", `Wget); ("curl", `Curl);
448 ("cohttp", `Cohttp); ("auto", `Auto)
449 ]) `Auto & info ["downloader"; "d"] ~env ~doc)
450
451 let retry_count_term =
452 let doc = "Number of download retries" in
453 let env = Cmdliner.Arg.env_var "TORU_RETRY_COUNT" in
454 Cmdliner.Arg.(value & opt int 3 &
455 info ["retries"; "r"] ~env ~doc)
456
457 let timeout_term =
458 let doc = "Download timeout in seconds" in
459 let env = Cmdliner.Arg.env_var "TORU_TIMEOUT" in
460 Cmdliner.Arg.(value & opt float 300.0 &
461 info ["timeout"; "t"] ~env ~doc)
462
463 let config_term =
464 let combine base_url cache_dir downloader retries timeout
465 wget_user wget_pass curl_user curl_pass cohttp_user cohttp_pass =
466 let wget_auth = { username = wget_user; password = wget_pass } in
467 let curl_auth = { username = curl_user; password = curl_pass } in
468 let cohttp_auth = { username = cohttp_user; password = cohttp_pass } in
469 { base_url; cache_dir; downloader; retry_count = retries; timeout;
470 wget_auth; curl_auth; cohttp_auth }
471 in
472 let (wget_user_term, wget_pass_term) = wget_auth_terms in
473 let (curl_user_term, curl_pass_term) = curl_auth_terms in
474 let (cohttp_user_term, cohttp_pass_term) = cohttp_auth_terms in
475 Cmdliner.Term.(const combine $ base_url_term $ cache_dir_term $
476 downloader_term $ retry_count_term $ timeout_term $
477 wget_user_term $ wget_pass_term $
478 curl_user_term $ curl_pass_term $
479 cohttp_user_term $ cohttp_pass_term)
480
481 (* Helper to get auth for specific downloader *)
482 let get_auth config = function
483 | `Wget -> config.wget_auth
484 | `Curl -> config.curl_auth
485 | `Cohttp -> config.cohttp_auth
486 | `Auto -> { username = None; password = None } (* No auth for auto-detect *)
487end
488```
489
490### **Updated Downloader Interface**
491
492```ocaml
493module type DOWNLOADER = sig
494 type t
495
496 val create : sw:Eio.Switch.t -> env:Eio_unix.Stdenv.base ->
497 ?auth:Config.auth -> unit -> t
498
499 val download : t ->
500 url:string ->
501 dest:Eio.Fs.dir_ty Eio.Path.t ->
502 ?hash:Hash.t ->
503 ?progress:Progress_reporter.t ->
504 ?resume:bool ->
505 unit -> (unit, string) result
506
507 val supports_resume : t -> bool
508 val name : t -> string
509end
510
511module Wget_downloader : DOWNLOADER = struct
512 type t = {
513 sw : Eio.Switch.t;
514 env : Eio_unix.Stdenv.base;
515 auth : Config.auth option;
516 timeout : float;
517 }
518
519 let create ~sw ~env ?auth () =
520 { sw; env; auth; timeout = 300.0 }
521
522 let download t ~url ~dest ?hash ?progress ?(resume=true) () =
523 let auth_args = match t.auth with
524 | Some { username = Some u; password = Some p } ->
525 ["--user=" ^ u; "--password=" ^ p]
526 | Some { username = Some u; password = None } ->
527 ["--user=" ^ u]
528 | _ -> [] in
529 let args = ["--quiet"; "--show-progress"; "--timeout=300"] @
530 auth_args @ ["--output-document=" ^ (Eio.Path.native_exn dest)] in
531 (* ... rest of implementation *)
532end
533```
534
535### **Environment Variable Standards**
536
537| Variable | Purpose | Example |
538|----------|---------|----------|
539| `TORU_BASE_URL` | Default download base URL | `https://data.example.com/` |
540| `TORU_CACHE_DIR` | Override cache location | `/custom/cache/path` |
541| `TORU_WGET_USERNAME` | Wget HTTP Basic Auth username | `myuser` |
542| `TORU_WGET_PASSWORD` | Wget HTTP Basic Auth password | `secret123` |
543| `TORU_CURL_USERNAME` | Curl HTTP Basic Auth username | `myuser` |
544| `TORU_CURL_PASSWORD` | Curl HTTP Basic Auth password | `secret123` |
545| `TORU_COHTTP_USERNAME` | Cohttp HTTP Basic Auth username | `myuser` |
546| `TORU_COHTTP_PASSWORD` | Cohttp HTTP Basic Auth password | `secret123` |
547| `TORU_DOWNLOADER` | Preferred download tool | `wget`, `curl`, `cohttp`, `auto` |
548| `TORU_RETRY_COUNT` | Download retry attempts | `5` |
549| `TORU_TIMEOUT` | Download timeout (seconds) | `600.0` |
550| `TORU_REGISTRY_URL` | Registry file URL override | `https://example.com/registry.txt` |
551
552### **Benefits of Per-Downloader Auth**
553
5541. **Flexibility**: Different auth for different downloaders
5552. **Tool-Specific**: Wget vs Curl may need different credentials
5563. **Migration Path**: Smooth transition from external tools to pure OCaml
5574. **Security**: Auth only passed to specific downloader implementations
5585. **Testing**: Easy to test each downloader with different auth configurations
559
560### **Usage Examples**
561
562```bash
563# Different auth per downloader
564export TORU_WGET_USERNAME=wget_user
565export TORU_WGET_PASSWORD=wget_pass
566export TORU_CURL_USERNAME=curl_user
567export TORU_CURL_PASSWORD=curl_pass
568toru fetch --downloader wget data.csv # uses wget auth
569toru fetch --downloader curl data.csv # uses curl auth
570
571# CLI override for specific downloader
572toru fetch --downloader cohttp --cohttp-username myuser data.csv
573
574# Auto-detection with fallback auth
575export TORU_WGET_USERNAME=backup_user
576toru fetch --downloader auto data.csv # will use wget if available, with auth
577```
578