My agentic slop goes here. Not intended for anyone else!
1I want to design an OCaml library that builds in support for modifying the
2program linked to it using Claude Code, and restarting itself with the fixes
3automatically. The idea is for long-running services to regularly consult with
4Claude (either on a fixed timetable, or urgently if something really unexpected
5happens) and improve their own functionality. Claude should be used to analyse
6patterns in the logs and determine whether to write code to handle a particular
7case. Claude should not be directly used in the application datapath itself, as
8it should write code.
9
10To make this work, the program needs to emit sufficient tracing data to be
11useful to Claude when it does an inspection, but not so much that it overwhelms
12the context window. Therefore, the first thing the library needs is some
13mechanism to intercept the logging output of the program suitably. The OCaml
14"logs" library is a good thing to standardise on here. It's also fine to use
15the OCaml direct-style Eio library for all interactions.
16
17Assume the code is running in a Linux environment with root level access. There
18will also be a Zulip server available with an API key that can be used to post
19messages to and interact with.
20
21This is an ambitious project, so before embarking on it, I need to think really
22carefully about the design and tradeoffs, including seeking clarificaiton where
23necessary about what sorts of MCP servers or other support infrastructure will
24be useful to making library successful. I'm ok taking risks and trying unusual
25approaches. The library will be called "Dancer" after the Hunter S Thompson
26quote "We're raising a generation of dancers, afraid to take one step out of
27line."
28
29## Architecture Design (v1)
30
31### Core Components
32
331. **Log Interceptor & Buffer**
34 - Hook into OCaml Logs library at the reporter level
35 - Maintain a persistent buffer on disk, perhaps in Sqlite, for analysis
36 - Group consecutive identical errors with count
37 - Tag logs with timestamp, module, and error type
38
392. **Pattern Detector**
40 - Track log messages and their frequency
41 - Use string matching to identify recurring patterns
42 - Maintain a simple SQLite database of seen patterns
43 - Trigger Claude consultation when:
44 - New error pattern appears frequently (>10 times in 5 min)
45 - Error rate spikes above baseline
46 - Scheduled review (e.g., every 6 hours)
47
483. **Claude Consultation Manager**
49 - Prepare context: recent logs + relevant source files
50 - Ask Claude to:
51 - Analyze the error pattern
52 - Generate OCaml code to handle the case
53 - Suggest where to integrate the fix
54 - Test the fixes and trial a deployment
55 - Store Claude's response and proposed changes
56
574. **Version Control Integration**
58 - Each Claude fix creates a new git branch: `dancer/fix-<timestamp>-<error-hash>`
59 - Use git worktrees for isolated changes:
60 ```bash
61 git worktree add ../dancer-fix-<id> -b dancer/fix-<id>
62 ```
63 - Apply Claude's changes in the worktree include a changelog in the commits
64 - Compile and test in isolation
65 - If successful, merge to main and restart application
66 - Have a script that can search for all the fix branches and update a central changelog ordered by time, suitable for a human to review regularly
67
685. **Restart Orchestration**
69 - Library has a supervisor for process management of the application itself
70 - Graceful shutdown: finish current requests with a timeout
71 - State persistence before restart (if needed)
72 - Automatic rollback if restart fails from the previous successful binary
73 - Health check after restart
74
756. **Zulip Integration**
76 - Post proposed changes for human review
77 - Emergency stop command
78 - Status updates on consultations
79 - Performance metrics before/after changes
80
81### Git Workflow Design
82
831. **Branch Strategy**
84 ```
85 main (production code)
86 ├── dancer/fix-2024-01-15-1200-auth-error
87 ├── dancer/fix-2024-01-15-1800-timeout-handler
88 └── dancer/rollback-2024-01-15-1900 (if needed)
89 ```
90
912. **Worktree Management**
92 - Base directory: `/var/dancer/worktrees/`
93 - Each fix gets its own worktree
94 - Clean up old worktrees after successful merge
95 - Keep failed attempts for analysis
96
973. **Change Process**
98 ```ocaml
99 type fix_status =
100 | Proposed
101 | Testing
102 | Approved
103 | Deployed
104 | Rolled_back
105
106 type fix_record = {
107 id: string;
108 branch: string;
109 worktree: string;
110 error_pattern: string;
111 claude_solution: string;
112 test_results: string option;
113 status: fix_status;
114 created_at: float;
115 }
116 ```
117
118### Simplified Log Management
119
1201. **Log Format**
121 ```ocaml
122 type log_entry = {
123 timestamp: float;
124 level: Logs.level;
125 source: string; (* module name *)
126 message: string;
127 error_type: string option;
128 stack_trace: string option;
129 }
130 ```
131
1322. **Context Preparation for Claude**
133 - Last 500 lines of logs
134 - Error frequency summary
135 - Relevant source file (where error originated)
136 - Previous fix attempts for similar errors
137 - System metrics (CPU, memory, request rate)
138
139### Restart Safety Mechanisms
140
1411. **Pre-Restart Checks**
142 - Compile the modified code
143 - Run unit tests if available
144 - Check syntax with `ocamlc -i`
145 - Verify no obvious issues (missing semicolons, etc.)
146
1472. **Restart Process**
148 ```bash
149 # Save current version
150 git tag dancer-before-$(date +%s)
151
152 # Merge fix
153 git merge --no-ff dancer/fix-<id>
154
155 # Rebuild
156 dune build
157
158 # Graceful restart
159 systemctl reload dancer-service || systemctl restart dancer-service
160
161 # Health check
162 ./health_check.sh || git reset --hard dancer-before-<timestamp>
163 ```
164
1653. **Rollback Triggers**
166 - Service fails to start
167 - Health check fails after restart
168 - Error rate increases by >50%
169 - Memory usage spikes
170 - Manual intervention via Zulip
171
172### MCP Server Requirements (Simplified)
173
1741. **Git Server**
175 - Local git repository with remote backup
176 - Web interface for viewing changes
177 - Webhook support for CI integration
178
1792. **Monitoring Server**
180 - Simple metrics collection (Prometheus/Grafana)
181 - Log aggregation (just file-based initially)
182 - Alert routing to Zulip
183
1843. **Claude API Gateway**
185 - Rate limiting
186 - Cost tracking
187 - Request/response logging
188 - Fallback to manual mode if quota exceeded
189
190### Implementation Phases (Simplified)
191
192**Phase 1: Core Infrastructure (Week 1-2)**
193- Log interception and buffering
194- Basic error pattern detection
195- Git worktree management
196- Manual Claude consultation
197
198**Phase 2: Automation (Week 3-4)**
199- Automatic Claude triggers
200- Code generation and application
201- Restart orchestration
202- Basic safety checks
203
204**Phase 3: Monitoring & Safety (Week 5-6)**
205- Zulip integration
206- Rollback mechanisms
207- Performance tracking
208- Cost management
209
210### Example Usage Flow
211
2121. **Error Detection**
213 ```ocaml
214 (* Application code *)
215 Logs.err (fun m -> m "Database connection failed: %s" error_msg);
216 (* This error happens 20 times in 2 minutes *)
217 ```
218
2192. **Claude Consultation**
220 ```
221 Context: Database connection errors occurring frequently
222 Pattern: "Database connection failed: Connection refused"
223
224 Claude generates:
225 - Exponential backoff retry logic
226 - Connection pool management
227 - Fallback to cached data
228 ```
229
2303. **Version Control**
231 ```bash
232 git worktree add ../dancer-fix-db-conn -b dancer/fix-db-conn
233 cd ../dancer-fix-db-conn
234 # Apply Claude's changes
235 dune build
236 # If successful, merge and restart
237 ```
238
2394. **Deployment**
240 ```bash
241 git checkout main
242 git merge dancer/fix-db-conn
243 systemctl restart dancer-service
244 # Monitor for 5 minutes
245 # If stable, cleanup worktree
246 ```
247
248### Data Structures
249
250```ocaml
251module Dancer = struct
252 type config = {
253 claude_api_key: string;
254 zulip_api_key: string;
255 zulip_stream: string;
256 max_context_size: int; (* chars to send to Claude *)
257 consultation_cooldown: float; (* seconds between consultations *)
258 error_threshold: int; (* errors before triggering *)
259 restart_timeout: float; (* max seconds for restart *)
260 worktree_base: string; (* base directory for git worktrees *)
261 }
262
263 type consultation_request = {
264 pattern: string;
265 occurrences: int;
266 timespan: float;
267 recent_logs: string;
268 source_context: string option;
269 }
270
271 type consultation_response = {
272 analysis: string;
273 proposed_fix: string;
274 target_file: string;
275 confidence: float;
276 }
277end
278```
279
280### Key Simplifications from Original Design
281
2821. **No Dynamic Linking** - Just restart the process
2832. **Simple Pattern Matching** - String comparison, no bloom filters
2843. **Basic Git Workflow** - Branches and worktrees, no complex versioning
2854. **Minimal Infrastructure** - SQLite instead of complex databases
2865. **Simple Rollback** - Git reset instead of sophisticated mechanisms
2876. **Direct Process Restart** - Using systemd/supervisor instead of hot-reload
2887. **File-Based Logs** - No complex log aggregation initially
2898. **Manual Approval Option** - Human can review via Zulip before deploy
290
291## Library Decomposition Plan
292
293### Core Libraries
294
2951. **dancer-logs** - Log interception and buffering
296 - Hook into OCaml Logs reporter
297 - SQLite-backed circular buffer
298 - Pattern normalization
299 - Standalone testable
300
3012. **dancer-patterns** - Pattern detection and tracking
302 - Error pattern recognition
303 - Frequency/acceleration tracking
304 - Pattern database management
305 - Trigger decision logic
306
3073. **dancer-claude** - Claude CLI integration
308 - Prompt construction
309 - Response parsing
310 - Context preparation
311 - Token cost tracking
312
3134. **dancer-git** - Git worktree management
314 - Worktree creation/cleanup
315 - Branch management
316 - Safe merging operations
317 - Rollback capabilities
318
3195. **dancer-test** - Alcotest generation
320 - Test template generation
321 - Test execution in worktrees
322 - Result parsing
323 - Coverage tracking
324
3256. **dancer-process** - Process management
326 - Tmux orchestration
327 - Service restart logic
328 - Health checking
329 - Graceful shutdown
330
3317. **dancer-observe** - Observability
332 - Metrics collection
333 - SQLite time-series storage
334 - Anomaly detection
335 - Audit trail management
336
3378. **dancer-spec** - Service specification
338 - YAML spec parsing
339 - Constraint validation
340 - Fix validation against spec
341 - Schema enforcement
342
3439. **dancer-deploy** - Deployment pipeline
344 - Staging environment setup
345 - Promotion criteria evaluation
346 - Production deployment
347 - Rollback orchestration
348
34910. **dancer-ui** - Human oversight interfaces
350 - Web dashboard (Dream)
351 - Terminal UI (Nottui)
352 - WebSocket live updates
353 - Audit log viewer
354
355### Implementation Order
356
357**Phase 1: Foundation** (Week 1)
3581. `dancer-logs` - Need log data first
3592. `dancer-patterns` - Pattern detection on logs
3603. `dancer-observe` - Basic metrics/storage
361
362**Phase 2: Claude Integration** (Week 2)
3634. `dancer-claude` - Claude consultation
3645. `dancer-spec` - Service constraints
3656. `dancer-test` - Test generation
366
367**Phase 3: Deployment** (Week 3)
3687. `dancer-git` - Worktree management
3698. `dancer-process` - Process control
3709. `dancer-deploy` - Staging/production
371
372**Phase 4: Oversight** (Week 4)
37310. `dancer-ui` - Dashboard and monitoring
374
375## dancer-logs Implementation Details
376
377### Design Decisions
378
3791. **Comprehensive Schema**: The SQLite schema captures everything that might be useful for Claude's analysis:
380 - Core fields (timestamp, level, source, message)
381 - Error classification (type, code, hash, stack trace)
382 - Execution context (PIDs, threads, fibers, domains)
383 - Request/session tracking (IDs for correlation)
384 - Source location (file, line, function, module)
385 - Performance metrics (duration, memory, CPU)
386 - Network context (IPs, ports, methods, status codes)
387 - User context (user ID, tenant ID)
388 - System context (OS, versions, environment, containers)
389 - Flexible metadata (JSON fields for tags, labels, custom data)
390
3912. **Full-Text Search**: Using SQLite's FTS5 with porter stemming and unicode support for comprehensive search across all text fields.
392
3933. **Pattern Detection**:
394 - Normalizes messages by replacing numbers with "N", hex with "0xHEX", UUIDs with "UUID"
395 - Tracks patterns with occurrence counts, severity scores, and consultation history
396 - Time-series bucketing for trend analysis
397
3984. **Logs Reporter Integration**:
399 - Implements OCaml Logs reporter interface
400 - Can chain with existing reporters
401 - Thread-safe with Eio mutexes
402 - Captures Eio fiber IDs and domain IDs
403
4045. **CLI Tool**: Colorful terminal interface using Fmt with:
405 - `list` - Browse logs with filters
406 - `search` - Full-text search
407 - `patterns` - View detected patterns
408 - `errors` - Recent error summary
409 - `stats` - Database statistics
410 - `export` - Export for analysis (JSON, CSV, Claude format)
411
412### Key Learnings
413
4141. **No Truncation**: Keep all data intact for Claude to analyze. Storage is cheap, context is valuable.
415
4162. **Structured Everything**: The more structure we capture, the better Claude can understand patterns and correlations.
417
4183. **Synchronous is Fine**: For now, synchronous SQLite operations are acceptable. Performance optimization can come later.
419
4204. **Reporter Chaining**: Following Logs library conventions by supporting reporter chaining allows integration with existing logging infrastructure.
421
4225. **Export for Claude**: Special export format that prepares context within token limits, focusing on recent errors and patterns.
423
424### Usage Example
425
426```ocaml
427(* Initialize the database *)
428let db = Dancer_logs.init ~path:"app_logs.db" () in
429
430(* Create and set the reporter *)
431let reporter = Dancer_logs.reporter db in
432let reporter = Dancer_logs.chain_reporter db (Logs_fmt.reporter ()) in
433Logs.set_reporter reporter;
434
435(* Use with context *)
436let ctx = Dancer_logs.Context.(
437 empty_context
438 |> with_session "session-123"
439 |> with_request "req-456"
440 |> with_user "user-789"
441 |> add_tag "payment"
442 |> add_label "environment" "production"
443) in
444
445(* Log with full context *)
446Dancer_logs.log db
447 ~level:Logs.Error
448 ~source:"Payment.Gateway"
449 ~message:"Payment failed: timeout"
450 ~context:ctx
451 ~error:{
452 error_type = Some "TimeoutException";
453 error_code = Some "GATEWAY_TIMEOUT";
454 error_hash = Some (hash_of_error);
455 stack_trace = Some (Printexc.get_backtrace ());
456 }
457 ~performance:{
458 duration_ms = Some 30000.0;
459 memory_before = Some 1000000;
460 memory_after = Some 1200000;
461 cpu_time_ms = Some 50.0;
462 }
463 ();
464```
465
466### CLI Usage
467
468```bash
469# View recent errors
470dancer-logs errors
471
472# Search for specific patterns
473dancer-logs search "connection refused"
474
475# View error patterns
476dancer-logs patterns --min-count 10
477
478# Export context for Claude
479dancer-logs export -f claude -o context.txt --since 24
480
481# Follow logs in real-time (to be implemented)
482dancer-logs list -f
483```
484
485### Next Steps
486
4871. Implement pattern detection algorithm that runs periodically
4882. Add metrics collection from /proc and system calls
4893. Implement log following mode for real-time monitoring
4904. Add session correlation views
4915. Create integration tests with sample applications