···
1
+
I want to design an OCaml library that builds in support for modifying the
2
+
program linked to it using Claude Code, and restarting itself with the fixes
3
+
automatically. The idea is for long-running services to regularly consult with
4
+
Claude (either on a fixed timetable, or urgently if something really unexpected
5
+
happens) and improve their own functionality. Claude should be used to analyse
6
+
patterns in the logs and determine whether to write code to handle a particular
7
+
case. Claude should not be directly used in the application datapath itself, as
8
+
it should write code.
10
+
To make this work, the program needs to emit sufficient tracing data to be
11
+
useful to Claude when it does an inspection, but not so much that it overwhelms
12
+
the context window. Therefore, the first thing the library needs is some
13
+
mechanism to intercept the logging output of the program suitably. The OCaml
14
+
"logs" library is a good thing to standardise on here. It's also fine to use
15
+
the OCaml direct-style Eio library for all interactions.
17
+
Assume the code is running in a Linux environment with root level access. There
18
+
will also be a Zulip server available with an API key that can be used to post
19
+
messages to and interact with.
21
+
This is an ambitious project, so before embarking on it, I need to think really
22
+
carefully about the design and tradeoffs, including seeking clarificaiton where
23
+
necessary about what sorts of MCP servers or other support infrastructure will
24
+
be useful to making library successful. I'm ok taking risks and trying unusual
25
+
approaches. The library will be called "Dancer" after the Hunter S Thompson
26
+
quote "We're raising a generation of dancers, afraid to take one step out of
29
+
## Architecture Design (v1)
33
+
1. **Log Interceptor & Buffer**
34
+
- Hook into OCaml Logs library at the reporter level
35
+
- Maintain a persistent buffer on disk, perhaps in Sqlite, for analysis
36
+
- Group consecutive identical errors with count
37
+
- Tag logs with timestamp, module, and error type
39
+
2. **Pattern Detector**
40
+
- Track log messages and their frequency
41
+
- Use string matching to identify recurring patterns
42
+
- Maintain a simple SQLite database of seen patterns
43
+
- Trigger Claude consultation when:
44
+
- New error pattern appears frequently (>10 times in 5 min)
45
+
- Error rate spikes above baseline
46
+
- Scheduled review (e.g., every 6 hours)
48
+
3. **Claude Consultation Manager**
49
+
- Prepare context: recent logs + relevant source files
51
+
- Analyze the error pattern
52
+
- Generate OCaml code to handle the case
53
+
- Suggest where to integrate the fix
54
+
- Test the fixes and trial a deployment
55
+
- Store Claude's response and proposed changes
57
+
4. **Version Control Integration**
58
+
- Each Claude fix creates a new git branch: `dancer/fix-<timestamp>-<error-hash>`
59
+
- Use git worktrees for isolated changes:
61
+
git worktree add ../dancer-fix-<id> -b dancer/fix-<id>
63
+
- Apply Claude's changes in the worktree include a changelog in the commits
64
+
- Compile and test in isolation
65
+
- If successful, merge to main and restart application
66
+
- Have a script that can search for all the fix branches and update a central changelog ordered by time, suitable for a human to review regularly
68
+
5. **Restart Orchestration**
69
+
- Library has a supervisor for process management of the application itself
70
+
- Graceful shutdown: finish current requests with a timeout
71
+
- State persistence before restart (if needed)
72
+
- Automatic rollback if restart fails from the previous successful binary
73
+
- Health check after restart
75
+
6. **Zulip Integration**
76
+
- Post proposed changes for human review
77
+
- Emergency stop command
78
+
- Status updates on consultations
79
+
- Performance metrics before/after changes
81
+
### Git Workflow Design
83
+
1. **Branch Strategy**
85
+
main (production code)
86
+
├── dancer/fix-2024-01-15-1200-auth-error
87
+
├── dancer/fix-2024-01-15-1800-timeout-handler
88
+
└── dancer/rollback-2024-01-15-1900 (if needed)
91
+
2. **Worktree Management**
92
+
- Base directory: `/var/dancer/worktrees/`
93
+
- Each fix gets its own worktree
94
+
- Clean up old worktrees after successful merge
95
+
- Keep failed attempts for analysis
97
+
3. **Change Process**
106
+
type fix_record = {
110
+
error_pattern: string;
111
+
claude_solution: string;
112
+
test_results: string option;
113
+
status: fix_status;
118
+
### Simplified Log Management
125
+
source: string; (* module name *)
127
+
error_type: string option;
128
+
stack_trace: string option;
132
+
2. **Context Preparation for Claude**
133
+
- Last 500 lines of logs
134
+
- Error frequency summary
135
+
- Relevant source file (where error originated)
136
+
- Previous fix attempts for similar errors
137
+
- System metrics (CPU, memory, request rate)
139
+
### Restart Safety Mechanisms
141
+
1. **Pre-Restart Checks**
142
+
- Compile the modified code
143
+
- Run unit tests if available
144
+
- Check syntax with `ocamlc -i`
145
+
- Verify no obvious issues (missing semicolons, etc.)
147
+
2. **Restart Process**
149
+
# Save current version
150
+
git tag dancer-before-$(date +%s)
153
+
git merge --no-ff dancer/fix-<id>
159
+
systemctl reload dancer-service || systemctl restart dancer-service
162
+
./health_check.sh || git reset --hard dancer-before-<timestamp>
165
+
3. **Rollback Triggers**
166
+
- Service fails to start
167
+
- Health check fails after restart
168
+
- Error rate increases by >50%
169
+
- Memory usage spikes
170
+
- Manual intervention via Zulip
172
+
### MCP Server Requirements (Simplified)
175
+
- Local git repository with remote backup
176
+
- Web interface for viewing changes
177
+
- Webhook support for CI integration
179
+
2. **Monitoring Server**
180
+
- Simple metrics collection (Prometheus/Grafana)
181
+
- Log aggregation (just file-based initially)
182
+
- Alert routing to Zulip
184
+
3. **Claude API Gateway**
187
+
- Request/response logging
188
+
- Fallback to manual mode if quota exceeded
190
+
### Implementation Phases (Simplified)
192
+
**Phase 1: Core Infrastructure (Week 1-2)**
193
+
- Log interception and buffering
194
+
- Basic error pattern detection
195
+
- Git worktree management
196
+
- Manual Claude consultation
198
+
**Phase 2: Automation (Week 3-4)**
199
+
- Automatic Claude triggers
200
+
- Code generation and application
201
+
- Restart orchestration
202
+
- Basic safety checks
204
+
**Phase 3: Monitoring & Safety (Week 5-6)**
205
+
- Zulip integration
206
+
- Rollback mechanisms
207
+
- Performance tracking
210
+
### Example Usage Flow
212
+
1. **Error Detection**
214
+
(* Application code *)
215
+
Logs.err (fun m -> m "Database connection failed: %s" error_msg);
216
+
(* This error happens 20 times in 2 minutes *)
219
+
2. **Claude Consultation**
221
+
Context: Database connection errors occurring frequently
222
+
Pattern: "Database connection failed: Connection refused"
225
+
- Exponential backoff retry logic
226
+
- Connection pool management
227
+
- Fallback to cached data
230
+
3. **Version Control**
232
+
git worktree add ../dancer-fix-db-conn -b dancer/fix-db-conn
233
+
cd ../dancer-fix-db-conn
234
+
# Apply Claude's changes
236
+
# If successful, merge and restart
242
+
git merge dancer/fix-db-conn
243
+
systemctl restart dancer-service
244
+
# Monitor for 5 minutes
245
+
# If stable, cleanup worktree
248
+
### Data Structures
251
+
module Dancer = struct
253
+
claude_api_key: string;
254
+
zulip_api_key: string;
255
+
zulip_stream: string;
256
+
max_context_size: int; (* chars to send to Claude *)
257
+
consultation_cooldown: float; (* seconds between consultations *)
258
+
error_threshold: int; (* errors before triggering *)
259
+
restart_timeout: float; (* max seconds for restart *)
260
+
worktree_base: string; (* base directory for git worktrees *)
263
+
type consultation_request = {
267
+
recent_logs: string;
268
+
source_context: string option;
271
+
type consultation_response = {
273
+
proposed_fix: string;
274
+
target_file: string;
280
+
### Key Simplifications from Original Design
282
+
1. **No Dynamic Linking** - Just restart the process
283
+
2. **Simple Pattern Matching** - String comparison, no bloom filters
284
+
3. **Basic Git Workflow** - Branches and worktrees, no complex versioning
285
+
4. **Minimal Infrastructure** - SQLite instead of complex databases
286
+
5. **Simple Rollback** - Git reset instead of sophisticated mechanisms
287
+
6. **Direct Process Restart** - Using systemd/supervisor instead of hot-reload
288
+
7. **File-Based Logs** - No complex log aggregation initially
289
+
8. **Manual Approval Option** - Human can review via Zulip before deploy
291
+
## Library Decomposition Plan
295
+
1. **dancer-logs** - Log interception and buffering
296
+
- Hook into OCaml Logs reporter
297
+
- SQLite-backed circular buffer
298
+
- Pattern normalization
299
+
- Standalone testable
301
+
2. **dancer-patterns** - Pattern detection and tracking
302
+
- Error pattern recognition
303
+
- Frequency/acceleration tracking
304
+
- Pattern database management
305
+
- Trigger decision logic
307
+
3. **dancer-claude** - Claude CLI integration
308
+
- Prompt construction
310
+
- Context preparation
311
+
- Token cost tracking
313
+
4. **dancer-git** - Git worktree management
314
+
- Worktree creation/cleanup
315
+
- Branch management
316
+
- Safe merging operations
317
+
- Rollback capabilities
319
+
5. **dancer-test** - Alcotest generation
320
+
- Test template generation
321
+
- Test execution in worktrees
323
+
- Coverage tracking
325
+
6. **dancer-process** - Process management
326
+
- Tmux orchestration
327
+
- Service restart logic
329
+
- Graceful shutdown
331
+
7. **dancer-observe** - Observability
332
+
- Metrics collection
333
+
- SQLite time-series storage
334
+
- Anomaly detection
335
+
- Audit trail management
337
+
8. **dancer-spec** - Service specification
338
+
- YAML spec parsing
339
+
- Constraint validation
340
+
- Fix validation against spec
341
+
- Schema enforcement
343
+
9. **dancer-deploy** - Deployment pipeline
344
+
- Staging environment setup
345
+
- Promotion criteria evaluation
346
+
- Production deployment
347
+
- Rollback orchestration
349
+
10. **dancer-ui** - Human oversight interfaces
350
+
- Web dashboard (Dream)
351
+
- Terminal UI (Nottui)
352
+
- WebSocket live updates
355
+
### Implementation Order
357
+
**Phase 1: Foundation** (Week 1)
358
+
1. `dancer-logs` - Need log data first
359
+
2. `dancer-patterns` - Pattern detection on logs
360
+
3. `dancer-observe` - Basic metrics/storage
362
+
**Phase 2: Claude Integration** (Week 2)
363
+
4. `dancer-claude` - Claude consultation
364
+
5. `dancer-spec` - Service constraints
365
+
6. `dancer-test` - Test generation
367
+
**Phase 3: Deployment** (Week 3)
368
+
7. `dancer-git` - Worktree management
369
+
8. `dancer-process` - Process control
370
+
9. `dancer-deploy` - Staging/production
372
+
**Phase 4: Oversight** (Week 4)
373
+
10. `dancer-ui` - Dashboard and monitoring