commit 870ae671b09aeae2e81fd9ececac6f61821ea2eb · anil.recoil.org/slop

+373

dancer/CLAUDE.md

···

       1
       +
       I want to design an OCaml library that builds in support for modifying the

     

       2
       +
       program linked to it using Claude Code, and restarting itself with the fixes

     

       3
       +
       automatically. The idea is for long-running services to regularly consult with

     

       4
       +
       Claude (either on a fixed timetable, or urgently if something really unexpected

     

       5
       +
       happens) and improve their own functionality. Claude should be used to analyse

     

       6
       +
       patterns in the logs and determine whether to write code to handle a particular

     

       7
       +
       case. Claude should not be directly used in the application datapath itself, as

     

       8
       +
       it should write code.

     

       9
       +
       

     

       10
       +
       To make this work, the program needs to emit sufficient tracing data to be

     

       11
       +
       useful to Claude when it does an inspection, but not so much that it overwhelms

     

       12
       +
       the context window.  Therefore, the first thing the library needs is some

     

       13
       +
       mechanism to intercept the logging output of the program suitably. The OCaml

     

       14
       +
       "logs" library is a good thing to standardise on here. It's also fine to use

     

       15
       +
       the OCaml direct-style Eio library for all interactions.

     

       16
       +
       

     

       17
       +
       Assume the code is running in a Linux environment with root level access. There

     

       18
       +
       will also be a Zulip server available with an API key that can be used to post

     

       19
       +
       messages to and interact with.

     

       20
       +
       

     

       21
       +
       This is an ambitious project, so before embarking on it, I need to think really

     

       22
       +
       carefully about the design and tradeoffs, including seeking clarificaiton where

     

       23
       +
       necessary about what sorts of MCP servers or other support infrastructure will

     

       24
       +
       be useful to making library successful. I'm ok taking risks and trying unusual

     

       25
       +
       approaches. The library will be called "Dancer" after the Hunter S Thompson

     

       26
       +
       quote "We're raising a generation of dancers, afraid to take one step out of

     

       27
       +
       line."

     

       28
       +
       

     

       29
       +
       ## Architecture Design (v1)

     

       30
       +
       

     

       31
       +
       ### Core Components

     

       32
       +
       

     

       33
       +
       1. **Log Interceptor & Buffer**

     

       34
       +
          - Hook into OCaml Logs library at the reporter level

     

       35
       +
          - Maintain a persistent buffer on disk, perhaps in Sqlite, for analysis

     

       36
       +
          - Group consecutive identical errors with count

     

       37
       +
          - Tag logs with timestamp, module, and error type

     

       38
       +
       

     

       39
       +
       2. **Pattern Detector**

     

       40
       +
          - Track log messages and their frequency

     

       41
       +
          - Use string matching to identify recurring patterns

     

       42
       +
          - Maintain a simple SQLite database of seen patterns

     

       43
       +
          - Trigger Claude consultation when:

     

       44
       +
            - New error pattern appears frequently (>10 times in 5 min)

     

       45
       +
            - Error rate spikes above baseline

     

       46
       +
            - Scheduled review (e.g., every 6 hours)

     

       47
       +
       

     

       48
       +
       3. **Claude Consultation Manager**

     

       49
       +
          - Prepare context: recent logs + relevant source files

     

       50
       +
          - Ask Claude to:

     

       51
       +
            - Analyze the error pattern

     

       52
       +
            - Generate OCaml code to handle the case

     

       53
       +
            - Suggest where to integrate the fix

     

       54
       +
            - Test the fixes and trial a deployment

     

       55
       +
          - Store Claude's response and proposed changes

     

       56
       +
       

     

       57
       +
       4. **Version Control Integration**

     

       58
       +
          - Each Claude fix creates a new git branch: `dancer/fix-<timestamp>-<error-hash>`

     

       59
       +
          - Use git worktrees for isolated changes:

     

       60
       +
            ```bash

     

       61
       +
            git worktree add ../dancer-fix-<id> -b dancer/fix-<id>

     

       62
       +
            ```

     

       63
       +
          - Apply Claude's changes in the worktree include a changelog in the commits

     

       64
       +
          - Compile and test in isolation

     

       65
       +
          - If successful, merge to main and restart application

     

       66
       +
          - Have a script that can search for all the fix branches and update a central changelog ordered by time, suitable for a human to review regularly

     

       67
       +
       

     

       68
       +
       5. **Restart Orchestration**

     

       69
       +
          - Library has a supervisor for process management of the application itself

     

       70
       +
          - Graceful shutdown: finish current requests with a timeout

     

       71
       +
          - State persistence before restart (if needed)

     

       72
       +
          - Automatic rollback if restart fails from the previous successful binary

     

       73
       +
          - Health check after restart

     

       74
       +
       

     

       75
       +
       6. **Zulip Integration**

     

       76
       +
          - Post proposed changes for human review

     

       77
       +
          - Emergency stop command

     

       78
       +
          - Status updates on consultations

     

       79
       +
          - Performance metrics before/after changes

     

       80
       +
       

     

       81
       +
       ### Git Workflow Design

     

       82
       +
       

     

       83
       +
       1. **Branch Strategy**

     

       84
       +
          ```

     

       85
       +
          main (production code)

     

       86
       +
          ├── dancer/fix-2024-01-15-1200-auth-error

     

       87
       +
          ├── dancer/fix-2024-01-15-1800-timeout-handler

     

       88
       +
          └── dancer/rollback-2024-01-15-1900 (if needed)

     

       89
       +
          ```

     

       90
       +
       

     

       91
       +
       2. **Worktree Management**

     

       92
       +
          - Base directory: `/var/dancer/worktrees/`

     

       93
       +
          - Each fix gets its own worktree

     

       94
       +
          - Clean up old worktrees after successful merge

     

       95
       +
          - Keep failed attempts for analysis

     

       96
       +
       

     

       97
       +
       3. **Change Process**

     

       98
       +
          ```ocaml

     

       99
       +
          type fix_status =

     

       100
       +
            | Proposed

     

       101
       +
            | Testing

     

       102
       +
            | Approved

     

       103
       +
            | Deployed

     

       104
       +
            | Rolled_back

     

       105
       +
          

     

       106
       +
          type fix_record = {

     

       107
       +
            id: string;

     

       108
       +
            branch: string;

     

       109
       +
            worktree: string;

     

       110
       +
            error_pattern: string;

     

       111
       +
            claude_solution: string;

     

       112
       +
            test_results: string option;

     

       113
       +
            status: fix_status;

     

       114
       +
            created_at: float;

     

       115
       +
          }

     

       116
       +
          ```

     

       117
       +
       

     

       118
       +
       ### Simplified Log Management

     

       119
       +
       

     

       120
       +
       1. **Log Format**

     

       121
       +
          ```ocaml

     

       122
       +
          type log_entry = {

     

       123
       +
            timestamp: float;

     

       124
       +
            level: Logs.level;

     

       125
       +
            source: string; (* module name *)

     

       126
       +
            message: string;

     

       127
       +
            error_type: string option;

     

       128
       +
            stack_trace: string option;

     

       129
       +
          }

     

       130
       +
          ```

     

       131
       +
       

     

       132
       +
       2. **Context Preparation for Claude**

     

       133
       +
          - Last 500 lines of logs

     

       134
       +
          - Error frequency summary

     

       135
       +
          - Relevant source file (where error originated)

     

       136
       +
          - Previous fix attempts for similar errors

     

       137
       +
          - System metrics (CPU, memory, request rate)

     

       138
       +
       

     

       139
       +
       ### Restart Safety Mechanisms

     

       140
       +
       

     

       141
       +
       1. **Pre-Restart Checks**

     

       142
       +
          - Compile the modified code

     

       143
       +
          - Run unit tests if available

     

       144
       +
          - Check syntax with `ocamlc -i`

     

       145
       +
          - Verify no obvious issues (missing semicolons, etc.)

     

       146
       +
       

     

       147
       +
       2. **Restart Process**

     

       148
       +
          ```bash

     

       149
       +
          # Save current version

     

       150
       +
          git tag dancer-before-$(date +%s)

     

       151
       +
          

     

       152
       +
          # Merge fix

     

       153
       +
          git merge --no-ff dancer/fix-<id>

     

       154
       +
          

     

       155
       +
          # Rebuild

     

       156
       +
          dune build

     

       157
       +
          

     

       158
       +
          # Graceful restart

     

       159
       +
          systemctl reload dancer-service || systemctl restart dancer-service

     

       160
       +
          

     

       161
       +
          # Health check

     

       162
       +
          ./health_check.sh || git reset --hard dancer-before-<timestamp>

     

       163
       +
          ```

     

       164
       +
       

     

       165
       +
       3. **Rollback Triggers**

     

       166
       +
          - Service fails to start

     

       167
       +
          - Health check fails after restart

     

       168
       +
          - Error rate increases by >50%

     

       169
       +
          - Memory usage spikes

     

       170
       +
          - Manual intervention via Zulip

     

       171
       +
       

     

       172
       +
       ### MCP Server Requirements (Simplified)

     

       173
       +
       

     

       174
       +
       1. **Git Server**

     

       175
       +
          - Local git repository with remote backup

     

       176
       +
          - Web interface for viewing changes

     

       177
       +
          - Webhook support for CI integration

     

       178
       +
       

     

       179
       +
       2. **Monitoring Server**

     

       180
       +
          - Simple metrics collection (Prometheus/Grafana)

     

       181
       +
          - Log aggregation (just file-based initially)

     

       182
       +
          - Alert routing to Zulip

     

       183
       +
       

     

       184
       +
       3. **Claude API Gateway**

     

       185
       +
          - Rate limiting

     

       186
       +
          - Cost tracking

     

       187
       +
          - Request/response logging

     

       188
       +
          - Fallback to manual mode if quota exceeded

     

       189
       +
       

     

       190
       +
       ### Implementation Phases (Simplified)

     

       191
       +
       

     

       192
       +
       **Phase 1: Core Infrastructure (Week 1-2)**

     

       193
       +
       - Log interception and buffering

     

       194
       +
       - Basic error pattern detection

     

       195
       +
       - Git worktree management

     

       196
       +
       - Manual Claude consultation

     

       197
       +
       

     

       198
       +
       **Phase 2: Automation (Week 3-4)**

     

       199
       +
       - Automatic Claude triggers

     

       200
       +
       - Code generation and application

     

       201
       +
       - Restart orchestration

     

       202
       +
       - Basic safety checks

     

       203
       +
       

     

       204
       +
       **Phase 3: Monitoring & Safety (Week 5-6)**

     

       205
       +
       - Zulip integration

     

       206
       +
       - Rollback mechanisms

     

       207
       +
       - Performance tracking

     

       208
       +
       - Cost management

     

       209
       +
       

     

       210
       +
       ### Example Usage Flow

     

       211
       +
       

     

       212
       +
       1. **Error Detection**

     

       213
       +
          ```ocaml

     

       214
       +
          (* Application code *)

     

       215
       +
          Logs.err (fun m -> m "Database connection failed: %s" error_msg);

     

       216
       +
          (* This error happens 20 times in 2 minutes *)

     

       217
       +
          ```

     

       218
       +
       

     

       219
       +
       2. **Claude Consultation**

     

       220
       +
          ```

     

       221
       +
          Context: Database connection errors occurring frequently

     

       222
       +
          Pattern: "Database connection failed: Connection refused"

     

       223
       +
          

     

       224
       +
          Claude generates:

     

       225
       +
          - Exponential backoff retry logic

     

       226
       +
          - Connection pool management

     

       227
       +
          - Fallback to cached data

     

       228
       +
          ```

     

       229
       +
       

     

       230
       +
       3. **Version Control**

     

       231
       +
          ```bash

     

       232
       +
          git worktree add ../dancer-fix-db-conn -b dancer/fix-db-conn

     

       233
       +
          cd ../dancer-fix-db-conn

     

       234
       +
          # Apply Claude's changes

     

       235
       +
          dune build

     

       236
       +
          # If successful, merge and restart

     

       237
       +
          ```

     

       238
       +
       

     

       239
       +
       4. **Deployment**

     

       240
       +
          ```bash

     

       241
       +
          git checkout main

     

       242
       +
          git merge dancer/fix-db-conn

     

       243
       +
          systemctl restart dancer-service

     

       244
       +
          # Monitor for 5 minutes

     

       245
       +
          # If stable, cleanup worktree

     

       246
       +
          ```

     

       247
       +
       

     

       248
       +
       ### Data Structures

     

       249
       +
       

     

       250
       +
       ```ocaml

     

       251
       +
       module Dancer = struct

     

       252
       +
         type config = {

     

       253
       +
           claude_api_key: string;

     

       254
       +
           zulip_api_key: string;

     

       255
       +
           zulip_stream: string;

     

       256
       +
           max_context_size: int; (* chars to send to Claude *)

     

       257
       +
           consultation_cooldown: float; (* seconds between consultations *)

     

       258
       +
           error_threshold: int; (* errors before triggering *)

     

       259
       +
           restart_timeout: float; (* max seconds for restart *)

     

       260
       +
           worktree_base: string; (* base directory for git worktrees *)

     

       261
       +
         }

     

       262
       +
         

     

       263
       +
         type consultation_request = {

     

       264
       +
           pattern: string;

     

       265
       +
           occurrences: int;

     

       266
       +
           timespan: float;

     

       267
       +
           recent_logs: string;

     

       268
       +
           source_context: string option;

     

       269
       +
         }

     

       270
       +
         

     

       271
       +
         type consultation_response = {

     

       272
       +
           analysis: string;

     

       273
       +
           proposed_fix: string;

     

       274
       +
           target_file: string;

     

       275
       +
           confidence: float;

     

       276
       +
         }

     

       277
       +
       end

     

       278
       +
       ```

     

       279
       +
       

     

       280
       +
       ### Key Simplifications from Original Design

     

       281
       +
       

     

       282
       +
       1. **No Dynamic Linking** - Just restart the process

     

       283
       +
       2. **Simple Pattern Matching** - String comparison, no bloom filters

     

       284
       +
       3. **Basic Git Workflow** - Branches and worktrees, no complex versioning

     

       285
       +
       4. **Minimal Infrastructure** - SQLite instead of complex databases

     

       286
       +
       5. **Simple Rollback** - Git reset instead of sophisticated mechanisms

     

       287
       +
       6. **Direct Process Restart** - Using systemd/supervisor instead of hot-reload

     

       288
       +
       7. **File-Based Logs** - No complex log aggregation initially

     

       289
       +
       8. **Manual Approval Option** - Human can review via Zulip before deploy

     

       290
       +
       

     

       291
       +
       ## Library Decomposition Plan

     

       292
       +
       

     

       293
       +
       ### Core Libraries

     

       294
       +
       

     

       295
       +
       1. **dancer-logs** - Log interception and buffering

     

       296
       +
          - Hook into OCaml Logs reporter

     

       297
       +
          - SQLite-backed circular buffer

     

       298
       +
          - Pattern normalization

     

       299
       +
          - Standalone testable

     

       300
       +
       

     

       301
       +
       2. **dancer-patterns** - Pattern detection and tracking

     

       302
       +
          - Error pattern recognition

     

       303
       +
          - Frequency/acceleration tracking

     

       304
       +
          - Pattern database management

     

       305
       +
          - Trigger decision logic

     

       306
       +
       

     

       307
       +
       3. **dancer-claude** - Claude CLI integration

     

       308
       +
          - Prompt construction

     

       309
       +
          - Response parsing

     

       310
       +
          - Context preparation

     

       311
       +
          - Token cost tracking

     

       312
       +
       

     

       313
       +
       4. **dancer-git** - Git worktree management

     

       314
       +
          - Worktree creation/cleanup

     

       315
       +
          - Branch management

     

       316
       +
          - Safe merging operations

     

       317
       +
          - Rollback capabilities

     

       318
       +
       

     

       319
       +
       5. **dancer-test** - Alcotest generation

     

       320
       +
          - Test template generation

     

       321
       +
          - Test execution in worktrees

     

       322
       +
          - Result parsing

     

       323
       +
          - Coverage tracking

     

       324
       +
       

     

       325
       +
       6. **dancer-process** - Process management

     

       326
       +
          - Tmux orchestration

     

       327
       +
          - Service restart logic

     

       328
       +
          - Health checking

     

       329
       +
          - Graceful shutdown

     

       330
       +
       

     

       331
       +
       7. **dancer-observe** - Observability

     

       332
       +
          - Metrics collection

     

       333
       +
          - SQLite time-series storage

     

       334
       +
          - Anomaly detection

     

       335
       +
          - Audit trail management

     

       336
       +
       

     

       337
       +
       8. **dancer-spec** - Service specification

     

       338
       +
          - YAML spec parsing

     

       339
       +
          - Constraint validation

     

       340
       +
          - Fix validation against spec

     

       341
       +
          - Schema enforcement

     

       342
       +
       

     

       343
       +
       9. **dancer-deploy** - Deployment pipeline

     

       344
       +
          - Staging environment setup

     

       345
       +
          - Promotion criteria evaluation

     

       346
       +
          - Production deployment

     

       347
       +
          - Rollback orchestration

     

       348
       +
       

     

       349
       +
       10. **dancer-ui** - Human oversight interfaces

     

       350
       +
           - Web dashboard (Dream)

     

       351
       +
           - Terminal UI (Nottui)

     

       352
       +
           - WebSocket live updates

     

       353
       +
           - Audit log viewer

     

       354
       +
       

     

       355
       +
       ### Implementation Order

     

       356
       +
       

     

       357
       +
       **Phase 1: Foundation** (Week 1)

     

       358
       +
       1. `dancer-logs` - Need log data first

     

       359
       +
       2. `dancer-patterns` - Pattern detection on logs

     

       360
       +
       3. `dancer-observe` - Basic metrics/storage

     

       361
       +
       

     

       362
       +
       **Phase 2: Claude Integration** (Week 2)

     

       363
       +
       4. `dancer-claude` - Claude consultation

     

       364
       +
       5. `dancer-spec` - Service constraints

     

       365
       +
       6. `dancer-test` - Test generation

     

       366
       +
       

     

       367
       +
       **Phase 3: Deployment** (Week 3)

     

       368
       +
       7. `dancer-git` - Worktree management

     

       369
       +
       8. `dancer-process` - Process control

     

       370
       +
       9. `dancer-deploy` - Staging/production

     

       371
       +
       

     

       372
       +
       **Phase 4: Oversight** (Week 4)

     

       373
       +
       10. `dancer-ui` - Dashboard and monitoring