dancer/CLAUDE.md at main · anil.recoil.org/slop

I want to design an OCaml library that builds in support for modifying the program linked to it using Claude Code, and restarting itself with the fixes automatically. The idea is for long-running services to regularly consult with Claude (either on a fixed timetable, or urgently if something really unexpected happens) and improve their own functionality. Claude should be used to analyse patterns in the logs and determine whether to write code to handle a particular case. Claude should not be directly used in the application datapath itself, as it should write code.

To make this work, the program needs to emit sufficient tracing data to be useful to Claude when it does an inspection, but not so much that it overwhelms the context window. Therefore, the first thing the library needs is some mechanism to intercept the logging output of the program suitably. The OCaml "logs" library is a good thing to standardise on here. It's also fine to use the OCaml direct-style Eio library for all interactions.

Assume the code is running in a Linux environment with root level access. There will also be a Zulip server available with an API key that can be used to post messages to and interact with.

This is an ambitious project, so before embarking on it, I need to think really carefully about the design and tradeoffs, including seeking clarificaiton where necessary about what sorts of MCP servers or other support infrastructure will be useful to making library successful. I'm ok taking risks and trying unusual approaches. The library will be called "Dancer" after the Hunter S Thompson quote "We're raising a generation of dancers, afraid to take one step out of line."

Architecture Design (v1)#

Core Components#

Log Interceptor & Buffer
- Hook into OCaml Logs library at the reporter level
- Maintain a persistent buffer on disk, perhaps in Sqlite, for analysis
- Group consecutive identical errors with count
- Tag logs with timestamp, module, and error type
Pattern Detector
- Track log messages and their frequency
- Use string matching to identify recurring patterns
- Maintain a simple SQLite database of seen patterns
- Trigger Claude consultation when:
  - New error pattern appears frequently (>10 times in 5 min)
  - Error rate spikes above baseline
  - Scheduled review (e.g., every 6 hours)
Claude Consultation Manager
- Prepare context: recent logs + relevant source files
- Ask Claude to:
  - Analyze the error pattern
  - Generate OCaml code to handle the case
  - Suggest where to integrate the fix
  - Test the fixes and trial a deployment
- Store Claude's response and proposed changes
Version Control Integration
- Each Claude fix creates a new git branch: dancer/fix-<timestamp>-<error-hash>
- Use git worktrees for isolated changes:
```
git worktree add ../dancer-fix-<id> -b dancer/fix-<id>
```
- Apply Claude's changes in the worktree include a changelog in the commits
- Compile and test in isolation
- If successful, merge to main and restart application
- Have a script that can search for all the fix branches and update a central changelog ordered by time, suitable for a human to review regularly
Restart Orchestration
- Library has a supervisor for process management of the application itself
- Graceful shutdown: finish current requests with a timeout
- State persistence before restart (if needed)
- Automatic rollback if restart fails from the previous successful binary
- Health check after restart
Zulip Integration
- Post proposed changes for human review
- Emergency stop command
- Status updates on consultations
- Performance metrics before/after changes

Git Workflow Design#

Branch Strategy

main (production code)
├── dancer/fix-2024-01-15-1200-auth-error
├── dancer/fix-2024-01-15-1800-timeout-handler
└── dancer/rollback-2024-01-15-1900 (if needed)

Worktree Management
- Base directory: /var/dancer/worktrees/
- Each fix gets its own worktree
- Clean up old worktrees after successful merge
- Keep failed attempts for analysis

Change Process

type fix_status =
  | Proposed
  | Testing
  | Approved
  | Deployed
  | Rolled_back

type fix_record = {
  id: string;
  branch: string;
  worktree: string;
  error_pattern: string;
  claude_solution: string;
  test_results: string option;
  status: fix_status;
  created_at: float;
}

Simplified Log Management#

Log Format

type log_entry = {
  timestamp: float;
  level: Logs.level;
  source: string; (* module name *)
  message: string;
  error_type: string option;
  stack_trace: string option;
}

Context Preparation for Claude
- Last 500 lines of logs
- Error frequency summary
- Relevant source file (where error originated)
- Previous fix attempts for similar errors
- System metrics (CPU, memory, request rate)

Restart Safety Mechanisms#

Pre-Restart Checks
- Compile the modified code
- Run unit tests if available
- Check syntax with ocamlc -i
- Verify no obvious issues (missing semicolons, etc.)

Restart Process

# Save current version
git tag dancer-before-$(date +%s)

# Merge fix
git merge --no-ff dancer/fix-<id>

# Rebuild
dune build

# Graceful restart
systemctl reload dancer-service || systemctl restart dancer-service

# Health check
./health_check.sh || git reset --hard dancer-before-<timestamp>

Rollback Triggers
- Service fails to start
- Health check fails after restart
- Error rate increases by >50%
- Memory usage spikes
- Manual intervention via Zulip

MCP Server Requirements (Simplified)#

Git Server
- Local git repository with remote backup
- Web interface for viewing changes
- Webhook support for CI integration
Monitoring Server
- Simple metrics collection (Prometheus/Grafana)
- Log aggregation (just file-based initially)
- Alert routing to Zulip
Claude API Gateway
- Rate limiting
- Cost tracking
- Request/response logging
- Fallback to manual mode if quota exceeded

Implementation Phases (Simplified)#

Phase 1: Core Infrastructure (Week 1-2)

Log interception and buffering
Basic error pattern detection
Git worktree management
Manual Claude consultation

Phase 2: Automation (Week 3-4)

Automatic Claude triggers
Code generation and application
Restart orchestration
Basic safety checks

Phase 3: Monitoring & Safety (Week 5-6)

Zulip integration
Rollback mechanisms
Performance tracking
Cost management

Example Usage Flow#

Error Detection

(* Application code *)
Logs.err (fun m -> m "Database connection failed: %s" error_msg);
(* This error happens 20 times in 2 minutes *)

Claude Consultation

Context: Database connection errors occurring frequently
Pattern: "Database connection failed: Connection refused"

Claude generates:
- Exponential backoff retry logic
- Connection pool management
- Fallback to cached data

Version Control

git worktree add ../dancer-fix-db-conn -b dancer/fix-db-conn
cd ../dancer-fix-db-conn
# Apply Claude's changes
dune build
# If successful, merge and restart

Deployment

git checkout main
git merge dancer/fix-db-conn
systemctl restart dancer-service
# Monitor for 5 minutes
# If stable, cleanup worktree

Data Structures#

module Dancer = struct
  type config = {
    claude_api_key: string;
    zulip_api_key: string;
    zulip_stream: string;
    max_context_size: int; (* chars to send to Claude *)
    consultation_cooldown: float; (* seconds between consultations *)
    error_threshold: int; (* errors before triggering *)
    restart_timeout: float; (* max seconds for restart *)
    worktree_base: string; (* base directory for git worktrees *)
  }
  
  type consultation_request = {
    pattern: string;
    occurrences: int;
    timespan: float;
    recent_logs: string;
    source_context: string option;
  }
  
  type consultation_response = {
    analysis: string;
    proposed_fix: string;
    target_file: string;
    confidence: float;
  }
end

Key Simplifications from Original Design#

No Dynamic Linking - Just restart the process
Simple Pattern Matching - String comparison, no bloom filters
Basic Git Workflow - Branches and worktrees, no complex versioning
Minimal Infrastructure - SQLite instead of complex databases
Simple Rollback - Git reset instead of sophisticated mechanisms
Direct Process Restart - Using systemd/supervisor instead of hot-reload
File-Based Logs - No complex log aggregation initially
Manual Approval Option - Human can review via Zulip before deploy

Library Decomposition Plan#

Core Libraries#

dancer-logs - Log interception and buffering
- Hook into OCaml Logs reporter
- SQLite-backed circular buffer
- Pattern normalization
- Standalone testable
dancer-patterns - Pattern detection and tracking
- Error pattern recognition
- Frequency/acceleration tracking
- Pattern database management
- Trigger decision logic
dancer-claude - Claude CLI integration
- Prompt construction
- Response parsing
- Context preparation
- Token cost tracking
dancer-git - Git worktree management
- Worktree creation/cleanup
- Branch management
- Safe merging operations
- Rollback capabilities
dancer-test - Alcotest generation
- Test template generation
- Test execution in worktrees
- Result parsing
- Coverage tracking
dancer-process - Process management
- Tmux orchestration
- Service restart logic
- Health checking
- Graceful shutdown
dancer-observe - Observability
- Metrics collection
- SQLite time-series storage
- Anomaly detection
- Audit trail management
dancer-spec - Service specification
- YAML spec parsing
- Constraint validation
- Fix validation against spec
- Schema enforcement
dancer-deploy - Deployment pipeline
- Staging environment setup
- Promotion criteria evaluation
- Production deployment
- Rollback orchestration
dancer-ui - Human oversight interfaces
- Web dashboard (Dream)
- Terminal UI (Nottui)
- WebSocket live updates
- Audit log viewer

Implementation Order#

Phase 1: Foundation (Week 1)

dancer-logs - Need log data first
dancer-patterns - Pattern detection on logs
dancer-observe - Basic metrics/storage

Phase 2: Claude Integration (Week 2) 4. dancer-claude - Claude consultation 5. dancer-spec - Service constraints 6. dancer-test - Test generation

Phase 3: Deployment (Week 3) 7. dancer-git - Worktree management 8. dancer-process - Process control 9. dancer-deploy - Staging/production

Phase 4: Oversight (Week 4) 10. dancer-ui - Dashboard and monitoring

dancer-logs Implementation Details#

Design Decisions#

Comprehensive Schema: The SQLite schema captures everything that might be useful for Claude's analysis:
- Core fields (timestamp, level, source, message)
- Error classification (type, code, hash, stack trace)
- Execution context (PIDs, threads, fibers, domains)
- Request/session tracking (IDs for correlation)
- Source location (file, line, function, module)
- Performance metrics (duration, memory, CPU)
- Network context (IPs, ports, methods, status codes)
- User context (user ID, tenant ID)
- System context (OS, versions, environment, containers)
- Flexible metadata (JSON fields for tags, labels, custom data)
Full-Text Search: Using SQLite's FTS5 with porter stemming and unicode support for comprehensive search across all text fields.
Pattern Detection:
- Normalizes messages by replacing numbers with "N", hex with "0xHEX", UUIDs with "UUID"
- Tracks patterns with occurrence counts, severity scores, and consultation history
- Time-series bucketing for trend analysis
Logs Reporter Integration:
- Implements OCaml Logs reporter interface
- Can chain with existing reporters
- Thread-safe with Eio mutexes
- Captures Eio fiber IDs and domain IDs
CLI Tool: Colorful terminal interface using Fmt with:
- list - Browse logs with filters
- search - Full-text search
- patterns - View detected patterns
- errors - Recent error summary
- stats - Database statistics
- export - Export for analysis (JSON, CSV, Claude format)

Key Learnings#

No Truncation: Keep all data intact for Claude to analyze. Storage is cheap, context is valuable.
Structured Everything: The more structure we capture, the better Claude can understand patterns and correlations.
Synchronous is Fine: For now, synchronous SQLite operations are acceptable. Performance optimization can come later.
Reporter Chaining: Following Logs library conventions by supporting reporter chaining allows integration with existing logging infrastructure.
Export for Claude: Special export format that prepares context within token limits, focusing on recent errors and patterns.

Usage Example#

(* Initialize the database *)
let db = Dancer_logs.init ~path:"app_logs.db" () in

(* Create and set the reporter *)
let reporter = Dancer_logs.reporter db in
let reporter = Dancer_logs.chain_reporter db (Logs_fmt.reporter ()) in
Logs.set_reporter reporter;

(* Use with context *)
let ctx = Dancer_logs.Context.(
  empty_context
  |> with_session "session-123"
  |> with_request "req-456"
  |> with_user "user-789"
  |> add_tag "payment"
  |> add_label "environment" "production"
) in

(* Log with full context *)
Dancer_logs.log db
  ~level:Logs.Error
  ~source:"Payment.Gateway"
  ~message:"Payment failed: timeout"
  ~context:ctx
  ~error:{
    error_type = Some "TimeoutException";
    error_code = Some "GATEWAY_TIMEOUT";
    error_hash = Some (hash_of_error);
    stack_trace = Some (Printexc.get_backtrace ());
  }
  ~performance:{
    duration_ms = Some 30000.0;
    memory_before = Some 1000000;
    memory_after = Some 1200000;
    cpu_time_ms = Some 50.0;
  }
  ();

CLI Usage#

# View recent errors
dancer-logs errors

# Search for specific patterns
dancer-logs search "connection refused"

# View error patterns
dancer-logs patterns --min-count 10

# Export context for Claude
dancer-logs export -f claude -o context.txt --since 24

# Follow logs in real-time (to be implemented)
dancer-logs list -f

Next Steps#

Implement pattern detection algorithm that runs periodically
Add metrics collection from /proc and system calls
Implement log following mode for real-time monitoring
Add session correlation views
Create integration tests with sample applications