My agentic slop goes here. Not intended for anyone else!

I want to design an OCaml library that builds in support for modifying the program linked to it using Claude Code, and restarting itself with the fixes automatically. The idea is for long-running services to regularly consult with Claude (either on a fixed timetable, or urgently if something really unexpected happens) and improve their own functionality. Claude should be used to analyse patterns in the logs and determine whether to write code to handle a particular case. Claude should not be directly used in the application datapath itself, as it should write code.

To make this work, the program needs to emit sufficient tracing data to be useful to Claude when it does an inspection, but not so much that it overwhelms the context window. Therefore, the first thing the library needs is some mechanism to intercept the logging output of the program suitably. The OCaml "logs" library is a good thing to standardise on here. It's also fine to use the OCaml direct-style Eio library for all interactions.

Assume the code is running in a Linux environment with root level access. There will also be a Zulip server available with an API key that can be used to post messages to and interact with.

This is an ambitious project, so before embarking on it, I need to think really carefully about the design and tradeoffs, including seeking clarificaiton where necessary about what sorts of MCP servers or other support infrastructure will be useful to making library successful. I'm ok taking risks and trying unusual approaches. The library will be called "Dancer" after the Hunter S Thompson quote "We're raising a generation of dancers, afraid to take one step out of line."

Architecture Design (v1)#

Core Components#

  1. Log Interceptor & Buffer

    • Hook into OCaml Logs library at the reporter level
    • Maintain a persistent buffer on disk, perhaps in Sqlite, for analysis
    • Group consecutive identical errors with count
    • Tag logs with timestamp, module, and error type
  2. Pattern Detector

    • Track log messages and their frequency
    • Use string matching to identify recurring patterns
    • Maintain a simple SQLite database of seen patterns
    • Trigger Claude consultation when:
      • New error pattern appears frequently (>10 times in 5 min)
      • Error rate spikes above baseline
      • Scheduled review (e.g., every 6 hours)
  3. Claude Consultation Manager

    • Prepare context: recent logs + relevant source files
    • Ask Claude to:
      • Analyze the error pattern
      • Generate OCaml code to handle the case
      • Suggest where to integrate the fix
      • Test the fixes and trial a deployment
    • Store Claude's response and proposed changes
  4. Version Control Integration

    • Each Claude fix creates a new git branch: dancer/fix-<timestamp>-<error-hash>
    • Use git worktrees for isolated changes:
      git worktree add ../dancer-fix-<id> -b dancer/fix-<id>
      
    • Apply Claude's changes in the worktree include a changelog in the commits
    • Compile and test in isolation
    • If successful, merge to main and restart application
    • Have a script that can search for all the fix branches and update a central changelog ordered by time, suitable for a human to review regularly
  5. Restart Orchestration

    • Library has a supervisor for process management of the application itself
    • Graceful shutdown: finish current requests with a timeout
    • State persistence before restart (if needed)
    • Automatic rollback if restart fails from the previous successful binary
    • Health check after restart
  6. Zulip Integration

    • Post proposed changes for human review
    • Emergency stop command
    • Status updates on consultations
    • Performance metrics before/after changes

Git Workflow Design#

  1. Branch Strategy

    main (production code)
    ├── dancer/fix-2024-01-15-1200-auth-error
    ├── dancer/fix-2024-01-15-1800-timeout-handler
    └── dancer/rollback-2024-01-15-1900 (if needed)
    
  2. Worktree Management

    • Base directory: /var/dancer/worktrees/
    • Each fix gets its own worktree
    • Clean up old worktrees after successful merge
    • Keep failed attempts for analysis
  3. Change Process

    type fix_status =
      | Proposed
      | Testing
      | Approved
      | Deployed
      | Rolled_back
    
    type fix_record = {
      id: string;
      branch: string;
      worktree: string;
      error_pattern: string;
      claude_solution: string;
      test_results: string option;
      status: fix_status;
      created_at: float;
    }
    

Simplified Log Management#

  1. Log Format

    type log_entry = {
      timestamp: float;
      level: Logs.level;
      source: string; (* module name *)
      message: string;
      error_type: string option;
      stack_trace: string option;
    }
    
  2. Context Preparation for Claude

    • Last 500 lines of logs
    • Error frequency summary
    • Relevant source file (where error originated)
    • Previous fix attempts for similar errors
    • System metrics (CPU, memory, request rate)

Restart Safety Mechanisms#

  1. Pre-Restart Checks

    • Compile the modified code
    • Run unit tests if available
    • Check syntax with ocamlc -i
    • Verify no obvious issues (missing semicolons, etc.)
  2. Restart Process

    # Save current version
    git tag dancer-before-$(date +%s)
    
    # Merge fix
    git merge --no-ff dancer/fix-<id>
    
    # Rebuild
    dune build
    
    # Graceful restart
    systemctl reload dancer-service || systemctl restart dancer-service
    
    # Health check
    ./health_check.sh || git reset --hard dancer-before-<timestamp>
    
  3. Rollback Triggers

    • Service fails to start
    • Health check fails after restart
    • Error rate increases by >50%
    • Memory usage spikes
    • Manual intervention via Zulip

MCP Server Requirements (Simplified)#

  1. Git Server

    • Local git repository with remote backup
    • Web interface for viewing changes
    • Webhook support for CI integration
  2. Monitoring Server

    • Simple metrics collection (Prometheus/Grafana)
    • Log aggregation (just file-based initially)
    • Alert routing to Zulip
  3. Claude API Gateway

    • Rate limiting
    • Cost tracking
    • Request/response logging
    • Fallback to manual mode if quota exceeded

Implementation Phases (Simplified)#

Phase 1: Core Infrastructure (Week 1-2)

  • Log interception and buffering
  • Basic error pattern detection
  • Git worktree management
  • Manual Claude consultation

Phase 2: Automation (Week 3-4)

  • Automatic Claude triggers
  • Code generation and application
  • Restart orchestration
  • Basic safety checks

Phase 3: Monitoring & Safety (Week 5-6)

  • Zulip integration
  • Rollback mechanisms
  • Performance tracking
  • Cost management

Example Usage Flow#

  1. Error Detection

    (* Application code *)
    Logs.err (fun m -> m "Database connection failed: %s" error_msg);
    (* This error happens 20 times in 2 minutes *)
    
  2. Claude Consultation

    Context: Database connection errors occurring frequently
    Pattern: "Database connection failed: Connection refused"
    
    Claude generates:
    - Exponential backoff retry logic
    - Connection pool management
    - Fallback to cached data
    
  3. Version Control

    git worktree add ../dancer-fix-db-conn -b dancer/fix-db-conn
    cd ../dancer-fix-db-conn
    # Apply Claude's changes
    dune build
    # If successful, merge and restart
    
  4. Deployment

    git checkout main
    git merge dancer/fix-db-conn
    systemctl restart dancer-service
    # Monitor for 5 minutes
    # If stable, cleanup worktree
    

Data Structures#

module Dancer = struct
  type config = {
    claude_api_key: string;
    zulip_api_key: string;
    zulip_stream: string;
    max_context_size: int; (* chars to send to Claude *)
    consultation_cooldown: float; (* seconds between consultations *)
    error_threshold: int; (* errors before triggering *)
    restart_timeout: float; (* max seconds for restart *)
    worktree_base: string; (* base directory for git worktrees *)
  }
  
  type consultation_request = {
    pattern: string;
    occurrences: int;
    timespan: float;
    recent_logs: string;
    source_context: string option;
  }
  
  type consultation_response = {
    analysis: string;
    proposed_fix: string;
    target_file: string;
    confidence: float;
  }
end

Key Simplifications from Original Design#

  1. No Dynamic Linking - Just restart the process
  2. Simple Pattern Matching - String comparison, no bloom filters
  3. Basic Git Workflow - Branches and worktrees, no complex versioning
  4. Minimal Infrastructure - SQLite instead of complex databases
  5. Simple Rollback - Git reset instead of sophisticated mechanisms
  6. Direct Process Restart - Using systemd/supervisor instead of hot-reload
  7. File-Based Logs - No complex log aggregation initially
  8. Manual Approval Option - Human can review via Zulip before deploy

Library Decomposition Plan#

Core Libraries#

  1. dancer-logs - Log interception and buffering

    • Hook into OCaml Logs reporter
    • SQLite-backed circular buffer
    • Pattern normalization
    • Standalone testable
  2. dancer-patterns - Pattern detection and tracking

    • Error pattern recognition
    • Frequency/acceleration tracking
    • Pattern database management
    • Trigger decision logic
  3. dancer-claude - Claude CLI integration

    • Prompt construction
    • Response parsing
    • Context preparation
    • Token cost tracking
  4. dancer-git - Git worktree management

    • Worktree creation/cleanup
    • Branch management
    • Safe merging operations
    • Rollback capabilities
  5. dancer-test - Alcotest generation

    • Test template generation
    • Test execution in worktrees
    • Result parsing
    • Coverage tracking
  6. dancer-process - Process management

    • Tmux orchestration
    • Service restart logic
    • Health checking
    • Graceful shutdown
  7. dancer-observe - Observability

    • Metrics collection
    • SQLite time-series storage
    • Anomaly detection
    • Audit trail management
  8. dancer-spec - Service specification

    • YAML spec parsing
    • Constraint validation
    • Fix validation against spec
    • Schema enforcement
  9. dancer-deploy - Deployment pipeline

    • Staging environment setup
    • Promotion criteria evaluation
    • Production deployment
    • Rollback orchestration
  10. dancer-ui - Human oversight interfaces

    • Web dashboard (Dream)
    • Terminal UI (Nottui)
    • WebSocket live updates
    • Audit log viewer

Implementation Order#

Phase 1: Foundation (Week 1)

  1. dancer-logs - Need log data first
  2. dancer-patterns - Pattern detection on logs
  3. dancer-observe - Basic metrics/storage

Phase 2: Claude Integration (Week 2) 4. dancer-claude - Claude consultation 5. dancer-spec - Service constraints 6. dancer-test - Test generation

Phase 3: Deployment (Week 3) 7. dancer-git - Worktree management 8. dancer-process - Process control 9. dancer-deploy - Staging/production

Phase 4: Oversight (Week 4) 10. dancer-ui - Dashboard and monitoring

dancer-logs Implementation Details#

Design Decisions#

  1. Comprehensive Schema: The SQLite schema captures everything that might be useful for Claude's analysis:

    • Core fields (timestamp, level, source, message)
    • Error classification (type, code, hash, stack trace)
    • Execution context (PIDs, threads, fibers, domains)
    • Request/session tracking (IDs for correlation)
    • Source location (file, line, function, module)
    • Performance metrics (duration, memory, CPU)
    • Network context (IPs, ports, methods, status codes)
    • User context (user ID, tenant ID)
    • System context (OS, versions, environment, containers)
    • Flexible metadata (JSON fields for tags, labels, custom data)
  2. Full-Text Search: Using SQLite's FTS5 with porter stemming and unicode support for comprehensive search across all text fields.

  3. Pattern Detection:

    • Normalizes messages by replacing numbers with "N", hex with "0xHEX", UUIDs with "UUID"
    • Tracks patterns with occurrence counts, severity scores, and consultation history
    • Time-series bucketing for trend analysis
  4. Logs Reporter Integration:

    • Implements OCaml Logs reporter interface
    • Can chain with existing reporters
    • Thread-safe with Eio mutexes
    • Captures Eio fiber IDs and domain IDs
  5. CLI Tool: Colorful terminal interface using Fmt with:

    • list - Browse logs with filters
    • search - Full-text search
    • patterns - View detected patterns
    • errors - Recent error summary
    • stats - Database statistics
    • export - Export for analysis (JSON, CSV, Claude format)

Key Learnings#

  1. No Truncation: Keep all data intact for Claude to analyze. Storage is cheap, context is valuable.

  2. Structured Everything: The more structure we capture, the better Claude can understand patterns and correlations.

  3. Synchronous is Fine: For now, synchronous SQLite operations are acceptable. Performance optimization can come later.

  4. Reporter Chaining: Following Logs library conventions by supporting reporter chaining allows integration with existing logging infrastructure.

  5. Export for Claude: Special export format that prepares context within token limits, focusing on recent errors and patterns.

Usage Example#

(* Initialize the database *)
let db = Dancer_logs.init ~path:"app_logs.db" () in

(* Create and set the reporter *)
let reporter = Dancer_logs.reporter db in
let reporter = Dancer_logs.chain_reporter db (Logs_fmt.reporter ()) in
Logs.set_reporter reporter;

(* Use with context *)
let ctx = Dancer_logs.Context.(
  empty_context
  |> with_session "session-123"
  |> with_request "req-456"
  |> with_user "user-789"
  |> add_tag "payment"
  |> add_label "environment" "production"
) in

(* Log with full context *)
Dancer_logs.log db
  ~level:Logs.Error
  ~source:"Payment.Gateway"
  ~message:"Payment failed: timeout"
  ~context:ctx
  ~error:{
    error_type = Some "TimeoutException";
    error_code = Some "GATEWAY_TIMEOUT";
    error_hash = Some (hash_of_error);
    stack_trace = Some (Printexc.get_backtrace ());
  }
  ~performance:{
    duration_ms = Some 30000.0;
    memory_before = Some 1000000;
    memory_after = Some 1200000;
    cpu_time_ms = Some 50.0;
  }
  ();

CLI Usage#

# View recent errors
dancer-logs errors

# Search for specific patterns
dancer-logs search "connection refused"

# View error patterns
dancer-logs patterns --min-count 10

# Export context for Claude
dancer-logs export -f claude -o context.txt --since 24

# Follow logs in real-time (to be implemented)
dancer-logs list -f

Next Steps#

  1. Implement pattern detection algorithm that runs periodically
  2. Add metrics collection from /proc and system calls
  3. Implement log following mode for real-time monitoring
  4. Add session correlation views
  5. Create integration tests with sample applications