A community based topic aggregation platform built on atproto

Alpha Go-Live Readiness PRD#

Status: Pre-Alpha → E2E Testing Complete 🎉 Target: Alpha launch with real users Last Updated: 2025-11-16

🎯 Major Progress Update#

✅ ALL E2E TESTS COMPLETE! (Completed 2025-11-16) ✅ BIDIRECTIONAL DID VERIFICATION COMPLETE! (Completed 2025-11-16)

All 6 critical E2E test suites have been implemented and are passing:

  • ✅ Full User Journey (signup → community → post → comment → vote)
  • ✅ Blob Upload (image uploads, PDS integration, validation)
  • ✅ Multi-Community Timeline (feed aggregation, sorting, pagination)
  • ✅ Concurrent Scenarios (race condition testing with database verification)
  • ✅ Rate Limiting (100 req/min general, 20 req/min comments, 10 posts/hour aggregators)
  • ✅ Error Recovery (Jetstream retry, PDS unavailability, malformed events)

Time Saved: ~7-12 hours through parallel agent implementation Test Quality: Enhanced with comprehensive database record verification to catch race conditions

Production Deployment Requirements#

Architecture:

  • AppView Domain: coves.social (instance identity, API, frontend)
  • PDS Domain: coves.me (separate domain required - cannot be same as AppView)
  • Community Handles: Use @coves.social (AppView domain)
  • Jetstream: Connects to Bluesky's production firehose (wss://jetstream2.us-east.bsky.network)

Required: .well-known/did.json at coves.social:

{
  "id": "did:web:coves.social",
  "alsoKnownAs": ["at://coves.social"],
  "verificationMethod": [
    {
      "id": "did:web:coves.social#atproto",
      "type": "Multikey",
      "controller": "did:web:coves.social",
      "publicKeyMultibase": "z..."
    }
  ],
  "service": [
    {
      "id": "#atproto_pds",
      "type": "AtprotoPersonalDataServer",
      "serviceEndpoint": "https://coves.me"
    }
  ]
}

Environment Variables:

  • AppView:
    • INSTANCE_DID=did:web:coves.social
    • INSTANCE_DOMAIN=coves.social
    • PDS_URL=https://coves.me (separate domain)
    • SKIP_DID_WEB_VERIFICATION=false (production)
    • JETSTREAM_URL=wss://jetstream2.us-east.bsky.network/subscribe

Verification:

  • curl https://coves.social/.well-known/did.json (should return DID document)
  • curl https://coves.me/xrpc/_health (PDS health check)

Overview#

This document tracks the remaining work required to launch Coves alpha with real users. Focus is on critical functionality, security, and operational readiness.


P0: Critical Blockers (Must Complete Before Alpha)#

1. Authentication & Security#

Production PDS Deployment#

CRITICAL: PDS must be on separate domain from AppView (coves.me, not coves.social)

  • Deploy PDS to coves.me domain
    • Set up DNS: A record for coves.me → server IP
    • Configure SSL certificate for coves.me
    • Deploy PDS container/service on port 2583
    • Configure nginx/Caddy reverse proxy for coves.me → localhost:2583
    • Set PDS_HOSTNAME=coves.me in PDS environment
    • Mount persistent volume for PDS data (/pds/data)
  • Verify PDS connectivity
    • Test: curl https://coves.me/xrpc/_health
    • Create test community account on PDS
    • Verify JWKS endpoint: curl https://coves.me/.well-known/jwks.json
    • Test community account token provisioning
  • Configure AppView to use production PDS
    • Set PDS_URL=https://coves.me in AppView .env
    • Test community creation flow (provisions account on coves.me)
    • Verify account provisioning works end-to-end

Important: Jetstream connects to Bluesky's production firehose, which automatically includes events from all production PDS instances (including coves.me once it's live)

Estimated Effort: 4-6 hours Risk: Medium (infrastructure setup, DNS propagation)

JWT Signature Verification (Production Mode)#

  • Test with production PDS at coves.me
    • Verify JWKS endpoint is accessible: https://coves.me/.well-known/jwks.json
    • Run TestJWTSignatureVerification against production PDS
    • Confirm signature verification succeeds
    • Test token refresh flow for community accounts
  • Set AUTH_SKIP_VERIFY=false in production environment
  • Verify all auth middleware tests pass with verification enabled

Estimated Effort: 2-3 hours Risk: Low (depends on PDS deployment)

did:web Verification ✅ COMPLETE#

  • Complete did:web domain verification implementation (2025-11-16)
  • Implement Bluesky-compatible bidirectional verification
  • Add alsoKnownAs field verification in DID documents
  • Add security logging for verification failures
  • Update cache TTL to 24h (matches Bluesky recommendations)
  • Comprehensive test coverage with mock HTTP servers
  • Set SKIP_DID_WEB_VERIFICATION=false for production (dev default: true)
  • Deploy .well-known/did.json to production domain

Implementation Details:

  • Location: internal/atproto/jetstream/community_consumer.go
  • Verification Flow: Domain matching + DID document fetch + alsoKnownAs validation
  • Security Model: Matches Bluesky (DNS/HTTPS authority + bidirectional binding)
  • Performance: Bounded LRU cache (1000 entries), rate limiting (10 req/s), 24h TTL
  • Impact: AppView indexing and federation trust (not community creation API)
  • Tests: tests/integration/community_hostedby_security_test.go

Actual Effort: 3 hours (implementation + testing) Risk: ✅ Low (complete and tested)

2. DPoP Token Architecture Fix#

Problem: Backend attempts to write subscriptions/blocks to user PDS using DPoP-bound tokens (fails with "Malformed token").

Remove Write-Forward Code#

  • Remove write-forward from SubscribeToCommunity handler
  • Remove write-forward from UnsubscribeFromCommunity handler
  • Remove write-forward from BlockCommunity handler
  • Remove write-forward from UnblockCommunity handler
  • Update handlers to return helpful error: "Write directly to your PDS"
  • Update API documentation to reflect client-write pattern
  • Verify Jetstream consumers still index correctly

Files:

  • internal/core/communities/service.go:564-816
  • internal/api/handlers/community/subscribe.go
  • internal/api/handlers/community/block.go

Estimated Effort: 3-4 hours Risk: Low (similar to votes pattern)

P1: Important (Should Complete Before Alpha)#

5. Post Read Operations#

  • Implement getPost endpoint (single post retrieval)
  • Implement listPosts endpoint (with pagination)
  • Add post permalink support
  • Integration tests for post retrieval
  • Error handling for missing/deleted posts

Estimated Effort: 6-8 hours Risk: Low Note: Can defer if direct post linking not needed initially

6. Production Infrastructure#

Monitoring Setup#

  • Add Prometheus metrics endpoints
    • HTTP request metrics (duration, status codes, paths)
    • Database query metrics (slow queries, connection pool)
    • Jetstream consumer metrics (events processed, lag, errors)
    • Auth metrics (token validations, failures)
  • Set up Grafana dashboards
    • Request rate and latency
    • Error rates by endpoint
    • Database performance
    • Jetstream consumer health
  • Configure alerting rules
    • High error rate (>5% 5xx responses)
    • Slow response time (p99 >1s)
    • Database connection pool exhaustion
    • Jetstream consumer lag >1 minute
    • PDS health check failures

Estimated Effort: 8-10 hours

Structured Logging#

  • Replace log package with structured logger (zerolog or zap)
  • Add log levels (debug, info, warn, error)
  • JSON output format for production
  • Add request ID tracking
  • Add correlation IDs for async operations
  • Sanitize sensitive data from logs (passwords, tokens, emails)
  • Configure log rotation
  • Ship logs to aggregation service (optional: Loki, CloudWatch)

Estimated Effort: 6-8 hours

Database Backups#

  • Automated PostgreSQL backups (daily minimum)
  • Backup retention policy (30 days)
  • Test restore procedure
  • Document backup/restore runbook
  • Off-site backup storage
  • Monitor backup success/failure
  • Point-in-time recovery (PITR) setup (optional)

Estimated Effort: 4-6 hours

Load Testing#

  • Define load test scenarios
    • User signup and authentication
    • Community creation
    • Post creation and viewing
    • Feed retrieval (timeline, discover, community)
    • Comment creation and threading
    • Voting
  • Set target metrics
    • Concurrent users target (e.g., 100 concurrent)
    • Requests per second target
    • P95 latency target (<500ms)
    • Error rate target (<1%)
  • Run load tests with k6/Artillery/JMeter
  • Identify bottlenecks (database, CPU, memory)
  • Optimize slow queries
  • Add database indexes if needed
  • Test graceful degradation under load

Estimated Effort: 10-12 hours

Deployment Runbook#

  • Document deployment procedure
    • Pre-deployment checklist
    • Database migration steps
    • Environment variable validation
    • Health check verification
    • Rollback procedure
  • Document operational procedures
    • How to check system health
    • How to read logs
    • How to check Jetstream consumer status
    • How to manually trigger community token refresh
    • How to clear caches
  • Document incident response
    • Who to contact
    • Escalation path
    • Common issues and fixes
    • Emergency procedures (PDS down, database down, etc.)
  • Create production environment checklist
    • Domain Setup
      • AppView domain (coves.social) DNS configured
      • PDS domain (coves.me) DNS configured - MUST be separate domain
      • SSL certificates for both domains
      • Nginx/Caddy reverse proxy configured for both domains
    • AppView Environment Variables
      • INSTANCE_DID=did:web:coves.social
      • INSTANCE_DOMAIN=coves.social
      • PDS_URL=https://coves.me (separate domain)
      • AUTH_SKIP_VERIFY=false
      • SKIP_DID_WEB_VERIFICATION=false
      • JETSTREAM_URL=wss://jetstream2.us-east.bsky.network/subscribe
    • PDS Environment Variables
      • PDS_HOSTNAME=coves.me
      • PDS_PORT=2583
      • Persistent storage mounted
    • Deployment Verification
      • Deploy .well-known/did.json to coves.social with serviceEndpoint: https://coves.me
      • Verify: curl https://coves.social/.well-known/did.json
      • Verify: curl https://coves.me/xrpc/_health
      • Database migrations applied
      • PDS connectivity verified from AppView
      • JWKS caching working
      • Jetstream consumer connected to Bluesky production firehose
      • Test community creation end-to-end
    • Monitoring and alerting active

Estimated Effort: 6-8 hours


P2: Nice to Have (Can Defer to Post-Alpha)#

7. Post Update/Delete#

  • Implement post update endpoint
  • Implement post delete endpoint
  • Jetstream consumer for UPDATE/DELETE events
  • Soft delete support

Estimated Effort: 4-6 hours

8. Community Delete#

  • Implement community delete endpoint
  • Cascade delete considerations
  • Archive vs hard delete decision

Estimated Effort: 2-3 hours

9. Content Rules Validation#

  • Implement text-only community enforcement
  • Implement allowed embed types validation
  • Content length limits

Estimated Effort: 6-8 hours

10. Search Functionality#

  • Community search improvements
  • Post search
  • User search
  • Full-text search with PostgreSQL or external service

Estimated Effort: 8-10 hours


Testing Gaps#

E2E Testing Recommendations#

1. Full User Journey Test (CRITICAL) ✅ COMPLETE#

What: Test complete user flow from signup to interaction Why: No single test validates the entire happy path

  • Create test: Signup → Authenticate → Create Community → Create Post → Add Comment → Vote
  • Verify all data flows through Jetstream correctly
  • Verify counts update (vote counts, comment counts, subscriber counts)
  • Verify timeline feed shows posts from subscribed communities
  • Test with 2+ users interacting (user A posts, user B comments)
  • Real E2E with Docker infrastructure (PDS, Jetstream, PostgreSQL)
  • Graceful fallback for CI/CD environments

Actual Time: ~3 hours (agent-implemented) Test Location: tests/integration/user_journey_e2e_test.go

2. Blob Upload E2E Test ✅ COMPLETE#

What: Test image upload and display in posts Why: No test validates the full blob upload → post → feed display flow

  • Create post with embedded image
  • Verify blob uploaded to PDS
  • Verify blob URL transformation in feed responses
  • Test multiple images in single post
  • Test image in comment
  • PDS health check (skips gracefully if PDS unavailable)
  • Mock server test (runs in all environments)
  • Comprehensive validation tests (empty data, MIME types, size limits)
  • Actual JPEG format testing (not just PNG with different MIME types)

Actual Time: ~2-3 hours (agent-implemented) Test Location: tests/integration/blob_upload_e2e_test.go

3. Multi-Community Timeline Test ✅ COMPLETE#

What: Test timeline feed with multiple community subscriptions Why: Timeline logic may have edge cases with multiple sources

  • Create 3+ communities
  • Subscribe user to all communities
  • Create posts in each community
  • Verify timeline shows posts from all subscribed communities
  • Verify hot/top/new sorting across communities
  • Test pagination across multiple communities
  • Verify security (unsubscribed communities excluded)
  • Verify record schema compliance across communities

Actual Time: ~2 hours Test Location: /tests/integration/timeline_test.go::TestGetTimeline_MultiCommunity_E2E

4. Concurrent User Scenarios ✅ COMPLETE#

What: Test system behavior with simultaneous users Why: Race conditions and locking issues only appear under concurrency

  • Multiple users voting on same post simultaneously (20-25 concurrent)
  • Multiple users commenting on same post simultaneously (25 concurrent)
  • Community creation with same handle (should fail) - verified UNIQUE constraint
  • Subscription race conditions (30 concurrent subscribers)
  • Enhanced with database record verification (detects duplicates/lost records)
  • Concurrent upvotes and downvotes (15 up + 10 down)
  • Concurrent replies to same comment (15 concurrent)
  • Concurrent subscribe/unsubscribe (20 users)

Actual Time: ~3 hours (agent-implemented) + 1 hour (race condition verification added) Test Location: tests/integration/concurrent_scenarios_test.go Finding: NO RACE CONDITIONS DETECTED - all tests pass with full database verification

5. Rate Limiting Tests ✅ COMPLETE#

What: Verify rate limits work correctly Why: Protection against abuse

  • Test aggregator rate limits (10 posts/hour) - existing test verified
  • Test general endpoint rate limits (100 req/min)
  • Test comment rate limits (20 req/min)
  • Verify 429 responses
  • Verify rate limit headers (documented as not implemented - acceptable for Alpha)
  • Verify per-client isolation (IP-based rate limiting)
  • Verify X-Forwarded-For and X-Real-IP header support
  • Test rate limit reset behavior
  • Test thread-safety with concurrent requests
  • Test rate limiting across different HTTP methods

Actual Time: ~2 hours (agent-implemented) Test Location: tests/e2e/ratelimit_e2e_test.go Configuration Documented: All rate limits documented in comments (removed fake summary "test")

6. Error Recovery Tests ✅ COMPLETE#

What: Test system recovery from failures Why: Production will have failures

  • Jetstream connection retry on failure (renamed from "reconnection" for accuracy)
  • PDS temporarily unavailable during post creation (AppView continues indexing)
  • Database connection loss and recovery (connection pool auto-recovery)
  • Malformed Jetstream events (gracefully skipped, no crashes)
  • Out-of-order event handling (last-write-wins strategy)
  • Events processed correctly after connection established

Actual Time: ~2 hours (agent-implemented) + 30 min (test accuracy improvements) Test Location: tests/e2e/error_recovery_test.go Findings:

  • ✅ Automatic reconnection with 5s backoff
  • ✅ Circuit breaker pattern for external services
  • ✅ AppView can index without PDS availability
  • ⚠️ Note: Tests verify connection retry, not full reconnect-after-disconnect (requires mock WebSocket server)

7. Federation Readiness (Optional)#

What: Test cross-PDS interactions Why: Future-proofing for federation

  • User on different PDS subscribing to Coves community
  • User on different PDS commenting on Coves post
  • User on different PDS voting on Coves content
  • Handle resolution across PDSs

Note: Defer to Beta unless federation is alpha requirement


Timeline Estimate#

Week 1: Critical Blockers (P0)#

  • Days 1-2: Authentication (JWT + did:web verification)did:web COMPLETED
  • Day 1: Production PDS deployment (coves.me domain setup)
  • Day 2: JWT signature verification with production PDS
  • Day 3: DPoP token architecture fix
  • Day 4: Handle resolution + comment count reconciliationCOMPLETED
  • Day 4-5: Testing and bug fixes

Total: 16-23 hours (added 4-6 hours for PDS deployment, reduced from original due to did:web completion)

Week 2: Production Infrastructure (P1)#

  • Days 6-7: Monitoring + structured logging
  • Day 8: Database backups + load testing
  • Days 9-10: Deployment runbook + final testing

Total: 30-35 hours

Week 3: E2E Testing + Polish ✅ E2E TESTS COMPLETE#

  • Days 11-12: Critical E2E tests (user journey, blob upload)COMPLETED (agent-implemented in ~6 hours)
  • Day 13: Additional E2E testsCOMPLETED (concurrent, rate limiting, error recovery in ~7 hours)
  • Days 14-15: Load testing, bug fixes, polish

Total: 20-25 hours13 hours actual (E2E tests) + 7-12 hours remaining (load testing, polish)

Grand Total: 65-80 hours → 51-68 hours remaining (approximately 1.5-2 weeks full-time) (Originally 70-85 hours. Adjusted for: +4-6 hours PDS deployment, -3 hours did:web completion, -13 hours E2E tests completion, -4 hours handle resolution and comment reconciliation)

✅ Progress Update: E2E testing section COMPLETE ahead of schedule - saved ~7-12 hours through parallel agent implementation


Success Criteria#

Alpha is ready when:

  • All P0 blockers resolved
    • ✅ Handle resolution (COMPLETE)
    • ✅ Comment count reconciliation (COMPLETE)
    • ✅ did:web verification (COMPLETE - needs production deployment)
    • Production PDS deployed to coves.me (separate domain)
    • JWT signature verification working with production PDS
    • DPoP architecture fix implemented
  • Subscriptions/blocking work via client-write pattern
  • All integration tests passing
  • E2E user journey test passing
  • E2E blob upload tests passing
  • E2E concurrent scenarios tests passing
  • E2E rate limiting tests passing
  • E2E error recovery tests passing
  • Load testing shows acceptable performance (100+ concurrent users)
  • Monitoring and alerting active
  • Database backups configured and tested
  • Deployment runbook complete and validated
  • Security audit completed (basic)
  • No known critical bugs

Go/No-Go Decision Points#

Can we launch without it?#

Feature Alpha Requirement Status Rationale
JWT signature verification ✅ YES 🟡 Needs testing Security critical
DPoP architecture fix ✅ YES 🔴 Not started Subscriptions broken without it
Handle resolution ✅ YES COMPLETE Core UX requirement
Comment count reconciliation ✅ YES COMPLETE Data accuracy
Post read endpoints ⚠️ MAYBE 🔴 Not implemented Can use feeds initially
Post update/delete ❌ NO 🔴 Not implemented Can add post-launch
Moderation system ❌ NO 🔴 Not implemented Deferred to Beta per PRD_GOVERNANCE
Full-text search ❌ NO 🔴 Not implemented Browse works without it
Federation testing ❌ NO 🔴 Not implemented Single-instance alpha
Mobile app ⚠️ MAYBE 🔴 Not implemented Web-first acceptable

Risk Assessment#

High Risk#

  1. JWT verification with production PDS - Never tested with real JWKS
  2. Load under real traffic - Current tests are single-user
  3. Operational knowledge - No one has run this in production yet

Medium Risk#

  1. Database performance - Queries optimized but not load tested
  2. Jetstream consumer lag - May fall behind under high write volume
  3. Token refresh stability - Community tokens refresh every 2 hours (tested but not long-running)

Low Risk#

  1. DPoP architecture fix - Similar pattern already works (votes)
  2. Handle resolution - ✅ Already implemented
  3. Comment reconciliation - ✅ Already implemented

Open Questions#

  1. What's the target alpha user count? (affects infrastructure sizing)
  2. What's the alpha duration? (affects monitoring retention, backup retention)
  3. Is mobile app required for alpha? (affects DPoP testing priority)
  4. What's the rollback strategy? (database migrations may not be reversible)
  5. Who's on-call during alpha? (affects runbook detail level)
  6. What's the acceptable downtime? (affects HA requirements)
  7. Budget for infrastructure? (affects monitoring/backup solutions)

Next Steps#

  1. ✅ Create this PRD
  2. ✅ Validate handle resolution (COMPLETE)
  3. ✅ Validate comment count reconciliation (COMPLETE)
  4. Write critical E2E tests (COMPLETE - all 6 test suites implemented)
  5. Review and prioritize with team
  6. Test JWT verification with pds.bretton.dev (requires invite code or existing account)
  7. Begin P0 blockers (DPoP fix first - highest user impact)
  8. Set up monitoring infrastructure
  9. Conduct load testing (infrastructure ready, tests written, needs execution)
  10. Security review
  11. Go/no-go decision
  12. Launch! 🚀

🎉 Major Milestones:

  • All E2E tests complete! Test coverage now includes full user journey, blob uploads, concurrent operations, rate limiting, and error recovery.
  • Bidirectional DID verification complete! Bluesky-compatible security model with alsoKnownAs validation, 24h cache TTL, and comprehensive test coverage.