A community based topic aggregation platform built on atproto
1# Alpha Go-Live Readiness PRD 2 3**Status**: Pre-Alpha 4**Target**: Alpha launch with real users 5**Last Updated**: 2025-11-16 6 7## Overview 8 9This document tracks the remaining work required to launch Coves alpha with real users. Focus is on critical functionality, security, and operational readiness. 10 11--- 12 13## P0: Critical Blockers (Must Complete Before Alpha) 14 15### 1. Authentication & Security 16 17#### JWT Signature Verification (Production Mode) 18- [ ] Test with production PDS at `pds.bretton.dev` 19 - [ ] Create test account on production PDS 20 - [ ] Verify JWKS endpoint is accessible 21 - [ ] Run `TestJWTSignatureVerification` against production PDS 22 - [ ] Confirm signature verification succeeds 23 - [ ] Test token refresh flow 24- [ ] Set `AUTH_SKIP_VERIFY=false` in production environment 25- [ ] Verify all auth middleware tests pass with verification enabled 26- [ ] Document production PDS requirements for communities 27 28**Estimated Effort**: 2-3 hours 29**Risk**: Medium (code implemented, needs validation) 30 31#### did:web Verification 32- [ ] Complete did:web domain verification implementation 33- [ ] Test with real did:web identities 34- [ ] Add security logging for verification failures 35- [ ] Set `SKIP_DID_WEB_VERIFICATION=false` for production 36 37**Estimated Effort**: 2-3 hours 38**Risk**: Medium 39 40### 2. DPoP Token Architecture Fix 41 42**Problem**: Backend attempts to write subscriptions/blocks to user PDS using DPoP-bound tokens (fails with "Malformed token"). 43 44#### Remove Write-Forward Code 45- [ ] Remove write-forward from `SubscribeToCommunity` handler 46- [ ] Remove write-forward from `UnsubscribeFromCommunity` handler 47- [ ] Remove write-forward from `BlockCommunity` handler 48- [ ] Remove write-forward from `UnblockCommunity` handler 49- [ ] Update handlers to return helpful error: "Write directly to your PDS" 50- [ ] Update API documentation to reflect client-write pattern 51- [ ] Verify Jetstream consumers still index correctly 52 53**Files**: 54- `internal/core/communities/service.go:564-816` 55- `internal/api/handlers/community/subscribe.go` 56- `internal/api/handlers/community/block.go` 57 58**Estimated Effort**: 3-4 hours 59**Risk**: Low (similar to votes pattern) 60 61## P1: Important (Should Complete Before Alpha) 62 63### 5. Post Read Operations 64 65- [ ] Implement `getPost` endpoint (single post retrieval) 66- [ ] Implement `listPosts` endpoint (with pagination) 67- [ ] Add post permalink support 68- [ ] Integration tests for post retrieval 69- [ ] Error handling for missing/deleted posts 70 71**Estimated Effort**: 6-8 hours 72**Risk**: Low 73**Note**: Can defer if direct post linking not needed initially 74 75### 6. Production Infrastructure 76 77#### Monitoring Setup 78- [ ] Add Prometheus metrics endpoints 79 - [ ] HTTP request metrics (duration, status codes, paths) 80 - [ ] Database query metrics (slow queries, connection pool) 81 - [ ] Jetstream consumer metrics (events processed, lag, errors) 82 - [ ] Auth metrics (token validations, failures) 83- [ ] Set up Grafana dashboards 84 - [ ] Request rate and latency 85 - [ ] Error rates by endpoint 86 - [ ] Database performance 87 - [ ] Jetstream consumer health 88- [ ] Configure alerting rules 89 - [ ] High error rate (>5% 5xx responses) 90 - [ ] Slow response time (p99 >1s) 91 - [ ] Database connection pool exhaustion 92 - [ ] Jetstream consumer lag >1 minute 93 - [ ] PDS health check failures 94 95**Estimated Effort**: 8-10 hours 96 97#### Structured Logging 98- [ ] Replace `log` package with structured logger (zerolog or zap) 99- [ ] Add log levels (debug, info, warn, error) 100- [ ] JSON output format for production 101- [ ] Add request ID tracking 102- [ ] Add correlation IDs for async operations 103- [ ] Sanitize sensitive data from logs (passwords, tokens, emails) 104- [ ] Configure log rotation 105- [ ] Ship logs to aggregation service (optional: Loki, CloudWatch) 106 107**Estimated Effort**: 6-8 hours 108 109#### Database Backups 110- [ ] Automated PostgreSQL backups (daily minimum) 111- [ ] Backup retention policy (30 days) 112- [ ] Test restore procedure 113- [ ] Document backup/restore runbook 114- [ ] Off-site backup storage 115- [ ] Monitor backup success/failure 116- [ ] Point-in-time recovery (PITR) setup (optional) 117 118**Estimated Effort**: 4-6 hours 119 120#### Load Testing 121- [ ] Define load test scenarios 122 - [ ] User signup and authentication 123 - [ ] Community creation 124 - [ ] Post creation and viewing 125 - [ ] Feed retrieval (timeline, discover, community) 126 - [ ] Comment creation and threading 127 - [ ] Voting 128- [ ] Set target metrics 129 - [ ] Concurrent users target (e.g., 100 concurrent) 130 - [ ] Requests per second target 131 - [ ] P95 latency target (<500ms) 132 - [ ] Error rate target (<1%) 133- [ ] Run load tests with k6/Artillery/JMeter 134- [ ] Identify bottlenecks (database, CPU, memory) 135- [ ] Optimize slow queries 136- [ ] Add database indexes if needed 137- [ ] Test graceful degradation under load 138 139**Estimated Effort**: 10-12 hours 140 141#### Deployment Runbook 142- [ ] Document deployment procedure 143 - [ ] Pre-deployment checklist 144 - [ ] Database migration steps 145 - [ ] Environment variable validation 146 - [ ] Health check verification 147 - [ ] Rollback procedure 148- [ ] Document operational procedures 149 - [ ] How to check system health 150 - [ ] How to read logs 151 - [ ] How to check Jetstream consumer status 152 - [ ] How to manually trigger community token refresh 153 - [ ] How to clear caches 154- [ ] Document incident response 155 - [ ] Who to contact 156 - [ ] Escalation path 157 - [ ] Common issues and fixes 158 - [ ] Emergency procedures (PDS down, database down, etc.) 159- [ ] Create production environment checklist 160 - [ ] All environment variables set 161 - [ ] `AUTH_SKIP_VERIFY=false` 162 - [ ] `SKIP_DID_WEB_VERIFICATION=false` 163 - [ ] Database migrations applied 164 - [ ] PDS connectivity verified 165 - [ ] JWKS caching working 166 - [ ] Jetstream consumers running 167 - [ ] Monitoring and alerting active 168 169**Estimated Effort**: 6-8 hours 170 171--- 172 173## P2: Nice to Have (Can Defer to Post-Alpha) 174 175### 7. Post Update/Delete 176- [ ] Implement post update endpoint 177- [ ] Implement post delete endpoint 178- [ ] Jetstream consumer for UPDATE/DELETE events 179- [ ] Soft delete support 180 181**Estimated Effort**: 4-6 hours 182 183### 8. Community Delete 184- [ ] Implement community delete endpoint 185- [ ] Cascade delete considerations 186- [ ] Archive vs hard delete decision 187 188**Estimated Effort**: 2-3 hours 189 190### 9. Content Rules Validation 191- [ ] Implement text-only community enforcement 192- [ ] Implement allowed embed types validation 193- [ ] Content length limits 194 195**Estimated Effort**: 6-8 hours 196 197### 10. Search Functionality 198- [ ] Community search improvements 199- [ ] Post search 200- [ ] User search 201- [ ] Full-text search with PostgreSQL or external service 202 203**Estimated Effort**: 8-10 hours 204 205--- 206 207## Testing Gaps 208 209### E2E Testing Recommendations 210 211#### 1. Full User Journey Test (CRITICAL) 212**What**: Test complete user flow from signup to interaction 213**Why**: No single test validates the entire happy path 214 215- [ ] Create test: Signup Authenticate Create Community Create Post Add Comment Vote 216- [ ] Verify all data flows through Jetstream correctly 217- [ ] Verify counts update (vote counts, comment counts, subscriber counts) 218- [ ] Verify timeline feed shows posts from subscribed communities 219- [ ] Test with 2+ users interacting (user A posts, user B comments) 220 221**File**: Create `tests/integration/user_journey_e2e_test.go` 222**Estimated Effort**: 4-6 hours 223 224#### 2. Blob Upload E2E Test 225**What**: Test image upload and display in posts 226**Why**: No test validates the full blob upload post feed display flow 227 228- [ ] Create post with embedded image 229- [ ] Verify blob uploaded to PDS 230- [ ] Verify blob URL transformation in feed responses 231- [ ] Test multiple images in single post 232- [ ] Test image in comment 233 234**Estimated Effort**: 3-4 hours 235 236#### 3. Multi-Community Timeline Test 237**What**: Test timeline feed with multiple community subscriptions 238**Why**: Timeline logic may have edge cases with multiple sources 239 240- [ ] Create 3+ communities 241- [ ] Subscribe user to all communities 242- [ ] Create posts in each community 243- [ ] Verify timeline shows posts from all subscribed communities 244- [ ] Verify hot/top/new sorting across communities 245 246**Estimated Effort**: 2-3 hours 247 248#### 4. Concurrent User Scenarios 249**What**: Test system behavior with simultaneous users 250**Why**: Race conditions and locking issues only appear under concurrency 251 252- [ ] Multiple users voting on same post simultaneously 253- [ ] Multiple users commenting on same post simultaneously 254- [ ] Community creation with same handle (should fail) 255- [ ] Subscription race conditions 256 257**Estimated Effort**: 4-5 hours 258 259#### 5. Rate Limiting Tests 260**What**: Verify rate limits work correctly 261**Why**: Protection against abuse 262 263- [ ] Test aggregator rate limits (already exists) 264- [ ] Test general endpoint rate limits (100 req/min) 265- [ ] Test comment rate limits (20 req/min) 266- [ ] Verify 429 responses 267- [ ] Verify rate limit headers 268 269**Estimated Effort**: 2-3 hours 270 271#### 6. Error Recovery Tests 272**What**: Test system recovery from failures 273**Why**: Production will have failures 274 275- [ ] Jetstream reconnection after disconnect 276- [ ] PDS temporarily unavailable during post creation 277- [ ] Database connection loss and recovery 278- [ ] Malformed Jetstream events (should skip, not crash) 279- [ ] Out-of-order event handling (already partially covered) 280 281**Estimated Effort**: 4-5 hours 282 283#### 7. Federation Readiness (Optional) 284**What**: Test cross-PDS interactions 285**Why**: Future-proofing for federation 286 287- [ ] User on different PDS subscribing to Coves community 288- [ ] User on different PDS commenting on Coves post 289- [ ] User on different PDS voting on Coves content 290- [ ] Handle resolution across PDSs 291 292**Note**: Defer to Beta unless federation is alpha requirement 293 294--- 295 296## Timeline Estimate 297 298### Week 1: Critical Blockers (P0) 299- **Days 1-2**: Authentication (JWT + did:web verification) 300- **Day 3**: DPoP token architecture fix 301- ~~**Day 4**: Handle resolution + comment count reconciliation~~ **COMPLETED** 302- **Day 4-5**: Testing and bug fixes 303 304**Total**: 15-20 hours (reduced from 20-25 due to completed items) 305 306### Week 2: Production Infrastructure (P1) 307- **Days 6-7**: Monitoring + structured logging 308- **Day 8**: Database backups + load testing 309- **Days 9-10**: Deployment runbook + final testing 310 311**Total**: 30-35 hours 312 313### Week 3: E2E Testing + Polish 314- **Days 11-12**: Critical E2E tests (user journey, blob upload) 315- **Day 13**: Additional E2E tests 316- **Days 14-15**: Load testing, bug fixes, polish 317 318**Total**: 20-25 hours 319 320**Grand Total: 65-80 hours (approximately 2-3 weeks full-time)** 321*(Reduced from original 70-85 hours estimate due to completed handle resolution and comment count reconciliation)* 322 323--- 324 325## Success Criteria 326 327Alpha is ready when: 328 329- [ ] All P0 blockers resolved 330 - Handle resolution (COMPLETE) 331 - Comment count reconciliation (COMPLETE) 332 - [ ] JWT signature verification working with production PDS 333 - [ ] DPoP architecture fix implemented 334 - [ ] did:web verification complete 335- [ ] Subscriptions/blocking work via client-write pattern 336- [ ] All integration tests passing 337- [ ] E2E user journey test passing 338- [ ] Load testing shows acceptable performance (100+ concurrent users) 339- [ ] Monitoring and alerting active 340- [ ] Database backups configured and tested 341- [ ] Deployment runbook complete and validated 342- [ ] Security audit completed (basic) 343- [ ] No known critical bugs 344 345--- 346 347## Go/No-Go Decision Points 348 349### Can we launch without it? 350 351| Feature | Alpha Requirement | Status | Rationale | 352|---------|------------------|--------|-----------| 353| JWT signature verification | YES | 🟡 Needs testing | Security critical | 354| DPoP architecture fix | YES | 🔴 Not started | Subscriptions broken without it | 355| ~~Handle resolution~~ | ~~✅ YES~~ | **COMPLETE** | Core UX requirement | 356| ~~Comment count reconciliation~~ | ~~✅ YES~~ | **COMPLETE** | Data accuracy | 357| Post read endpoints | MAYBE | 🔴 Not implemented | Can use feeds initially | 358| Post update/delete | NO | 🔴 Not implemented | Can add post-launch | 359| Moderation system | NO | 🔴 Not implemented | Deferred to Beta per PRD_GOVERNANCE | 360| Full-text search | NO | 🔴 Not implemented | Browse works without it | 361| Federation testing | NO | 🔴 Not implemented | Single-instance alpha | 362| Mobile app | MAYBE | 🔴 Not implemented | Web-first acceptable | 363 364--- 365 366## Risk Assessment 367 368### High Risk 3691. **JWT verification with production PDS** - Never tested with real JWKS 3702. **Load under real traffic** - Current tests are single-user 3713. **Operational knowledge** - No one has run this in production yet 372 373### Medium Risk 3741. **Database performance** - Queries optimized but not load tested 3752. **Jetstream consumer lag** - May fall behind under high write volume 3763. **Token refresh stability** - Community tokens refresh every 2 hours (tested but not long-running) 377 378### Low Risk 3791. **DPoP architecture fix** - Similar pattern already works (votes) 3802. ~~**Handle resolution**~~ - Already implemented 3813. ~~**Comment reconciliation**~~ - Already implemented 382 383--- 384 385## Open Questions 386 3871. **What's the target alpha user count?** (affects infrastructure sizing) 3882. **What's the alpha duration?** (affects monitoring retention, backup retention) 3893. **Is mobile app required for alpha?** (affects DPoP testing priority) 3904. **What's the rollback strategy?** (database migrations may not be reversible) 3915. **Who's on-call during alpha?** (affects runbook detail level) 3926. **What's the acceptable downtime?** (affects HA requirements) 3937. **Budget for infrastructure?** (affects monitoring/backup solutions) 394 395--- 396 397## Next Steps 398 3991. Create this PRD 4002. Validate handle resolution (COMPLETE) 4013. Validate comment count reconciliation (COMPLETE) 4024. [ ] Review and prioritize with team 4035. [ ] Test JWT verification with `pds.bretton.dev` (requires invite code or existing account) 4046. [ ] Begin P0 blockers (DPoP fix first - highest user impact) 4057. [ ] Set up monitoring infrastructure 4068. [ ] Write critical E2E tests (especially full user journey) 4079. [ ] Conduct load testing 40810. [ ] Security review 40911. [ ] Go/no-go decision 41012. [ ] Launch! 🚀