···
1
+
# Alpha Go-Live Readiness PRD
3
+
**Status**: Pre-Alpha
4
+
**Target**: Alpha launch with real users
5
+
**Last Updated**: 2025-11-16
9
+
This document tracks the remaining work required to launch Coves alpha with real users. Focus is on critical functionality, security, and operational readiness.
13
+
## P0: Critical Blockers (Must Complete Before Alpha)
15
+
### 1. Authentication & Security
17
+
#### JWT Signature Verification (Production Mode)
18
+
- [ ] Test with production PDS at `pds.bretton.dev`
19
+
- [ ] Create test account on production PDS
20
+
- [ ] Verify JWKS endpoint is accessible
21
+
- [ ] Run `TestJWTSignatureVerification` against production PDS
22
+
- [ ] Confirm signature verification succeeds
23
+
- [ ] Test token refresh flow
24
+
- [ ] Set `AUTH_SKIP_VERIFY=false` in production environment
25
+
- [ ] Verify all auth middleware tests pass with verification enabled
26
+
- [ ] Document production PDS requirements for communities
28
+
**Estimated Effort**: 2-3 hours
29
+
**Risk**: Medium (code implemented, needs validation)
31
+
#### did:web Verification
32
+
- [ ] Complete did:web domain verification implementation
33
+
- [ ] Test with real did:web identities
34
+
- [ ] Add security logging for verification failures
35
+
- [ ] Set `SKIP_DID_WEB_VERIFICATION=false` for production
37
+
**Estimated Effort**: 2-3 hours
40
+
### 2. DPoP Token Architecture Fix
42
+
**Problem**: Backend attempts to write subscriptions/blocks to user PDS using DPoP-bound tokens (fails with "Malformed token").
44
+
#### Remove Write-Forward Code
45
+
- [ ] Remove write-forward from `SubscribeToCommunity` handler
46
+
- [ ] Remove write-forward from `UnsubscribeFromCommunity` handler
47
+
- [ ] Remove write-forward from `BlockCommunity` handler
48
+
- [ ] Remove write-forward from `UnblockCommunity` handler
49
+
- [ ] Update handlers to return helpful error: "Write directly to your PDS"
50
+
- [ ] Update API documentation to reflect client-write pattern
51
+
- [ ] Verify Jetstream consumers still index correctly
54
+
- `internal/core/communities/service.go:564-816`
55
+
- `internal/api/handlers/community/subscribe.go`
56
+
- `internal/api/handlers/community/block.go`
58
+
**Estimated Effort**: 3-4 hours
59
+
**Risk**: Low (similar to votes pattern)
61
+
## P1: Important (Should Complete Before Alpha)
63
+
### 5. Post Read Operations
65
+
- [ ] Implement `getPost` endpoint (single post retrieval)
66
+
- [ ] Implement `listPosts` endpoint (with pagination)
67
+
- [ ] Add post permalink support
68
+
- [ ] Integration tests for post retrieval
69
+
- [ ] Error handling for missing/deleted posts
71
+
**Estimated Effort**: 6-8 hours
73
+
**Note**: Can defer if direct post linking not needed initially
75
+
### 6. Production Infrastructure
77
+
#### Monitoring Setup
78
+
- [ ] Add Prometheus metrics endpoints
79
+
- [ ] HTTP request metrics (duration, status codes, paths)
80
+
- [ ] Database query metrics (slow queries, connection pool)
81
+
- [ ] Jetstream consumer metrics (events processed, lag, errors)
82
+
- [ ] Auth metrics (token validations, failures)
83
+
- [ ] Set up Grafana dashboards
84
+
- [ ] Request rate and latency
85
+
- [ ] Error rates by endpoint
86
+
- [ ] Database performance
87
+
- [ ] Jetstream consumer health
88
+
- [ ] Configure alerting rules
89
+
- [ ] High error rate (>5% 5xx responses)
90
+
- [ ] Slow response time (p99 >1s)
91
+
- [ ] Database connection pool exhaustion
92
+
- [ ] Jetstream consumer lag >1 minute
93
+
- [ ] PDS health check failures
95
+
**Estimated Effort**: 8-10 hours
97
+
#### Structured Logging
98
+
- [ ] Replace `log` package with structured logger (zerolog or zap)
99
+
- [ ] Add log levels (debug, info, warn, error)
100
+
- [ ] JSON output format for production
101
+
- [ ] Add request ID tracking
102
+
- [ ] Add correlation IDs for async operations
103
+
- [ ] Sanitize sensitive data from logs (passwords, tokens, emails)
104
+
- [ ] Configure log rotation
105
+
- [ ] Ship logs to aggregation service (optional: Loki, CloudWatch)
107
+
**Estimated Effort**: 6-8 hours
109
+
#### Database Backups
110
+
- [ ] Automated PostgreSQL backups (daily minimum)
111
+
- [ ] Backup retention policy (30 days)
112
+
- [ ] Test restore procedure
113
+
- [ ] Document backup/restore runbook
114
+
- [ ] Off-site backup storage
115
+
- [ ] Monitor backup success/failure
116
+
- [ ] Point-in-time recovery (PITR) setup (optional)
118
+
**Estimated Effort**: 4-6 hours
121
+
- [ ] Define load test scenarios
122
+
- [ ] User signup and authentication
123
+
- [ ] Community creation
124
+
- [ ] Post creation and viewing
125
+
- [ ] Feed retrieval (timeline, discover, community)
126
+
- [ ] Comment creation and threading
128
+
- [ ] Set target metrics
129
+
- [ ] Concurrent users target (e.g., 100 concurrent)
130
+
- [ ] Requests per second target
131
+
- [ ] P95 latency target (<500ms)
132
+
- [ ] Error rate target (<1%)
133
+
- [ ] Run load tests with k6/Artillery/JMeter
134
+
- [ ] Identify bottlenecks (database, CPU, memory)
135
+
- [ ] Optimize slow queries
136
+
- [ ] Add database indexes if needed
137
+
- [ ] Test graceful degradation under load
139
+
**Estimated Effort**: 10-12 hours
141
+
#### Deployment Runbook
142
+
- [ ] Document deployment procedure
143
+
- [ ] Pre-deployment checklist
144
+
- [ ] Database migration steps
145
+
- [ ] Environment variable validation
146
+
- [ ] Health check verification
147
+
- [ ] Rollback procedure
148
+
- [ ] Document operational procedures
149
+
- [ ] How to check system health
150
+
- [ ] How to read logs
151
+
- [ ] How to check Jetstream consumer status
152
+
- [ ] How to manually trigger community token refresh
153
+
- [ ] How to clear caches
154
+
- [ ] Document incident response
155
+
- [ ] Who to contact
156
+
- [ ] Escalation path
157
+
- [ ] Common issues and fixes
158
+
- [ ] Emergency procedures (PDS down, database down, etc.)
159
+
- [ ] Create production environment checklist
160
+
- [ ] All environment variables set
161
+
- [ ] `AUTH_SKIP_VERIFY=false`
162
+
- [ ] `SKIP_DID_WEB_VERIFICATION=false`
163
+
- [ ] Database migrations applied
164
+
- [ ] PDS connectivity verified
165
+
- [ ] JWKS caching working
166
+
- [ ] Jetstream consumers running
167
+
- [ ] Monitoring and alerting active
169
+
**Estimated Effort**: 6-8 hours
173
+
## P2: Nice to Have (Can Defer to Post-Alpha)
175
+
### 7. Post Update/Delete
176
+
- [ ] Implement post update endpoint
177
+
- [ ] Implement post delete endpoint
178
+
- [ ] Jetstream consumer for UPDATE/DELETE events
179
+
- [ ] Soft delete support
181
+
**Estimated Effort**: 4-6 hours
183
+
### 8. Community Delete
184
+
- [ ] Implement community delete endpoint
185
+
- [ ] Cascade delete considerations
186
+
- [ ] Archive vs hard delete decision
188
+
**Estimated Effort**: 2-3 hours
190
+
### 9. Content Rules Validation
191
+
- [ ] Implement text-only community enforcement
192
+
- [ ] Implement allowed embed types validation
193
+
- [ ] Content length limits
195
+
**Estimated Effort**: 6-8 hours
197
+
### 10. Search Functionality
198
+
- [ ] Community search improvements
201
+
- [ ] Full-text search with PostgreSQL or external service
203
+
**Estimated Effort**: 8-10 hours
209
+
### E2E Testing Recommendations
211
+
#### 1. Full User Journey Test (CRITICAL)
212
+
**What**: Test complete user flow from signup to interaction
213
+
**Why**: No single test validates the entire happy path
215
+
- [ ] Create test: Signup → Authenticate → Create Community → Create Post → Add Comment → Vote
216
+
- [ ] Verify all data flows through Jetstream correctly
217
+
- [ ] Verify counts update (vote counts, comment counts, subscriber counts)
218
+
- [ ] Verify timeline feed shows posts from subscribed communities
219
+
- [ ] Test with 2+ users interacting (user A posts, user B comments)
221
+
**File**: Create `tests/integration/user_journey_e2e_test.go`
222
+
**Estimated Effort**: 4-6 hours
224
+
#### 2. Blob Upload E2E Test
225
+
**What**: Test image upload and display in posts
226
+
**Why**: No test validates the full blob upload → post → feed display flow
228
+
- [ ] Create post with embedded image
229
+
- [ ] Verify blob uploaded to PDS
230
+
- [ ] Verify blob URL transformation in feed responses
231
+
- [ ] Test multiple images in single post
232
+
- [ ] Test image in comment
234
+
**Estimated Effort**: 3-4 hours
236
+
#### 3. Multi-Community Timeline Test
237
+
**What**: Test timeline feed with multiple community subscriptions
238
+
**Why**: Timeline logic may have edge cases with multiple sources
240
+
- [ ] Create 3+ communities
241
+
- [ ] Subscribe user to all communities
242
+
- [ ] Create posts in each community
243
+
- [ ] Verify timeline shows posts from all subscribed communities
244
+
- [ ] Verify hot/top/new sorting across communities
246
+
**Estimated Effort**: 2-3 hours
248
+
#### 4. Concurrent User Scenarios
249
+
**What**: Test system behavior with simultaneous users
250
+
**Why**: Race conditions and locking issues only appear under concurrency
252
+
- [ ] Multiple users voting on same post simultaneously
253
+
- [ ] Multiple users commenting on same post simultaneously
254
+
- [ ] Community creation with same handle (should fail)
255
+
- [ ] Subscription race conditions
257
+
**Estimated Effort**: 4-5 hours
259
+
#### 5. Rate Limiting Tests
260
+
**What**: Verify rate limits work correctly
261
+
**Why**: Protection against abuse
263
+
- [ ] Test aggregator rate limits (already exists)
264
+
- [ ] Test general endpoint rate limits (100 req/min)
265
+
- [ ] Test comment rate limits (20 req/min)
266
+
- [ ] Verify 429 responses
267
+
- [ ] Verify rate limit headers
269
+
**Estimated Effort**: 2-3 hours
271
+
#### 6. Error Recovery Tests
272
+
**What**: Test system recovery from failures
273
+
**Why**: Production will have failures
275
+
- [ ] Jetstream reconnection after disconnect
276
+
- [ ] PDS temporarily unavailable during post creation
277
+
- [ ] Database connection loss and recovery
278
+
- [ ] Malformed Jetstream events (should skip, not crash)
279
+
- [ ] Out-of-order event handling (already partially covered)
281
+
**Estimated Effort**: 4-5 hours
283
+
#### 7. Federation Readiness (Optional)
284
+
**What**: Test cross-PDS interactions
285
+
**Why**: Future-proofing for federation
287
+
- [ ] User on different PDS subscribing to Coves community
288
+
- [ ] User on different PDS commenting on Coves post
289
+
- [ ] User on different PDS voting on Coves content
290
+
- [ ] Handle resolution across PDSs
292
+
**Note**: Defer to Beta unless federation is alpha requirement
296
+
## Timeline Estimate
298
+
### Week 1: Critical Blockers (P0)
299
+
- **Days 1-2**: Authentication (JWT + did:web verification)
300
+
- **Day 3**: DPoP token architecture fix
301
+
- ~~**Day 4**: Handle resolution + comment count reconciliation~~ ✅ **COMPLETED**
302
+
- **Day 4-5**: Testing and bug fixes
304
+
**Total**: 15-20 hours (reduced from 20-25 due to completed items)
306
+
### Week 2: Production Infrastructure (P1)
307
+
- **Days 6-7**: Monitoring + structured logging
308
+
- **Day 8**: Database backups + load testing
309
+
- **Days 9-10**: Deployment runbook + final testing
311
+
**Total**: 30-35 hours
313
+
### Week 3: E2E Testing + Polish
314
+
- **Days 11-12**: Critical E2E tests (user journey, blob upload)
315
+
- **Day 13**: Additional E2E tests
316
+
- **Days 14-15**: Load testing, bug fixes, polish
318
+
**Total**: 20-25 hours
320
+
**Grand Total: 65-80 hours (approximately 2-3 weeks full-time)**
321
+
*(Reduced from original 70-85 hours estimate due to completed handle resolution and comment count reconciliation)*
325
+
## Success Criteria
327
+
Alpha is ready when:
329
+
- [ ] All P0 blockers resolved
330
+
- ✅ Handle resolution (COMPLETE)
331
+
- ✅ Comment count reconciliation (COMPLETE)
332
+
- [ ] JWT signature verification working with production PDS
333
+
- [ ] DPoP architecture fix implemented
334
+
- [ ] did:web verification complete
335
+
- [ ] Subscriptions/blocking work via client-write pattern
336
+
- [ ] All integration tests passing
337
+
- [ ] E2E user journey test passing
338
+
- [ ] Load testing shows acceptable performance (100+ concurrent users)
339
+
- [ ] Monitoring and alerting active
340
+
- [ ] Database backups configured and tested
341
+
- [ ] Deployment runbook complete and validated
342
+
- [ ] Security audit completed (basic)
343
+
- [ ] No known critical bugs
347
+
## Go/No-Go Decision Points
349
+
### Can we launch without it?
351
+
| Feature | Alpha Requirement | Status | Rationale |
352
+
|---------|------------------|--------|-----------|
353
+
| JWT signature verification | ✅ YES | 🟡 Needs testing | Security critical |
354
+
| DPoP architecture fix | ✅ YES | 🔴 Not started | Subscriptions broken without it |
355
+
| ~~Handle resolution~~ | ~~✅ YES~~ | ✅ **COMPLETE** | Core UX requirement |
356
+
| ~~Comment count reconciliation~~ | ~~✅ YES~~ | ✅ **COMPLETE** | Data accuracy |
357
+
| Post read endpoints | ⚠️ MAYBE | 🔴 Not implemented | Can use feeds initially |
358
+
| Post update/delete | ❌ NO | 🔴 Not implemented | Can add post-launch |
359
+
| Moderation system | ❌ NO | 🔴 Not implemented | Deferred to Beta per PRD_GOVERNANCE |
360
+
| Full-text search | ❌ NO | 🔴 Not implemented | Browse works without it |
361
+
| Federation testing | ❌ NO | 🔴 Not implemented | Single-instance alpha |
362
+
| Mobile app | ⚠️ MAYBE | 🔴 Not implemented | Web-first acceptable |
369
+
1. **JWT verification with production PDS** - Never tested with real JWKS
370
+
2. **Load under real traffic** - Current tests are single-user
371
+
3. **Operational knowledge** - No one has run this in production yet
374
+
1. **Database performance** - Queries optimized but not load tested
375
+
2. **Jetstream consumer lag** - May fall behind under high write volume
376
+
3. **Token refresh stability** - Community tokens refresh every 2 hours (tested but not long-running)
379
+
1. **DPoP architecture fix** - Similar pattern already works (votes)
380
+
2. ~~**Handle resolution**~~ - ✅ Already implemented
381
+
3. ~~**Comment reconciliation**~~ - ✅ Already implemented
387
+
1. **What's the target alpha user count?** (affects infrastructure sizing)
388
+
2. **What's the alpha duration?** (affects monitoring retention, backup retention)
389
+
3. **Is mobile app required for alpha?** (affects DPoP testing priority)
390
+
4. **What's the rollback strategy?** (database migrations may not be reversible)
391
+
5. **Who's on-call during alpha?** (affects runbook detail level)
392
+
6. **What's the acceptable downtime?** (affects HA requirements)
393
+
7. **Budget for infrastructure?** (affects monitoring/backup solutions)
399
+
1. ✅ Create this PRD
400
+
2. ✅ Validate handle resolution (COMPLETE)
401
+
3. ✅ Validate comment count reconciliation (COMPLETE)
402
+
4. [ ] Review and prioritize with team
403
+
5. [ ] Test JWT verification with `pds.bretton.dev` (requires invite code or existing account)
404
+
6. [ ] Begin P0 blockers (DPoP fix first - highest user impact)
405
+
7. [ ] Set up monitoring infrastructure
406
+
8. [ ] Write critical E2E tests (especially full user journey)
407
+
9. [ ] Conduct load testing
408
+
10. [ ] Security review
409
+
11. [ ] Go/no-go decision