A community based topic aggregation platform built on atproto
1# Alpha Go-Live Readiness PRD
2
3**Status**: Pre-Alpha
4**Target**: Alpha launch with real users
5**Last Updated**: 2025-11-16
6
7## Overview
8
9This document tracks the remaining work required to launch Coves alpha with real users. Focus is on critical functionality, security, and operational readiness.
10
11---
12
13## P0: Critical Blockers (Must Complete Before Alpha)
14
15### 1. Authentication & Security
16
17#### JWT Signature Verification (Production Mode)
18- [ ] Test with production PDS at `pds.bretton.dev`
19 - [ ] Create test account on production PDS
20 - [ ] Verify JWKS endpoint is accessible
21 - [ ] Run `TestJWTSignatureVerification` against production PDS
22 - [ ] Confirm signature verification succeeds
23 - [ ] Test token refresh flow
24- [ ] Set `AUTH_SKIP_VERIFY=false` in production environment
25- [ ] Verify all auth middleware tests pass with verification enabled
26- [ ] Document production PDS requirements for communities
27
28**Estimated Effort**: 2-3 hours
29**Risk**: Medium (code implemented, needs validation)
30
31#### did:web Verification
32- [ ] Complete did:web domain verification implementation
33- [ ] Test with real did:web identities
34- [ ] Add security logging for verification failures
35- [ ] Set `SKIP_DID_WEB_VERIFICATION=false` for production
36
37**Estimated Effort**: 2-3 hours
38**Risk**: Medium
39
40### 2. DPoP Token Architecture Fix
41
42**Problem**: Backend attempts to write subscriptions/blocks to user PDS using DPoP-bound tokens (fails with "Malformed token").
43
44#### Remove Write-Forward Code
45- [ ] Remove write-forward from `SubscribeToCommunity` handler
46- [ ] Remove write-forward from `UnsubscribeFromCommunity` handler
47- [ ] Remove write-forward from `BlockCommunity` handler
48- [ ] Remove write-forward from `UnblockCommunity` handler
49- [ ] Update handlers to return helpful error: "Write directly to your PDS"
50- [ ] Update API documentation to reflect client-write pattern
51- [ ] Verify Jetstream consumers still index correctly
52
53**Files**:
54- `internal/core/communities/service.go:564-816`
55- `internal/api/handlers/community/subscribe.go`
56- `internal/api/handlers/community/block.go`
57
58**Estimated Effort**: 3-4 hours
59**Risk**: Low (similar to votes pattern)
60
61## P1: Important (Should Complete Before Alpha)
62
63### 5. Post Read Operations
64
65- [ ] Implement `getPost` endpoint (single post retrieval)
66- [ ] Implement `listPosts` endpoint (with pagination)
67- [ ] Add post permalink support
68- [ ] Integration tests for post retrieval
69- [ ] Error handling for missing/deleted posts
70
71**Estimated Effort**: 6-8 hours
72**Risk**: Low
73**Note**: Can defer if direct post linking not needed initially
74
75### 6. Production Infrastructure
76
77#### Monitoring Setup
78- [ ] Add Prometheus metrics endpoints
79 - [ ] HTTP request metrics (duration, status codes, paths)
80 - [ ] Database query metrics (slow queries, connection pool)
81 - [ ] Jetstream consumer metrics (events processed, lag, errors)
82 - [ ] Auth metrics (token validations, failures)
83- [ ] Set up Grafana dashboards
84 - [ ] Request rate and latency
85 - [ ] Error rates by endpoint
86 - [ ] Database performance
87 - [ ] Jetstream consumer health
88- [ ] Configure alerting rules
89 - [ ] High error rate (>5% 5xx responses)
90 - [ ] Slow response time (p99 >1s)
91 - [ ] Database connection pool exhaustion
92 - [ ] Jetstream consumer lag >1 minute
93 - [ ] PDS health check failures
94
95**Estimated Effort**: 8-10 hours
96
97#### Structured Logging
98- [ ] Replace `log` package with structured logger (zerolog or zap)
99- [ ] Add log levels (debug, info, warn, error)
100- [ ] JSON output format for production
101- [ ] Add request ID tracking
102- [ ] Add correlation IDs for async operations
103- [ ] Sanitize sensitive data from logs (passwords, tokens, emails)
104- [ ] Configure log rotation
105- [ ] Ship logs to aggregation service (optional: Loki, CloudWatch)
106
107**Estimated Effort**: 6-8 hours
108
109#### Database Backups
110- [ ] Automated PostgreSQL backups (daily minimum)
111- [ ] Backup retention policy (30 days)
112- [ ] Test restore procedure
113- [ ] Document backup/restore runbook
114- [ ] Off-site backup storage
115- [ ] Monitor backup success/failure
116- [ ] Point-in-time recovery (PITR) setup (optional)
117
118**Estimated Effort**: 4-6 hours
119
120#### Load Testing
121- [ ] Define load test scenarios
122 - [ ] User signup and authentication
123 - [ ] Community creation
124 - [ ] Post creation and viewing
125 - [ ] Feed retrieval (timeline, discover, community)
126 - [ ] Comment creation and threading
127 - [ ] Voting
128- [ ] Set target metrics
129 - [ ] Concurrent users target (e.g., 100 concurrent)
130 - [ ] Requests per second target
131 - [ ] P95 latency target (<500ms)
132 - [ ] Error rate target (<1%)
133- [ ] Run load tests with k6/Artillery/JMeter
134- [ ] Identify bottlenecks (database, CPU, memory)
135- [ ] Optimize slow queries
136- [ ] Add database indexes if needed
137- [ ] Test graceful degradation under load
138
139**Estimated Effort**: 10-12 hours
140
141#### Deployment Runbook
142- [ ] Document deployment procedure
143 - [ ] Pre-deployment checklist
144 - [ ] Database migration steps
145 - [ ] Environment variable validation
146 - [ ] Health check verification
147 - [ ] Rollback procedure
148- [ ] Document operational procedures
149 - [ ] How to check system health
150 - [ ] How to read logs
151 - [ ] How to check Jetstream consumer status
152 - [ ] How to manually trigger community token refresh
153 - [ ] How to clear caches
154- [ ] Document incident response
155 - [ ] Who to contact
156 - [ ] Escalation path
157 - [ ] Common issues and fixes
158 - [ ] Emergency procedures (PDS down, database down, etc.)
159- [ ] Create production environment checklist
160 - [ ] All environment variables set
161 - [ ] `AUTH_SKIP_VERIFY=false`
162 - [ ] `SKIP_DID_WEB_VERIFICATION=false`
163 - [ ] Database migrations applied
164 - [ ] PDS connectivity verified
165 - [ ] JWKS caching working
166 - [ ] Jetstream consumers running
167 - [ ] Monitoring and alerting active
168
169**Estimated Effort**: 6-8 hours
170
171---
172
173## P2: Nice to Have (Can Defer to Post-Alpha)
174
175### 7. Post Update/Delete
176- [ ] Implement post update endpoint
177- [ ] Implement post delete endpoint
178- [ ] Jetstream consumer for UPDATE/DELETE events
179- [ ] Soft delete support
180
181**Estimated Effort**: 4-6 hours
182
183### 8. Community Delete
184- [ ] Implement community delete endpoint
185- [ ] Cascade delete considerations
186- [ ] Archive vs hard delete decision
187
188**Estimated Effort**: 2-3 hours
189
190### 9. Content Rules Validation
191- [ ] Implement text-only community enforcement
192- [ ] Implement allowed embed types validation
193- [ ] Content length limits
194
195**Estimated Effort**: 6-8 hours
196
197### 10. Search Functionality
198- [ ] Community search improvements
199- [ ] Post search
200- [ ] User search
201- [ ] Full-text search with PostgreSQL or external service
202
203**Estimated Effort**: 8-10 hours
204
205---
206
207## Testing Gaps
208
209### E2E Testing Recommendations
210
211#### 1. Full User Journey Test (CRITICAL)
212**What**: Test complete user flow from signup to interaction
213**Why**: No single test validates the entire happy path
214
215- [ ] Create test: Signup → Authenticate → Create Community → Create Post → Add Comment → Vote
216- [ ] Verify all data flows through Jetstream correctly
217- [ ] Verify counts update (vote counts, comment counts, subscriber counts)
218- [ ] Verify timeline feed shows posts from subscribed communities
219- [ ] Test with 2+ users interacting (user A posts, user B comments)
220
221**File**: Create `tests/integration/user_journey_e2e_test.go`
222**Estimated Effort**: 4-6 hours
223
224#### 2. Blob Upload E2E Test
225**What**: Test image upload and display in posts
226**Why**: No test validates the full blob upload → post → feed display flow
227
228- [ ] Create post with embedded image
229- [ ] Verify blob uploaded to PDS
230- [ ] Verify blob URL transformation in feed responses
231- [ ] Test multiple images in single post
232- [ ] Test image in comment
233
234**Estimated Effort**: 3-4 hours
235
236#### 3. Multi-Community Timeline Test
237**What**: Test timeline feed with multiple community subscriptions
238**Why**: Timeline logic may have edge cases with multiple sources
239
240- [ ] Create 3+ communities
241- [ ] Subscribe user to all communities
242- [ ] Create posts in each community
243- [ ] Verify timeline shows posts from all subscribed communities
244- [ ] Verify hot/top/new sorting across communities
245
246**Estimated Effort**: 2-3 hours
247
248#### 4. Concurrent User Scenarios
249**What**: Test system behavior with simultaneous users
250**Why**: Race conditions and locking issues only appear under concurrency
251
252- [ ] Multiple users voting on same post simultaneously
253- [ ] Multiple users commenting on same post simultaneously
254- [ ] Community creation with same handle (should fail)
255- [ ] Subscription race conditions
256
257**Estimated Effort**: 4-5 hours
258
259#### 5. Rate Limiting Tests
260**What**: Verify rate limits work correctly
261**Why**: Protection against abuse
262
263- [ ] Test aggregator rate limits (already exists)
264- [ ] Test general endpoint rate limits (100 req/min)
265- [ ] Test comment rate limits (20 req/min)
266- [ ] Verify 429 responses
267- [ ] Verify rate limit headers
268
269**Estimated Effort**: 2-3 hours
270
271#### 6. Error Recovery Tests
272**What**: Test system recovery from failures
273**Why**: Production will have failures
274
275- [ ] Jetstream reconnection after disconnect
276- [ ] PDS temporarily unavailable during post creation
277- [ ] Database connection loss and recovery
278- [ ] Malformed Jetstream events (should skip, not crash)
279- [ ] Out-of-order event handling (already partially covered)
280
281**Estimated Effort**: 4-5 hours
282
283#### 7. Federation Readiness (Optional)
284**What**: Test cross-PDS interactions
285**Why**: Future-proofing for federation
286
287- [ ] User on different PDS subscribing to Coves community
288- [ ] User on different PDS commenting on Coves post
289- [ ] User on different PDS voting on Coves content
290- [ ] Handle resolution across PDSs
291
292**Note**: Defer to Beta unless federation is alpha requirement
293
294---
295
296## Timeline Estimate
297
298### Week 1: Critical Blockers (P0)
299- **Days 1-2**: Authentication (JWT + did:web verification)
300- **Day 3**: DPoP token architecture fix
301- ~~**Day 4**: Handle resolution + comment count reconciliation~~ ✅ **COMPLETED**
302- **Day 4-5**: Testing and bug fixes
303
304**Total**: 15-20 hours (reduced from 20-25 due to completed items)
305
306### Week 2: Production Infrastructure (P1)
307- **Days 6-7**: Monitoring + structured logging
308- **Day 8**: Database backups + load testing
309- **Days 9-10**: Deployment runbook + final testing
310
311**Total**: 30-35 hours
312
313### Week 3: E2E Testing + Polish
314- **Days 11-12**: Critical E2E tests (user journey, blob upload)
315- **Day 13**: Additional E2E tests
316- **Days 14-15**: Load testing, bug fixes, polish
317
318**Total**: 20-25 hours
319
320**Grand Total: 65-80 hours (approximately 2-3 weeks full-time)**
321*(Reduced from original 70-85 hours estimate due to completed handle resolution and comment count reconciliation)*
322
323---
324
325## Success Criteria
326
327Alpha is ready when:
328
329- [ ] All P0 blockers resolved
330 - ✅ Handle resolution (COMPLETE)
331 - ✅ Comment count reconciliation (COMPLETE)
332 - [ ] JWT signature verification working with production PDS
333 - [ ] DPoP architecture fix implemented
334 - [ ] did:web verification complete
335- [ ] Subscriptions/blocking work via client-write pattern
336- [ ] All integration tests passing
337- [ ] E2E user journey test passing
338- [ ] Load testing shows acceptable performance (100+ concurrent users)
339- [ ] Monitoring and alerting active
340- [ ] Database backups configured and tested
341- [ ] Deployment runbook complete and validated
342- [ ] Security audit completed (basic)
343- [ ] No known critical bugs
344
345---
346
347## Go/No-Go Decision Points
348
349### Can we launch without it?
350
351| Feature | Alpha Requirement | Status | Rationale |
352|---------|------------------|--------|-----------|
353| JWT signature verification | ✅ YES | 🟡 Needs testing | Security critical |
354| DPoP architecture fix | ✅ YES | 🔴 Not started | Subscriptions broken without it |
355| ~~Handle resolution~~ | ~~✅ YES~~ | ✅ **COMPLETE** | Core UX requirement |
356| ~~Comment count reconciliation~~ | ~~✅ YES~~ | ✅ **COMPLETE** | Data accuracy |
357| Post read endpoints | ⚠️ MAYBE | 🔴 Not implemented | Can use feeds initially |
358| Post update/delete | ❌ NO | 🔴 Not implemented | Can add post-launch |
359| Moderation system | ❌ NO | 🔴 Not implemented | Deferred to Beta per PRD_GOVERNANCE |
360| Full-text search | ❌ NO | 🔴 Not implemented | Browse works without it |
361| Federation testing | ❌ NO | 🔴 Not implemented | Single-instance alpha |
362| Mobile app | ⚠️ MAYBE | 🔴 Not implemented | Web-first acceptable |
363
364---
365
366## Risk Assessment
367
368### High Risk
3691. **JWT verification with production PDS** - Never tested with real JWKS
3702. **Load under real traffic** - Current tests are single-user
3713. **Operational knowledge** - No one has run this in production yet
372
373### Medium Risk
3741. **Database performance** - Queries optimized but not load tested
3752. **Jetstream consumer lag** - May fall behind under high write volume
3763. **Token refresh stability** - Community tokens refresh every 2 hours (tested but not long-running)
377
378### Low Risk
3791. **DPoP architecture fix** - Similar pattern already works (votes)
3802. ~~**Handle resolution**~~ - ✅ Already implemented
3813. ~~**Comment reconciliation**~~ - ✅ Already implemented
382
383---
384
385## Open Questions
386
3871. **What's the target alpha user count?** (affects infrastructure sizing)
3882. **What's the alpha duration?** (affects monitoring retention, backup retention)
3893. **Is mobile app required for alpha?** (affects DPoP testing priority)
3904. **What's the rollback strategy?** (database migrations may not be reversible)
3915. **Who's on-call during alpha?** (affects runbook detail level)
3926. **What's the acceptable downtime?** (affects HA requirements)
3937. **Budget for infrastructure?** (affects monitoring/backup solutions)
394
395---
396
397## Next Steps
398
3991. ✅ Create this PRD
4002. ✅ Validate handle resolution (COMPLETE)
4013. ✅ Validate comment count reconciliation (COMPLETE)
4024. [ ] Review and prioritize with team
4035. [ ] Test JWT verification with `pds.bretton.dev` (requires invite code or existing account)
4046. [ ] Begin P0 blockers (DPoP fix first - highest user impact)
4057. [ ] Set up monitoring infrastructure
4068. [ ] Write critical E2E tests (especially full user journey)
4079. [ ] Conduct load testing
40810. [ ] Security review
40911. [ ] Go/no-go decision
41012. [ ] Launch! 🚀