Alpha Go-Live Readiness PRD#
Status: Pre-Alpha Target: Alpha launch with real users Last Updated: 2025-11-16
Overview#
This document tracks the remaining work required to launch Coves alpha with real users. Focus is on critical functionality, security, and operational readiness.
P0: Critical Blockers (Must Complete Before Alpha)#
1. Authentication & Security#
JWT Signature Verification (Production Mode)#
- Test with production PDS at
pds.bretton.dev- Create test account on production PDS
- Verify JWKS endpoint is accessible
- Run
TestJWTSignatureVerificationagainst production PDS - Confirm signature verification succeeds
- Test token refresh flow
- Set
AUTH_SKIP_VERIFY=falsein production environment - Verify all auth middleware tests pass with verification enabled
- Document production PDS requirements for communities
Estimated Effort: 2-3 hours Risk: Medium (code implemented, needs validation)
did:web Verification#
- Complete did:web domain verification implementation
- Test with real did:web identities
- Add security logging for verification failures
- Set
SKIP_DID_WEB_VERIFICATION=falsefor production
Estimated Effort: 2-3 hours Risk: Medium
2. DPoP Token Architecture Fix#
Problem: Backend attempts to write subscriptions/blocks to user PDS using DPoP-bound tokens (fails with "Malformed token").
Remove Write-Forward Code#
- Remove write-forward from
SubscribeToCommunityhandler - Remove write-forward from
UnsubscribeFromCommunityhandler - Remove write-forward from
BlockCommunityhandler - Remove write-forward from
UnblockCommunityhandler - Update handlers to return helpful error: "Write directly to your PDS"
- Update API documentation to reflect client-write pattern
- Verify Jetstream consumers still index correctly
Files:
internal/core/communities/service.go:564-816internal/api/handlers/community/subscribe.gointernal/api/handlers/community/block.go
Estimated Effort: 3-4 hours Risk: Low (similar to votes pattern)
P1: Important (Should Complete Before Alpha)#
5. Post Read Operations#
- Implement
getPostendpoint (single post retrieval) - Implement
listPostsendpoint (with pagination) - Add post permalink support
- Integration tests for post retrieval
- Error handling for missing/deleted posts
Estimated Effort: 6-8 hours Risk: Low Note: Can defer if direct post linking not needed initially
6. Production Infrastructure#
Monitoring Setup#
- Add Prometheus metrics endpoints
- HTTP request metrics (duration, status codes, paths)
- Database query metrics (slow queries, connection pool)
- Jetstream consumer metrics (events processed, lag, errors)
- Auth metrics (token validations, failures)
- Set up Grafana dashboards
- Request rate and latency
- Error rates by endpoint
- Database performance
- Jetstream consumer health
- Configure alerting rules
- High error rate (>5% 5xx responses)
- Slow response time (p99 >1s)
- Database connection pool exhaustion
- Jetstream consumer lag >1 minute
- PDS health check failures
Estimated Effort: 8-10 hours
Structured Logging#
- Replace
logpackage with structured logger (zerolog or zap) - Add log levels (debug, info, warn, error)
- JSON output format for production
- Add request ID tracking
- Add correlation IDs for async operations
- Sanitize sensitive data from logs (passwords, tokens, emails)
- Configure log rotation
- Ship logs to aggregation service (optional: Loki, CloudWatch)
Estimated Effort: 6-8 hours
Database Backups#
- Automated PostgreSQL backups (daily minimum)
- Backup retention policy (30 days)
- Test restore procedure
- Document backup/restore runbook
- Off-site backup storage
- Monitor backup success/failure
- Point-in-time recovery (PITR) setup (optional)
Estimated Effort: 4-6 hours
Load Testing#
- Define load test scenarios
- User signup and authentication
- Community creation
- Post creation and viewing
- Feed retrieval (timeline, discover, community)
- Comment creation and threading
- Voting
- Set target metrics
- Concurrent users target (e.g., 100 concurrent)
- Requests per second target
- P95 latency target (<500ms)
- Error rate target (<1%)
- Run load tests with k6/Artillery/JMeter
- Identify bottlenecks (database, CPU, memory)
- Optimize slow queries
- Add database indexes if needed
- Test graceful degradation under load
Estimated Effort: 10-12 hours
Deployment Runbook#
- Document deployment procedure
- Pre-deployment checklist
- Database migration steps
- Environment variable validation
- Health check verification
- Rollback procedure
- Document operational procedures
- How to check system health
- How to read logs
- How to check Jetstream consumer status
- How to manually trigger community token refresh
- How to clear caches
- Document incident response
- Who to contact
- Escalation path
- Common issues and fixes
- Emergency procedures (PDS down, database down, etc.)
- Create production environment checklist
- All environment variables set
-
AUTH_SKIP_VERIFY=false -
SKIP_DID_WEB_VERIFICATION=false - Database migrations applied
- PDS connectivity verified
- JWKS caching working
- Jetstream consumers running
- Monitoring and alerting active
Estimated Effort: 6-8 hours
P2: Nice to Have (Can Defer to Post-Alpha)#
7. Post Update/Delete#
- Implement post update endpoint
- Implement post delete endpoint
- Jetstream consumer for UPDATE/DELETE events
- Soft delete support
Estimated Effort: 4-6 hours
8. Community Delete#
- Implement community delete endpoint
- Cascade delete considerations
- Archive vs hard delete decision
Estimated Effort: 2-3 hours
9. Content Rules Validation#
- Implement text-only community enforcement
- Implement allowed embed types validation
- Content length limits
Estimated Effort: 6-8 hours
10. Search Functionality#
- Community search improvements
- Post search
- User search
- Full-text search with PostgreSQL or external service
Estimated Effort: 8-10 hours
Testing Gaps#
E2E Testing Recommendations#
1. Full User Journey Test (CRITICAL)#
What: Test complete user flow from signup to interaction Why: No single test validates the entire happy path
- Create test: Signup → Authenticate → Create Community → Create Post → Add Comment → Vote
- Verify all data flows through Jetstream correctly
- Verify counts update (vote counts, comment counts, subscriber counts)
- Verify timeline feed shows posts from subscribed communities
- Test with 2+ users interacting (user A posts, user B comments)
File: Create tests/integration/user_journey_e2e_test.go
Estimated Effort: 4-6 hours
2. Blob Upload E2E Test#
What: Test image upload and display in posts Why: No test validates the full blob upload → post → feed display flow
- Create post with embedded image
- Verify blob uploaded to PDS
- Verify blob URL transformation in feed responses
- Test multiple images in single post
- Test image in comment
Estimated Effort: 3-4 hours
3. Multi-Community Timeline Test#
What: Test timeline feed with multiple community subscriptions Why: Timeline logic may have edge cases with multiple sources
- Create 3+ communities
- Subscribe user to all communities
- Create posts in each community
- Verify timeline shows posts from all subscribed communities
- Verify hot/top/new sorting across communities
Estimated Effort: 2-3 hours
4. Concurrent User Scenarios#
What: Test system behavior with simultaneous users Why: Race conditions and locking issues only appear under concurrency
- Multiple users voting on same post simultaneously
- Multiple users commenting on same post simultaneously
- Community creation with same handle (should fail)
- Subscription race conditions
Estimated Effort: 4-5 hours
5. Rate Limiting Tests#
What: Verify rate limits work correctly Why: Protection against abuse
- Test aggregator rate limits (already exists)
- Test general endpoint rate limits (100 req/min)
- Test comment rate limits (20 req/min)
- Verify 429 responses
- Verify rate limit headers
Estimated Effort: 2-3 hours
6. Error Recovery Tests#
What: Test system recovery from failures Why: Production will have failures
- Jetstream reconnection after disconnect
- PDS temporarily unavailable during post creation
- Database connection loss and recovery
- Malformed Jetstream events (should skip, not crash)
- Out-of-order event handling (already partially covered)
Estimated Effort: 4-5 hours
7. Federation Readiness (Optional)#
What: Test cross-PDS interactions Why: Future-proofing for federation
- User on different PDS subscribing to Coves community
- User on different PDS commenting on Coves post
- User on different PDS voting on Coves content
- Handle resolution across PDSs
Note: Defer to Beta unless federation is alpha requirement
Timeline Estimate#
Week 1: Critical Blockers (P0)#
- Days 1-2: Authentication (JWT + did:web verification)
- Day 3: DPoP token architecture fix
Day 4: Handle resolution + comment count reconciliation✅ COMPLETED- Day 4-5: Testing and bug fixes
Total: 15-20 hours (reduced from 20-25 due to completed items)
Week 2: Production Infrastructure (P1)#
- Days 6-7: Monitoring + structured logging
- Day 8: Database backups + load testing
- Days 9-10: Deployment runbook + final testing
Total: 30-35 hours
Week 3: E2E Testing + Polish#
- Days 11-12: Critical E2E tests (user journey, blob upload)
- Day 13: Additional E2E tests
- Days 14-15: Load testing, bug fixes, polish
Total: 20-25 hours
Grand Total: 65-80 hours (approximately 2-3 weeks full-time) (Reduced from original 70-85 hours estimate due to completed handle resolution and comment count reconciliation)
Success Criteria#
Alpha is ready when:
- All P0 blockers resolved
- ✅ Handle resolution (COMPLETE)
- ✅ Comment count reconciliation (COMPLETE)
- JWT signature verification working with production PDS
- DPoP architecture fix implemented
- did:web verification complete
- Subscriptions/blocking work via client-write pattern
- All integration tests passing
- E2E user journey test passing
- Load testing shows acceptable performance (100+ concurrent users)
- Monitoring and alerting active
- Database backups configured and tested
- Deployment runbook complete and validated
- Security audit completed (basic)
- No known critical bugs
Go/No-Go Decision Points#
Can we launch without it?#
| Feature | Alpha Requirement | Status | Rationale |
|---|---|---|---|
| JWT signature verification | ✅ YES | 🟡 Needs testing | Security critical |
| DPoP architecture fix | ✅ YES | 🔴 Not started | Subscriptions broken without it |
| ✅ COMPLETE | Core UX requirement | ||
| ✅ COMPLETE | Data accuracy | ||
| Post read endpoints | ⚠️ MAYBE | 🔴 Not implemented | Can use feeds initially |
| Post update/delete | ❌ NO | 🔴 Not implemented | Can add post-launch |
| Moderation system | ❌ NO | 🔴 Not implemented | Deferred to Beta per PRD_GOVERNANCE |
| Full-text search | ❌ NO | 🔴 Not implemented | Browse works without it |
| Federation testing | ❌ NO | 🔴 Not implemented | Single-instance alpha |
| Mobile app | ⚠️ MAYBE | 🔴 Not implemented | Web-first acceptable |
Risk Assessment#
High Risk#
- JWT verification with production PDS - Never tested with real JWKS
- Load under real traffic - Current tests are single-user
- Operational knowledge - No one has run this in production yet
Medium Risk#
- Database performance - Queries optimized but not load tested
- Jetstream consumer lag - May fall behind under high write volume
- Token refresh stability - Community tokens refresh every 2 hours (tested but not long-running)
Low Risk#
- DPoP architecture fix - Similar pattern already works (votes)
Handle resolution- ✅ Already implementedComment reconciliation- ✅ Already implemented
Open Questions#
- What's the target alpha user count? (affects infrastructure sizing)
- What's the alpha duration? (affects monitoring retention, backup retention)
- Is mobile app required for alpha? (affects DPoP testing priority)
- What's the rollback strategy? (database migrations may not be reversible)
- Who's on-call during alpha? (affects runbook detail level)
- What's the acceptable downtime? (affects HA requirements)
- Budget for infrastructure? (affects monitoring/backup solutions)
Next Steps#
- ✅ Create this PRD
- ✅ Validate handle resolution (COMPLETE)
- ✅ Validate comment count reconciliation (COMPLETE)
- Review and prioritize with team
- Test JWT verification with
pds.bretton.dev(requires invite code or existing account) - Begin P0 blockers (DPoP fix first - highest user impact)
- Set up monitoring infrastructure
- Write critical E2E tests (especially full user journey)
- Conduct load testing
- Security review
- Go/no-go decision
- Launch! 🚀