A community based topic aggregation platform built on atproto
1# Backlog PRD: Platform Improvements & Technical Debt 2 3**Status:** Ongoing 4**Owner:** Platform Team 5**Last Updated:** 2025-10-17 6 7## Overview 8 9Miscellaneous platform improvements, bug fixes, and technical debt that don't fit into feature-specific PRDs. 10 11--- 12 13## 🟡 P1: Important (Alpha Blockers) 14 15### did:web Domain Verification & hostedByDID Auto-Population 16**Added:** 2025-10-11 | **Updated:** 2025-10-16 | **Effort:** 2-3 days | **Priority:** ALPHA BLOCKER 17 18**Problem:** 191. **Domain Impersonation**: Self-hosters can set `INSTANCE_DID=did:web:nintendo.com` without owning the domain, enabling attacks where communities appear hosted by trusted domains 202. **hostedByDID Spoofing**: Malicious instance operators can modify source code to claim communities are hosted by domains they don't own, enabling reputation hijacking and phishing 21 22**Attack Scenarios:** 23- Malicious instance sets `instanceDID="did:web:coves.social"` → communities show as hosted by official Coves 24- Federation partners can't verify instance authenticity 25- AppView pollution with fake hosting claims 26 27**Solution:** 281. **Basic Validation (Phase 1)**: Verify `did:web:` domain matches configured `instanceDomain` 292. **Cryptographic Verification (Phase 2)**: Fetch `https://domain/.well-known/did.json` and verify: 30 - DID document exists and is valid 31 - Domain ownership proven via HTTPS hosting 32 - DID document matches claimed `instanceDID` 333. **Auto-populate hostedByDID**: Remove from client API, derive from instance configuration in service layer 34 35**Current Status:** 36- ✅ Default changed from `coves.local``coves.social` (fixes `.local` TLD bug) 37- ✅ TODO comment in [cmd/server/main.go:126-131](../cmd/server/main.go#L126-L131) 38- ✅ hostedByDID removed from client requests (2025-10-16) 39- ✅ Service layer auto-populates `hostedByDID` from `instanceDID` (2025-10-16) 40- ✅ Handler rejects client-provided `hostedByDID` (2025-10-16) 41- ✅ Basic validation: Logs warning if `did:web:` domain ≠ `instanceDomain` (2025-10-16) 42- ⚠️ **REMAINING**: Full DID document verification (cryptographic proof of ownership) 43 44**Implementation Notes:** 45- Phase 1 complete: Basic validation catches config errors, logs warnings 46- Phase 2 needed: Fetch `https://domain/.well-known/did.json` and verify ownership 47- Add `SKIP_DID_WEB_VERIFICATION=true` for dev mode 48- Full verification blocks startup if domain ownership cannot be proven 49 50--- 51 52### ✅ Token Refresh Logic for Community Credentials - COMPLETE 53**Added:** 2025-10-11 | **Completed:** 2025-10-17 | **Effort:** 1.5 days | **Status:** ✅ DONE 54 55**Problem:** Community PDS access tokens expire (~2hrs). Updates fail until manual intervention. 56 57**Solution Implemented:** 58- ✅ Automatic token refresh before PDS operations (5-minute buffer before expiration) 59- ✅ JWT expiration parsing without signature verification (`parseJWTExpiration`, `needsRefresh`) 60- ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`) 61- ✅ Password fallback when refresh tokens expire (~2 months) via `atproto.ServerCreateSession` 62- ✅ Atomic credential updates (`UpdateCredentials` repository method) 63- ✅ Concurrency-safe with per-community mutex locking 64- ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events) 65- ✅ Integration tests for token expiration detection and credential updates 66 67**Files Created:** 68- [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go) - JWT parsing utilities 69- [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go) - Refresh and re-auth logic 70- [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go) - Integration tests 71 72**Files Modified:** 73- [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` + concurrency control 74- [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface 75- [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials` 76 77**Documentation:** See [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md) for full details 78 79**Impact:** ✅ Communities can now be updated 24+ hours after creation without manual intervention 80 81--- 82 83### ✅ Subscription Visibility Level (Feed Slider 1-5 Scale) - COMPLETE 84**Added:** 2025-10-15 | **Completed:** 2025-10-16 | **Effort:** 1 day | **Status:** ✅ DONE 85 86**Problem:** Users couldn't control how much content they see from each community. Lexicon had `contentVisibility` (1-5 scale) but code didn't use it. 87 88**Solution Implemented:** 89- ✅ Updated subscribe handler to accept `contentVisibility` parameter (1-5, default 3) 90- ✅ Store in subscription record on PDS (`social.coves.community.subscription`) 91- ✅ Migration 008 adds `content_visibility` column to database with CHECK constraint 92- ✅ Clamping at all layers (handler, service, consumer) for defense in depth 93- ✅ Atomic subscriber count updates (SubscribeWithCount/UnsubscribeWithCount) 94- ✅ Idempotent operations (safe for Jetstream event replays) 95- ✅ Fixed critical collection name bug (was using wrong namespace) 96- ✅ Production Jetstream consumer now running 97- ✅ 13 comprehensive integration tests - all passing 98 99**Files Modified:** 100- Lexicon: [subscription.json](../internal/atproto/lexicon/social/coves/community/subscription.json) ✅ Updated to atProto conventions 101- Handler: [community/subscribe.go](../internal/api/handlers/community/subscribe.go) ✅ Accepts contentVisibility 102- Service: [communities/service.go](../internal/core/communities/service.go) ✅ Clamps and passes to PDS 103- Consumer: [community_consumer.go](../internal/atproto/jetstream/community_consumer.go) ✅ Extracts and indexes 104- Repository: [community_repo_subscriptions.go](../internal/db/postgres/community_repo_subscriptions.go) ✅ All queries updated 105- Migration: [008_add_content_visibility_to_subscriptions.sql](../internal/db/migrations/008_add_content_visibility_to_subscriptions.sql) ✅ Schema changes 106- Tests: [subscription_indexing_test.go](../tests/integration/subscription_indexing_test.go) ✅ Comprehensive coverage 107 108**Documentation:** See [IMPLEMENTATION_SUBSCRIPTION_INDEXING.md](../docs/IMPLEMENTATION_SUBSCRIPTION_INDEXING.md) for full details 109 110**Impact:** ✅ Users can now adjust feed volume per community (key feature from DOMAIN_KNOWLEDGE.md enabled) 111 112--- 113 114### Community Blocking 115**Added:** 2025-10-15 | **Effort:** 1 day | **Priority:** ALPHA BLOCKER 116 117**Problem:** Users have no way to block unwanted communities from their feeds. 118 119**Solution:** 1201. **Lexicon:** Extend `social.coves.actor.block` to support community DIDs (currently user-only) 1212. **Service:** Implement `BlockCommunity(userDID, communityDID)` and `UnblockCommunity()` 1223. **Handlers:** Add XRPC endpoints `social.coves.community.block` and `unblock` 1234. **Repository:** Add methods to track blocked communities 1245. **Feed:** Filter blocked communities from feed queries (beta work) 125 126**Code:** 127- Lexicon: [actor/block.json](../internal/atproto/lexicon/social/coves/actor/block.json) - Currently only supports user DIDs 128- Service: New methods needed 129- Handlers: New files needed 130 131**Impact:** Users can't avoid unwanted content without blocking 132 133--- 134 135## 🔴 P1.5: Federation Blockers (Beta Launch) 136 137### Cross-PDS Write-Forward Support 138**Added:** 2025-10-17 | **Effort:** 3-4 hours | **Priority:** FEDERATION BLOCKER (Beta) 139 140**Problem:** Current write-forward implementation assumes all users are on the same PDS as the Coves instance. This breaks federation when users from external PDSs try to interact with communities. 141 142**Current Behavior:** 143- User on `pds.bsky.social` subscribes to community on `coves.social` 144- Coves calls `s.pdsURL` (instance default: `http://localhost:3001`) 145- Write goes to WRONG PDS → fails with 401/403 146 147**Impact:** 148-**Alpha**: Works fine (single PDS deployment) 149-**Beta**: Breaks federation (users on different PDSs can't subscribe/interact) 150 151**Root Cause:** 152- [service.go:736](../internal/core/communities/service.go#L736): `createRecordOnPDSAs` hardcodes `s.pdsURL` 153- [service.go:753](../internal/core/communities/service.go#L753): `putRecordOnPDSAs` hardcodes `s.pdsURL` 154- [service.go:767](../internal/core/communities/service.go#L767): `deleteRecordOnPDSAs` hardcodes `s.pdsURL` 155 156**Solution:** 1571. Add identity resolver dependency to `CommunityService` 1582. Before write-forward, resolve user's DID → extract PDS URL 1593. Call user's actual PDS instead of `s.pdsURL` 160 161**Implementation:** 162```go 163// Before write-forward to user's repo: 164userIdentity, err := s.identityResolver.ResolveDID(ctx, userDID) 165if err != nil { 166 return fmt.Errorf("failed to resolve user PDS: %w", err) 167} 168 169// Use user's actual PDS URL 170endpoint := fmt.Sprintf("%s/xrpc/com.atproto.repo.createRecord", userIdentity.PDSURL) 171``` 172 173**Files to Modify:** 174- `internal/core/communities/service.go` - Add resolver, modify write-forward methods 175- `cmd/server/main.go` - Pass identity resolver to community service constructor 176- Tests - Add cross-PDS scenarios 177 178**Testing:** 179- User on external PDS subscribes to community 180- User on external PDS blocks community 181- Community updates still work (communities ARE on instance PDS) 182 183--- 184 185## 🟢 P2: Nice-to-Have 186 187### Remove Categories from Community Lexicon 188**Added:** 2025-10-15 | **Effort:** 30 minutes | **Priority:** Cleanup 189 190**Problem:** Categories field exists in create/update lexicon but not in profile record. Adds complexity without clear value. 191 192**Solution:** 193- Remove `categories` from [create.json](../internal/atproto/lexicon/social/coves/community/create.json#L46-L54) 194- Remove `categories` from [update.json](../internal/atproto/lexicon/social/coves/community/update.json#L51-L59) 195- Remove from [community.go:91](../internal/core/communities/community.go#L91) 196- Remove from service layer ([service.go:109-110](../internal/core/communities/service.go#L109-L110)) 197 198**Impact:** Simplifies lexicon, removes unused feature 199 200--- 201 202### Improve .local TLD Error Messages 203**Added:** 2025-10-11 | **Effort:** 1 hour 204 205**Problem:** Generic error "TLD .local is not allowed" confuses developers. 206 207**Solution:** Enhance `InvalidHandleError` to explain root cause and suggest fixing `INSTANCE_DID`. 208 209--- 210 211### Self-Hosting Security Guide 212**Added:** 2025-10-11 | **Effort:** 1 day 213 214**Needed:** Document did:web setup, DNS config, secrets management, rate limiting, PostgreSQL hardening, monitoring. 215 216--- 217 218### OAuth Session Cleanup Race Condition 219**Added:** 2025-10-11 | **Effort:** 2 hours 220 221**Problem:** Cleanup goroutine doesn't handle graceful shutdown, may orphan DB connections. 222 223**Solution:** Pass cancellable context, handle SIGTERM, add cleanup timeout. 224 225--- 226 227### Jetstream Consumer Race Condition 228**Added:** 2025-10-11 | **Effort:** 1 hour 229 230**Problem:** Multiple goroutines can call `close(done)` concurrently in consumer shutdown. 231 232**Solution:** Use `sync.Once` for channel close or atomic flag for shutdown state. 233 234**Code:** TODO in [jetstream/user_consumer.go:114](../internal/atproto/jetstream/user_consumer.go#L114) 235 236--- 237 238## 🔵 P3: Technical Debt 239 240### Consolidate Environment Variable Validation 241**Added:** 2025-10-11 | **Effort:** 2-3 hours 242 243Create `internal/config` package with structured config validation. Fail fast with clear errors. 244 245--- 246 247### Add Connection Pooling for PDS HTTP Clients 248**Added:** 2025-10-11 | **Effort:** 2 hours 249 250Create shared `http.Client` with connection pooling instead of new client per request. 251 252--- 253 254### Architecture Decision Records (ADRs) 255**Added:** 2025-10-11 | **Effort:** Ongoing 256 257Document: did:plc choice, pgcrypto encryption, Jetstream vs firehose, write-forward pattern, single handle field. 258 259--- 260 261### Replace log Package with Structured Logger 262**Added:** 2025-10-11 | **Effort:** 1 day 263 264**Problem:** Using standard `log` package. Need structured logging (JSON) with levels. 265 266**Solution:** Switch to `slog`, `zap`, or `zerolog`. Add request IDs, context fields. 267 268**Code:** TODO in [community/errors.go:46](../internal/api/handlers/community/errors.go#L46) 269 270--- 271 272### PDS URL Resolution from DID 273**Added:** 2025-10-11 | **Effort:** 2-3 hours 274 275**Problem:** User consumer doesn't resolve PDS URL from DID document when missing. 276 277**Solution:** Query PLC directory for DID document, extract `serviceEndpoint`. 278 279**Code:** TODO in [jetstream/user_consumer.go:203](../internal/atproto/jetstream/user_consumer.go#L203) 280 281--- 282 283### PLC Directory Registration (Production) 284**Added:** 2025-10-11 | **Effort:** 1 day 285 286**Problem:** DID generator creates did:plc but doesn't register in prod mode. 287 288**Solution:** Implement PLC registration API call when `isDevEnv=false`. 289 290**Code:** TODO in [did/generator.go:46](../internal/atproto/did/generator.go#L46) 291 292--- 293 294## Recent Completions 295 296### ✅ Token Refresh for Community Credentials (2025-10-17) 297**Completed:** Automatic token refresh prevents communities from breaking after 2 hours 298 299**Implementation:** 300- ✅ JWT expiration parsing and refresh detection (5-minute buffer) 301- ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`) 302- ✅ Password fallback when refresh tokens expire (`atproto.ServerCreateSession`) 303- ✅ Atomic credential updates in database (`UpdateCredentials`) 304- ✅ Concurrency-safe with per-community mutex locking 305- ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events) 306- ✅ Integration tests for expiration detection and credential updates 307 308**Files Created:** 309- [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go) 310- [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go) 311- [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go) 312 313**Files Modified:** 314- [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` method 315- [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface 316- [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials` 317 318**Documentation:** [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md) 319 320**Impact:** Communities now work indefinitely without manual token management 321 322--- 323 324### ✅ OAuth Authentication for Community Actions (2025-10-16) 325**Completed:** Full OAuth JWT authentication flow for protected endpoints 326 327**Implementation:** 328- ✅ JWT parser compatible with atProto PDS tokens (aud/iss handling) 329- ✅ Auth middleware protecting create/update/subscribe/unsubscribe endpoints 330- ✅ Handler-level DID extraction from JWT tokens via `middleware.GetUserDID(r)` 331- ✅ Removed all X-User-DID header placeholders 332- ✅ E2E tests validate complete OAuth flow with real PDS tokens 333- ✅ Security: Issuer validation supports both HTTPS URLs and DIDs 334 335**Files Modified:** 336- [internal/atproto/auth/jwt.go](../internal/atproto/auth/jwt.go) - JWT parsing with atProto compatibility 337- [internal/api/middleware/auth.go](../internal/api/middleware/auth.go) - Auth middleware 338- [internal/api/handlers/community/](../internal/api/handlers/community/) - All handlers updated 339- [tests/integration/community_e2e_test.go](../tests/integration/community_e2e_test.go) - OAuth E2E tests 340 341**Related:** Also implemented `hostedByDID` auto-population for security (see P1 item above) 342 343--- 344 345### ✅ Fix .local TLD Bug (2025-10-11) 346Changed default `INSTANCE_DID` from `did:web:coves.local``did:web:coves.social`. Fixed community creation failure due to disallowed `.local` TLD. 347 348--- 349 350## Prioritization 351 352- **P0:** Security vulns, data loss, prod blockers 353- **P1:** Major UX/reliability issues 354- **P2:** QOL improvements, minor bugs, docs 355- **P3:** Refactoring, code quality