A community based topic aggregation platform built on atproto
1# Backlog PRD: Platform Improvements & Technical Debt 2 3**Status:** Ongoing 4**Owner:** Platform Team 5**Last Updated:** 2025-10-17 6 7## Overview 8 9Miscellaneous platform improvements, bug fixes, and technical debt that don't fit into feature-specific PRDs. 10 11--- 12 13## 🟡 P1: Important (Alpha Blockers) 14 15### at-identifier Handle Resolution in Endpoints 16**Added:** 2025-10-18 | **Effort:** 2-3 hours | **Priority:** ALPHA BLOCKER 17 18**Problem:** 19Current implementation rejects handles in endpoints that declare `"format": "at-identifier"` in their lexicon schemas, violating atProto best practices and breaking legitimate client usage. 20 21**Impact:** 22- ❌ Post creation fails when client sends community handle (e.g., `!gardening.communities.coves.social`) 23- ❌ Subscribe/unsubscribe endpoints reject handles despite lexicon declaring `at-identifier` 24- ❌ Block endpoints use `"format": "did"` but should use `at-identifier` for consistency 25- 🔴 **P0 Issue:** API contract violation - clients following the schema are rejected 26 27**Root Cause:** 28Handlers and services validate `strings.HasPrefix(req.Community, "did:")` instead of calling `ResolveCommunityIdentifier()`. 29 30**Affected Endpoints:** 311. **Post Creation** - [create.go:54](../internal/api/handlers/post/create.go#L54), [service.go:51](../internal/core/posts/service.go#L51) 32 - Lexicon declares `at-identifier`: [post/create.json:16](../internal/atproto/lexicon/social/coves/post/create.json#L16) 33 342. **Subscribe** - [subscribe.go:52](../internal/api/handlers/community/subscribe.go#L52) 35 - Lexicon declares `at-identifier`: [subscribe.json:16](../internal/atproto/lexicon/social/coves/community/subscribe.json#L16) 36 373. **Unsubscribe** - [subscribe.go:120](../internal/api/handlers/community/subscribe.go#L120) 38 - Lexicon declares `at-identifier`: [unsubscribe.json:16](../internal/atproto/lexicon/social/coves/community/unsubscribe.json#L16) 39 404. **Block/Unblock** - [block.go:58](../internal/api/handlers/community/block.go#L58), [block.go:132](../internal/api/handlers/community/block.go#L132) 41 - Lexicon declares `"format": "did"`: [block.json:15](../internal/atproto/lexicon/social/coves/community/block.json#L15) 42 - Should be changed to `at-identifier` for consistency and best practice 43 44**atProto Best Practice (from docs):** 45- ✅ API endpoints should accept both DIDs and handles via `at-identifier` format 46- ✅ Resolve handles to DIDs immediately at API boundary 47- ✅ Use DIDs internally for all business logic and storage 48- ✅ Handles are weak refs (changeable), DIDs are strong refs (permanent) 49- ⚠️ Bidirectional verification required (already handled by `identity.CachingResolver`) 50 51**Solution:** 52Replace direct DID validation with handle resolution using existing `ResolveCommunityIdentifier()`: 53 54```go 55// BEFORE (wrong) ❌ 56if !strings.HasPrefix(req.Community, "did:") { 57 return error 58} 59 60// AFTER (correct) ✅ 61communityDID, err := h.communityService.ResolveCommunityIdentifier(ctx, req.Community) 62if err != nil { 63 if communities.IsNotFound(err) { 64 writeError(w, http.StatusNotFound, "CommunityNotFound", "Community not found") 65 return 66 } 67 writeError(w, http.StatusBadRequest, "InvalidRequest", err.Error()) 68 return 69} 70// Now use communityDID (guaranteed to be a DID) 71``` 72 73**Implementation Plan:** 741.**Phase 1 (Alpha Blocker):** Fix post creation endpoint 75 - Update handler validation in `internal/api/handlers/post/create.go` 76 - Update service validation in `internal/core/posts/service.go` 77 - Add integration tests for handle resolution in post creation 78 792. 📋 **Phase 2 (Beta):** Fix subscription endpoints 80 - Update subscribe/unsubscribe handlers 81 - Add tests for handle resolution in subscriptions 82 833. 📋 **Phase 3 (Beta):** Fix block endpoints 84 - Update lexicon from `"format": "did"``"format": "at-identifier"` 85 - Update block/unblock handlers 86 - Add tests for handle resolution in blocking 87 88**Files to Modify (Phase 1 - Post Creation):** 89- `internal/api/handlers/post/create.go` - Remove DID validation, add handle resolution 90- `internal/core/posts/service.go` - Remove DID validation, add handle resolution 91- `internal/core/posts/interfaces.go` - Add `CommunityService` dependency 92- `cmd/server/main.go` - Pass community service to post service constructor 93- `tests/integration/post_creation_test.go` - Add handle resolution test cases 94 95**Existing Infrastructure:** 96`ResolveCommunityIdentifier()` already implemented at [service.go:843](../internal/core/communities/service.go#L843) 97`identity.CachingResolver` handles bidirectional verification and caching 98✅ Supports both handle (`!name.communities.instance.com`) and DID formats 99 100**Current Status:** 101- ⚠️ **BLOCKING POST CREATION PR**: Identified as P0 issue in code review 102- 📋 Phase 1 (post creation) - To be implemented immediately 103- 📋 Phase 2-3 (other endpoints) - Deferred to Beta 104 105--- 106 107### did:web Domain Verification & hostedByDID Auto-Population 108**Added:** 2025-10-11 | **Updated:** 2025-10-16 | **Effort:** 2-3 days | **Priority:** ALPHA BLOCKER 109 110**Problem:** 1111. **Domain Impersonation**: Self-hosters can set `INSTANCE_DID=did:web:nintendo.com` without owning the domain, enabling attacks where communities appear hosted by trusted domains 1122. **hostedByDID Spoofing**: Malicious instance operators can modify source code to claim communities are hosted by domains they don't own, enabling reputation hijacking and phishing 113 114**Attack Scenarios:** 115- Malicious instance sets `instanceDID="did:web:coves.social"` → communities show as hosted by official Coves 116- Federation partners can't verify instance authenticity 117- AppView pollution with fake hosting claims 118 119**Solution:** 1201. **Basic Validation (Phase 1)**: Verify `did:web:` domain matches configured `instanceDomain` 1212. **Cryptographic Verification (Phase 2)**: Fetch `https://domain/.well-known/did.json` and verify: 122 - DID document exists and is valid 123 - Domain ownership proven via HTTPS hosting 124 - DID document matches claimed `instanceDID` 1253. **Auto-populate hostedByDID**: Remove from client API, derive from instance configuration in service layer 126 127**Current Status:** 128- ✅ Default changed from `coves.local``coves.social` (fixes `.local` TLD bug) 129- ✅ TODO comment in [cmd/server/main.go:126-131](../cmd/server/main.go#L126-L131) 130- ✅ hostedByDID removed from client requests (2025-10-16) 131- ✅ Service layer auto-populates `hostedByDID` from `instanceDID` (2025-10-16) 132- ✅ Handler rejects client-provided `hostedByDID` (2025-10-16) 133- ✅ Basic validation: Logs warning if `did:web:` domain ≠ `instanceDomain` (2025-10-16) 134- ⚠️ **REMAINING**: Full DID document verification (cryptographic proof of ownership) 135 136**Implementation Notes:** 137- Phase 1 complete: Basic validation catches config errors, logs warnings 138- Phase 2 needed: Fetch `https://domain/.well-known/did.json` and verify ownership 139- Add `SKIP_DID_WEB_VERIFICATION=true` for dev mode 140- Full verification blocks startup if domain ownership cannot be proven 141 142--- 143 144### ✅ Token Refresh Logic for Community Credentials - COMPLETE 145**Added:** 2025-10-11 | **Completed:** 2025-10-17 | **Effort:** 1.5 days | **Status:** ✅ DONE 146 147**Problem:** Community PDS access tokens expire (~2hrs). Updates fail until manual intervention. 148 149**Solution Implemented:** 150- ✅ Automatic token refresh before PDS operations (5-minute buffer before expiration) 151- ✅ JWT expiration parsing without signature verification (`parseJWTExpiration`, `needsRefresh`) 152- ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`) 153- ✅ Password fallback when refresh tokens expire (~2 months) via `atproto.ServerCreateSession` 154- ✅ Atomic credential updates (`UpdateCredentials` repository method) 155- ✅ Concurrency-safe with per-community mutex locking 156- ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events) 157- ✅ Integration tests for token expiration detection and credential updates 158 159**Files Created:** 160- [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go) - JWT parsing utilities 161- [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go) - Refresh and re-auth logic 162- [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go) - Integration tests 163 164**Files Modified:** 165- [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` + concurrency control 166- [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface 167- [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials` 168 169**Documentation:** See [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md) for full details 170 171**Impact:** ✅ Communities can now be updated 24+ hours after creation without manual intervention 172 173--- 174 175### ✅ Subscription Visibility Level (Feed Slider 1-5 Scale) - COMPLETE 176**Added:** 2025-10-15 | **Completed:** 2025-10-16 | **Effort:** 1 day | **Status:** ✅ DONE 177 178**Problem:** Users couldn't control how much content they see from each community. Lexicon had `contentVisibility` (1-5 scale) but code didn't use it. 179 180**Solution Implemented:** 181- ✅ Updated subscribe handler to accept `contentVisibility` parameter (1-5, default 3) 182- ✅ Store in subscription record on PDS (`social.coves.community.subscription`) 183- ✅ Migration 008 adds `content_visibility` column to database with CHECK constraint 184- ✅ Clamping at all layers (handler, service, consumer) for defense in depth 185- ✅ Atomic subscriber count updates (SubscribeWithCount/UnsubscribeWithCount) 186- ✅ Idempotent operations (safe for Jetstream event replays) 187- ✅ Fixed critical collection name bug (was using wrong namespace) 188- ✅ Production Jetstream consumer now running 189- ✅ 13 comprehensive integration tests - all passing 190 191**Files Modified:** 192- Lexicon: [subscription.json](../internal/atproto/lexicon/social/coves/community/subscription.json) ✅ Updated to atProto conventions 193- Handler: [community/subscribe.go](../internal/api/handlers/community/subscribe.go) ✅ Accepts contentVisibility 194- Service: [communities/service.go](../internal/core/communities/service.go) ✅ Clamps and passes to PDS 195- Consumer: [community_consumer.go](../internal/atproto/jetstream/community_consumer.go) ✅ Extracts and indexes 196- Repository: [community_repo_subscriptions.go](../internal/db/postgres/community_repo_subscriptions.go) ✅ All queries updated 197- Migration: [008_add_content_visibility_to_subscriptions.sql](../internal/db/migrations/008_add_content_visibility_to_subscriptions.sql) ✅ Schema changes 198- Tests: [subscription_indexing_test.go](../tests/integration/subscription_indexing_test.go) ✅ Comprehensive coverage 199 200**Documentation:** See [IMPLEMENTATION_SUBSCRIPTION_INDEXING.md](../docs/IMPLEMENTATION_SUBSCRIPTION_INDEXING.md) for full details 201 202**Impact:** ✅ Users can now adjust feed volume per community (key feature from DOMAIN_KNOWLEDGE.md enabled) 203 204--- 205 206### Community Blocking 207**Added:** 2025-10-15 | **Effort:** 1 day | **Priority:** ALPHA BLOCKER 208 209**Problem:** Users have no way to block unwanted communities from their feeds. 210 211**Solution:** 2121. **Lexicon:** Extend `social.coves.actor.block` to support community DIDs (currently user-only) 2132. **Service:** Implement `BlockCommunity(userDID, communityDID)` and `UnblockCommunity()` 2143. **Handlers:** Add XRPC endpoints `social.coves.community.block` and `unblock` 2154. **Repository:** Add methods to track blocked communities 2165. **Feed:** Filter blocked communities from feed queries (beta work) 217 218**Code:** 219- Lexicon: [actor/block.json](../internal/atproto/lexicon/social/coves/actor/block.json) - Currently only supports user DIDs 220- Service: New methods needed 221- Handlers: New files needed 222 223**Impact:** Users can't avoid unwanted content without blocking 224 225--- 226 227## 🔴 P1.5: Federation Blockers (Beta Launch) 228 229### Cross-PDS Write-Forward Support 230**Added:** 2025-10-17 | **Effort:** 3-4 hours | **Priority:** FEDERATION BLOCKER (Beta) 231 232**Problem:** Current write-forward implementation assumes all users are on the same PDS as the Coves instance. This breaks federation when users from external PDSs try to interact with communities. 233 234**Current Behavior:** 235- User on `pds.bsky.social` subscribes to community on `coves.social` 236- Coves calls `s.pdsURL` (instance default: `http://localhost:3001`) 237- Write goes to WRONG PDS → fails with 401/403 238 239**Impact:** 240-**Alpha**: Works fine (single PDS deployment) 241-**Beta**: Breaks federation (users on different PDSs can't subscribe/interact) 242 243**Root Cause:** 244- [service.go:736](../internal/core/communities/service.go#L736): `createRecordOnPDSAs` hardcodes `s.pdsURL` 245- [service.go:753](../internal/core/communities/service.go#L753): `putRecordOnPDSAs` hardcodes `s.pdsURL` 246- [service.go:767](../internal/core/communities/service.go#L767): `deleteRecordOnPDSAs` hardcodes `s.pdsURL` 247 248**Solution:** 2491. Add identity resolver dependency to `CommunityService` 2502. Before write-forward, resolve user's DID → extract PDS URL 2513. Call user's actual PDS instead of `s.pdsURL` 252 253**Implementation:** 254```go 255// Before write-forward to user's repo: 256userIdentity, err := s.identityResolver.ResolveDID(ctx, userDID) 257if err != nil { 258 return fmt.Errorf("failed to resolve user PDS: %w", err) 259} 260 261// Use user's actual PDS URL 262endpoint := fmt.Sprintf("%s/xrpc/com.atproto.repo.createRecord", userIdentity.PDSURL) 263``` 264 265**Files to Modify:** 266- `internal/core/communities/service.go` - Add resolver, modify write-forward methods 267- `cmd/server/main.go` - Pass identity resolver to community service constructor 268- Tests - Add cross-PDS scenarios 269 270**Testing:** 271- User on external PDS subscribes to community 272- User on external PDS blocks community 273- Community updates still work (communities ARE on instance PDS) 274 275--- 276 277## 🟢 P2: Nice-to-Have 278 279### Remove Categories from Community Lexicon 280**Added:** 2025-10-15 | **Effort:** 30 minutes | **Priority:** Cleanup 281 282**Problem:** Categories field exists in create/update lexicon but not in profile record. Adds complexity without clear value. 283 284**Solution:** 285- Remove `categories` from [create.json](../internal/atproto/lexicon/social/coves/community/create.json#L46-L54) 286- Remove `categories` from [update.json](../internal/atproto/lexicon/social/coves/community/update.json#L51-L59) 287- Remove from [community.go:91](../internal/core/communities/community.go#L91) 288- Remove from service layer ([service.go:109-110](../internal/core/communities/service.go#L109-L110)) 289 290**Impact:** Simplifies lexicon, removes unused feature 291 292--- 293 294### Improve .local TLD Error Messages 295**Added:** 2025-10-11 | **Effort:** 1 hour 296 297**Problem:** Generic error "TLD .local is not allowed" confuses developers. 298 299**Solution:** Enhance `InvalidHandleError` to explain root cause and suggest fixing `INSTANCE_DID`. 300 301--- 302 303### Self-Hosting Security Guide 304**Added:** 2025-10-11 | **Effort:** 1 day 305 306**Needed:** Document did:web setup, DNS config, secrets management, rate limiting, PostgreSQL hardening, monitoring. 307 308--- 309 310### OAuth Session Cleanup Race Condition 311**Added:** 2025-10-11 | **Effort:** 2 hours 312 313**Problem:** Cleanup goroutine doesn't handle graceful shutdown, may orphan DB connections. 314 315**Solution:** Pass cancellable context, handle SIGTERM, add cleanup timeout. 316 317--- 318 319### Jetstream Consumer Race Condition 320**Added:** 2025-10-11 | **Effort:** 1 hour 321 322**Problem:** Multiple goroutines can call `close(done)` concurrently in consumer shutdown. 323 324**Solution:** Use `sync.Once` for channel close or atomic flag for shutdown state. 325 326**Code:** TODO in [jetstream/user_consumer.go:114](../internal/atproto/jetstream/user_consumer.go#L114) 327 328--- 329 330## 🔵 P3: Technical Debt 331 332### Consolidate Environment Variable Validation 333**Added:** 2025-10-11 | **Effort:** 2-3 hours 334 335Create `internal/config` package with structured config validation. Fail fast with clear errors. 336 337--- 338 339### Add Connection Pooling for PDS HTTP Clients 340**Added:** 2025-10-11 | **Effort:** 2 hours 341 342Create shared `http.Client` with connection pooling instead of new client per request. 343 344--- 345 346### Architecture Decision Records (ADRs) 347**Added:** 2025-10-11 | **Effort:** Ongoing 348 349Document: did:plc choice, pgcrypto encryption, Jetstream vs firehose, write-forward pattern, single handle field. 350 351--- 352 353### Replace log Package with Structured Logger 354**Added:** 2025-10-11 | **Effort:** 1 day 355 356**Problem:** Using standard `log` package. Need structured logging (JSON) with levels. 357 358**Solution:** Switch to `slog`, `zap`, or `zerolog`. Add request IDs, context fields. 359 360**Code:** TODO in [community/errors.go:46](../internal/api/handlers/community/errors.go#L46) 361 362--- 363 364### PDS URL Resolution from DID 365**Added:** 2025-10-11 | **Effort:** 2-3 hours 366 367**Problem:** User consumer doesn't resolve PDS URL from DID document when missing. 368 369**Solution:** Query PLC directory for DID document, extract `serviceEndpoint`. 370 371**Code:** TODO in [jetstream/user_consumer.go:203](../internal/atproto/jetstream/user_consumer.go#L203) 372 373--- 374 375### PLC Directory Registration (Production) 376**Added:** 2025-10-11 | **Effort:** 1 day 377 378**Problem:** DID generator creates did:plc but doesn't register in prod mode. 379 380**Solution:** Implement PLC registration API call when `isDevEnv=false`. 381 382**Code:** TODO in [did/generator.go:46](../internal/atproto/did/generator.go#L46) 383 384--- 385 386## Recent Completions 387 388### ✅ Token Refresh for Community Credentials (2025-10-17) 389**Completed:** Automatic token refresh prevents communities from breaking after 2 hours 390 391**Implementation:** 392- ✅ JWT expiration parsing and refresh detection (5-minute buffer) 393- ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`) 394- ✅ Password fallback when refresh tokens expire (`atproto.ServerCreateSession`) 395- ✅ Atomic credential updates in database (`UpdateCredentials`) 396- ✅ Concurrency-safe with per-community mutex locking 397- ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events) 398- ✅ Integration tests for expiration detection and credential updates 399 400**Files Created:** 401- [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go) 402- [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go) 403- [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go) 404 405**Files Modified:** 406- [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` method 407- [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface 408- [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials` 409 410**Documentation:** [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md) 411 412**Impact:** Communities now work indefinitely without manual token management 413 414--- 415 416### ✅ OAuth Authentication for Community Actions (2025-10-16) 417**Completed:** Full OAuth JWT authentication flow for protected endpoints 418 419**Implementation:** 420- ✅ JWT parser compatible with atProto PDS tokens (aud/iss handling) 421- ✅ Auth middleware protecting create/update/subscribe/unsubscribe endpoints 422- ✅ Handler-level DID extraction from JWT tokens via `middleware.GetUserDID(r)` 423- ✅ Removed all X-User-DID header placeholders 424- ✅ E2E tests validate complete OAuth flow with real PDS tokens 425- ✅ Security: Issuer validation supports both HTTPS URLs and DIDs 426 427**Files Modified:** 428- [internal/atproto/auth/jwt.go](../internal/atproto/auth/jwt.go) - JWT parsing with atProto compatibility 429- [internal/api/middleware/auth.go](../internal/api/middleware/auth.go) - Auth middleware 430- [internal/api/handlers/community/](../internal/api/handlers/community/) - All handlers updated 431- [tests/integration/community_e2e_test.go](../tests/integration/community_e2e_test.go) - OAuth E2E tests 432 433**Related:** Also implemented `hostedByDID` auto-population for security (see P1 item above) 434 435--- 436 437### ✅ Fix .local TLD Bug (2025-10-11) 438Changed default `INSTANCE_DID` from `did:web:coves.local``did:web:coves.social`. Fixed community creation failure due to disallowed `.local` TLD. 439 440--- 441 442## Prioritization 443 444- **P0:** Security vulns, data loss, prod blockers 445- **P1:** Major UX/reliability issues 446- **P2:** QOL improvements, minor bugs, docs 447- **P3:** Refactoring, code quality