# Backlog PRD: Platform Improvements & Technical Debt **Status:** Ongoing **Owner:** Platform Team **Last Updated:** 2025-10-17 ## Overview Miscellaneous platform improvements, bug fixes, and technical debt that don't fit into feature-specific PRDs. --- ## 🔴 P0: Critical (Alpha Blockers) ### OAuth DPoP Token Architecture - Voting Write-Forward **Added:** 2025-11-02 | **Completed:** 2025-11-02 | **Effort:** 2 hours | **Priority:** ALPHA BLOCKER **Status:** ✅ COMPLETE **Problem:** Our backend is attempting to use DPoP-bound OAuth tokens to write votes to users' PDSs, causing "Malformed token" errors. This violates atProto architecture patterns. **Current (Incorrect) Flow:** ``` Mobile Client (OAuth + DPoP) → Coves Backend → User's PDS ❌ ↓ "Malformed token" error ``` **Root Cause:** - Mobile app uses OAuth with DPoP (Demonstrating Proof of Possession) - DPoP tokens are cryptographically bound to client's private key via `cnf.jkt` claim - Each PDS request requires **both**: - `Authorization: Bearer ` - `DPoP: ` (signature proves client has private key) - Backend cannot create DPoP proofs (doesn't have client's private key) - **DPoP tokens are intentionally non-transferable** (security feature to prevent token theft) **Evidence:** ```json // Token decoded from mobile app session { "sub": "did:plc:txrork7rurdueix27ulzi7ke", "cnf": { "jkt": "LSWROJhTkPn4yT18xUjiIz2Z7z7l_gozKfjjQTYgW9o" // ← DPoP binding }, "client_id": "https://lingering-darkness-50a6.brettmay0212.workers.dev/client-metadata.json", "iss": "http://localhost:3001" } ``` **atProto Best Practice (from Bluesky social-app analysis):** - ✅ Clients write **directly to their own PDS** (no backend proxy) - ✅ AppView **only indexes** from Jetstream (eventual consistency) - ✅ PDS = User's personal data store (user controls writes) - ✅ AppView = Read-only aggregator/indexer - ❌ Backend should NOT proxy user write operations **Correct Architecture:** ``` Mobile Client → User's PDS (direct write with DPoP proof) ✓ ↓ Jetstream (firehose) ↓ Coves AppView (indexes votes from firehose) ``` **Affected Endpoints:** 1. **Vote Creation** - [create_vote.go:76](../internal/api/handlers/vote/create_vote.go#L76) - Currently: Backend writes to PDS using user's token - Should: Return error directing client to write directly 2. **Vote Service** - [service.go:126](../internal/core/votes/service.go#L126) - Currently: `createRecordOnPDSAs()` attempts write-forward - Should: Remove write-forward, rely on Jetstream indexing only **Solution Options:** **Option A: Client Direct Write (RECOMMENDED - Follows Bluesky)** ```typescript // Mobile client writes directly (like Bluesky social-app) const agent = new Agent(oauthSession) await agent.call('com.atproto.repo.createRecord', { repo: userDid, collection: 'social.coves.interaction.vote', record: { $type: 'social.coves.interaction.vote', subject: { uri: postUri, cid: postCid }, direction: 'up', createdAt: new Date().toISOString() } }) ``` Backend changes: - Remove write-forward code from vote service - Return error from XRPC endpoint: "Votes must be created directly at your PDS" - Index votes from Jetstream consumer (already implemented) **Option B: Backend App Passwords (NOT RECOMMENDED)** - User creates app-specific password - Backend uses password auth (gets regular JWTs, not DPoP) - Security downgrade, poor UX **Option C: Service Auth Token (Complex)** - Backend gets its own service credentials - Requires PDS to trust our AppView as delegated writer - Non-standard atProto pattern **Recommendation:** Option A (Client Direct Write) - Matches atProto architecture - Follows Bluesky social-app pattern - Best security (user controls their data) - Simplest implementation **Implementation Tasks:** 1. Update Flutter OAuth package to expose `agent.call()` for custom lexicons 2. Update mobile vote UI to write directly to PDS 3. Remove write-forward code from backend vote service 4. Update vote XRPC handler to return helpful error message 5. Verify Jetstream consumer correctly indexes votes 6. Update integration tests to match new flow **References:** - Bluesky social-app: Direct PDS writes via agent - atProto OAuth spec: DPoP binding prevents token reuse - atProto architecture: AppView = read-only indexer --- ### OAuth DPoP Token Architecture - Community Subscriptions **Added:** 2025-11-02 | **Effort:** 1-2 hours | **Priority:** ALPHA BLOCKER **Status:** 📋 TODO (Waiting for frontend implementation) **Problem:** Same DPoP token issue as voting - backend cannot use user's DPoP-bound OAuth tokens to write subscription records to user's PDS. **Affected Operations:** - `SubscribeToCommunity()` - [service.go:564-624](../internal/core/communities/service.go#L564-L624) - `UnsubscribeFromCommunity()` - [service.go:626-660](../internal/core/communities/service.go#L626-L660) **Collection:** `social.coves.community.subscription` **Solution:** Client writes directly using `com.atproto.repo.createRecord`: ```typescript await agent.call('com.atproto.repo.createRecord', { repo: userDid, collection: 'social.coves.community.subscription', record: { $type: 'social.coves.community.subscription', subject: communityDid, contentVisibility: 3, createdAt: new Date().toISOString() } }) ``` **Backend Changes Needed:** 1. Remove write-forward from `SubscribeToCommunity()` and `UnsubscribeFromCommunity()` 2. Update handlers to return errors directing to client-direct pattern 3. Verify Jetstream consumer indexes subscriptions (already working) **Files to Modify:** - `internal/core/communities/service.go` - `internal/api/handlers/community/subscribe.go` --- ### OAuth DPoP Token Architecture - Community Blocking **Added:** 2025-11-02 | **Effort:** 1-2 hours | **Priority:** ALPHA BLOCKER **Status:** 📋 TODO (Waiting for frontend implementation) **Problem:** Same DPoP token issue - backend cannot use user's DPoP-bound OAuth tokens to write block records to user's PDS. **Affected Operations:** - `BlockCommunity()` - [service.go:709-781](../internal/core/communities/service.go#L709-L781) - `UnblockCommunity()` - [service.go:783-816](../internal/core/communities/service.go#L783-L816) **Collection:** `social.coves.community.block` **Solution:** Client writes directly using `com.atproto.repo.createRecord`: ```typescript await agent.call('com.atproto.repo.createRecord', { repo: userDid, collection: 'social.coves.community.block', record: { $type: 'social.coves.community.block', subject: communityDid, createdAt: new Date().toISOString() } }) ``` **Backend Changes Needed:** 1. Remove write-forward from `BlockCommunity()` and `UnblockCommunity()` 2. Update handlers to return errors directing to client-direct pattern 3. Verify Jetstream consumer indexes blocks (already working) **Files to Modify:** - `internal/core/communities/service.go` - `internal/api/handlers/community/block.go` --- ## 🟡 P1: Important (Alpha Blockers) ### at-identifier Handle Resolution in Endpoints **Added:** 2025-10-18 | **Effort:** 2-3 hours | **Priority:** ALPHA BLOCKER **Problem:** Current implementation rejects handles in endpoints that declare `"format": "at-identifier"` in their lexicon schemas, violating atProto best practices and breaking legitimate client usage. **Impact:** - ❌ Post creation fails when client sends community handle (e.g., `!gardening.communities.coves.social`) - ❌ Subscribe/unsubscribe endpoints reject handles despite lexicon declaring `at-identifier` - ❌ Block endpoints use `"format": "did"` but should use `at-identifier` for consistency - 🔴 **P0 Issue:** API contract violation - clients following the schema are rejected **Root Cause:** Handlers and services validate `strings.HasPrefix(req.Community, "did:")` instead of calling `ResolveCommunityIdentifier()`. **Affected Endpoints:** 1. **Post Creation** - [create.go:54](../internal/api/handlers/post/create.go#L54), [service.go:51](../internal/core/posts/service.go#L51) - Lexicon declares `at-identifier`: [post/create.json:16](../internal/atproto/lexicon/social/coves/post/create.json#L16) 2. **Subscribe** - [subscribe.go:52](../internal/api/handlers/community/subscribe.go#L52) - Lexicon declares `at-identifier`: [subscribe.json:16](../internal/atproto/lexicon/social/coves/community/subscribe.json#L16) 3. **Unsubscribe** - [subscribe.go:120](../internal/api/handlers/community/subscribe.go#L120) - Lexicon declares `at-identifier`: [unsubscribe.json:16](../internal/atproto/lexicon/social/coves/community/unsubscribe.json#L16) 4. **Block/Unblock** - [block.go:58](../internal/api/handlers/community/block.go#L58), [block.go:132](../internal/api/handlers/community/block.go#L132) - Lexicon declares `"format": "did"`: [block.json:15](../internal/atproto/lexicon/social/coves/community/block.json#L15) - Should be changed to `at-identifier` for consistency and best practice **atProto Best Practice (from docs):** - ✅ API endpoints should accept both DIDs and handles via `at-identifier` format - ✅ Resolve handles to DIDs immediately at API boundary - ✅ Use DIDs internally for all business logic and storage - ✅ Handles are weak refs (changeable), DIDs are strong refs (permanent) - ⚠️ Bidirectional verification required (already handled by `identity.CachingResolver`) **Solution:** Replace direct DID validation with handle resolution using existing `ResolveCommunityIdentifier()`: ```go // BEFORE (wrong) ❌ if !strings.HasPrefix(req.Community, "did:") { return error } // AFTER (correct) ✅ communityDID, err := h.communityService.ResolveCommunityIdentifier(ctx, req.Community) if err != nil { if communities.IsNotFound(err) { writeError(w, http.StatusNotFound, "CommunityNotFound", "Community not found") return } writeError(w, http.StatusBadRequest, "InvalidRequest", err.Error()) return } // Now use communityDID (guaranteed to be a DID) ``` **Implementation Plan:** 1. ✅ **Phase 1 (Alpha Blocker):** Fix post creation endpoint - COMPLETE (2025-10-18) - Post creation already uses `ResolveCommunityIdentifier()` at [service.go:100](../internal/core/posts/service.go#L100) - Supports handles, DIDs, and scoped formats 2. 📋 **Phase 2 (Beta):** Fix subscription endpoints - Update subscribe/unsubscribe handlers - Add tests for handle resolution in subscriptions 3. ✅ **Phase 3 (Beta):** Fix block endpoints - COMPLETE (2025-11-16) - Updated block/unblock handlers to use `ResolveCommunityIdentifier()` - Accepts handles (`@gaming.community.coves.social`), DIDs, and scoped format (`!gaming@coves.social`) - Added comprehensive tests: [block_handle_resolution_test.go](../tests/integration/block_handle_resolution_test.go) - All 7 test cases passing **Files Modified (Phase 3 - Block Endpoints):** - `internal/api/handlers/community/block.go` - Added `ResolveCommunityIdentifier()` calls - `tests/integration/block_handle_resolution_test.go` - Comprehensive test coverage **Existing Infrastructure:** ✅ `ResolveCommunityIdentifier()` already implemented at [service.go:852](../internal/core/communities/service.go#L852) ✅ `identity.CachingResolver` handles bidirectional verification and caching ✅ Supports both handle (`!name.communities.instance.com`) and DID formats **Current Status:** - ✅ Phase 1 (post creation) - Already implemented - 📋 Phase 2 (subscriptions) - Deferred to Beta (lower priority) - ✅ Phase 3 (block endpoints) - COMPLETE (2025-11-16) --- ### ✅ did:web Domain Verification & hostedByDID Auto-Population - COMPLETE **Added:** 2025-10-11 | **Updated:** 2025-11-16 | **Completed:** 2025-11-16 | **Status:** ✅ DONE **Problem:** 1. **Domain Impersonation**: Self-hosters can set `INSTANCE_DID=did:web:nintendo.com` without owning the domain, enabling attacks where communities appear hosted by trusted domains 2. **hostedByDID Spoofing**: Malicious instance operators can modify source code to claim communities are hosted by domains they don't own, enabling reputation hijacking and phishing **Attack Scenarios:** - Malicious instance sets `instanceDID="did:web:coves.social"` → communities show as hosted by official Coves - Federation partners can't verify instance authenticity - AppView pollution with fake hosting claims **Solution Implemented (Bluesky-Compatible):** 1. ✅ **Domain Matching**: Verify `did:web:` domain matches configured `instanceDomain` 2. ✅ **Bidirectional Verification**: Fetch `https://domain/.well-known/did.json` and verify: - DID document exists and is valid - DID document ID matches claimed `instanceDID` - DID document claims handle domain in `alsoKnownAs` field (bidirectional binding) - Domain ownership proven via HTTPS hosting (matches Bluesky's trust model) 3. ✅ **Auto-populate hostedByDID**: Removed from client API, derived from instance configuration in service layer **Current Status:** - ✅ Default changed from `coves.local` → `coves.social` (fixes `.local` TLD bug) - ✅ hostedByDID removed from client requests (2025-10-16) - ✅ Service layer auto-populates `hostedByDID` from `instanceDID` (2025-10-16) - ✅ Handler rejects client-provided `hostedByDID` (2025-10-16) - ✅ Basic validation: Logs warning if `did:web:` domain ≠ `instanceDomain` (2025-10-16) - ✅ **MANDATORY bidirectional DID verification** (2025-11-16) - ✅ Cache TTL updated to 24h (matches Bluesky recommendations) (2025-11-16) **Implementation Details:** - **Security Model**: Matches Bluesky's approach - relies on DNS/HTTPS authority, not cryptographic proof - **Enforcement**: MANDATORY hard-fail in production (rejects communities with verification failures) - **Dev Mode**: Set `SKIP_DID_WEB_VERIFICATION=true` to bypass verification for local development - **Performance**: Bounded LRU cache (1000 entries), rate limiting (10 req/s), 24h cache TTL - **Bidirectional Check**: Prevents impersonation by requiring DID document to claim the handle - **Location**: [internal/atproto/jetstream/community_consumer.go](../internal/atproto/jetstream/community_consumer.go) --- ### ✅ Token Refresh Logic for Community Credentials - COMPLETE **Added:** 2025-10-11 | **Completed:** 2025-10-17 | **Effort:** 1.5 days | **Status:** ✅ DONE **Problem:** Community PDS access tokens expire (~2hrs). Updates fail until manual intervention. **Solution Implemented:** - ✅ Automatic token refresh before PDS operations (5-minute buffer before expiration) - ✅ JWT expiration parsing without signature verification (`parseJWTExpiration`, `needsRefresh`) - ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`) - ✅ Password fallback when refresh tokens expire (~2 months) via `atproto.ServerCreateSession` - ✅ Atomic credential updates (`UpdateCredentials` repository method) - ✅ Concurrency-safe with per-community mutex locking - ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events) - ✅ Integration tests for token expiration detection and credential updates **Files Created:** - [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go) - JWT parsing utilities - [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go) - Refresh and re-auth logic - [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go) - Integration tests **Files Modified:** - [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` + concurrency control - [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface - [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials` **Documentation:** See [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md) for full details **Impact:** ✅ Communities can now be updated 24+ hours after creation without manual intervention --- ### ✅ Subscription Visibility Level (Feed Slider 1-5 Scale) - COMPLETE **Added:** 2025-10-15 | **Completed:** 2025-10-16 | **Effort:** 1 day | **Status:** ✅ DONE **Problem:** Users couldn't control how much content they see from each community. Lexicon had `contentVisibility` (1-5 scale) but code didn't use it. **Solution Implemented:** - ✅ Updated subscribe handler to accept `contentVisibility` parameter (1-5, default 3) - ✅ Store in subscription record on PDS (`social.coves.community.subscription`) - ✅ Migration 008 adds `content_visibility` column to database with CHECK constraint - ✅ Clamping at all layers (handler, service, consumer) for defense in depth - ✅ Atomic subscriber count updates (SubscribeWithCount/UnsubscribeWithCount) - ✅ Idempotent operations (safe for Jetstream event replays) - ✅ Fixed critical collection name bug (was using wrong namespace) - ✅ Production Jetstream consumer now running - ✅ 13 comprehensive integration tests - all passing **Files Modified:** - Lexicon: [subscription.json](../internal/atproto/lexicon/social/coves/community/subscription.json) ✅ Updated to atProto conventions - Handler: [community/subscribe.go](../internal/api/handlers/community/subscribe.go) ✅ Accepts contentVisibility - Service: [communities/service.go](../internal/core/communities/service.go) ✅ Clamps and passes to PDS - Consumer: [community_consumer.go](../internal/atproto/jetstream/community_consumer.go) ✅ Extracts and indexes - Repository: [community_repo_subscriptions.go](../internal/db/postgres/community_repo_subscriptions.go) ✅ All queries updated - Migration: [008_add_content_visibility_to_subscriptions.sql](../internal/db/migrations/008_add_content_visibility_to_subscriptions.sql) ✅ Schema changes - Tests: [subscription_indexing_test.go](../tests/integration/subscription_indexing_test.go) ✅ Comprehensive coverage **Documentation:** See [IMPLEMENTATION_SUBSCRIPTION_INDEXING.md](../docs/IMPLEMENTATION_SUBSCRIPTION_INDEXING.md) for full details **Impact:** ✅ Users can now adjust feed volume per community (key feature from DOMAIN_KNOWLEDGE.md enabled) --- ### Community Blocking **Added:** 2025-10-15 | **Effort:** 1 day | **Priority:** ALPHA BLOCKER **Problem:** Users have no way to block unwanted communities from their feeds. **Solution:** 1. **Lexicon:** Extend `social.coves.actor.block` to support community DIDs (currently user-only) 2. **Service:** Implement `BlockCommunity(userDID, communityDID)` and `UnblockCommunity()` 3. **Handlers:** Add XRPC endpoints `social.coves.community.block` and `unblock` 4. **Repository:** Add methods to track blocked communities 5. **Feed:** Filter blocked communities from feed queries (beta work) **Code:** - Lexicon: [actor/block.json](../internal/atproto/lexicon/social/coves/actor/block.json) - Currently only supports user DIDs - Service: New methods needed - Handlers: New files needed **Impact:** Users can't avoid unwanted content without blocking --- ### ✅ Post comment_count Reconciliation - COMPLETE **Added:** 2025-11-04 | **Completed:** 2025-11-16 | **Effort:** 2 hours | **Status:** ✅ DONE **Problem:** When comments arrive before their parent post is indexed (common with cross-repo Jetstream ordering), the post's `comment_count` was never reconciled, causing posts to show permanently stale "0 comments" counters. **Solution Implemented:** - ✅ Post consumer reconciliation logic WAS already implemented at [post_consumer.go:210-226](../internal/atproto/jetstream/post_consumer.go#L210-L226) - ✅ Reconciliation query counts pre-existing comments when indexing new posts - ✅ Comprehensive test suite added: [post_consumer_test.go](../tests/integration/post_consumer_test.go) - Single comment before post - Multiple comments before post - Mixed before/after ordering - Idempotent indexing preserves counts - ✅ Updated outdated FIXME comment at [comment_consumer.go:362](../internal/atproto/jetstream/comment_consumer.go#L362) - ✅ All 4 test cases passing **Implementation:** ```go // Post consumer reconciliation (lines 210-226) reconcileQuery := ` UPDATE posts SET comment_count = ( SELECT COUNT(*) FROM comments c WHERE c.parent_uri = $1 AND c.deleted_at IS NULL ) WHERE id = $2 ` _, reconcileErr := tx.ExecContext(ctx, reconcileQuery, post.URI, postID) ``` **Files Modified:** - `internal/atproto/jetstream/comment_consumer.go` - Updated documentation - `tests/integration/post_consumer_test.go` - Added comprehensive test coverage **Impact:** ✅ Post comment counters are now accurate regardless of Jetstream event ordering --- ## 🔴 P1.5: Federation Blockers (Beta Launch) ### Cross-PDS Write-Forward Support for Community Service **Added:** 2025-10-17 | **Updated:** 2025-11-02 | **Effort:** 3-4 hours | **Priority:** FEDERATION BLOCKER (Beta) **Problem:** Community service write-forward methods assume all users are on the same PDS as the Coves instance. This breaks federation when users from external PDSs try to subscribe/block communities. **Current Behavior:** - User on `pds.bsky.social` subscribes to community on `coves.social` - Coves calls `s.pdsURL` (instance default: `http://localhost:3001`) - Write goes to WRONG PDS → fails with `{"error":"InvalidToken","message":"Malformed token"}` **Impact:** - ✅ **Alpha**: Works fine (single PDS deployment, no federation) - ❌ **Beta**: Breaks federation (users on different PDSs can't subscribe/block) **Root Cause:** - [service.go:1033](../internal/core/communities/service.go#L1033): `createRecordOnPDSAs` hardcodes `s.pdsURL` - [service.go:1050](../internal/core/communities/service.go#L1050): `putRecordOnPDSAs` hardcodes `s.pdsURL` - [service.go:1063](../internal/core/communities/service.go#L1063): `deleteRecordOnPDSAs` hardcodes `s.pdsURL` **Affected Operations:** - `SubscribeToCommunity` ([service.go:608](../internal/core/communities/service.go#L608)) - `UnsubscribeFromCommunity` (calls `deleteRecordOnPDSAs`) - `BlockCommunity` ([service.go:739](../internal/core/communities/service.go#L739)) - `UnblockCommunity` (calls `deleteRecordOnPDSAs`) **Solution:** 1. Add `identityResolver identity.Resolver` to `communityService` struct 2. Before write-forward, resolve user's DID → extract PDS URL 3. Call user's actual PDS instead of hardcoded `s.pdsURL` **Implementation Pattern (from Vote Service):** ```go // Add helper method to resolve user's PDS func (s *communityService) resolveUserPDS(ctx context.Context, userDID string) (string, error) { identity, err := s.identityResolver.Resolve(ctx, userDID) if err != nil { return "", fmt.Errorf("failed to resolve user PDS: %w", err) } if identity.PDSURL == "" { log.Printf("[COMMUNITY-PDS] WARNING: No PDS URL found for %s, using fallback: %s", userDID, s.pdsURL) return s.pdsURL, nil } return identity.PDSURL, nil } // Update write-forward methods: func (s *communityService) createRecordOnPDSAs(ctx context.Context, repoDID, collection, rkey string, record map[string]interface{}, accessToken string) (string, string, error) { // Resolve user's actual PDS (critical for federation) pdsURL, err := s.resolveUserPDS(ctx, repoDID) if err != nil { return "", "", fmt.Errorf("failed to resolve user PDS: %w", err) } endpoint := fmt.Sprintf("%s/xrpc/com.atproto.repo.createRecord", strings.TrimSuffix(pdsURL, "/")) // ... rest of method } ``` **Files to Modify:** - `internal/core/communities/service.go` - Add resolver field + `resolveUserPDS` helper - `internal/core/communities/service.go` - Update `createRecordOnPDSAs`, `putRecordOnPDSAs`, `deleteRecordOnPDSAs` - `cmd/server/main.go` - Pass identity resolver to community service constructor - Tests - Add cross-PDS subscription/block scenarios **Testing:** - User on external PDS subscribes to community → writes to their PDS - User on external PDS blocks community → writes to their PDS - Community profile updates still work (writes to community's own PDS) **Related:** - ✅ **Vote Service**: Fixed in Alpha (2025-11-02) - users can vote from any PDS - 🔴 **Community Service**: Deferred to Beta (no federation in Alpha) --- ## 🟢 P2: Nice-to-Have ### Remove Categories from Community Lexicon **Added:** 2025-10-15 | **Effort:** 30 minutes | **Priority:** Cleanup **Problem:** Categories field exists in create/update lexicon but not in profile record. Adds complexity without clear value. **Solution:** - Remove `categories` from [create.json](../internal/atproto/lexicon/social/coves/community/create.json#L46-L54) - Remove `categories` from [update.json](../internal/atproto/lexicon/social/coves/community/update.json#L51-L59) - Remove from [community.go:91](../internal/core/communities/community.go#L91) - Remove from service layer ([service.go:109-110](../internal/core/communities/service.go#L109-L110)) **Impact:** Simplifies lexicon, removes unused feature --- ### Improve .local TLD Error Messages **Added:** 2025-10-11 | **Effort:** 1 hour **Problem:** Generic error "TLD .local is not allowed" confuses developers. **Solution:** Enhance `InvalidHandleError` to explain root cause and suggest fixing `INSTANCE_DID`. --- ### Self-Hosting Security Guide **Added:** 2025-10-11 | **Effort:** 1 day **Needed:** Document did:web setup, DNS config, secrets management, rate limiting, PostgreSQL hardening, monitoring. --- ### OAuth Session Cleanup Race Condition **Added:** 2025-10-11 | **Effort:** 2 hours **Problem:** Cleanup goroutine doesn't handle graceful shutdown, may orphan DB connections. **Solution:** Pass cancellable context, handle SIGTERM, add cleanup timeout. --- ### Jetstream Consumer Race Condition **Added:** 2025-10-11 | **Effort:** 1 hour **Problem:** Multiple goroutines can call `close(done)` concurrently in consumer shutdown. **Solution:** Use `sync.Once` for channel close or atomic flag for shutdown state. **Code:** TODO in [jetstream/user_consumer.go:114](../internal/atproto/jetstream/user_consumer.go#L114) --- ### Unfurl Cache Cleanup Background Job **Added:** 2025-11-07 | **Effort:** 2-3 hours | **Priority:** Performance/Maintenance **Problem:** The `unfurl_cache` table will grow indefinitely as expired entries are not deleted. While the cache uses lazy expiration (checking `expires_at` on read), old records remain in the database consuming disk space. **Impact:** - 📊 ~1KB per cached URL - 📈 At 10K cached URLs = ~10MB (negligible for alpha) - ⚠️ At 1M cached URLs = ~1GB (potential issue at scale) - 🐌 Table bloat can slow down queries over time **Current Mitigation:** - ✅ Lazy expiration: Cache hits check `expires_at` and refetch if expired - ✅ Indexed on `expires_at` for efficient expiration queries - ✅ Not critical for alpha (growth is gradual) **Solution (Beta/Production):** Implement background cleanup job to delete expired entries: ```go // Periodic cleanup (run daily or weekly) func (r *unfurlRepository) CleanupExpired(ctx context.Context) (int64, error) { query := `DELETE FROM unfurl_cache WHERE expires_at < NOW()` result, err := r.db.ExecContext(ctx, query) if err != nil { return 0, err } return result.RowsAffected() } ``` **Implementation Options:** 1. **Cron job**: Separate process runs cleanup on schedule 2. **Background goroutine**: Service-level background task with configurable interval 3. **PostgreSQL pg_cron extension**: Database-level scheduled cleanup **Recommended Approach:** - Phase 1 (Beta): Background goroutine running weekly cleanup - Phase 2 (Production): Migrate to pg_cron or external cron for reliability **Configuration:** ```bash UNFURL_CACHE_CLEANUP_ENABLED=true UNFURL_CACHE_CLEANUP_INTERVAL=168h # 7 days ``` **Monitoring:** - Log cleanup operations: `[UNFURL-CACHE-CLEANUP] Deleted 1234 expired entries` - Track table size growth over time - Alert if table exceeds threshold (e.g., 100MB) **Files to Create:** - `internal/core/unfurl/cleanup.go` - Background cleanup service **Related:** - Implemented in oEmbed unfurling feature (2025-11-07) - Cache table: [migration XXX_create_unfurl_cache.sql](../internal/db/migrations/) --- ## 🔵 P3: Technical Debt ### Implement PutRecord in PDS Client **Added:** 2025-12-04 | **Effort:** 2-3 hours | **Priority:** Technical Debt **Status:** 📋 TODO **Problem:** The PDS client (`internal/atproto/pds/client.go`) only has `CreateRecord` but lacks `PutRecord`. This means updates use `CreateRecord` with an existing rkey, which: 1. Loses optimistic locking (no CID swap check) 2. Is semantically incorrect (creates vs updates) 3. Could cause race conditions on concurrent updates **atProto Best Practice:** - `com.atproto.repo.putRecord` should be used for updates - Accepts `swapRecord` (expected CID) for optimistic locking - Returns conflict error if CID doesn't match (concurrent modification detected) **Solution:** Add `PutRecord` method to the PDS client interface: ```go // Client interface addition type Client interface { // ... existing methods ... // PutRecord creates or updates a record with optional optimistic locking. // If swapRecord is provided, the operation fails if the current CID doesn't match. PutRecord(ctx context.Context, collection string, rkey string, record any, swapRecord string) (uri string, cid string, err error) } // Implementation func (c *client) PutRecord(ctx context.Context, collection string, rkey string, record any, swapRecord string) (string, string, error) { payload := map[string]any{ "repo": c.did, "collection": collection, "rkey": rkey, "record": record, } // Optional: optimistic locking via CID swap check if swapRecord != "" { payload["swapRecord"] = swapRecord } var result struct { URI string `json:"uri"` CID string `json:"cid"` } err := c.apiClient.Post(ctx, syntax.NSID("com.atproto.repo.putRecord"), payload, &result) if err != nil { return "", "", wrapAPIError(err, "putRecord") } return result.URI, result.CID, nil } ``` **Error Handling:** Add new error type for conflict detection: ```go var ErrConflict = errors.New("record was modified by another operation") ``` Map HTTP 409 in `wrapAPIError`: ```go case 409: return fmt.Errorf("%s: %w: %s", operation, ErrConflict, apiErr.Message) ``` **Files to Modify:** - `internal/atproto/pds/client.go` - Add `PutRecord` method and interface - `internal/atproto/pds/errors.go` - Add `ErrConflict` error type **Testing:** - Unit test: Verify payload includes `swapRecord` when provided - Integration test: Concurrent updates detect conflict - Integration test: Update without `swapRecord` still works (backwards compatible) **Blocked By:** Nothing **Blocks:** "Migrate UpdateComment to use PutRecord" --- ### Migrate UpdateComment to Use PutRecord **Added:** 2025-12-04 | **Effort:** 1 hour | **Priority:** Technical Debt **Status:** 📋 TODO (Blocked) **Blocked By:** "Implement PutRecord in PDS Client" **Problem:** `UpdateComment` in `internal/core/comments/comment_service.go` uses `CreateRecord` for updates instead of `PutRecord`. This lacks optimistic locking and is semantically incorrect. **Current Code (lines 687-690):** ```go // TODO: Use PutRecord instead of CreateRecord for proper update semantics with optimistic locking. // PutRecord should accept the existing CID (existingRecord.CID) to ensure concurrent updates are detected. // However, PutRecord is not yet implemented in internal/atproto/pds/client.go. uri, cid, err := pdsClient.CreateRecord(ctx, commentCollection, rkey, updatedRecord) ``` **Solution:** Once `PutRecord` is implemented in the PDS client, update to: ```go // Use PutRecord with optimistic locking via existing CID uri, cid, err := pdsClient.PutRecord(ctx, commentCollection, rkey, updatedRecord, existingRecord.CID) if err != nil { if errors.Is(err, pds.ErrConflict) { // Record was modified by another operation - return appropriate error return nil, fmt.Errorf("comment was modified, please refresh and try again: %w", err) } // ... existing error handling } ``` **Files to Modify:** - `internal/core/comments/comment_service.go` - UpdateComment method - `internal/core/comments/errors.go` - Add `ErrConcurrentModification` if needed **Testing:** - Unit test: Verify `PutRecord` is called with correct CID - Integration test: Simulate concurrent update, verify conflict handling **Impact:** Proper optimistic locking prevents lost updates from race conditions --- ### Consolidate Environment Variable Validation **Added:** 2025-10-11 | **Effort:** 2-3 hours Create `internal/config` package with structured config validation. Fail fast with clear errors. --- ### Add Connection Pooling for PDS HTTP Clients **Added:** 2025-10-11 | **Effort:** 2 hours Create shared `http.Client` with connection pooling instead of new client per request. --- ### Architecture Decision Records (ADRs) **Added:** 2025-10-11 | **Effort:** Ongoing Document: did:plc choice, pgcrypto encryption, Jetstream vs firehose, write-forward pattern, single handle field. --- ### Replace log Package with Structured Logger **Added:** 2025-10-11 | **Effort:** 1 day **Problem:** Using standard `log` package. Need structured logging (JSON) with levels. **Solution:** Switch to `slog`, `zap`, or `zerolog`. Add request IDs, context fields. **Code:** TODO in [community/errors.go:46](../internal/api/handlers/community/errors.go#L46) --- ### PDS URL Resolution from DID **Added:** 2025-10-11 | **Effort:** 2-3 hours **Problem:** User consumer doesn't resolve PDS URL from DID document when missing. **Solution:** Query PLC directory for DID document, extract `serviceEndpoint`. **Code:** TODO in [jetstream/user_consumer.go:203](../internal/atproto/jetstream/user_consumer.go#L203) --- ## Recent Completions ### ✅ Token Refresh for Community Credentials (2025-10-17) **Completed:** Automatic token refresh prevents communities from breaking after 2 hours **Implementation:** - ✅ JWT expiration parsing and refresh detection (5-minute buffer) - ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`) - ✅ Password fallback when refresh tokens expire (`atproto.ServerCreateSession`) - ✅ Atomic credential updates in database (`UpdateCredentials`) - ✅ Concurrency-safe with per-community mutex locking - ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events) - ✅ Integration tests for expiration detection and credential updates **Files Created:** - [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go) - [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go) - [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go) **Files Modified:** - [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` method - [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface - [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials` **Documentation:** [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md) **Impact:** Communities now work indefinitely without manual token management --- ### ✅ OAuth Authentication for Community Actions (2025-10-16) **Completed:** Full OAuth JWT authentication flow for protected endpoints **Implementation:** - ✅ JWT parser compatible with atProto PDS tokens (aud/iss handling) - ✅ Auth middleware protecting create/update/subscribe/unsubscribe endpoints - ✅ Handler-level DID extraction from JWT tokens via `middleware.GetUserDID(r)` - ✅ Removed all X-User-DID header placeholders - ✅ E2E tests validate complete OAuth flow with real PDS tokens - ✅ Security: Issuer validation supports both HTTPS URLs and DIDs **Files Modified:** - [internal/atproto/auth/jwt.go](../internal/atproto/auth/jwt.go) - JWT parsing with atProto compatibility - [internal/api/middleware/auth.go](../internal/api/middleware/auth.go) - Auth middleware - [internal/api/handlers/community/](../internal/api/handlers/community/) - All handlers updated - [tests/integration/community_e2e_test.go](../tests/integration/community_e2e_test.go) - OAuth E2E tests **Related:** Also implemented `hostedByDID` auto-population for security (see P1 item above) --- ### ✅ Fix .local TLD Bug (2025-10-11) Changed default `INSTANCE_DID` from `did:web:coves.local` → `did:web:coves.social`. Fixed community creation failure due to disallowed `.local` TLD. --- ## Prioritization - **P0:** Security vulns, data loss, prod blockers - **P1:** Major UX/reliability issues - **P2:** QOL improvements, minor bugs, docs - **P3:** Refactoring, code quality