A community based topic aggregation platform built on atproto
1# Backlog PRD: Platform Improvements & Technical Debt
2
3**Status:** Ongoing
4**Owner:** Platform Team
5**Last Updated:** 2025-10-17
6
7## Overview
8
9Miscellaneous platform improvements, bug fixes, and technical debt that don't fit into feature-specific PRDs.
10
11---
12
13## 🟡 P1: Important (Alpha Blockers)
14
15### at-identifier Handle Resolution in Endpoints
16**Added:** 2025-10-18 | **Effort:** 2-3 hours | **Priority:** ALPHA BLOCKER
17
18**Problem:**
19Current implementation rejects handles in endpoints that declare `"format": "at-identifier"` in their lexicon schemas, violating atProto best practices and breaking legitimate client usage.
20
21**Impact:**
22- ❌ Post creation fails when client sends community handle (e.g., `!gardening.communities.coves.social`)
23- ❌ Subscribe/unsubscribe endpoints reject handles despite lexicon declaring `at-identifier`
24- ❌ Block endpoints use `"format": "did"` but should use `at-identifier` for consistency
25- 🔴 **P0 Issue:** API contract violation - clients following the schema are rejected
26
27**Root Cause:**
28Handlers and services validate `strings.HasPrefix(req.Community, "did:")` instead of calling `ResolveCommunityIdentifier()`.
29
30**Affected Endpoints:**
311. **Post Creation** - [create.go:54](../internal/api/handlers/post/create.go#L54), [service.go:51](../internal/core/posts/service.go#L51)
32 - Lexicon declares `at-identifier`: [post/create.json:16](../internal/atproto/lexicon/social/coves/post/create.json#L16)
33
342. **Subscribe** - [subscribe.go:52](../internal/api/handlers/community/subscribe.go#L52)
35 - Lexicon declares `at-identifier`: [subscribe.json:16](../internal/atproto/lexicon/social/coves/community/subscribe.json#L16)
36
373. **Unsubscribe** - [subscribe.go:120](../internal/api/handlers/community/subscribe.go#L120)
38 - Lexicon declares `at-identifier`: [unsubscribe.json:16](../internal/atproto/lexicon/social/coves/community/unsubscribe.json#L16)
39
404. **Block/Unblock** - [block.go:58](../internal/api/handlers/community/block.go#L58), [block.go:132](../internal/api/handlers/community/block.go#L132)
41 - Lexicon declares `"format": "did"`: [block.json:15](../internal/atproto/lexicon/social/coves/community/block.json#L15)
42 - Should be changed to `at-identifier` for consistency and best practice
43
44**atProto Best Practice (from docs):**
45- ✅ API endpoints should accept both DIDs and handles via `at-identifier` format
46- ✅ Resolve handles to DIDs immediately at API boundary
47- ✅ Use DIDs internally for all business logic and storage
48- ✅ Handles are weak refs (changeable), DIDs are strong refs (permanent)
49- ⚠️ Bidirectional verification required (already handled by `identity.CachingResolver`)
50
51**Solution:**
52Replace direct DID validation with handle resolution using existing `ResolveCommunityIdentifier()`:
53
54```go
55// BEFORE (wrong) ❌
56if !strings.HasPrefix(req.Community, "did:") {
57 return error
58}
59
60// AFTER (correct) ✅
61communityDID, err := h.communityService.ResolveCommunityIdentifier(ctx, req.Community)
62if err != nil {
63 if communities.IsNotFound(err) {
64 writeError(w, http.StatusNotFound, "CommunityNotFound", "Community not found")
65 return
66 }
67 writeError(w, http.StatusBadRequest, "InvalidRequest", err.Error())
68 return
69}
70// Now use communityDID (guaranteed to be a DID)
71```
72
73**Implementation Plan:**
741. ✅ **Phase 1 (Alpha Blocker):** Fix post creation endpoint
75 - Update handler validation in `internal/api/handlers/post/create.go`
76 - Update service validation in `internal/core/posts/service.go`
77 - Add integration tests for handle resolution in post creation
78
792. 📋 **Phase 2 (Beta):** Fix subscription endpoints
80 - Update subscribe/unsubscribe handlers
81 - Add tests for handle resolution in subscriptions
82
833. 📋 **Phase 3 (Beta):** Fix block endpoints
84 - Update lexicon from `"format": "did"` → `"format": "at-identifier"`
85 - Update block/unblock handlers
86 - Add tests for handle resolution in blocking
87
88**Files to Modify (Phase 1 - Post Creation):**
89- `internal/api/handlers/post/create.go` - Remove DID validation, add handle resolution
90- `internal/core/posts/service.go` - Remove DID validation, add handle resolution
91- `internal/core/posts/interfaces.go` - Add `CommunityService` dependency
92- `cmd/server/main.go` - Pass community service to post service constructor
93- `tests/integration/post_creation_test.go` - Add handle resolution test cases
94
95**Existing Infrastructure:**
96✅ `ResolveCommunityIdentifier()` already implemented at [service.go:843](../internal/core/communities/service.go#L843)
97✅ `identity.CachingResolver` handles bidirectional verification and caching
98✅ Supports both handle (`!name.communities.instance.com`) and DID formats
99
100**Current Status:**
101- ⚠️ **BLOCKING POST CREATION PR**: Identified as P0 issue in code review
102- 📋 Phase 1 (post creation) - To be implemented immediately
103- 📋 Phase 2-3 (other endpoints) - Deferred to Beta
104
105---
106
107### did:web Domain Verification & hostedByDID Auto-Population
108**Added:** 2025-10-11 | **Updated:** 2025-10-16 | **Effort:** 2-3 days | **Priority:** ALPHA BLOCKER
109
110**Problem:**
1111. **Domain Impersonation**: Self-hosters can set `INSTANCE_DID=did:web:nintendo.com` without owning the domain, enabling attacks where communities appear hosted by trusted domains
1122. **hostedByDID Spoofing**: Malicious instance operators can modify source code to claim communities are hosted by domains they don't own, enabling reputation hijacking and phishing
113
114**Attack Scenarios:**
115- Malicious instance sets `instanceDID="did:web:coves.social"` → communities show as hosted by official Coves
116- Federation partners can't verify instance authenticity
117- AppView pollution with fake hosting claims
118
119**Solution:**
1201. **Basic Validation (Phase 1)**: Verify `did:web:` domain matches configured `instanceDomain`
1212. **Cryptographic Verification (Phase 2)**: Fetch `https://domain/.well-known/did.json` and verify:
122 - DID document exists and is valid
123 - Domain ownership proven via HTTPS hosting
124 - DID document matches claimed `instanceDID`
1253. **Auto-populate hostedByDID**: Remove from client API, derive from instance configuration in service layer
126
127**Current Status:**
128- ✅ Default changed from `coves.local` → `coves.social` (fixes `.local` TLD bug)
129- ✅ TODO comment in [cmd/server/main.go:126-131](../cmd/server/main.go#L126-L131)
130- ✅ hostedByDID removed from client requests (2025-10-16)
131- ✅ Service layer auto-populates `hostedByDID` from `instanceDID` (2025-10-16)
132- ✅ Handler rejects client-provided `hostedByDID` (2025-10-16)
133- ✅ Basic validation: Logs warning if `did:web:` domain ≠ `instanceDomain` (2025-10-16)
134- ⚠️ **REMAINING**: Full DID document verification (cryptographic proof of ownership)
135
136**Implementation Notes:**
137- Phase 1 complete: Basic validation catches config errors, logs warnings
138- Phase 2 needed: Fetch `https://domain/.well-known/did.json` and verify ownership
139- Add `SKIP_DID_WEB_VERIFICATION=true` for dev mode
140- Full verification blocks startup if domain ownership cannot be proven
141
142---
143
144### ✅ Token Refresh Logic for Community Credentials - COMPLETE
145**Added:** 2025-10-11 | **Completed:** 2025-10-17 | **Effort:** 1.5 days | **Status:** ✅ DONE
146
147**Problem:** Community PDS access tokens expire (~2hrs). Updates fail until manual intervention.
148
149**Solution Implemented:**
150- ✅ Automatic token refresh before PDS operations (5-minute buffer before expiration)
151- ✅ JWT expiration parsing without signature verification (`parseJWTExpiration`, `needsRefresh`)
152- ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`)
153- ✅ Password fallback when refresh tokens expire (~2 months) via `atproto.ServerCreateSession`
154- ✅ Atomic credential updates (`UpdateCredentials` repository method)
155- ✅ Concurrency-safe with per-community mutex locking
156- ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events)
157- ✅ Integration tests for token expiration detection and credential updates
158
159**Files Created:**
160- [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go) - JWT parsing utilities
161- [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go) - Refresh and re-auth logic
162- [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go) - Integration tests
163
164**Files Modified:**
165- [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` + concurrency control
166- [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface
167- [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials`
168
169**Documentation:** See [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md) for full details
170
171**Impact:** ✅ Communities can now be updated 24+ hours after creation without manual intervention
172
173---
174
175### ✅ Subscription Visibility Level (Feed Slider 1-5 Scale) - COMPLETE
176**Added:** 2025-10-15 | **Completed:** 2025-10-16 | **Effort:** 1 day | **Status:** ✅ DONE
177
178**Problem:** Users couldn't control how much content they see from each community. Lexicon had `contentVisibility` (1-5 scale) but code didn't use it.
179
180**Solution Implemented:**
181- ✅ Updated subscribe handler to accept `contentVisibility` parameter (1-5, default 3)
182- ✅ Store in subscription record on PDS (`social.coves.community.subscription`)
183- ✅ Migration 008 adds `content_visibility` column to database with CHECK constraint
184- ✅ Clamping at all layers (handler, service, consumer) for defense in depth
185- ✅ Atomic subscriber count updates (SubscribeWithCount/UnsubscribeWithCount)
186- ✅ Idempotent operations (safe for Jetstream event replays)
187- ✅ Fixed critical collection name bug (was using wrong namespace)
188- ✅ Production Jetstream consumer now running
189- ✅ 13 comprehensive integration tests - all passing
190
191**Files Modified:**
192- Lexicon: [subscription.json](../internal/atproto/lexicon/social/coves/community/subscription.json) ✅ Updated to atProto conventions
193- Handler: [community/subscribe.go](../internal/api/handlers/community/subscribe.go) ✅ Accepts contentVisibility
194- Service: [communities/service.go](../internal/core/communities/service.go) ✅ Clamps and passes to PDS
195- Consumer: [community_consumer.go](../internal/atproto/jetstream/community_consumer.go) ✅ Extracts and indexes
196- Repository: [community_repo_subscriptions.go](../internal/db/postgres/community_repo_subscriptions.go) ✅ All queries updated
197- Migration: [008_add_content_visibility_to_subscriptions.sql](../internal/db/migrations/008_add_content_visibility_to_subscriptions.sql) ✅ Schema changes
198- Tests: [subscription_indexing_test.go](../tests/integration/subscription_indexing_test.go) ✅ Comprehensive coverage
199
200**Documentation:** See [IMPLEMENTATION_SUBSCRIPTION_INDEXING.md](../docs/IMPLEMENTATION_SUBSCRIPTION_INDEXING.md) for full details
201
202**Impact:** ✅ Users can now adjust feed volume per community (key feature from DOMAIN_KNOWLEDGE.md enabled)
203
204---
205
206### Community Blocking
207**Added:** 2025-10-15 | **Effort:** 1 day | **Priority:** ALPHA BLOCKER
208
209**Problem:** Users have no way to block unwanted communities from their feeds.
210
211**Solution:**
2121. **Lexicon:** Extend `social.coves.actor.block` to support community DIDs (currently user-only)
2132. **Service:** Implement `BlockCommunity(userDID, communityDID)` and `UnblockCommunity()`
2143. **Handlers:** Add XRPC endpoints `social.coves.community.block` and `unblock`
2154. **Repository:** Add methods to track blocked communities
2165. **Feed:** Filter blocked communities from feed queries (beta work)
217
218**Code:**
219- Lexicon: [actor/block.json](../internal/atproto/lexicon/social/coves/actor/block.json) - Currently only supports user DIDs
220- Service: New methods needed
221- Handlers: New files needed
222
223**Impact:** Users can't avoid unwanted content without blocking
224
225---
226
227## 🔴 P1.5: Federation Blockers (Beta Launch)
228
229### Cross-PDS Write-Forward Support
230**Added:** 2025-10-17 | **Effort:** 3-4 hours | **Priority:** FEDERATION BLOCKER (Beta)
231
232**Problem:** Current write-forward implementation assumes all users are on the same PDS as the Coves instance. This breaks federation when users from external PDSs try to interact with communities.
233
234**Current Behavior:**
235- User on `pds.bsky.social` subscribes to community on `coves.social`
236- Coves calls `s.pdsURL` (instance default: `http://localhost:3001`)
237- Write goes to WRONG PDS → fails with 401/403
238
239**Impact:**
240- ✅ **Alpha**: Works fine (single PDS deployment)
241- ❌ **Beta**: Breaks federation (users on different PDSs can't subscribe/interact)
242
243**Root Cause:**
244- [service.go:736](../internal/core/communities/service.go#L736): `createRecordOnPDSAs` hardcodes `s.pdsURL`
245- [service.go:753](../internal/core/communities/service.go#L753): `putRecordOnPDSAs` hardcodes `s.pdsURL`
246- [service.go:767](../internal/core/communities/service.go#L767): `deleteRecordOnPDSAs` hardcodes `s.pdsURL`
247
248**Solution:**
2491. Add identity resolver dependency to `CommunityService`
2502. Before write-forward, resolve user's DID → extract PDS URL
2513. Call user's actual PDS instead of `s.pdsURL`
252
253**Implementation:**
254```go
255// Before write-forward to user's repo:
256userIdentity, err := s.identityResolver.ResolveDID(ctx, userDID)
257if err != nil {
258 return fmt.Errorf("failed to resolve user PDS: %w", err)
259}
260
261// Use user's actual PDS URL
262endpoint := fmt.Sprintf("%s/xrpc/com.atproto.repo.createRecord", userIdentity.PDSURL)
263```
264
265**Files to Modify:**
266- `internal/core/communities/service.go` - Add resolver, modify write-forward methods
267- `cmd/server/main.go` - Pass identity resolver to community service constructor
268- Tests - Add cross-PDS scenarios
269
270**Testing:**
271- User on external PDS subscribes to community
272- User on external PDS blocks community
273- Community updates still work (communities ARE on instance PDS)
274
275---
276
277## 🟢 P2: Nice-to-Have
278
279### Remove Categories from Community Lexicon
280**Added:** 2025-10-15 | **Effort:** 30 minutes | **Priority:** Cleanup
281
282**Problem:** Categories field exists in create/update lexicon but not in profile record. Adds complexity without clear value.
283
284**Solution:**
285- Remove `categories` from [create.json](../internal/atproto/lexicon/social/coves/community/create.json#L46-L54)
286- Remove `categories` from [update.json](../internal/atproto/lexicon/social/coves/community/update.json#L51-L59)
287- Remove from [community.go:91](../internal/core/communities/community.go#L91)
288- Remove from service layer ([service.go:109-110](../internal/core/communities/service.go#L109-L110))
289
290**Impact:** Simplifies lexicon, removes unused feature
291
292---
293
294### Improve .local TLD Error Messages
295**Added:** 2025-10-11 | **Effort:** 1 hour
296
297**Problem:** Generic error "TLD .local is not allowed" confuses developers.
298
299**Solution:** Enhance `InvalidHandleError` to explain root cause and suggest fixing `INSTANCE_DID`.
300
301---
302
303### Self-Hosting Security Guide
304**Added:** 2025-10-11 | **Effort:** 1 day
305
306**Needed:** Document did:web setup, DNS config, secrets management, rate limiting, PostgreSQL hardening, monitoring.
307
308---
309
310### OAuth Session Cleanup Race Condition
311**Added:** 2025-10-11 | **Effort:** 2 hours
312
313**Problem:** Cleanup goroutine doesn't handle graceful shutdown, may orphan DB connections.
314
315**Solution:** Pass cancellable context, handle SIGTERM, add cleanup timeout.
316
317---
318
319### Jetstream Consumer Race Condition
320**Added:** 2025-10-11 | **Effort:** 1 hour
321
322**Problem:** Multiple goroutines can call `close(done)` concurrently in consumer shutdown.
323
324**Solution:** Use `sync.Once` for channel close or atomic flag for shutdown state.
325
326**Code:** TODO in [jetstream/user_consumer.go:114](../internal/atproto/jetstream/user_consumer.go#L114)
327
328---
329
330## 🔵 P3: Technical Debt
331
332### Consolidate Environment Variable Validation
333**Added:** 2025-10-11 | **Effort:** 2-3 hours
334
335Create `internal/config` package with structured config validation. Fail fast with clear errors.
336
337---
338
339### Add Connection Pooling for PDS HTTP Clients
340**Added:** 2025-10-11 | **Effort:** 2 hours
341
342Create shared `http.Client` with connection pooling instead of new client per request.
343
344---
345
346### Architecture Decision Records (ADRs)
347**Added:** 2025-10-11 | **Effort:** Ongoing
348
349Document: did:plc choice, pgcrypto encryption, Jetstream vs firehose, write-forward pattern, single handle field.
350
351---
352
353### Replace log Package with Structured Logger
354**Added:** 2025-10-11 | **Effort:** 1 day
355
356**Problem:** Using standard `log` package. Need structured logging (JSON) with levels.
357
358**Solution:** Switch to `slog`, `zap`, or `zerolog`. Add request IDs, context fields.
359
360**Code:** TODO in [community/errors.go:46](../internal/api/handlers/community/errors.go#L46)
361
362---
363
364### PDS URL Resolution from DID
365**Added:** 2025-10-11 | **Effort:** 2-3 hours
366
367**Problem:** User consumer doesn't resolve PDS URL from DID document when missing.
368
369**Solution:** Query PLC directory for DID document, extract `serviceEndpoint`.
370
371**Code:** TODO in [jetstream/user_consumer.go:203](../internal/atproto/jetstream/user_consumer.go#L203)
372
373---
374
375### PLC Directory Registration (Production)
376**Added:** 2025-10-11 | **Effort:** 1 day
377
378**Problem:** DID generator creates did:plc but doesn't register in prod mode.
379
380**Solution:** Implement PLC registration API call when `isDevEnv=false`.
381
382**Code:** TODO in [did/generator.go:46](../internal/atproto/did/generator.go#L46)
383
384---
385
386## Recent Completions
387
388### ✅ Token Refresh for Community Credentials (2025-10-17)
389**Completed:** Automatic token refresh prevents communities from breaking after 2 hours
390
391**Implementation:**
392- ✅ JWT expiration parsing and refresh detection (5-minute buffer)
393- ✅ Token refresh using Indigo SDK (`atproto.ServerRefreshSession`)
394- ✅ Password fallback when refresh tokens expire (`atproto.ServerCreateSession`)
395- ✅ Atomic credential updates in database (`UpdateCredentials`)
396- ✅ Concurrency-safe with per-community mutex locking
397- ✅ Structured logging for monitoring (`[TOKEN-REFRESH]` events)
398- ✅ Integration tests for expiration detection and credential updates
399
400**Files Created:**
401- [internal/core/communities/token_utils.go](../internal/core/communities/token_utils.go)
402- [internal/core/communities/token_refresh.go](../internal/core/communities/token_refresh.go)
403- [tests/integration/token_refresh_test.go](../tests/integration/token_refresh_test.go)
404
405**Files Modified:**
406- [internal/core/communities/service.go](../internal/core/communities/service.go) - Added `ensureFreshToken` method
407- [internal/core/communities/interfaces.go](../internal/core/communities/interfaces.go) - Added `UpdateCredentials` interface
408- [internal/db/postgres/community_repo.go](../internal/db/postgres/community_repo.go) - Implemented `UpdateCredentials`
409
410**Documentation:** [IMPLEMENTATION_TOKEN_REFRESH.md](../docs/IMPLEMENTATION_TOKEN_REFRESH.md)
411
412**Impact:** Communities now work indefinitely without manual token management
413
414---
415
416### ✅ OAuth Authentication for Community Actions (2025-10-16)
417**Completed:** Full OAuth JWT authentication flow for protected endpoints
418
419**Implementation:**
420- ✅ JWT parser compatible with atProto PDS tokens (aud/iss handling)
421- ✅ Auth middleware protecting create/update/subscribe/unsubscribe endpoints
422- ✅ Handler-level DID extraction from JWT tokens via `middleware.GetUserDID(r)`
423- ✅ Removed all X-User-DID header placeholders
424- ✅ E2E tests validate complete OAuth flow with real PDS tokens
425- ✅ Security: Issuer validation supports both HTTPS URLs and DIDs
426
427**Files Modified:**
428- [internal/atproto/auth/jwt.go](../internal/atproto/auth/jwt.go) - JWT parsing with atProto compatibility
429- [internal/api/middleware/auth.go](../internal/api/middleware/auth.go) - Auth middleware
430- [internal/api/handlers/community/](../internal/api/handlers/community/) - All handlers updated
431- [tests/integration/community_e2e_test.go](../tests/integration/community_e2e_test.go) - OAuth E2E tests
432
433**Related:** Also implemented `hostedByDID` auto-population for security (see P1 item above)
434
435---
436
437### ✅ Fix .local TLD Bug (2025-10-11)
438Changed default `INSTANCE_DID` from `did:web:coves.local` → `did:web:coves.social`. Fixed community creation failure due to disallowed `.local` TLD.
439
440---
441
442## Prioritization
443
444- **P0:** Security vulns, data loss, prod blockers
445- **P1:** Major UX/reliability issues
446- **P2:** QOL improvements, minor bugs, docs
447- **P3:** Refactoring, code quality