A community based topic aggregation platform built on atproto
1# Kagi News RSS Aggregator PRD
2
3**Status:** ✅ Phase 1 Complete - Ready for Deployment
4**Owner:** Platform Team
5**Last Updated:** 2025-10-24
6**Parent PRD:** [PRD_AGGREGATORS.md](PRD_AGGREGATORS.md)
7**Implementation:** Python + Docker Compose
8
9## 🎉 Implementation Complete
10
11All core components have been implemented and tested:
12
13- ✅ **RSS Fetcher** - Fetches feeds with retry logic and error handling
14- ✅ **HTML Parser** - Extracts all structured data (summary, highlights, perspectives, quote, sources)
15- ✅ **Rich Text Formatter** - Formats content with proper facets for Coves
16- ✅ **State Manager** - Tracks posted stories to prevent duplicates
17- ✅ **Config Manager** - Loads and validates YAML configuration
18- ✅ **Coves Client** - Handles authentication and post creation
19- ✅ **Main Orchestrator** - Coordinates all components
20- ✅ **Comprehensive Tests** - 57 tests with 83% code coverage
21- ✅ **Documentation** - README with setup and deployment instructions
22- ✅ **Example Configs** - config.example.yaml and .env.example
23
24**Test Results:**
25```
2657 passed, 6 skipped, 1 warning in 8.76s
27Coverage: 83%
28```
29
30**Ready for:**
31- Integration testing with live Coves API
32- Aggregator DID creation and authorization
33- Production deployment
34
35## Overview
36
37The Kagi News RSS Aggregator is a reference implementation of the Coves aggregator system that automatically posts high-quality, multi-source news summaries to communities. It leverages Kagi News's free RSS feeds to provide pre-aggregated, deduped news content with multiple perspectives and source citations.
38
39**Key Value Propositions:**
40- **Multi-source aggregation**: Kagi News already aggregates multiple sources per story
41- **Balanced perspectives**: Built-in perspective tracking from different outlets
42- **Rich metadata**: Categories, highlights, source links included
43- **Legal & free**: CC BY-NC licensed for non-commercial use
44- **Low complexity**: No LLM deduplication needed (Kagi does it)
45- **Simple deployment**: Python + Docker Compose, runs alongside Coves on same instance
46
47## Data Source: Kagi News RSS Feeds
48
49### Licensing & Legal
50
51**License:** CC BY-NC (Creative Commons Attribution-NonCommercial)
52
53**Terms:**
54- ✅ **Free for non-commercial use** (Coves qualifies)
55- ✅ **Attribution required** (must credit Kagi News)
56- ❌ **Cannot use commercially** (must contact support@kagi.com for commercial license)
57- ✅ **Data can be shared** (with same attribution + NC requirements)
58
59**Source:** https://news.kagi.com/about
60
61**Quote from Kagi:**
62> Note that kite.json and files referenced by it are licensed under CC BY-NC license. This means that this data can be used free of charge (with attribution and for non-commercial use). If you would like to license this data for commercial use let us know through support@kagi.com.
63
64**Compliance Requirements:**
65- Visible attribution to Kagi News on every post
66- Link back to original Kagi story page
67- Non-commercial operation (met: Coves is non-commercial)
68
69---
70
71### RSS Feed Structure
72
73**Base URL Pattern:** `https://news.kagi.com/{category}.xml`
74
75**Known Categories:**
76- `world.xml` - World news
77- `tech.xml` - Technology
78- `business.xml` - Business
79- `sports.xml` - Sports (likely)
80- Additional categories TBD (need to scrape homepage)
81
82**Feed Format:** RSS 2.0 (standard XML)
83
84**Update Frequency:** One daily update (~noon UTC)
85
86**Important Note on Domain Migration (October 2025):**
87Kagi migrated their RSS feeds from `kite.kagi.com` to `news.kagi.com`. The old domain now redirects (302) to the new domain, but for reliability, always use `news.kagi.com` directly in your feed URLs. Story links within the RSS feed still reference `kite.kagi.com` as permalinks.
88
89---
90
91### RSS Item Schema
92
93Each `<item>` in the feed contains:
94
95```xml
96<item>
97 <title>Story headline</title>
98 <link>https://kite.kagi.com/{uuid}/{category}/{id}</link>
99 <description>Full HTML content (see below)</description>
100 <guid isPermaLink="true">https://kite.kagi.com/{uuid}/{category}/{id}</guid>
101 <category>Primary category (e.g., "World")</category>
102 <category>Subcategory (e.g., "World/Conflict & Security")</category>
103 <category>Tag (e.g., "Conflict & Security")</category>
104 <pubDate>Mon, 20 Oct 2025 01:46:31 +0000</pubDate>
105</item>
106```
107
108**Description HTML Structure:**
109```html
110<p>Main summary paragraph with inline source citations [source1.com#1][source2.com#1]</p>
111
112<img src='https://kagiproxy.com/img/...' alt='Image caption' />
113
114<h3>Highlights:</h3>
115<ul>
116 <li>Key point 1 with [source.com#1] citations</li>
117 <li>Key point 2...</li>
118</ul>
119
120<blockquote>Notable quote - Person Name</blockquote>
121
122<h3>Perspectives:</h3>
123<ul>
124 <li>Viewpoint holder: Their perspective. (<a href='...'>Source</a>)</li>
125</ul>
126
127<h3>Sources:</h3>
128<ul>
129 <li><a href='https://...'>Article title</a> - domain.com</li>
130</ul>
131```
132
133**✅ Verified Feed Structure:**
134Analysis of live Kagi News feeds confirms the following structure:
135- **Only 3 H3 sections:** Highlights, Perspectives, Sources (no other sections like Timeline or Historical Background)
136- **Historical context** is woven into the summary paragraph and highlights (not a separate section)
137- **Not all stories have all sections** - Quote (blockquote) and image are optional
138- **Feed contains everything shown on website** except for Timeline (which is a frontend-only feature)
139
140**Key Features:**
141- Multiple source citations inline
142- Balanced perspectives from different actors
143- Highlights extract key points with historical context
144- Direct quotes preserved (when available)
145- All sources linked with attribution
146- Images from Kagi's proxy CDN
147
148---
149
150## Architecture
151
152### High-Level Flow
153
154```
155┌─────────────────────────────────────────────────────────────┐
156│ Kagi News RSS Feeds (External) │
157│ - https://news.kagi.com/world.xml │
158│ - https://news.kagi.com/tech.xml │
159│ - etc. │
160└─────────────────────────────────────────────────────────────┘
161 │
162 │ HTTP GET one job after update
163 ▼
164┌─────────────────────────────────────────────────────────────┐
165│ Kagi News Aggregator Service (Python + Docker Compose) │
166│ DID: did:plc:[generated-on-creation] │
167│ Location: aggregators/kagi-news/ │
168│ │
169│ Components: │
170│ 1. RSS Fetcher: Fetches RSS feeds on schedule (feedparser) │
171│ 2. Item Parser: Extracts structured data from HTML (bs4) │
172│ 3. Deduplication: Tracks posted items via JSON state file │
173│ 4. Feed Mapper: Maps feed URLs to community handles │
174│ 5. Post Formatter: Converts to Coves post format │
175│ 6. Post Publisher: Calls social.coves.community.post.create via XRPC │
176│ 7. Blob Uploader: Handles image upload to ATProto │
177└─────────────────────────────────────────────────────────────┘
178 │
179 │ Authenticated XRPC calls
180 ▼
181┌─────────────────────────────────────────────────────────────┐
182│ Coves AppView (social.coves.community.post.create) │
183│ - Validates aggregator authorization │
184│ - Creates post with author = did:plc:[aggregator-did] │
185│ - Indexes to community feeds │
186└─────────────────────────────────────────────────────────────┘
187```
188
189---
190
191### Aggregator Service Declaration
192
193```json
194{
195 "$type": "social.coves.aggregator.service",
196 "did": "did:plc:[generated-on-creation]",
197 "displayName": "Kagi News Aggregator",
198 "description": "Automatically posts breaking news from Kagi News RSS feeds. Kagi News aggregates multiple sources per story with balanced perspectives and comprehensive source citations.",
199 "aggregatorType": "social.coves.aggregator.types#rss",
200 "avatar": "<blob reference to Kagi logo>",
201 "configSchema": {
202 "type": "object",
203 "properties": {
204 "feedUrl": {
205 "type": "string",
206 "format": "uri",
207 "description": "Kagi News RSS feed URL (e.g., https://news.kagi.com/world.xml)"
208 }
209 },
210 "required": ["feedUrl"]
211 },
212 "sourceUrl": "https://github.com/coves-social/kagi-news-aggregator",
213 "maintainer": "did:plc:coves-platform",
214 "createdAt": "2025-10-23T00:00:00Z"
215}
216```
217
218**Note:** The MVP implementation uses a simpler configuration model. Feed-to-community mappings are defined in the aggregator's own config file rather than per-community configuration. This allows one aggregator instance to post to multiple communities.
219
220---
221
222## Aggregator Configuration (MVP)
223
224The MVP uses a simplified configuration model where the aggregator service defines feed-to-community mappings in its own config file.
225
226### Configuration File: `config.yaml`
227
228```yaml
229# Aggregator credentials (from environment variables)
230# AGGREGATOR_DID=did:plc:xyz...
231# AGGREGATOR_PRIVATE_KEY=base64-encoded-key...
232
233# Coves API endpoint
234coves_api_url: "https://api.coves.social"
235
236# Feed-to-community mappings
237feeds:
238 - name: "World News"
239 url: "https://news.kagi.com/world.xml"
240 community_handle: "world-news.coves.social"
241 enabled: true
242
243 - name: "Tech News"
244 url: "https://news.kagi.com/tech.xml"
245 community_handle: "tech.coves.social"
246 enabled: true
247
248 - name: "Science News"
249 url: "https://news.kagi.com/science.xml"
250 community_handle: "science.coves.social"
251 enabled: false # Can be disabled without removing
252
253# Scheduling
254check_interval: "24h" # Run once daily
255
256# Logging
257log_level: "info"
258```
259
260**Key Decisions:**
261- Uses **community handles** (not DIDs) for easier configuration - resolved at runtime
262- One aggregator can post to multiple communities
263- Feed mappings managed in aggregator config (not per-community config)
264- No complex filtering logic in MVP - one feed = one community
265
266---
267
268## Post Format Specification
269
270### Post Record Structure
271
272```json
273{
274 "$type": "social.coves.community.post.record",
275 "author": "did:plc:[aggregator-did]",
276 "community": "world-news.coves.social",
277 "title": "{Kagi story title}",
278 "content": "{formatted content - full format for MVP}",
279 "embed": {
280 "$type": "social.coves.embed.external",
281 "external": {
282 "uri": "{Kagi story URL}",
283 "title": "{story title}",
284 "description": "{summary excerpt - first 200 chars}",
285 "thumb": "{Kagi proxy image URL from HTML}"
286 }
287 },
288 "federatedFrom": {
289 "platform": "kagi-news-rss",
290 "uri": "https://kite.kagi.com/{uuid}/{category}/{id}",
291 "id": "{guid}",
292 "originalCreatedAt": "{pubDate from RSS}"
293 },
294 "contentLabels": [
295 "{primary category}",
296 "{subcategories}"
297 ],
298 "createdAt": "{current timestamp}"
299}
300```
301
302**MVP Notes:**
303- Uses `social.coves.embed.external` for hot-linked images (no blob upload)
304- Community specified as handle (resolved to DID by post creation endpoint)
305- Images referenced via original Kagi proxy URLs
306- "Full" format only for MVP (no format variations)
307- Content uses Coves rich text with facets (not markdown)
308
309---
310
311### Content Formatting (MVP: "Full" Format Only)
312
313The MVP implements a single "full" format using Coves rich text with facets:
314
315**Plain Text Structure:**
316```
317{Main summary paragraph with source citations}
318
319Highlights:
320• {Bullet point 1}
321• {Bullet point 2}
322• ...
323
324Perspectives:
325• {Actor}: {Their perspective} (Source)
326• ...
327
328"{Notable quote}" — {Attribution}
329
330Sources:
331• {Title} - {domain}
332• ...
333
334---
335📰 Story aggregated by Kagi News
336```
337
338**Rich Text Facets Applied:**
339- **Bold** (`social.coves.richtext.facet#bold`) on section headers: "Highlights:", "Perspectives:", "Sources:"
340- **Bold** on perspective actors
341- **Italic** (`social.coves.richtext.facet#italic`) on quotes
342- **Link** (`social.coves.richtext.facet#link`) on all URLs (source links, Kagi story link, perspective sources)
343- Byte ranges calculated using UTF-8 byte positions
344
345**Example with Facets:**
346```json
347{
348 "content": "Main summary [source.com#1]\n\nHighlights:\n• Key point 1...",
349 "facets": [
350 {
351 "index": {"byteStart": 35, "byteEnd": 46},
352 "features": [{"$type": "social.coves.richtext.facet#bold"}]
353 },
354 {
355 "index": {"byteStart": 15, "byteEnd": 26},
356 "features": [{"$type": "social.coves.richtext.facet#link", "uri": "https://source.com"}]
357 }
358 ]
359}
360```
361
362**Rationale:**
363- Uses native Coves rich text format (not markdown)
364- Preserves Kagi's rich multi-source analysis
365- Provides maximum value to communities
366- Meets CC BY-NC attribution requirements
367- Additional formats ("summary", "minimal") can be added post-MVP
368
369---
370
371## Implementation Details (Python MVP)
372
373### Technology Stack
374
375**Language:** Python 3.11+
376
377**Key Libraries:**
378- `feedparser` - RSS/Atom parsing
379- `beautifulsoup4` - HTML parsing for RSS item descriptions
380- `requests` - HTTP client for fetching feeds
381- `atproto` - Official ATProto Python SDK for authentication
382- `pyyaml` - Configuration file parsing
383- `pytest` - Testing framework
384
385### Project Structure
386
387```
388aggregators/kagi-news/
389├── Dockerfile
390├── docker-compose.yml
391├── requirements.txt
392├── config.example.yaml
393├── crontab # CRON schedule configuration
394├── .env.example # Environment variables template
395├── scripts/
396│ └── generate_did.py # Helper to generate aggregator DID
397├── src/
398│ ├── main.py # Entry point (single run, called by CRON)
399│ ├── config.py # Configuration loading and validation
400│ ├── rss_fetcher.py # RSS feed fetching with retry logic
401│ ├── html_parser.py # Parse Kagi HTML to structured data
402│ ├── richtext_formatter.py # Format content with rich text facets
403│ ├── atproto_client.py # ATProto authentication and operations
404│ ├── state_manager.py # Deduplication state tracking (JSON)
405│ └── models.py # Data models (KagiStory, etc.)
406├── tests/
407│ ├── test_parser.py
408│ ├── test_richtext_formatter.py
409│ ├── test_state_manager.py
410│ └── fixtures/ # Sample RSS feeds for testing
411└── README.md
412```
413
414---
415
416### Component 1: RSS Fetcher (`rss_fetcher.py`) ✅ COMPLETE
417
418**Responsibility:** Fetch RSS feeds with retry logic and error handling
419
420**Key Functions:**
421- `fetch_feed(url: str) -> feedparser.FeedParserDict`
422 - Uses `requests` with timeout (30s)
423 - Retry logic: 3 attempts with exponential backoff
424 - Returns parsed RSS feed or raises exception
425
426**Error Handling:**
427- Network timeouts
428- Invalid XML
429- HTTP errors (404, 500, etc.)
430
431**Implementation Status:**
432- ✅ Implemented with comprehensive error handling
433- ✅ Tests passing (5 tests)
434- ✅ Handles retries with exponential backoff
435
436---
437
438### Component 2: HTML Parser (`html_parser.py`) ✅ COMPLETE
439
440**Responsibility:** Extract structured data from Kagi's HTML description field
441
442**Key Class:** `KagiHTMLParser`
443
444**Data Model (`models.py`):**
445```python
446@dataclass
447class KagiStory:
448 title: str
449 link: str
450 guid: str
451 pub_date: datetime
452 categories: List[str]
453
454 # Parsed from HTML
455 summary: str
456 highlights: List[str]
457 perspectives: List[Perspective]
458 quote: Optional[Quote]
459 sources: List[Source]
460 image_url: Optional[str]
461 image_alt: Optional[str]
462
463@dataclass
464class Perspective:
465 actor: str
466 description: str
467 source_url: str
468
469@dataclass
470class Quote:
471 text: str
472 attribution: str
473
474@dataclass
475class Source:
476 title: str
477 url: str
478 domain: str
479```
480
481**Parsing Strategy:**
482- Use BeautifulSoup to parse HTML description
483- Extract sections by finding `<h3>` tags (Highlights, Perspectives, Sources)
484- Handle missing sections gracefully (not all stories have all sections)
485- Clean and normalize text
486
487**Implementation Status:**
488- ✅ Extracts all 3 H3 sections (Highlights, Perspectives, Sources)
489- ✅ Handles optional elements (quote, image)
490- ✅ Tests passing (8 tests)
491- ✅ Validates against real feed data
492
493---
494
495### Component 3: State Manager (`state_manager.py`) ✅ COMPLETE
496
497**Responsibility:** Track processed stories to prevent duplicates
498
499**Implementation:** Simple JSON file persistence
500
501**State File Format:**
502```json
503{
504 "feeds": {
505 "https://news.kagi.com/world.xml": {
506 "last_successful_run": "2025-10-23T12:00:00Z",
507 "posted_guids": [
508 "https://kite.kagi.com/uuid1/world/123",
509 "https://kite.kagi.com/uuid2/world/124"
510 ]
511 }
512 }
513}
514```
515
516**Key Functions:**
517- `is_posted(feed_url: str, guid: str) -> bool`
518- `mark_posted(feed_url: str, guid: str, post_uri: str)`
519- `get_last_run(feed_url: str) -> Optional[datetime]`
520- `update_last_run(feed_url: str, timestamp: datetime)`
521
522**Deduplication Strategy:**
523- Keep last 100 GUIDs per feed (rolling window)
524- Stories older than 30 days are automatically removed
525- Simple, no database needed
526
527**Implementation Status:**
528- ✅ JSON-based persistence with atomic writes
529- ✅ GUID tracking with rolling window
530- ✅ Tests passing (12 tests)
531- ✅ Thread-safe operations
532
533---
534
535### Component 4: Rich Text Formatter (`richtext_formatter.py`) ✅ COMPLETE
536
537**Responsibility:** Format parsed Kagi stories into Coves rich text with facets
538
539**Key Function:**
540- `format_full(story: KagiStory) -> dict`
541 - Returns: `{"content": str, "facets": List[dict]}`
542 - Builds plain text content with all sections
543 - Calculates UTF-8 byte positions for facets
544 - Applies bold, italic, and link facets
545 - Includes all sections: summary, highlights, perspectives, quote, sources
546 - Adds Kagi News attribution footer with link
547
548**Facet Types Applied:**
549- `social.coves.richtext.facet#bold` - Section headers, perspective actors
550- `social.coves.richtext.facet#italic` - Quotes
551- `social.coves.richtext.facet#link` - All URLs (sources, Kagi story link)
552
553**Key Challenge:** UTF-8 byte position calculation
554- Must handle multi-byte characters correctly (emoji, non-ASCII)
555- Use `str.encode('utf-8')` to get byte positions
556- Test with complex characters
557
558**Implementation Status:**
559- ✅ Full rich text formatting with facets
560- ✅ UTF-8 byte position calculation working correctly
561- ✅ Tests passing (10 tests)
562- ✅ Handles all sections: summary, highlights, perspectives, quote, sources
563
564---
565
566### Component 5: Coves Client (`coves_client.py`) ✅ COMPLETE
567
568**Responsibility:** Handle authentication and post creation via Coves API
569
570**Implementation Note:** Uses direct HTTP client instead of ATProto SDK for simplicity in MVP.
571
572**Key Functions:**
573- `authenticate() -> dict`
574 - Authenticates aggregator using credentials
575 - Returns auth token for subsequent API calls
576
577- `create_post(community_handle: str, title: str, content: str, facets: List[dict], ...) -> dict`
578 - Calls Coves post creation endpoint
579 - Includes aggregator authentication
580 - Returns post URI and metadata
581
582**Authentication Flow:**
583- Load aggregator credentials from environment
584- Authenticate with Coves API
585- Store and use auth token for requests
586- Handle token refresh if needed
587
588**Implementation Status:**
589- ✅ HTTP-based client implementation
590- ✅ Authentication and token management
591- ✅ Post creation with all required fields
592- ✅ Error handling and retries
593
594---
595
596### Component 6: Config Manager (`config.py`) ✅ COMPLETE
597
598**Responsibility:** Load and validate configuration from YAML and environment
599
600**Key Functions:**
601- `load_config(config_path: str) -> AggregatorConfig`
602 - Loads YAML configuration
603 - Validates structure and required fields
604 - Merges with environment variables
605 - Returns validated config object
606
607**Implementation Status:**
608- ✅ YAML parsing with validation
609- ✅ Environment variable support
610- ✅ Tests passing (3 tests)
611- ✅ Clear error messages for config issues
612
613---
614
615### Main Orchestration (`main.py`) ✅ COMPLETE
616
617**Responsibility:** Coordinate all components in a single execution (called by CRON)
618
619**Flow (Single Run):**
6201. Load configuration from `config.yaml`
6212. Load environment variables (AGGREGATOR_DID, AGGREGATOR_PRIVATE_KEY)
6223. Initialize all components (fetcher, parser, formatter, client, state)
6234. For each enabled feed in config:
624 a. Fetch RSS feed
625 b. Parse all items
626 c. Filter out already-posted items (check state)
627 d. For each new item:
628 - Parse HTML to structured KagiStory
629 - Format post content with rich text facets
630 - Build post record (with hot-linked image if present)
631 - Create post via XRPC
632 - Mark as posted in state
633 e. Update last run timestamp
6345. Save state to disk
6356. Log summary (posts created, errors encountered)
6367. Exit (CRON will call again on schedule)
637
638**Error Isolation:**
639- Feed-level: One feed failing doesn't stop others
640- Item-level: One item failing doesn't stop feed processing
641- Continue on non-fatal errors, log all failures
642- Exit code 0 even with partial failures (CRON won't alert)
643- Exit code 1 only on catastrophic failure (config missing, auth failure)
644
645**Implementation Status:**
646- ✅ Complete orchestration logic implemented
647- ✅ Feed-level and item-level error isolation
648- ✅ Structured logging throughout
649- ✅ Tests passing (9 tests covering various scenarios)
650- ✅ Dry-run mode for testing
651
652---
653
654## Deployment (Docker Compose with CRON)
655
656### Dockerfile
657
658```dockerfile
659FROM python:3.11-slim
660
661WORKDIR /app
662
663# Install cron
664RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/*
665
666# Install dependencies
667COPY requirements.txt .
668RUN pip install --no-cache-dir -r requirements.txt
669
670# Copy source code and scripts
671COPY src/ ./src/
672COPY scripts/ ./scripts/
673COPY crontab /etc/cron.d/kagi-news-cron
674
675# Set up cron
676RUN chmod 0644 /etc/cron.d/kagi-news-cron && \
677 crontab /etc/cron.d/kagi-news-cron && \
678 touch /var/log/cron.log
679
680# Create non-root user for security
681RUN useradd --create-home appuser && \
682 chown -R appuser:appuser /app && \
683 chown appuser:appuser /var/log/cron.log
684
685USER appuser
686
687# Run cron in foreground
688CMD ["cron", "-f"]
689```
690
691### Crontab Configuration (`crontab`)
692
693```bash
694# Run Kagi News aggregator daily at 1 PM UTC (after Kagi updates around noon)
6950 13 * * * cd /app && /usr/local/bin/python -m src.main >> /var/log/cron.log 2>&1
696
697# Blank line required at end of crontab
698```
699
700---
701
702### docker-compose.yml
703
704```yaml
705version: '3.8'
706
707services:
708 kagi-news-aggregator:
709 build: .
710 container_name: kagi-news-aggregator
711 restart: unless-stopped
712
713 environment:
714 # Aggregator identity (from aggregator creation)
715 - AGGREGATOR_DID=${AGGREGATOR_DID}
716 - AGGREGATOR_PRIVATE_KEY=${AGGREGATOR_PRIVATE_KEY}
717
718 volumes:
719 # Config file (read-only)
720 - ./config.yaml:/app/config.yaml:ro
721 # State file (read-write for deduplication)
722 - ./data/state.json:/app/data/state.json
723
724 logging:
725 driver: "json-file"
726 options:
727 max-size: "10m"
728 max-file: "3"
729```
730
731**Environment Variables:**
732- `AGGREGATOR_DID`: PLC DID created for this aggregator instance
733- `AGGREGATOR_PRIVATE_KEY`: Base64-encoded private key for signing
734
735**Volumes:**
736- `config.yaml`: Feed-to-community mappings (user-editable)
737- `data/state.json`: Deduplication state (managed by aggregator)
738
739**Deployment:**
740```bash
741# On same host as Coves
742cd aggregators/kagi-news
743cp config.example.yaml config.yaml
744# Edit config.yaml with your feed mappings
745
746# Set environment variables
747export AGGREGATOR_DID="did:plc:xyz..."
748export AGGREGATOR_PRIVATE_KEY="base64-key..."
749
750# Start aggregator
751docker-compose up -d
752
753# View logs
754docker-compose logs -f
755```
756
757---
758
759## Image Handling Strategy (MVP)
760
761### Approach: Hot-Linked Images via External Embed
762
763The MVP uses hot-linked images from Kagi's proxy:
764
765**Flow:**
7661. Extract image URL from HTML description (`https://kagiproxy.com/img/...`)
7672. Include in post using `social.coves.embed.external`:
768 ```json
769 {
770 "$type": "social.coves.embed.external",
771 "external": {
772 "uri": "{Kagi story URL}",
773 "title": "{Story title}",
774 "description": "{Summary excerpt}",
775 "thumb": "{Kagi proxy image URL}"
776 }
777 }
778 ```
7793. Frontend renders image from Kagi proxy URL
780
781**Rationale:**
782- Simpler MVP implementation (no blob upload complexity)
783- No storage requirements on our end
784- Kagi proxy is reliable and CDN-backed
785- Faster posting (no download/upload step)
786- Images already properly sized and optimized
787
788**Future Consideration:** If Kagi proxy becomes unreliable, migrate to blob storage in Phase 2.
789
790---
791
792## Rate Limiting & Performance (MVP)
793
794### Simplified Rate Strategy
795
796**RSS Fetching:**
797- Poll each feed once per day (~noon UTC after Kagi updates)
798- No aggressive polling needed (Kagi updates daily)
799- ~3-5 feeds = minimal load
800
801**Post Creation:**
802- One run per day = 5-15 posts per feed
803- Total: ~15-75 posts/day across all communities
804- Well within any reasonable rate limits
805
806**Performance:**
807- RSS fetch + parse: < 5 seconds per feed
808- Image download + upload: < 3 seconds per image
809- Post creation: < 1 second per post
810- Total runtime per day: < 5 minutes
811
812No complex rate limiting needed for MVP.
813
814---
815
816## Logging & Observability (MVP)
817
818### Structured Logging
819
820**Python logging module** with JSON formatter:
821
822```python
823import logging
824import json
825
826logging.basicConfig(
827 level=logging.INFO,
828 format='%(message)s'
829)
830
831logger = logging.getLogger(__name__)
832
833# Example structured log
834logger.info(json.dumps({
835 "event": "post_created",
836 "feed": "world.xml",
837 "story_title": "Breaking News...",
838 "community": "world-news.coves.social",
839 "post_uri": "at://...",
840 "timestamp": "2025-10-23T12:00:00Z"
841}))
842```
843
844**Key Events to Log:**
845- `feed_fetched`: RSS feed successfully fetched
846- `story_parsed`: Story successfully parsed from HTML
847- `post_created`: Post successfully created
848- `error`: Any failures (with context)
849- `run_completed`: Summary of entire run
850
851**Log Levels:**
852- INFO: Successful operations
853- WARNING: Retryable errors, skipped items
854- ERROR: Fatal errors, failed posts
855
856### Simple Monitoring
857
858**Health Check:** Check last successful run timestamp
859- If > 48 hours: alert (should run daily)
860- If errors > 50% of items: investigate
861
862**Metrics to Track (manually via logs):**
863- Posts created per run
864- Parse failures per run
865- Post creation failures per run
866- Total runtime
867
868No complex metrics infrastructure needed for MVP - Docker logs are sufficient.
869
870---
871
872## Testing Strategy ✅ COMPLETE
873
874### Unit Tests - 57 Tests Passing (83% Coverage)
875
876**Test Coverage by Component:**
877- ✅ **RSS Fetcher** (5 tests)
878 - Successful feed fetch
879 - Timeout handling
880 - Retry logic with exponential backoff
881 - Invalid XML handling
882 - Empty URL validation
883
884- ✅ **HTML Parser** (8 tests)
885 - Summary extraction
886 - Image URL and alt text extraction
887 - Highlights list parsing
888 - Quote extraction with attribution
889 - Perspectives parsing with actors and sources
890 - Sources list extraction
891 - Missing sections handling
892 - Full story object creation
893
894- ✅ **Rich Text Formatter** (10 tests)
895 - Full format generation
896 - Bold facets on headers and actors
897 - Italic facets on quotes
898 - Link facets on URLs
899 - UTF-8 byte position calculation
900 - Multi-byte character handling (emoji, special chars)
901 - All sections formatted correctly
902
903- ✅ **State Manager** (12 tests)
904 - GUID tracking
905 - Duplicate detection
906 - Rolling window (100 GUID limit)
907 - Age-based cleanup (30 days)
908 - Last run timestamp tracking
909 - JSON persistence
910 - Atomic file writes
911 - Concurrent access safety
912
913- ✅ **Config Manager** (3 tests)
914 - YAML loading and validation
915 - Environment variable merging
916 - Error handling for missing/invalid config
917
918- ✅ **Main Orchestrator** (9 tests)
919 - End-to-end flow
920 - Feed-level error isolation
921 - Item-level error isolation
922 - Dry-run mode
923 - State persistence across runs
924 - Multiple feed handling
925
926- ✅ **E2E Tests** (6 skipped - require live API)
927 - Integration with Coves API (manual testing required)
928 - Authentication flow
929 - Post creation
930
931**Test Results:**
932```
93357 passed, 6 skipped, 1 warning in 8.76s
934Coverage: 83%
935```
936
937**Test Fixtures:**
938- Real Kagi News RSS item with all sections
939- Sample HTML descriptions
940- Mock HTTP responses
941
942### Integration Tests
943
944**Manual Integration Testing Required:**
945- [ ] Can authenticate with live Coves API
946- [ ] Can create post via Coves API
947- [ ] Can fetch real Kagi RSS feed
948- [ ] Images display correctly from Kagi proxy
949- [ ] State persistence works in production
950- [ ] CRON scheduling works correctly
951
952**Pre-deployment Checklist:**
953- [x] All unit tests passing
954- [x] Can parse real Kagi HTML
955- [x] State persistence works
956- [x] Config validation works
957- [x] Error handling comprehensive
958- [ ] Aggregator DID created
959- [ ] Can authenticate with Coves API
960- [ ] Docker container builds and runs
961
962---
963
964## Success Metrics
965
966### ✅ Phase 1: Implementation - COMPLETE
967
968- [x] All core components implemented
969- [x] 57 tests passing with 83% coverage
970- [x] RSS fetching and parsing working
971- [x] Rich text formatting with facets
972- [x] State management and deduplication
973- [x] Configuration management
974- [x] Comprehensive error handling
975- [x] Documentation complete
976
977### 🔄 Phase 2: Integration Testing - IN PROGRESS
978
979- [ ] Aggregator DID created (PLC)
980- [ ] Aggregator authorized in 1+ test communities
981- [ ] Can authenticate with Coves API
982- [ ] First post created end-to-end
983- [ ] Attribution visible ("Via Kagi News")
984- [ ] No duplicate posts on repeated runs
985- [ ] Images display correctly
986
987### 📋 Phase 3: Alpha Deployment (First Week)
988
989- [ ] Docker Compose runs successfully in production
990- [ ] 2-3 communities receiving posts
991- [ ] 20+ posts created successfully
992- [ ] Zero duplicates
993- [ ] < 10% errors (parse or post creation)
994- [ ] CRON scheduling reliable
995
996### 🎯 Phase 4: Beta (First Month)
997
998- [ ] 5+ communities using aggregator
999- [ ] 200+ posts created
1000- [ ] Positive community feedback
1001- [ ] No rate limit issues
1002- [ ] < 5% error rate
1003- [ ] Performance metrics tracked
1004
1005---
1006
1007## What's Next: Integration & Deployment
1008
1009### Immediate Next Steps
1010
10111. **Create Aggregator Identity**
1012 - Generate DID for aggregator
1013 - Store credentials securely
1014 - Test authentication with Coves API
1015
10162. **Integration Testing**
1017 - Test with live Coves API
1018 - Verify post creation works
1019 - Validate rich text rendering
1020 - Check image display from Kagi proxy
1021
10223. **Docker Deployment**
1023 - Build Docker image
1024 - Test docker-compose setup
1025 - Verify CRON scheduling
1026 - Set up monitoring/logging
1027
10284. **Community Authorization**
1029 - Get aggregator authorized in test community
1030 - Verify authorization flow works
1031 - Test posting to real community
1032
10335. **Production Deployment**
1034 - Deploy to production server
1035 - Configure feeds for real communities
1036 - Monitor first batch of posts
1037 - Gather community feedback
1038
1039### Open Questions to Resolve
1040
10411. **Aggregator DID Creation:**
1042 - Need helper script or manual process?
1043 - Where to store credentials securely?
1044
10452. **Authorization Flow:**
1046 - How does community admin authorize aggregator?
1047 - UI flow or XRPC endpoint?
1048
10493. **Image Strategy:**
1050 - Confirm Kagi proxy images work reliably
1051 - Fallback plan if proxy becomes unreliable?
1052
10534. **Monitoring:**
1054 - What metrics to track initially?
1055 - Alerting strategy for failures?
1056
1057---
1058
1059## Future Enhancements (Post-MVP)
1060
1061### Phase 2
1062- Multiple post formats (summary, minimal)
1063- Per-community filtering (subcategories, min sources)
1064- More sophisticated deduplication
1065- Metrics dashboard
1066
1067### Phase 3
1068- Interactive features (bot responds to comments)
1069- Cross-posting prevention
1070- Federation support
1071
1072---
1073
1074## References
1075
1076- Kagi News About: https://news.kagi.com/about
1077- Kagi News RSS: https://news.kagi.com/world.xml
1078- CC BY-NC License: https://creativecommons.org/licenses/by-nc/4.0/
1079- Parent PRD: [PRD_AGGREGATORS.md](PRD_AGGREGATORS.md)
1080- ATProto Python SDK: https://github.com/MarshalX/atproto
1081- Implementation: [aggregators/kagi-news/](/aggregators/kagi-news/)
1082
1083---
1084
1085## Implementation Summary
1086
1087**Phase 1 Status:** ✅ **COMPLETE**
1088
1089The Kagi News RSS Aggregator implementation is complete and ready for integration testing and deployment. All 7 core components have been implemented with comprehensive test coverage (57 tests, 83% coverage).
1090
1091**What Was Built:**
1092- Complete RSS feed fetching and parsing pipeline
1093- HTML parser that extracts all structured data from Kagi News feeds (summary, highlights, perspectives, quote, sources)
1094- Rich text formatter with proper facets for Coves
1095- State management system for deduplication
1096- Configuration management with YAML and environment variables
1097- HTTP client for Coves API authentication and post creation
1098- Main orchestrator with robust error handling
1099- Comprehensive test suite with real feed fixtures
1100- Documentation and example configurations
1101
1102**Key Findings:**
1103- Kagi News RSS feeds contain only 3 structured sections (Highlights, Perspectives, Sources)
1104- Historical context is woven into the summary and highlights, not a separate section
1105- Timeline feature visible on Kagi website is not in the RSS feed
1106- All essential data for rich posts is available in the feed
1107- Feed structure is stable and well-formed
1108
1109**Next Phase:**
1110Integration testing with live Coves API, followed by alpha deployment to test communities.
1111
1112---
1113
1114**End of PRD - Phase 1 Implementation Complete**