A community based topic aggregation platform built on atproto
1# Kagi News RSS Aggregator PRD 2 3**Status:** ✅ Phase 1 Complete - Ready for Deployment 4**Owner:** Platform Team 5**Last Updated:** 2025-10-24 6**Parent PRD:** [PRD_AGGREGATORS.md](PRD_AGGREGATORS.md) 7**Implementation:** Python + Docker Compose 8 9## 🎉 Implementation Complete 10 11All core components have been implemented and tested: 12 13-**RSS Fetcher** - Fetches feeds with retry logic and error handling 14-**HTML Parser** - Extracts all structured data (summary, highlights, perspectives, quote, sources) 15-**Rich Text Formatter** - Formats content with proper facets for Coves 16-**State Manager** - Tracks posted stories to prevent duplicates 17-**Config Manager** - Loads and validates YAML configuration 18-**Coves Client** - Handles authentication and post creation 19-**Main Orchestrator** - Coordinates all components 20-**Comprehensive Tests** - 57 tests with 83% code coverage 21-**Documentation** - README with setup and deployment instructions 22-**Example Configs** - config.example.yaml and .env.example 23 24**Test Results:** 25``` 2657 passed, 6 skipped, 1 warning in 8.76s 27Coverage: 83% 28``` 29 30**Ready for:** 31- Integration testing with live Coves API 32- Aggregator DID creation and authorization 33- Production deployment 34 35## Overview 36 37The Kagi News RSS Aggregator is a reference implementation of the Coves aggregator system that automatically posts high-quality, multi-source news summaries to communities. It leverages Kagi News's free RSS feeds to provide pre-aggregated, deduped news content with multiple perspectives and source citations. 38 39**Key Value Propositions:** 40- **Multi-source aggregation**: Kagi News already aggregates multiple sources per story 41- **Balanced perspectives**: Built-in perspective tracking from different outlets 42- **Rich metadata**: Categories, highlights, source links included 43- **Legal & free**: CC BY-NC licensed for non-commercial use 44- **Low complexity**: No LLM deduplication needed (Kagi does it) 45- **Simple deployment**: Python + Docker Compose, runs alongside Coves on same instance 46 47## Data Source: Kagi News RSS Feeds 48 49### Licensing & Legal 50 51**License:** CC BY-NC (Creative Commons Attribution-NonCommercial) 52 53**Terms:** 54-**Free for non-commercial use** (Coves qualifies) 55-**Attribution required** (must credit Kagi News) 56-**Cannot use commercially** (must contact support@kagi.com for commercial license) 57-**Data can be shared** (with same attribution + NC requirements) 58 59**Source:** https://news.kagi.com/about 60 61**Quote from Kagi:** 62> Note that kite.json and files referenced by it are licensed under CC BY-NC license. This means that this data can be used free of charge (with attribution and for non-commercial use). If you would like to license this data for commercial use let us know through support@kagi.com. 63 64**Compliance Requirements:** 65- Visible attribution to Kagi News on every post 66- Link back to original Kagi story page 67- Non-commercial operation (met: Coves is non-commercial) 68 69--- 70 71### RSS Feed Structure 72 73**Base URL Pattern:** `https://news.kagi.com/{category}.xml` 74 75**Known Categories:** 76- `world.xml` - World news 77- `tech.xml` - Technology 78- `business.xml` - Business 79- `sports.xml` - Sports (likely) 80- Additional categories TBD (need to scrape homepage) 81 82**Feed Format:** RSS 2.0 (standard XML) 83 84**Update Frequency:** One daily update (~noon UTC) 85 86**Important Note on Domain Migration (October 2025):** 87Kagi migrated their RSS feeds from `kite.kagi.com` to `news.kagi.com`. The old domain now redirects (302) to the new domain, but for reliability, always use `news.kagi.com` directly in your feed URLs. Story links within the RSS feed still reference `kite.kagi.com` as permalinks. 88 89--- 90 91### RSS Item Schema 92 93Each `<item>` in the feed contains: 94 95```xml 96<item> 97 <title>Story headline</title> 98 <link>https://kite.kagi.com/{uuid}/{category}/{id}</link> 99 <description>Full HTML content (see below)</description> 100 <guid isPermaLink="true">https://kite.kagi.com/{uuid}/{category}/{id}</guid> 101 <category>Primary category (e.g., "World")</category> 102 <category>Subcategory (e.g., "World/Conflict & Security")</category> 103 <category>Tag (e.g., "Conflict & Security")</category> 104 <pubDate>Mon, 20 Oct 2025 01:46:31 +0000</pubDate> 105</item> 106``` 107 108**Description HTML Structure:** 109```html 110<p>Main summary paragraph with inline source citations [source1.com#1][source2.com#1]</p> 111 112<img src='https://kagiproxy.com/img/...' alt='Image caption' /> 113 114<h3>Highlights:</h3> 115<ul> 116 <li>Key point 1 with [source.com#1] citations</li> 117 <li>Key point 2...</li> 118</ul> 119 120<blockquote>Notable quote - Person Name</blockquote> 121 122<h3>Perspectives:</h3> 123<ul> 124 <li>Viewpoint holder: Their perspective. (<a href='...'>Source</a>)</li> 125</ul> 126 127<h3>Sources:</h3> 128<ul> 129 <li><a href='https://...'>Article title</a> - domain.com</li> 130</ul> 131``` 132 133**✅ Verified Feed Structure:** 134Analysis of live Kagi News feeds confirms the following structure: 135- **Only 3 H3 sections:** Highlights, Perspectives, Sources (no other sections like Timeline or Historical Background) 136- **Historical context** is woven into the summary paragraph and highlights (not a separate section) 137- **Not all stories have all sections** - Quote (blockquote) and image are optional 138- **Feed contains everything shown on website** except for Timeline (which is a frontend-only feature) 139 140**Key Features:** 141- Multiple source citations inline 142- Balanced perspectives from different actors 143- Highlights extract key points with historical context 144- Direct quotes preserved (when available) 145- All sources linked with attribution 146- Images from Kagi's proxy CDN 147 148--- 149 150## Architecture 151 152### High-Level Flow 153 154``` 155┌─────────────────────────────────────────────────────────────┐ 156│ Kagi News RSS Feeds (External) │ 157│ - https://news.kagi.com/world.xml │ 158│ - https://news.kagi.com/tech.xml │ 159│ - etc. │ 160└─────────────────────────────────────────────────────────────┘ 161 162 │ HTTP GET one job after update 163 164┌─────────────────────────────────────────────────────────────┐ 165│ Kagi News Aggregator Service (Python + Docker Compose) │ 166│ DID: did:plc:[generated-on-creation] │ 167│ Location: aggregators/kagi-news/ │ 168│ │ 169│ Components: │ 170│ 1. RSS Fetcher: Fetches RSS feeds on schedule (feedparser) │ 171│ 2. Item Parser: Extracts structured data from HTML (bs4) │ 172│ 3. Deduplication: Tracks posted items via JSON state file │ 173│ 4. Feed Mapper: Maps feed URLs to community handles │ 174│ 5. Post Formatter: Converts to Coves post format │ 175│ 6. Post Publisher: Calls social.coves.community.post.create via XRPC │ 176│ 7. Blob Uploader: Handles image upload to ATProto │ 177└─────────────────────────────────────────────────────────────┘ 178 179 │ Authenticated XRPC calls 180 181┌─────────────────────────────────────────────────────────────┐ 182│ Coves AppView (social.coves.community.post.create) │ 183│ - Validates aggregator authorization │ 184│ - Creates post with author = did:plc:[aggregator-did] │ 185│ - Indexes to community feeds │ 186└─────────────────────────────────────────────────────────────┘ 187``` 188 189--- 190 191### Aggregator Service Declaration 192 193```json 194{ 195 "$type": "social.coves.aggregator.service", 196 "did": "did:plc:[generated-on-creation]", 197 "displayName": "Kagi News Aggregator", 198 "description": "Automatically posts breaking news from Kagi News RSS feeds. Kagi News aggregates multiple sources per story with balanced perspectives and comprehensive source citations.", 199 "aggregatorType": "social.coves.aggregator.types#rss", 200 "avatar": "<blob reference to Kagi logo>", 201 "configSchema": { 202 "type": "object", 203 "properties": { 204 "feedUrl": { 205 "type": "string", 206 "format": "uri", 207 "description": "Kagi News RSS feed URL (e.g., https://news.kagi.com/world.xml)" 208 } 209 }, 210 "required": ["feedUrl"] 211 }, 212 "sourceUrl": "https://github.com/coves-social/kagi-news-aggregator", 213 "maintainer": "did:plc:coves-platform", 214 "createdAt": "2025-10-23T00:00:00Z" 215} 216``` 217 218**Note:** The MVP implementation uses a simpler configuration model. Feed-to-community mappings are defined in the aggregator's own config file rather than per-community configuration. This allows one aggregator instance to post to multiple communities. 219 220--- 221 222## Aggregator Configuration (MVP) 223 224The MVP uses a simplified configuration model where the aggregator service defines feed-to-community mappings in its own config file. 225 226### Configuration File: `config.yaml` 227 228```yaml 229# Aggregator credentials (from environment variables) 230# AGGREGATOR_DID=did:plc:xyz... 231# AGGREGATOR_PRIVATE_KEY=base64-encoded-key... 232 233# Coves API endpoint 234coves_api_url: "https://api.coves.social" 235 236# Feed-to-community mappings 237feeds: 238 - name: "World News" 239 url: "https://news.kagi.com/world.xml" 240 community_handle: "world-news.coves.social" 241 enabled: true 242 243 - name: "Tech News" 244 url: "https://news.kagi.com/tech.xml" 245 community_handle: "tech.coves.social" 246 enabled: true 247 248 - name: "Science News" 249 url: "https://news.kagi.com/science.xml" 250 community_handle: "science.coves.social" 251 enabled: false # Can be disabled without removing 252 253# Scheduling 254check_interval: "24h" # Run once daily 255 256# Logging 257log_level: "info" 258``` 259 260**Key Decisions:** 261- Uses **community handles** (not DIDs) for easier configuration - resolved at runtime 262- One aggregator can post to multiple communities 263- Feed mappings managed in aggregator config (not per-community config) 264- No complex filtering logic in MVP - one feed = one community 265 266--- 267 268## Post Format Specification 269 270### Post Record Structure 271 272```json 273{ 274 "$type": "social.coves.community.post.record", 275 "author": "did:plc:[aggregator-did]", 276 "community": "world-news.coves.social", 277 "title": "{Kagi story title}", 278 "content": "{formatted content - full format for MVP}", 279 "embed": { 280 "$type": "social.coves.embed.external", 281 "external": { 282 "uri": "{Kagi story URL}", 283 "title": "{story title}", 284 "description": "{summary excerpt - first 200 chars}", 285 "thumb": "{Kagi proxy image URL from HTML}" 286 } 287 }, 288 "federatedFrom": { 289 "platform": "kagi-news-rss", 290 "uri": "https://kite.kagi.com/{uuid}/{category}/{id}", 291 "id": "{guid}", 292 "originalCreatedAt": "{pubDate from RSS}" 293 }, 294 "contentLabels": [ 295 "{primary category}", 296 "{subcategories}" 297 ], 298 "createdAt": "{current timestamp}" 299} 300``` 301 302**MVP Notes:** 303- Uses `social.coves.embed.external` for hot-linked images (no blob upload) 304- Community specified as handle (resolved to DID by post creation endpoint) 305- Images referenced via original Kagi proxy URLs 306- "Full" format only for MVP (no format variations) 307- Content uses Coves rich text with facets (not markdown) 308 309--- 310 311### Content Formatting (MVP: "Full" Format Only) 312 313The MVP implements a single "full" format using Coves rich text with facets: 314 315**Plain Text Structure:** 316``` 317{Main summary paragraph with source citations} 318 319Highlights: 320• {Bullet point 1} 321• {Bullet point 2} 322• ... 323 324Perspectives: 325• {Actor}: {Their perspective} (Source) 326• ... 327 328"{Notable quote}" — {Attribution} 329 330Sources: 331• {Title} - {domain} 332• ... 333 334--- 335📰 Story aggregated by Kagi News 336``` 337 338**Rich Text Facets Applied:** 339- **Bold** (`social.coves.richtext.facet#bold`) on section headers: "Highlights:", "Perspectives:", "Sources:" 340- **Bold** on perspective actors 341- **Italic** (`social.coves.richtext.facet#italic`) on quotes 342- **Link** (`social.coves.richtext.facet#link`) on all URLs (source links, Kagi story link, perspective sources) 343- Byte ranges calculated using UTF-8 byte positions 344 345**Example with Facets:** 346```json 347{ 348 "content": "Main summary [source.com#1]\n\nHighlights:\n• Key point 1...", 349 "facets": [ 350 { 351 "index": {"byteStart": 35, "byteEnd": 46}, 352 "features": [{"$type": "social.coves.richtext.facet#bold"}] 353 }, 354 { 355 "index": {"byteStart": 15, "byteEnd": 26}, 356 "features": [{"$type": "social.coves.richtext.facet#link", "uri": "https://source.com"}] 357 } 358 ] 359} 360``` 361 362**Rationale:** 363- Uses native Coves rich text format (not markdown) 364- Preserves Kagi's rich multi-source analysis 365- Provides maximum value to communities 366- Meets CC BY-NC attribution requirements 367- Additional formats ("summary", "minimal") can be added post-MVP 368 369--- 370 371## Implementation Details (Python MVP) 372 373### Technology Stack 374 375**Language:** Python 3.11+ 376 377**Key Libraries:** 378- `feedparser` - RSS/Atom parsing 379- `beautifulsoup4` - HTML parsing for RSS item descriptions 380- `requests` - HTTP client for fetching feeds 381- `atproto` - Official ATProto Python SDK for authentication 382- `pyyaml` - Configuration file parsing 383- `pytest` - Testing framework 384 385### Project Structure 386 387``` 388aggregators/kagi-news/ 389├── Dockerfile 390├── docker-compose.yml 391├── requirements.txt 392├── config.example.yaml 393├── crontab # CRON schedule configuration 394├── .env.example # Environment variables template 395├── scripts/ 396│ └── generate_did.py # Helper to generate aggregator DID 397├── src/ 398│ ├── main.py # Entry point (single run, called by CRON) 399│ ├── config.py # Configuration loading and validation 400│ ├── rss_fetcher.py # RSS feed fetching with retry logic 401│ ├── html_parser.py # Parse Kagi HTML to structured data 402│ ├── richtext_formatter.py # Format content with rich text facets 403│ ├── atproto_client.py # ATProto authentication and operations 404│ ├── state_manager.py # Deduplication state tracking (JSON) 405│ └── models.py # Data models (KagiStory, etc.) 406├── tests/ 407│ ├── test_parser.py 408│ ├── test_richtext_formatter.py 409│ ├── test_state_manager.py 410│ └── fixtures/ # Sample RSS feeds for testing 411└── README.md 412``` 413 414--- 415 416### Component 1: RSS Fetcher (`rss_fetcher.py`) ✅ COMPLETE 417 418**Responsibility:** Fetch RSS feeds with retry logic and error handling 419 420**Key Functions:** 421- `fetch_feed(url: str) -> feedparser.FeedParserDict` 422 - Uses `requests` with timeout (30s) 423 - Retry logic: 3 attempts with exponential backoff 424 - Returns parsed RSS feed or raises exception 425 426**Error Handling:** 427- Network timeouts 428- Invalid XML 429- HTTP errors (404, 500, etc.) 430 431**Implementation Status:** 432- ✅ Implemented with comprehensive error handling 433- ✅ Tests passing (5 tests) 434- ✅ Handles retries with exponential backoff 435 436--- 437 438### Component 2: HTML Parser (`html_parser.py`) ✅ COMPLETE 439 440**Responsibility:** Extract structured data from Kagi's HTML description field 441 442**Key Class:** `KagiHTMLParser` 443 444**Data Model (`models.py`):** 445```python 446@dataclass 447class KagiStory: 448 title: str 449 link: str 450 guid: str 451 pub_date: datetime 452 categories: List[str] 453 454 # Parsed from HTML 455 summary: str 456 highlights: List[str] 457 perspectives: List[Perspective] 458 quote: Optional[Quote] 459 sources: List[Source] 460 image_url: Optional[str] 461 image_alt: Optional[str] 462 463@dataclass 464class Perspective: 465 actor: str 466 description: str 467 source_url: str 468 469@dataclass 470class Quote: 471 text: str 472 attribution: str 473 474@dataclass 475class Source: 476 title: str 477 url: str 478 domain: str 479``` 480 481**Parsing Strategy:** 482- Use BeautifulSoup to parse HTML description 483- Extract sections by finding `<h3>` tags (Highlights, Perspectives, Sources) 484- Handle missing sections gracefully (not all stories have all sections) 485- Clean and normalize text 486 487**Implementation Status:** 488- ✅ Extracts all 3 H3 sections (Highlights, Perspectives, Sources) 489- ✅ Handles optional elements (quote, image) 490- ✅ Tests passing (8 tests) 491- ✅ Validates against real feed data 492 493--- 494 495### Component 3: State Manager (`state_manager.py`) ✅ COMPLETE 496 497**Responsibility:** Track processed stories to prevent duplicates 498 499**Implementation:** Simple JSON file persistence 500 501**State File Format:** 502```json 503{ 504 "feeds": { 505 "https://news.kagi.com/world.xml": { 506 "last_successful_run": "2025-10-23T12:00:00Z", 507 "posted_guids": [ 508 "https://kite.kagi.com/uuid1/world/123", 509 "https://kite.kagi.com/uuid2/world/124" 510 ] 511 } 512 } 513} 514``` 515 516**Key Functions:** 517- `is_posted(feed_url: str, guid: str) -> bool` 518- `mark_posted(feed_url: str, guid: str, post_uri: str)` 519- `get_last_run(feed_url: str) -> Optional[datetime]` 520- `update_last_run(feed_url: str, timestamp: datetime)` 521 522**Deduplication Strategy:** 523- Keep last 100 GUIDs per feed (rolling window) 524- Stories older than 30 days are automatically removed 525- Simple, no database needed 526 527**Implementation Status:** 528- ✅ JSON-based persistence with atomic writes 529- ✅ GUID tracking with rolling window 530- ✅ Tests passing (12 tests) 531- ✅ Thread-safe operations 532 533--- 534 535### Component 4: Rich Text Formatter (`richtext_formatter.py`) ✅ COMPLETE 536 537**Responsibility:** Format parsed Kagi stories into Coves rich text with facets 538 539**Key Function:** 540- `format_full(story: KagiStory) -> dict` 541 - Returns: `{"content": str, "facets": List[dict]}` 542 - Builds plain text content with all sections 543 - Calculates UTF-8 byte positions for facets 544 - Applies bold, italic, and link facets 545 - Includes all sections: summary, highlights, perspectives, quote, sources 546 - Adds Kagi News attribution footer with link 547 548**Facet Types Applied:** 549- `social.coves.richtext.facet#bold` - Section headers, perspective actors 550- `social.coves.richtext.facet#italic` - Quotes 551- `social.coves.richtext.facet#link` - All URLs (sources, Kagi story link) 552 553**Key Challenge:** UTF-8 byte position calculation 554- Must handle multi-byte characters correctly (emoji, non-ASCII) 555- Use `str.encode('utf-8')` to get byte positions 556- Test with complex characters 557 558**Implementation Status:** 559- ✅ Full rich text formatting with facets 560- ✅ UTF-8 byte position calculation working correctly 561- ✅ Tests passing (10 tests) 562- ✅ Handles all sections: summary, highlights, perspectives, quote, sources 563 564--- 565 566### Component 5: Coves Client (`coves_client.py`) ✅ COMPLETE 567 568**Responsibility:** Handle authentication and post creation via Coves API 569 570**Implementation Note:** Uses direct HTTP client instead of ATProto SDK for simplicity in MVP. 571 572**Key Functions:** 573- `authenticate() -> dict` 574 - Authenticates aggregator using credentials 575 - Returns auth token for subsequent API calls 576 577- `create_post(community_handle: str, title: str, content: str, facets: List[dict], ...) -> dict` 578 - Calls Coves post creation endpoint 579 - Includes aggregator authentication 580 - Returns post URI and metadata 581 582**Authentication Flow:** 583- Load aggregator credentials from environment 584- Authenticate with Coves API 585- Store and use auth token for requests 586- Handle token refresh if needed 587 588**Implementation Status:** 589- ✅ HTTP-based client implementation 590- ✅ Authentication and token management 591- ✅ Post creation with all required fields 592- ✅ Error handling and retries 593 594--- 595 596### Component 6: Config Manager (`config.py`) ✅ COMPLETE 597 598**Responsibility:** Load and validate configuration from YAML and environment 599 600**Key Functions:** 601- `load_config(config_path: str) -> AggregatorConfig` 602 - Loads YAML configuration 603 - Validates structure and required fields 604 - Merges with environment variables 605 - Returns validated config object 606 607**Implementation Status:** 608- ✅ YAML parsing with validation 609- ✅ Environment variable support 610- ✅ Tests passing (3 tests) 611- ✅ Clear error messages for config issues 612 613--- 614 615### Main Orchestration (`main.py`) ✅ COMPLETE 616 617**Responsibility:** Coordinate all components in a single execution (called by CRON) 618 619**Flow (Single Run):** 6201. Load configuration from `config.yaml` 6212. Load environment variables (AGGREGATOR_DID, AGGREGATOR_PRIVATE_KEY) 6223. Initialize all components (fetcher, parser, formatter, client, state) 6234. For each enabled feed in config: 624 a. Fetch RSS feed 625 b. Parse all items 626 c. Filter out already-posted items (check state) 627 d. For each new item: 628 - Parse HTML to structured KagiStory 629 - Format post content with rich text facets 630 - Build post record (with hot-linked image if present) 631 - Create post via XRPC 632 - Mark as posted in state 633 e. Update last run timestamp 6345. Save state to disk 6356. Log summary (posts created, errors encountered) 6367. Exit (CRON will call again on schedule) 637 638**Error Isolation:** 639- Feed-level: One feed failing doesn't stop others 640- Item-level: One item failing doesn't stop feed processing 641- Continue on non-fatal errors, log all failures 642- Exit code 0 even with partial failures (CRON won't alert) 643- Exit code 1 only on catastrophic failure (config missing, auth failure) 644 645**Implementation Status:** 646- ✅ Complete orchestration logic implemented 647- ✅ Feed-level and item-level error isolation 648- ✅ Structured logging throughout 649- ✅ Tests passing (9 tests covering various scenarios) 650- ✅ Dry-run mode for testing 651 652--- 653 654## Deployment (Docker Compose with CRON) 655 656### Dockerfile 657 658```dockerfile 659FROM python:3.11-slim 660 661WORKDIR /app 662 663# Install cron 664RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/* 665 666# Install dependencies 667COPY requirements.txt . 668RUN pip install --no-cache-dir -r requirements.txt 669 670# Copy source code and scripts 671COPY src/ ./src/ 672COPY scripts/ ./scripts/ 673COPY crontab /etc/cron.d/kagi-news-cron 674 675# Set up cron 676RUN chmod 0644 /etc/cron.d/kagi-news-cron && \ 677 crontab /etc/cron.d/kagi-news-cron && \ 678 touch /var/log/cron.log 679 680# Create non-root user for security 681RUN useradd --create-home appuser && \ 682 chown -R appuser:appuser /app && \ 683 chown appuser:appuser /var/log/cron.log 684 685USER appuser 686 687# Run cron in foreground 688CMD ["cron", "-f"] 689``` 690 691### Crontab Configuration (`crontab`) 692 693```bash 694# Run Kagi News aggregator daily at 1 PM UTC (after Kagi updates around noon) 6950 13 * * * cd /app && /usr/local/bin/python -m src.main >> /var/log/cron.log 2>&1 696 697# Blank line required at end of crontab 698``` 699 700--- 701 702### docker-compose.yml 703 704```yaml 705version: '3.8' 706 707services: 708 kagi-news-aggregator: 709 build: . 710 container_name: kagi-news-aggregator 711 restart: unless-stopped 712 713 environment: 714 # Aggregator identity (from aggregator creation) 715 - AGGREGATOR_DID=${AGGREGATOR_DID} 716 - AGGREGATOR_PRIVATE_KEY=${AGGREGATOR_PRIVATE_KEY} 717 718 volumes: 719 # Config file (read-only) 720 - ./config.yaml:/app/config.yaml:ro 721 # State file (read-write for deduplication) 722 - ./data/state.json:/app/data/state.json 723 724 logging: 725 driver: "json-file" 726 options: 727 max-size: "10m" 728 max-file: "3" 729``` 730 731**Environment Variables:** 732- `AGGREGATOR_DID`: PLC DID created for this aggregator instance 733- `AGGREGATOR_PRIVATE_KEY`: Base64-encoded private key for signing 734 735**Volumes:** 736- `config.yaml`: Feed-to-community mappings (user-editable) 737- `data/state.json`: Deduplication state (managed by aggregator) 738 739**Deployment:** 740```bash 741# On same host as Coves 742cd aggregators/kagi-news 743cp config.example.yaml config.yaml 744# Edit config.yaml with your feed mappings 745 746# Set environment variables 747export AGGREGATOR_DID="did:plc:xyz..." 748export AGGREGATOR_PRIVATE_KEY="base64-key..." 749 750# Start aggregator 751docker-compose up -d 752 753# View logs 754docker-compose logs -f 755``` 756 757--- 758 759## Image Handling Strategy (MVP) 760 761### Approach: Hot-Linked Images via External Embed 762 763The MVP uses hot-linked images from Kagi's proxy: 764 765**Flow:** 7661. Extract image URL from HTML description (`https://kagiproxy.com/img/...`) 7672. Include in post using `social.coves.embed.external`: 768 ```json 769 { 770 "$type": "social.coves.embed.external", 771 "external": { 772 "uri": "{Kagi story URL}", 773 "title": "{Story title}", 774 "description": "{Summary excerpt}", 775 "thumb": "{Kagi proxy image URL}" 776 } 777 } 778 ``` 7793. Frontend renders image from Kagi proxy URL 780 781**Rationale:** 782- Simpler MVP implementation (no blob upload complexity) 783- No storage requirements on our end 784- Kagi proxy is reliable and CDN-backed 785- Faster posting (no download/upload step) 786- Images already properly sized and optimized 787 788**Future Consideration:** If Kagi proxy becomes unreliable, migrate to blob storage in Phase 2. 789 790--- 791 792## Rate Limiting & Performance (MVP) 793 794### Simplified Rate Strategy 795 796**RSS Fetching:** 797- Poll each feed once per day (~noon UTC after Kagi updates) 798- No aggressive polling needed (Kagi updates daily) 799- ~3-5 feeds = minimal load 800 801**Post Creation:** 802- One run per day = 5-15 posts per feed 803- Total: ~15-75 posts/day across all communities 804- Well within any reasonable rate limits 805 806**Performance:** 807- RSS fetch + parse: < 5 seconds per feed 808- Image download + upload: < 3 seconds per image 809- Post creation: < 1 second per post 810- Total runtime per day: < 5 minutes 811 812No complex rate limiting needed for MVP. 813 814--- 815 816## Logging & Observability (MVP) 817 818### Structured Logging 819 820**Python logging module** with JSON formatter: 821 822```python 823import logging 824import json 825 826logging.basicConfig( 827 level=logging.INFO, 828 format='%(message)s' 829) 830 831logger = logging.getLogger(__name__) 832 833# Example structured log 834logger.info(json.dumps({ 835 "event": "post_created", 836 "feed": "world.xml", 837 "story_title": "Breaking News...", 838 "community": "world-news.coves.social", 839 "post_uri": "at://...", 840 "timestamp": "2025-10-23T12:00:00Z" 841})) 842``` 843 844**Key Events to Log:** 845- `feed_fetched`: RSS feed successfully fetched 846- `story_parsed`: Story successfully parsed from HTML 847- `post_created`: Post successfully created 848- `error`: Any failures (with context) 849- `run_completed`: Summary of entire run 850 851**Log Levels:** 852- INFO: Successful operations 853- WARNING: Retryable errors, skipped items 854- ERROR: Fatal errors, failed posts 855 856### Simple Monitoring 857 858**Health Check:** Check last successful run timestamp 859- If > 48 hours: alert (should run daily) 860- If errors > 50% of items: investigate 861 862**Metrics to Track (manually via logs):** 863- Posts created per run 864- Parse failures per run 865- Post creation failures per run 866- Total runtime 867 868No complex metrics infrastructure needed for MVP - Docker logs are sufficient. 869 870--- 871 872## Testing Strategy ✅ COMPLETE 873 874### Unit Tests - 57 Tests Passing (83% Coverage) 875 876**Test Coverage by Component:** 877- ✅ **RSS Fetcher** (5 tests) 878 - Successful feed fetch 879 - Timeout handling 880 - Retry logic with exponential backoff 881 - Invalid XML handling 882 - Empty URL validation 883 884- ✅ **HTML Parser** (8 tests) 885 - Summary extraction 886 - Image URL and alt text extraction 887 - Highlights list parsing 888 - Quote extraction with attribution 889 - Perspectives parsing with actors and sources 890 - Sources list extraction 891 - Missing sections handling 892 - Full story object creation 893 894- ✅ **Rich Text Formatter** (10 tests) 895 - Full format generation 896 - Bold facets on headers and actors 897 - Italic facets on quotes 898 - Link facets on URLs 899 - UTF-8 byte position calculation 900 - Multi-byte character handling (emoji, special chars) 901 - All sections formatted correctly 902 903- ✅ **State Manager** (12 tests) 904 - GUID tracking 905 - Duplicate detection 906 - Rolling window (100 GUID limit) 907 - Age-based cleanup (30 days) 908 - Last run timestamp tracking 909 - JSON persistence 910 - Atomic file writes 911 - Concurrent access safety 912 913- ✅ **Config Manager** (3 tests) 914 - YAML loading and validation 915 - Environment variable merging 916 - Error handling for missing/invalid config 917 918- ✅ **Main Orchestrator** (9 tests) 919 - End-to-end flow 920 - Feed-level error isolation 921 - Item-level error isolation 922 - Dry-run mode 923 - State persistence across runs 924 - Multiple feed handling 925 926- ✅ **E2E Tests** (6 skipped - require live API) 927 - Integration with Coves API (manual testing required) 928 - Authentication flow 929 - Post creation 930 931**Test Results:** 932``` 93357 passed, 6 skipped, 1 warning in 8.76s 934Coverage: 83% 935``` 936 937**Test Fixtures:** 938- Real Kagi News RSS item with all sections 939- Sample HTML descriptions 940- Mock HTTP responses 941 942### Integration Tests 943 944**Manual Integration Testing Required:** 945- [ ] Can authenticate with live Coves API 946- [ ] Can create post via Coves API 947- [ ] Can fetch real Kagi RSS feed 948- [ ] Images display correctly from Kagi proxy 949- [ ] State persistence works in production 950- [ ] CRON scheduling works correctly 951 952**Pre-deployment Checklist:** 953- [x] All unit tests passing 954- [x] Can parse real Kagi HTML 955- [x] State persistence works 956- [x] Config validation works 957- [x] Error handling comprehensive 958- [ ] Aggregator DID created 959- [ ] Can authenticate with Coves API 960- [ ] Docker container builds and runs 961 962--- 963 964## Success Metrics 965 966### ✅ Phase 1: Implementation - COMPLETE 967 968- [x] All core components implemented 969- [x] 57 tests passing with 83% coverage 970- [x] RSS fetching and parsing working 971- [x] Rich text formatting with facets 972- [x] State management and deduplication 973- [x] Configuration management 974- [x] Comprehensive error handling 975- [x] Documentation complete 976 977### 🔄 Phase 2: Integration Testing - IN PROGRESS 978 979- [ ] Aggregator DID created (PLC) 980- [ ] Aggregator authorized in 1+ test communities 981- [ ] Can authenticate with Coves API 982- [ ] First post created end-to-end 983- [ ] Attribution visible ("Via Kagi News") 984- [ ] No duplicate posts on repeated runs 985- [ ] Images display correctly 986 987### 📋 Phase 3: Alpha Deployment (First Week) 988 989- [ ] Docker Compose runs successfully in production 990- [ ] 2-3 communities receiving posts 991- [ ] 20+ posts created successfully 992- [ ] Zero duplicates 993- [ ] < 10% errors (parse or post creation) 994- [ ] CRON scheduling reliable 995 996### 🎯 Phase 4: Beta (First Month) 997 998- [ ] 5+ communities using aggregator 999- [ ] 200+ posts created 1000- [ ] Positive community feedback 1001- [ ] No rate limit issues 1002- [ ] < 5% error rate 1003- [ ] Performance metrics tracked 1004 1005--- 1006 1007## What's Next: Integration & Deployment 1008 1009### Immediate Next Steps 1010 10111. **Create Aggregator Identity** 1012 - Generate DID for aggregator 1013 - Store credentials securely 1014 - Test authentication with Coves API 1015 10162. **Integration Testing** 1017 - Test with live Coves API 1018 - Verify post creation works 1019 - Validate rich text rendering 1020 - Check image display from Kagi proxy 1021 10223. **Docker Deployment** 1023 - Build Docker image 1024 - Test docker-compose setup 1025 - Verify CRON scheduling 1026 - Set up monitoring/logging 1027 10284. **Community Authorization** 1029 - Get aggregator authorized in test community 1030 - Verify authorization flow works 1031 - Test posting to real community 1032 10335. **Production Deployment** 1034 - Deploy to production server 1035 - Configure feeds for real communities 1036 - Monitor first batch of posts 1037 - Gather community feedback 1038 1039### Open Questions to Resolve 1040 10411. **Aggregator DID Creation:** 1042 - Need helper script or manual process? 1043 - Where to store credentials securely? 1044 10452. **Authorization Flow:** 1046 - How does community admin authorize aggregator? 1047 - UI flow or XRPC endpoint? 1048 10493. **Image Strategy:** 1050 - Confirm Kagi proxy images work reliably 1051 - Fallback plan if proxy becomes unreliable? 1052 10534. **Monitoring:** 1054 - What metrics to track initially? 1055 - Alerting strategy for failures? 1056 1057--- 1058 1059## Future Enhancements (Post-MVP) 1060 1061### Phase 2 1062- Multiple post formats (summary, minimal) 1063- Per-community filtering (subcategories, min sources) 1064- More sophisticated deduplication 1065- Metrics dashboard 1066 1067### Phase 3 1068- Interactive features (bot responds to comments) 1069- Cross-posting prevention 1070- Federation support 1071 1072--- 1073 1074## References 1075 1076- Kagi News About: https://news.kagi.com/about 1077- Kagi News RSS: https://news.kagi.com/world.xml 1078- CC BY-NC License: https://creativecommons.org/licenses/by-nc/4.0/ 1079- Parent PRD: [PRD_AGGREGATORS.md](PRD_AGGREGATORS.md) 1080- ATProto Python SDK: https://github.com/MarshalX/atproto 1081- Implementation: [aggregators/kagi-news/](/aggregators/kagi-news/) 1082 1083--- 1084 1085## Implementation Summary 1086 1087**Phase 1 Status:** **COMPLETE** 1088 1089The Kagi News RSS Aggregator implementation is complete and ready for integration testing and deployment. All 7 core components have been implemented with comprehensive test coverage (57 tests, 83% coverage). 1090 1091**What Was Built:** 1092- Complete RSS feed fetching and parsing pipeline 1093- HTML parser that extracts all structured data from Kagi News feeds (summary, highlights, perspectives, quote, sources) 1094- Rich text formatter with proper facets for Coves 1095- State management system for deduplication 1096- Configuration management with YAML and environment variables 1097- HTTP client for Coves API authentication and post creation 1098- Main orchestrator with robust error handling 1099- Comprehensive test suite with real feed fixtures 1100- Documentation and example configurations 1101 1102**Key Findings:** 1103- Kagi News RSS feeds contain only 3 structured sections (Highlights, Perspectives, Sources) 1104- Historical context is woven into the summary and highlights, not a separate section 1105- Timeline feature visible on Kagi website is not in the RSS feed 1106- All essential data for rich posts is available in the feed 1107- Feed structure is stable and well-formed 1108 1109**Next Phase:** 1110Integration testing with live Coves API, followed by alpha deployment to test communities. 1111 1112--- 1113 1114**End of PRD - Phase 1 Implementation Complete**