Manage Atom feeds in a persistent git repository
1# Thicket Architecture Design 2 3## Overview 4Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures. 5 6## Technology Stack 7 8### Core Libraries 9 10#### CLI Framework 11- **Typer** (0.15.x) - Modern CLI framework with type hints 12- **Rich** (13.x) - Beautiful terminal output, progress bars, and tables 13- **prompt-toolkit** - Interactive prompts when needed 14 15#### Feed Processing 16- **feedparser** (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 17 - Alternative: **atoma** for stricter Atom/RSS parsing with JSON feed support 18 - Alternative: **fastfeedparser** for high-performance parsing (10x faster) 19 20#### Git Integration 21- **GitPython** (3.1.44) - High-level git operations, requires git CLI 22 - Alternative: **pygit2** (1.18.0) - Direct libgit2 bindings, better for authentication 23 24#### HTTP Client 25- **httpx** (0.28.x) - Modern async/sync HTTP client with connection pooling 26- **aiohttp** (3.11.x) - For async-only operations if needed 27 28#### Configuration & Data Models 29- **pydantic** (2.11.x) - Data validation and settings management 30- **pydantic-settings** (2.10.x) - Configuration file handling with env var support 31 32#### Utilities 33- **pendulum** (3.x) - Better datetime handling 34- **bleach** (6.x) - HTML sanitization for feed content 35- **platformdirs** (4.x) - Cross-platform directory paths 36 37## Project Structure 38 39``` 40thicket/ 41├── pyproject.toml # Modern Python packaging 42├── README.md # Project documentation 43├── ARCH.md # This file 44├── CLAUDE.md # Project instructions 45├── .gitignore 46├── src/ 47│ └── thicket/ 48│ ├── __init__.py 49│ ├── __main__.py # Entry point for `python -m thicket` 50│ ├── cli/ # CLI commands and interface 51│ │ ├── __init__.py 52│ │ ├── main.py # Main CLI app with Typer 53│ │ ├── commands/ # Subcommands 54│ │ │ ├── __init__.py 55│ │ │ ├── init.py # Initialize git store 56│ │ │ ├── add.py # Add users and feeds 57│ │ │ ├── sync.py # Sync feeds 58│ │ │ ├── list_cmd.py # List users/feeds 59│ │ │ ├── duplicates.py # Manage duplicate entries 60│ │ │ ├── links_cmd.py # Extract and categorize links 61│ │ │ └── index_cmd.py # Build reference index and show threads 62│ │ └── utils.py # CLI utilities (progress, formatting) 63│ ├── core/ # Core business logic 64│ │ ├── __init__.py 65│ │ ├── feed_parser.py # Feed parsing and normalization 66│ │ ├── git_store.py # Git repository operations 67│ │ └── reference_parser.py # Link extraction and threading 68│ ├── models/ # Pydantic data models 69│ │ ├── __init__.py 70│ │ ├── config.py # Configuration models 71│ │ ├── feed.py # Feed/Entry models 72│ │ └── user.py # User metadata models 73│ └── utils/ # Shared utilities 74│ └── __init__.py 75├── tests/ 76│ ├── __init__.py 77│ ├── conftest.py # pytest configuration 78│ ├── test_feed_parser.py 79│ ├── test_git_store.py 80│ └── fixtures/ # Test data 81│ └── feeds/ 82└── docs/ 83 └── examples/ # Example configurations 84``` 85 86## Data Models 87 88### Configuration File (YAML/TOML) 89```python 90class ThicketConfig(BaseSettings): 91 git_store: Path # Git repository location 92 cache_dir: Path # Cache directory 93 users: list[UserConfig] 94 95 model_config = SettingsConfigDict( 96 env_prefix="THICKET_", 97 env_file=".env", 98 yaml_file="thicket.yaml" 99 ) 100 101class UserConfig(BaseModel): 102 username: str 103 feeds: list[HttpUrl] 104 email: Optional[EmailStr] = None 105 homepage: Optional[HttpUrl] = None 106 icon: Optional[HttpUrl] = None 107 display_name: Optional[str] = None 108``` 109 110### Feed Storage Format 111```python 112class AtomEntry(BaseModel): 113 id: str # Original Atom ID 114 title: str 115 link: HttpUrl 116 updated: datetime 117 published: Optional[datetime] 118 summary: Optional[str] 119 content: Optional[str] # Full body content from Atom entry 120 content_type: Optional[str] = "html" # text, html, xhtml 121 author: Optional[dict] 122 categories: list[str] = [] 123 rights: Optional[str] = None # Copyright info 124 source: Optional[str] = None # Source feed URL 125 # Additional Atom fields preserved during RSS->Atom conversion 126 127 model_config = ConfigDict( 128 json_encoders={ 129 datetime: lambda v: v.isoformat() 130 } 131 ) 132 133class DuplicateMap(BaseModel): 134 """Maps duplicate entry IDs to canonical entry IDs""" 135 duplicates: dict[str, str] = {} # duplicate_id -> canonical_id 136 comment: str = "Entry IDs that map to the same canonical content" 137 138 def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None: 139 """Add a duplicate mapping""" 140 self.duplicates[duplicate_id] = canonical_id 141 142 def remove_duplicate(self, duplicate_id: str) -> bool: 143 """Remove a duplicate mapping. Returns True if existed.""" 144 return self.duplicates.pop(duplicate_id, None) is not None 145 146 def get_canonical(self, entry_id: str) -> str: 147 """Get canonical ID for an entry (returns original if not duplicate)""" 148 return self.duplicates.get(entry_id, entry_id) 149 150 def is_duplicate(self, entry_id: str) -> bool: 151 """Check if entry ID is marked as duplicate""" 152 return entry_id in self.duplicates 153``` 154 155## Git Repository Structure 156``` 157git-store/ 158├── index.json # User directory index 159├── duplicates.json # Manual curation of duplicate entries 160├── links.json # Unified links, references, and mapping data 161├── user1/ 162│ ├── entry_id_1.json # Sanitized entry files 163│ ├── entry_id_2.json 164│ └── ... 165└── user2/ 166 └── ... 167``` 168 169## Key Design Decisions 170 171### 1. Feed Normalization & Auto-Discovery 172- All RSS feeds converted to Atom format before storage 173- Preserves maximum metadata during conversion 174- Sanitizes HTML content to prevent XSS 175- **Auto-discovery**: Extracts user metadata from feed during `add user` command 176 177### 2. ID Sanitization 178- Consistent algorithm to convert Atom IDs to safe filenames 179- Handles edge cases (very long IDs, special characters) 180- Maintains reversibility where possible 181 182### 3. Git Operations 183- Uses GitPython for simplicity (no authentication required) 184- Single main branch for all users and entries 185- Atomic commits per sync operation 186- Meaningful commit messages with feed update summaries 187- Preserves complete history - never delete entries even if they disappear from feeds 188 189### 4. Caching Strategy 190- HTTP caching with Last-Modified/ETag support 191- Local cache of parsed feeds with TTL 192- Cache invalidation on configuration changes 193- Git store serves as permanent historical archive beyond feed depth limits 194 195### 5. Error Handling 196- Graceful handling of feed parsing errors 197- Retry logic for network failures 198- Clear error messages with recovery suggestions 199 200## CLI Command Structure 201 202```bash 203# Initialize a new git store 204thicket init /path/to/store 205 206# Add a user with feeds (auto-discovers metadata from feed) 207thicket add user "alyssa" \ 208 --feed "https://example.com/feed.atom" 209 # Auto-populates: email, homepage, icon, display_name from feed metadata 210 211# Add a user with manual overrides 212thicket add user "alyssa" \ 213 --feed "https://example.com/feed.atom" \ 214 --email "alyssa@example.com" \ 215 --homepage "https://alyssa.example.com" \ 216 --icon "https://example.com/avatar.png" \ 217 --display-name "Alyssa P. Hacker" 218 219# Add additional feed to existing user 220thicket add feed "alyssa" "https://example.com/other-feed.rss" 221 222# Sync all feeds (designed for cron usage) 223thicket sync --all 224 225# Sync specific user 226thicket sync --user alyssa 227 228# List users and their feeds 229thicket list users 230thicket list feeds --user alyssa 231 232# Manage duplicate entries 233thicket duplicates list 234thicket duplicates add <entry_id_1> <entry_id_2> # Mark as duplicates 235thicket duplicates remove <entry_id_1> <entry_id_2> # Unmark duplicates 236 237# Link processing and threading 238thicket links --verbose # Extract and categorize all links 239thicket index --verbose # Build reference index for threading 240thicket threads # Show conversation threads 241thicket threads --username user1 # Show threads for specific user 242thicket threads --min-size 3 # Show threads with minimum size 243``` 244 245## Performance Considerations 246 2471. **Concurrent Feed Fetching**: Use httpx with asyncio for parallel downloads 2482. **Incremental Updates**: Only fetch/parse feeds that have changed 2493. **Efficient Git Operations**: Batch commits, use shallow clones where appropriate 2504. **Progress Feedback**: Rich progress bars for long operations 251 252## Security Considerations 253 2541. **HTML Sanitization**: Use bleach to clean feed content 2552. **URL Validation**: Strict validation of feed URLs 2563. **Git Security**: No credentials stored in repository 2574. **Path Traversal**: Careful sanitization of filenames 258 259## Future Enhancements 260 2611. **Web Interface**: Optional web UI for browsing the git store 2622. **Webhooks**: Notify external services on feed updates 2633. **Feed Discovery**: Auto-discover feeds from HTML pages 2644. **Export Formats**: Generate static sites, OPML exports 2655. **Federation**: P2P sync between thicket instances 266 267## Requirements Clarification 268 269**✓ Resolved Requirements:** 2701. **Feed Update Frequency**: Designed for cron usage - no built-in scheduling needed 2712. **Duplicate Handling**: Manual curation via `duplicates.json` file with CLI commands 2723. **Git Branching**: Single main branch for all users and entries 2734. **Authentication**: No feeds require authentication currently 2745. **Content Storage**: Store complete Atom entry body content as provided 2756. **Deleted Entries**: Preserve all entries in Git store permanently (historical archive) 2767. **History Depth**: Git store maintains full history beyond feed depth limits 2778. **Feed Auto-Discovery**: Extract user metadata from feed during `add user` command 278 279## Duplicate Entry Management 280 281### Duplicate Detection Strategy 282- **Manual Curation**: Duplicates identified and managed manually via CLI 283- **Storage**: `duplicates.json` file in Git root maps entry IDs to canonical entries 284- **Structure**: `{"duplicate_id": "canonical_id", ...}` 285- **CLI Commands**: Add/remove duplicate mappings with validation 286- **Query Resolution**: Search/list commands resolve duplicates to canonical entries 287 288### Duplicate File Format 289```json 290{ 291 "https://example.com/feed/entry/123": "https://canonical.com/posts/same-post", 292 "https://mirror.com/articles/456": "https://canonical.com/posts/same-post", 293 "comment": "Entry IDs that map to the same canonical content" 294} 295``` 296 297## Feed Metadata Auto-Discovery 298 299### Extraction Strategy 300When adding a new user with `thicket add user`, the system fetches and parses the feed to extract: 301 302- **Display Name**: From `feed.title` or `feed.author.name` 303- **Email**: From `feed.author.email` or `feed.managingEditor` 304- **Homepage**: From `feed.link` or `feed.author.uri` 305- **Icon**: From `feed.logo`, `feed.icon`, or `feed.image.url` 306 307### Discovery Priority Order 3081. **Author Information**: Prefer `feed.author.*` fields (more specific to person) 3092. **Feed-Level**: Fall back to feed-level metadata 3103. **Manual Override**: CLI flags always take precedence over discovered values 3114. **Update Behavior**: Auto-discovery only runs during initial `add user`, not on sync 312 313### Extracted Metadata Format 314```python 315class FeedMetadata(BaseModel): 316 title: Optional[str] = None 317 author_name: Optional[str] = None 318 author_email: Optional[EmailStr] = None 319 author_uri: Optional[HttpUrl] = None 320 link: Optional[HttpUrl] = None 321 logo: Optional[HttpUrl] = None 322 icon: Optional[HttpUrl] = None 323 image_url: Optional[HttpUrl] = None 324 325 def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig: 326 """Convert discovered metadata to UserConfig with fallbacks""" 327 return UserConfig( 328 username=username, 329 feeds=[feed_url], 330 display_name=self.author_name or self.title, 331 email=self.author_email, 332 homepage=self.author_uri or self.link, 333 icon=self.logo or self.icon or self.image_url 334 ) 335``` 336 337## Link Processing and Threading Architecture 338 339### Overview 340The thicket system implements a sophisticated link processing and threading system to create email-style threaded views of blog entries by tracking cross-references between different blogs. 341 342### Link Processing Pipeline 343 344#### 1. Link Extraction (`thicket links`) 345The `links` command systematically extracts all outbound links from blog entries and categorizes them: 346 347```python 348class LinkData(BaseModel): 349 url: str # Fully resolved URL 350 entry_id: str # Source entry ID 351 username: str # Source username 352 context: str # Surrounding text context 353 category: str # "internal", "user", or "unknown" 354 target_username: Optional[str] # Target user if applicable 355``` 356 357**Link Categories:** 358- **Internal**: Links to the same user's domain (self-references) 359- **User**: Links to other tracked users' domains 360- **Unknown**: Links to external sites not tracked by thicket 361 362#### 2. URL Resolution 363All links are properly resolved using the Atom feed's base URL to handle: 364- Relative URLs (converted to absolute) 365- Protocol-relative URLs 366- Fragment identifiers 367- Redirects and canonical URLs 368 369#### 3. Domain Mapping 370The system builds a comprehensive domain mapping from user configuration: 371- Feed URLs → domain extraction 372- Homepage URLs → domain extraction 373- Reverse mapping: domain → username 374 375### Threading System 376 377#### 1. Reference Index Generation (`thicket index`) 378Creates a bidirectional reference index from the categorized links: 379 380```python 381class BlogReference(BaseModel): 382 source_entry_id: str 383 source_username: str 384 target_url: str 385 target_username: Optional[str] 386 target_entry_id: Optional[str] 387 context: str 388``` 389 390#### 2. Thread Detection Algorithm 391Uses graph traversal to find connected blog entries: 392- **Outbound references**: Links from an entry to other entries 393- **Inbound references**: Links to an entry from other entries 394- **Thread members**: All entries connected through references 395 396#### 3. Threading Display (`thicket threads`) 397Creates email-style threaded views: 398- Chronological ordering within threads 399- Reference counts (outbound/inbound) 400- Context preservation 401- Filtering options (user, entry, minimum size) 402 403### Data Structures 404 405#### links.json Format (Unified Structure) 406```json 407{ 408 "links": { 409 "https://example.com/post/123": { 410 "referencing_entries": ["https://blog.user.com/entry/456"], 411 "target_username": "user2" 412 }, 413 "https://external-site.com/article": { 414 "referencing_entries": ["https://blog.user.com/entry/789"] 415 } 416 }, 417 "reverse_mapping": { 418 "https://blog.user.com/entry/456": ["https://example.com/post/123"], 419 "https://blog.user.com/entry/789": ["https://external-site.com/article"] 420 }, 421 "references": [ 422 { 423 "source_entry_id": "https://blog.user.com/entry/456", 424 "source_username": "user1", 425 "target_url": "https://example.com/post/123", 426 "target_username": "user2", 427 "target_entry_id": "https://example.com/post/123", 428 "context": "As mentioned in this post..." 429 } 430 ], 431 "user_domains": { 432 "user1": ["blog.user.com"], 433 "user2": ["example.com"] 434 } 435} 436``` 437 438This unified structure eliminates duplication by: 439- Storing each URL only once with minimal metadata 440- Including all link data, reference data, and mappings in one file 441- Using presence of `target_username` to identify tracked vs external links 442- Providing bidirectional mappings for efficient queries 443 444### Unified Structure Benefits 445 446- **Eliminates Duplication**: Each URL appears only once with metadata 447- **Single Source of Truth**: All link-related data in one file 448- **Efficient Queries**: Fast lookups for both directions (URL→entries, entry→URLs) 449- **Atomic Updates**: All link data changes together 450- **Reduced I/O**: Fewer file operations 451 452### Implementation Benefits 453 4541. **Systematic Link Processing**: All links are extracted and categorized consistently 4552. **Proper URL Resolution**: Handles relative URLs and base URL resolution correctly 4563. **Domain-based Categorization**: Automatically identifies user-to-user references 4574. **Bidirectional Indexing**: Supports both "who links to whom" and "who is linked by whom" 4585. **Thread Discovery**: Finds conversation threads automatically 4596. **Rich Context**: Preserves surrounding text for each link 4607. **Performance**: Pre-computed indexes for fast threading queries 461 462### CLI Commands 463 464```bash 465# Extract and categorize all links 466thicket links --verbose 467 468# Build reference index for threading 469thicket index --verbose 470 471# Show all conversation threads 472thicket threads 473 474# Show threads for specific user 475thicket threads --username user1 476 477# Show threads with minimum size 478thicket threads --min-size 3 479``` 480 481### Integration with Existing Commands 482 483The link processing system integrates seamlessly with existing thicket commands: 484- `thicket sync` updates entries, requiring `thicket links` to be run afterward 485- `thicket index` uses the output from `thicket links` for improved accuracy 486- `thicket threads` provides the user-facing threading interface 487 488## Current Implementation Status 489 490### ✅ Completed Features 4911. **Core Infrastructure** 492 - Modern CLI with Typer and Rich 493 - Pydantic data models for type safety 494 - Git repository operations with GitPython 495 - Feed parsing and normalization with feedparser 496 4972. **User and Feed Management** 498 - `thicket init` - Initialize git store 499 - `thicket add` - Add users and feeds with auto-discovery 500 - `thicket sync` - Sync feeds with progress tracking 501 - `thicket list` - List users, feeds, and entries 502 - `thicket duplicates` - Manage duplicate entries 503 5043. **Link Processing and Threading** 505 - `thicket links` - Extract and categorize all outbound links 506 - `thicket index` - Build reference index from links 507 - `thicket threads` - Display threaded conversation views 508 - Proper URL resolution with base URL handling 509 - Domain-based link categorization 510 - Context preservation for links 511 512### 📊 System Performance 513- **Link Extraction**: Successfully processes thousands of blog entries 514- **Categorization**: Identifies internal, user, and unknown links 515- **Threading**: Creates email-style threaded views of conversations 516- **Storage**: Efficient JSON-based data structures for links and references 517 518### 🔧 Current Architecture Highlights 519- **Modular Design**: Clear separation between CLI, core logic, and models 520- **Type Safety**: Comprehensive Pydantic models for data validation 521- **Rich CLI**: Beautiful progress bars, tables, and error handling 522- **Extensible**: Easy to add new commands and features 523- **Git Integration**: All data stored in version-controlled JSON files 524 525### 🎯 Proven Functionality 526The system has been tested with real blog data and successfully: 527- Extracted 14,396 total links from blog entries 528- Categorized 3,994 internal links, 363 user-to-user links, and 10,039 unknown links 529- Built comprehensive domain mappings for 16 users across 20 domains 530- Generated threaded views showing blog conversation patterns 531 532### 🚀 Ready for Use 533The thicket system is now fully functional for: 534- Maintaining Git repositories of blog feeds 535- Tracking cross-references between blogs 536- Creating threaded views of blog conversations 537- Discovering blog interaction patterns 538- Building distributed comment systems