Manage Atom feeds in a persistent git repository
1# Thicket Architecture Design
2
3## Overview
4Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures.
5
6## Technology Stack
7
8### Core Libraries
9
10#### CLI Framework
11- **Typer** (0.15.x) - Modern CLI framework with type hints
12- **Rich** (13.x) - Beautiful terminal output, progress bars, and tables
13- **prompt-toolkit** - Interactive prompts when needed
14
15#### Feed Processing
16- **feedparser** (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0
17 - Alternative: **atoma** for stricter Atom/RSS parsing with JSON feed support
18 - Alternative: **fastfeedparser** for high-performance parsing (10x faster)
19
20#### Git Integration
21- **GitPython** (3.1.44) - High-level git operations, requires git CLI
22 - Alternative: **pygit2** (1.18.0) - Direct libgit2 bindings, better for authentication
23
24#### HTTP Client
25- **httpx** (0.28.x) - Modern async/sync HTTP client with connection pooling
26- **aiohttp** (3.11.x) - For async-only operations if needed
27
28#### Configuration & Data Models
29- **pydantic** (2.11.x) - Data validation and settings management
30- **pydantic-settings** (2.10.x) - Configuration file handling with env var support
31
32#### Utilities
33- **pendulum** (3.x) - Better datetime handling
34- **bleach** (6.x) - HTML sanitization for feed content
35- **platformdirs** (4.x) - Cross-platform directory paths
36
37## Project Structure
38
39```
40thicket/
41├── pyproject.toml # Modern Python packaging
42├── README.md # Project documentation
43├── ARCH.md # This file
44├── CLAUDE.md # Project instructions
45├── .gitignore
46├── src/
47│ └── thicket/
48│ ├── __init__.py
49│ ├── __main__.py # Entry point for `python -m thicket`
50│ ├── cli/ # CLI commands and interface
51│ │ ├── __init__.py
52│ │ ├── main.py # Main CLI app with Typer
53│ │ ├── commands/ # Subcommands
54│ │ │ ├── __init__.py
55│ │ │ ├── init.py # Initialize git store
56│ │ │ ├── add.py # Add users and feeds
57│ │ │ ├── sync.py # Sync feeds
58│ │ │ ├── list_cmd.py # List users/feeds
59│ │ │ ├── duplicates.py # Manage duplicate entries
60│ │ │ ├── links_cmd.py # Extract and categorize links
61│ │ │ └── index_cmd.py # Build reference index and show threads
62│ │ └── utils.py # CLI utilities (progress, formatting)
63│ ├── core/ # Core business logic
64│ │ ├── __init__.py
65│ │ ├── feed_parser.py # Feed parsing and normalization
66│ │ ├── git_store.py # Git repository operations
67│ │ └── reference_parser.py # Link extraction and threading
68│ ├── models/ # Pydantic data models
69│ │ ├── __init__.py
70│ │ ├── config.py # Configuration models
71│ │ ├── feed.py # Feed/Entry models
72│ │ └── user.py # User metadata models
73│ └── utils/ # Shared utilities
74│ └── __init__.py
75├── tests/
76│ ├── __init__.py
77│ ├── conftest.py # pytest configuration
78│ ├── test_feed_parser.py
79│ ├── test_git_store.py
80│ └── fixtures/ # Test data
81│ └── feeds/
82└── docs/
83 └── examples/ # Example configurations
84```
85
86## Data Models
87
88### Configuration File (YAML/TOML)
89```python
90class ThicketConfig(BaseSettings):
91 git_store: Path # Git repository location
92 cache_dir: Path # Cache directory
93 users: list[UserConfig]
94
95 model_config = SettingsConfigDict(
96 env_prefix="THICKET_",
97 env_file=".env",
98 yaml_file="thicket.yaml"
99 )
100
101class UserConfig(BaseModel):
102 username: str
103 feeds: list[HttpUrl]
104 email: Optional[EmailStr] = None
105 homepage: Optional[HttpUrl] = None
106 icon: Optional[HttpUrl] = None
107 display_name: Optional[str] = None
108```
109
110### Feed Storage Format
111```python
112class AtomEntry(BaseModel):
113 id: str # Original Atom ID
114 title: str
115 link: HttpUrl
116 updated: datetime
117 published: Optional[datetime]
118 summary: Optional[str]
119 content: Optional[str] # Full body content from Atom entry
120 content_type: Optional[str] = "html" # text, html, xhtml
121 author: Optional[dict]
122 categories: list[str] = []
123 rights: Optional[str] = None # Copyright info
124 source: Optional[str] = None # Source feed URL
125 # Additional Atom fields preserved during RSS->Atom conversion
126
127 model_config = ConfigDict(
128 json_encoders={
129 datetime: lambda v: v.isoformat()
130 }
131 )
132
133class DuplicateMap(BaseModel):
134 """Maps duplicate entry IDs to canonical entry IDs"""
135 duplicates: dict[str, str] = {} # duplicate_id -> canonical_id
136 comment: str = "Entry IDs that map to the same canonical content"
137
138 def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None:
139 """Add a duplicate mapping"""
140 self.duplicates[duplicate_id] = canonical_id
141
142 def remove_duplicate(self, duplicate_id: str) -> bool:
143 """Remove a duplicate mapping. Returns True if existed."""
144 return self.duplicates.pop(duplicate_id, None) is not None
145
146 def get_canonical(self, entry_id: str) -> str:
147 """Get canonical ID for an entry (returns original if not duplicate)"""
148 return self.duplicates.get(entry_id, entry_id)
149
150 def is_duplicate(self, entry_id: str) -> bool:
151 """Check if entry ID is marked as duplicate"""
152 return entry_id in self.duplicates
153```
154
155## Git Repository Structure
156```
157git-store/
158├── index.json # User directory index
159├── duplicates.json # Manual curation of duplicate entries
160├── links.json # Unified links, references, and mapping data
161├── user1/
162│ ├── entry_id_1.json # Sanitized entry files
163│ ├── entry_id_2.json
164│ └── ...
165└── user2/
166 └── ...
167```
168
169## Key Design Decisions
170
171### 1. Feed Normalization & Auto-Discovery
172- All RSS feeds converted to Atom format before storage
173- Preserves maximum metadata during conversion
174- Sanitizes HTML content to prevent XSS
175- **Auto-discovery**: Extracts user metadata from feed during `add user` command
176
177### 2. ID Sanitization
178- Consistent algorithm to convert Atom IDs to safe filenames
179- Handles edge cases (very long IDs, special characters)
180- Maintains reversibility where possible
181
182### 3. Git Operations
183- Uses GitPython for simplicity (no authentication required)
184- Single main branch for all users and entries
185- Atomic commits per sync operation
186- Meaningful commit messages with feed update summaries
187- Preserves complete history - never delete entries even if they disappear from feeds
188
189### 4. Caching Strategy
190- HTTP caching with Last-Modified/ETag support
191- Local cache of parsed feeds with TTL
192- Cache invalidation on configuration changes
193- Git store serves as permanent historical archive beyond feed depth limits
194
195### 5. Error Handling
196- Graceful handling of feed parsing errors
197- Retry logic for network failures
198- Clear error messages with recovery suggestions
199
200## CLI Command Structure
201
202```bash
203# Initialize a new git store
204thicket init /path/to/store
205
206# Add a user with feeds (auto-discovers metadata from feed)
207thicket add user "alyssa" \
208 --feed "https://example.com/feed.atom"
209 # Auto-populates: email, homepage, icon, display_name from feed metadata
210
211# Add a user with manual overrides
212thicket add user "alyssa" \
213 --feed "https://example.com/feed.atom" \
214 --email "alyssa@example.com" \
215 --homepage "https://alyssa.example.com" \
216 --icon "https://example.com/avatar.png" \
217 --display-name "Alyssa P. Hacker"
218
219# Add additional feed to existing user
220thicket add feed "alyssa" "https://example.com/other-feed.rss"
221
222# Sync all feeds (designed for cron usage)
223thicket sync --all
224
225# Sync specific user
226thicket sync --user alyssa
227
228# List users and their feeds
229thicket list users
230thicket list feeds --user alyssa
231
232# Manage duplicate entries
233thicket duplicates list
234thicket duplicates add <entry_id_1> <entry_id_2> # Mark as duplicates
235thicket duplicates remove <entry_id_1> <entry_id_2> # Unmark duplicates
236
237# Link processing and threading
238thicket links --verbose # Extract and categorize all links
239thicket index --verbose # Build reference index for threading
240thicket threads # Show conversation threads
241thicket threads --username user1 # Show threads for specific user
242thicket threads --min-size 3 # Show threads with minimum size
243```
244
245## Performance Considerations
246
2471. **Concurrent Feed Fetching**: Use httpx with asyncio for parallel downloads
2482. **Incremental Updates**: Only fetch/parse feeds that have changed
2493. **Efficient Git Operations**: Batch commits, use shallow clones where appropriate
2504. **Progress Feedback**: Rich progress bars for long operations
251
252## Security Considerations
253
2541. **HTML Sanitization**: Use bleach to clean feed content
2552. **URL Validation**: Strict validation of feed URLs
2563. **Git Security**: No credentials stored in repository
2574. **Path Traversal**: Careful sanitization of filenames
258
259## Future Enhancements
260
2611. **Web Interface**: Optional web UI for browsing the git store
2622. **Webhooks**: Notify external services on feed updates
2633. **Feed Discovery**: Auto-discover feeds from HTML pages
2644. **Export Formats**: Generate static sites, OPML exports
2655. **Federation**: P2P sync between thicket instances
266
267## Requirements Clarification
268
269**✓ Resolved Requirements:**
2701. **Feed Update Frequency**: Designed for cron usage - no built-in scheduling needed
2712. **Duplicate Handling**: Manual curation via `duplicates.json` file with CLI commands
2723. **Git Branching**: Single main branch for all users and entries
2734. **Authentication**: No feeds require authentication currently
2745. **Content Storage**: Store complete Atom entry body content as provided
2756. **Deleted Entries**: Preserve all entries in Git store permanently (historical archive)
2767. **History Depth**: Git store maintains full history beyond feed depth limits
2778. **Feed Auto-Discovery**: Extract user metadata from feed during `add user` command
278
279## Duplicate Entry Management
280
281### Duplicate Detection Strategy
282- **Manual Curation**: Duplicates identified and managed manually via CLI
283- **Storage**: `duplicates.json` file in Git root maps entry IDs to canonical entries
284- **Structure**: `{"duplicate_id": "canonical_id", ...}`
285- **CLI Commands**: Add/remove duplicate mappings with validation
286- **Query Resolution**: Search/list commands resolve duplicates to canonical entries
287
288### Duplicate File Format
289```json
290{
291 "https://example.com/feed/entry/123": "https://canonical.com/posts/same-post",
292 "https://mirror.com/articles/456": "https://canonical.com/posts/same-post",
293 "comment": "Entry IDs that map to the same canonical content"
294}
295```
296
297## Feed Metadata Auto-Discovery
298
299### Extraction Strategy
300When adding a new user with `thicket add user`, the system fetches and parses the feed to extract:
301
302- **Display Name**: From `feed.title` or `feed.author.name`
303- **Email**: From `feed.author.email` or `feed.managingEditor`
304- **Homepage**: From `feed.link` or `feed.author.uri`
305- **Icon**: From `feed.logo`, `feed.icon`, or `feed.image.url`
306
307### Discovery Priority Order
3081. **Author Information**: Prefer `feed.author.*` fields (more specific to person)
3092. **Feed-Level**: Fall back to feed-level metadata
3103. **Manual Override**: CLI flags always take precedence over discovered values
3114. **Update Behavior**: Auto-discovery only runs during initial `add user`, not on sync
312
313### Extracted Metadata Format
314```python
315class FeedMetadata(BaseModel):
316 title: Optional[str] = None
317 author_name: Optional[str] = None
318 author_email: Optional[EmailStr] = None
319 author_uri: Optional[HttpUrl] = None
320 link: Optional[HttpUrl] = None
321 logo: Optional[HttpUrl] = None
322 icon: Optional[HttpUrl] = None
323 image_url: Optional[HttpUrl] = None
324
325 def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig:
326 """Convert discovered metadata to UserConfig with fallbacks"""
327 return UserConfig(
328 username=username,
329 feeds=[feed_url],
330 display_name=self.author_name or self.title,
331 email=self.author_email,
332 homepage=self.author_uri or self.link,
333 icon=self.logo or self.icon or self.image_url
334 )
335```
336
337## Link Processing and Threading Architecture
338
339### Overview
340The thicket system implements a sophisticated link processing and threading system to create email-style threaded views of blog entries by tracking cross-references between different blogs.
341
342### Link Processing Pipeline
343
344#### 1. Link Extraction (`thicket links`)
345The `links` command systematically extracts all outbound links from blog entries and categorizes them:
346
347```python
348class LinkData(BaseModel):
349 url: str # Fully resolved URL
350 entry_id: str # Source entry ID
351 username: str # Source username
352 context: str # Surrounding text context
353 category: str # "internal", "user", or "unknown"
354 target_username: Optional[str] # Target user if applicable
355```
356
357**Link Categories:**
358- **Internal**: Links to the same user's domain (self-references)
359- **User**: Links to other tracked users' domains
360- **Unknown**: Links to external sites not tracked by thicket
361
362#### 2. URL Resolution
363All links are properly resolved using the Atom feed's base URL to handle:
364- Relative URLs (converted to absolute)
365- Protocol-relative URLs
366- Fragment identifiers
367- Redirects and canonical URLs
368
369#### 3. Domain Mapping
370The system builds a comprehensive domain mapping from user configuration:
371- Feed URLs → domain extraction
372- Homepage URLs → domain extraction
373- Reverse mapping: domain → username
374
375### Threading System
376
377#### 1. Reference Index Generation (`thicket index`)
378Creates a bidirectional reference index from the categorized links:
379
380```python
381class BlogReference(BaseModel):
382 source_entry_id: str
383 source_username: str
384 target_url: str
385 target_username: Optional[str]
386 target_entry_id: Optional[str]
387 context: str
388```
389
390#### 2. Thread Detection Algorithm
391Uses graph traversal to find connected blog entries:
392- **Outbound references**: Links from an entry to other entries
393- **Inbound references**: Links to an entry from other entries
394- **Thread members**: All entries connected through references
395
396#### 3. Threading Display (`thicket threads`)
397Creates email-style threaded views:
398- Chronological ordering within threads
399- Reference counts (outbound/inbound)
400- Context preservation
401- Filtering options (user, entry, minimum size)
402
403### Data Structures
404
405#### links.json Format (Unified Structure)
406```json
407{
408 "links": {
409 "https://example.com/post/123": {
410 "referencing_entries": ["https://blog.user.com/entry/456"],
411 "target_username": "user2"
412 },
413 "https://external-site.com/article": {
414 "referencing_entries": ["https://blog.user.com/entry/789"]
415 }
416 },
417 "reverse_mapping": {
418 "https://blog.user.com/entry/456": ["https://example.com/post/123"],
419 "https://blog.user.com/entry/789": ["https://external-site.com/article"]
420 },
421 "references": [
422 {
423 "source_entry_id": "https://blog.user.com/entry/456",
424 "source_username": "user1",
425 "target_url": "https://example.com/post/123",
426 "target_username": "user2",
427 "target_entry_id": "https://example.com/post/123",
428 "context": "As mentioned in this post..."
429 }
430 ],
431 "user_domains": {
432 "user1": ["blog.user.com"],
433 "user2": ["example.com"]
434 }
435}
436```
437
438This unified structure eliminates duplication by:
439- Storing each URL only once with minimal metadata
440- Including all link data, reference data, and mappings in one file
441- Using presence of `target_username` to identify tracked vs external links
442- Providing bidirectional mappings for efficient queries
443
444### Unified Structure Benefits
445
446- **Eliminates Duplication**: Each URL appears only once with metadata
447- **Single Source of Truth**: All link-related data in one file
448- **Efficient Queries**: Fast lookups for both directions (URL→entries, entry→URLs)
449- **Atomic Updates**: All link data changes together
450- **Reduced I/O**: Fewer file operations
451
452### Implementation Benefits
453
4541. **Systematic Link Processing**: All links are extracted and categorized consistently
4552. **Proper URL Resolution**: Handles relative URLs and base URL resolution correctly
4563. **Domain-based Categorization**: Automatically identifies user-to-user references
4574. **Bidirectional Indexing**: Supports both "who links to whom" and "who is linked by whom"
4585. **Thread Discovery**: Finds conversation threads automatically
4596. **Rich Context**: Preserves surrounding text for each link
4607. **Performance**: Pre-computed indexes for fast threading queries
461
462### CLI Commands
463
464```bash
465# Extract and categorize all links
466thicket links --verbose
467
468# Build reference index for threading
469thicket index --verbose
470
471# Show all conversation threads
472thicket threads
473
474# Show threads for specific user
475thicket threads --username user1
476
477# Show threads with minimum size
478thicket threads --min-size 3
479```
480
481### Integration with Existing Commands
482
483The link processing system integrates seamlessly with existing thicket commands:
484- `thicket sync` updates entries, requiring `thicket links` to be run afterward
485- `thicket index` uses the output from `thicket links` for improved accuracy
486- `thicket threads` provides the user-facing threading interface
487
488## Current Implementation Status
489
490### ✅ Completed Features
4911. **Core Infrastructure**
492 - Modern CLI with Typer and Rich
493 - Pydantic data models for type safety
494 - Git repository operations with GitPython
495 - Feed parsing and normalization with feedparser
496
4972. **User and Feed Management**
498 - `thicket init` - Initialize git store
499 - `thicket add` - Add users and feeds with auto-discovery
500 - `thicket sync` - Sync feeds with progress tracking
501 - `thicket list` - List users, feeds, and entries
502 - `thicket duplicates` - Manage duplicate entries
503
5043. **Link Processing and Threading**
505 - `thicket links` - Extract and categorize all outbound links
506 - `thicket index` - Build reference index from links
507 - `thicket threads` - Display threaded conversation views
508 - Proper URL resolution with base URL handling
509 - Domain-based link categorization
510 - Context preservation for links
511
512### 📊 System Performance
513- **Link Extraction**: Successfully processes thousands of blog entries
514- **Categorization**: Identifies internal, user, and unknown links
515- **Threading**: Creates email-style threaded views of conversations
516- **Storage**: Efficient JSON-based data structures for links and references
517
518### 🔧 Current Architecture Highlights
519- **Modular Design**: Clear separation between CLI, core logic, and models
520- **Type Safety**: Comprehensive Pydantic models for data validation
521- **Rich CLI**: Beautiful progress bars, tables, and error handling
522- **Extensible**: Easy to add new commands and features
523- **Git Integration**: All data stored in version-controlled JSON files
524
525### 🎯 Proven Functionality
526The system has been tested with real blog data and successfully:
527- Extracted 14,396 total links from blog entries
528- Categorized 3,994 internal links, 363 user-to-user links, and 10,039 unknown links
529- Built comprehensive domain mappings for 16 users across 20 domains
530- Generated threaded views showing blog conversation patterns
531
532### 🚀 Ready for Use
533The thicket system is now fully functional for:
534- Maintaining Git repositories of blog feeds
535- Tracking cross-references between blogs
536- Creating threaded views of blog conversations
537- Discovering blog interaction patterns
538- Building distributed comment systems