ARCH.md at main · anil.recoil.org/thicket

anil.recoil.org / thicket
Manage Atom feeds in a persistent git repository
thicket / ARCH.md
at main 20 kB view raw view rendered
  1# Thicket Architecture Design
  2
  3## Overview
  4Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures.
  5
  6## Technology Stack
  7
  8### Core Libraries
  9
 10#### CLI Framework
 11- **Typer** (0.15.x) - Modern CLI framework with type hints
 12- **Rich** (13.x) - Beautiful terminal output, progress bars, and tables
 13- **prompt-toolkit** - Interactive prompts when needed
 14
 15#### Feed Processing
 16- **feedparser** (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0
 17  - Alternative: **atoma** for stricter Atom/RSS parsing with JSON feed support
 18  - Alternative: **fastfeedparser** for high-performance parsing (10x faster)
 19
 20#### Git Integration
 21- **GitPython** (3.1.44) - High-level git operations, requires git CLI
 22  - Alternative: **pygit2** (1.18.0) - Direct libgit2 bindings, better for authentication
 23
 24#### HTTP Client
 25- **httpx** (0.28.x) - Modern async/sync HTTP client with connection pooling
 26- **aiohttp** (3.11.x) - For async-only operations if needed
 27
 28#### Configuration & Data Models
 29- **pydantic** (2.11.x) - Data validation and settings management
 30- **pydantic-settings** (2.10.x) - Configuration file handling with env var support
 31
 32#### Utilities
 33- **pendulum** (3.x) - Better datetime handling
 34- **bleach** (6.x) - HTML sanitization for feed content
 35- **platformdirs** (4.x) - Cross-platform directory paths
 36
 37## Project Structure
 38
 39```
 40thicket/
 41├── pyproject.toml          # Modern Python packaging
 42├── README.md               # Project documentation
 43├── ARCH.md                 # This file
 44├── CLAUDE.md               # Project instructions
 45├── .gitignore
 46├── src/
 47│   └── thicket/
 48│       ├── __init__.py
 49│       ├── __main__.py     # Entry point for `python -m thicket`
 50│       ├── cli/            # CLI commands and interface
 51│       │   ├── __init__.py
 52│       │   ├── main.py     # Main CLI app with Typer
 53│       │   ├── commands/   # Subcommands
 54│       │   │   ├── __init__.py
 55│       │   │   ├── init.py      # Initialize git store
 56│       │   │   ├── add.py       # Add users and feeds
 57│       │   │   ├── sync.py      # Sync feeds
 58│       │   │   ├── list_cmd.py  # List users/feeds
 59│       │   │   ├── duplicates.py # Manage duplicate entries
 60│       │   │   ├── links_cmd.py  # Extract and categorize links
 61│       │   │   └── index_cmd.py  # Build reference index and show threads
 62│       │   └── utils.py    # CLI utilities (progress, formatting)
 63│       ├── core/           # Core business logic
 64│       │   ├── __init__.py
 65│       │   ├── feed_parser.py   # Feed parsing and normalization
 66│       │   ├── git_store.py     # Git repository operations
 67│       │   └── reference_parser.py # Link extraction and threading
 68│       ├── models/         # Pydantic data models
 69│       │   ├── __init__.py
 70│       │   ├── config.py        # Configuration models
 71│       │   ├── feed.py          # Feed/Entry models
 72│       │   └── user.py          # User metadata models
 73│       └── utils/          # Shared utilities
 74│           └── __init__.py
 75├── tests/
 76│   ├── __init__.py
 77│   ├── conftest.py         # pytest configuration
 78│   ├── test_feed_parser.py
 79│   ├── test_git_store.py
 80│   └── fixtures/           # Test data
 81│       └── feeds/
 82└── docs/
 83    └── examples/           # Example configurations
 84```
 85
 86## Data Models
 87
 88### Configuration File (YAML/TOML)
 89```python
 90class ThicketConfig(BaseSettings):
 91    git_store: Path  # Git repository location
 92    cache_dir: Path  # Cache directory
 93    users: list[UserConfig]
 94    
 95    model_config = SettingsConfigDict(
 96        env_prefix="THICKET_",
 97        env_file=".env",
 98        yaml_file="thicket.yaml"
 99    )
100
101class UserConfig(BaseModel):
102    username: str
103    feeds: list[HttpUrl]
104    email: Optional[EmailStr] = None
105    homepage: Optional[HttpUrl] = None
106    icon: Optional[HttpUrl] = None
107    display_name: Optional[str] = None
108```
109
110### Feed Storage Format
111```python
112class AtomEntry(BaseModel):
113    id: str  # Original Atom ID
114    title: str
115    link: HttpUrl
116    updated: datetime
117    published: Optional[datetime]
118    summary: Optional[str]
119    content: Optional[str]  # Full body content from Atom entry
120    content_type: Optional[str] = "html"  # text, html, xhtml
121    author: Optional[dict]
122    categories: list[str] = []
123    rights: Optional[str] = None  # Copyright info
124    source: Optional[str] = None  # Source feed URL
125    # Additional Atom fields preserved during RSS->Atom conversion
126    
127    model_config = ConfigDict(
128        json_encoders={
129            datetime: lambda v: v.isoformat()
130        }
131    )
132
133class DuplicateMap(BaseModel):
134    """Maps duplicate entry IDs to canonical entry IDs"""
135    duplicates: dict[str, str] = {}  # duplicate_id -> canonical_id
136    comment: str = "Entry IDs that map to the same canonical content"
137    
138    def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None:
139        """Add a duplicate mapping"""
140        self.duplicates[duplicate_id] = canonical_id
141    
142    def remove_duplicate(self, duplicate_id: str) -> bool:
143        """Remove a duplicate mapping. Returns True if existed."""
144        return self.duplicates.pop(duplicate_id, None) is not None
145    
146    def get_canonical(self, entry_id: str) -> str:
147        """Get canonical ID for an entry (returns original if not duplicate)"""
148        return self.duplicates.get(entry_id, entry_id)
149    
150    def is_duplicate(self, entry_id: str) -> bool:
151        """Check if entry ID is marked as duplicate"""
152        return entry_id in self.duplicates
153```
154
155## Git Repository Structure
156```
157git-store/
158├── index.json              # User directory index
159├── duplicates.json         # Manual curation of duplicate entries
160├── links.json              # Unified links, references, and mapping data
161├── user1/
162│   ├── entry_id_1.json     # Sanitized entry files
163│   ├── entry_id_2.json
164│   └── ...
165└── user2/
166    └── ...
167```
168
169## Key Design Decisions
170
171### 1. Feed Normalization & Auto-Discovery
172- All RSS feeds converted to Atom format before storage
173- Preserves maximum metadata during conversion
174- Sanitizes HTML content to prevent XSS
175- **Auto-discovery**: Extracts user metadata from feed during `add user` command
176
177### 2. ID Sanitization
178- Consistent algorithm to convert Atom IDs to safe filenames
179- Handles edge cases (very long IDs, special characters)
180- Maintains reversibility where possible
181
182### 3. Git Operations
183- Uses GitPython for simplicity (no authentication required)
184- Single main branch for all users and entries
185- Atomic commits per sync operation
186- Meaningful commit messages with feed update summaries
187- Preserves complete history - never delete entries even if they disappear from feeds
188
189### 4. Caching Strategy
190- HTTP caching with Last-Modified/ETag support
191- Local cache of parsed feeds with TTL
192- Cache invalidation on configuration changes
193- Git store serves as permanent historical archive beyond feed depth limits
194
195### 5. Error Handling
196- Graceful handling of feed parsing errors
197- Retry logic for network failures
198- Clear error messages with recovery suggestions
199
200## CLI Command Structure
201
202```bash
203# Initialize a new git store
204thicket init /path/to/store
205
206# Add a user with feeds (auto-discovers metadata from feed)
207thicket add user "alyssa" \
208  --feed "https://example.com/feed.atom"
209  # Auto-populates: email, homepage, icon, display_name from feed metadata
210
211# Add a user with manual overrides
212thicket add user "alyssa" \
213  --feed "https://example.com/feed.atom" \
214  --email "alyssa@example.com" \
215  --homepage "https://alyssa.example.com" \
216  --icon "https://example.com/avatar.png" \
217  --display-name "Alyssa P. Hacker"
218
219# Add additional feed to existing user
220thicket add feed "alyssa" "https://example.com/other-feed.rss"
221
222# Sync all feeds (designed for cron usage)
223thicket sync --all
224
225# Sync specific user
226thicket sync --user alyssa
227
228# List users and their feeds
229thicket list users
230thicket list feeds --user alyssa
231
232# Manage duplicate entries
233thicket duplicates list
234thicket duplicates add <entry_id_1> <entry_id_2>  # Mark as duplicates
235thicket duplicates remove <entry_id_1> <entry_id_2>  # Unmark duplicates
236
237# Link processing and threading
238thicket links --verbose                 # Extract and categorize all links
239thicket index --verbose                 # Build reference index for threading
240thicket threads                         # Show conversation threads
241thicket threads --username user1        # Show threads for specific user
242thicket threads --min-size 3           # Show threads with minimum size
243```
244
245## Performance Considerations
246
2471. **Concurrent Feed Fetching**: Use httpx with asyncio for parallel downloads
2482. **Incremental Updates**: Only fetch/parse feeds that have changed
2493. **Efficient Git Operations**: Batch commits, use shallow clones where appropriate
2504. **Progress Feedback**: Rich progress bars for long operations
251
252## Security Considerations
253
2541. **HTML Sanitization**: Use bleach to clean feed content
2552. **URL Validation**: Strict validation of feed URLs
2563. **Git Security**: No credentials stored in repository
2574. **Path Traversal**: Careful sanitization of filenames
258
259## Future Enhancements
260
2611. **Web Interface**: Optional web UI for browsing the git store
2622. **Webhooks**: Notify external services on feed updates
2633. **Feed Discovery**: Auto-discover feeds from HTML pages
2644. **Export Formats**: Generate static sites, OPML exports
2655. **Federation**: P2P sync between thicket instances
266
267## Requirements Clarification
268
269**✓ Resolved Requirements:**
2701. **Feed Update Frequency**: Designed for cron usage - no built-in scheduling needed
2712. **Duplicate Handling**: Manual curation via `duplicates.json` file with CLI commands
2723. **Git Branching**: Single main branch for all users and entries
2734. **Authentication**: No feeds require authentication currently
2745. **Content Storage**: Store complete Atom entry body content as provided
2756. **Deleted Entries**: Preserve all entries in Git store permanently (historical archive)
2767. **History Depth**: Git store maintains full history beyond feed depth limits
2778. **Feed Auto-Discovery**: Extract user metadata from feed during `add user` command
278
279## Duplicate Entry Management
280
281### Duplicate Detection Strategy
282- **Manual Curation**: Duplicates identified and managed manually via CLI
283- **Storage**: `duplicates.json` file in Git root maps entry IDs to canonical entries
284- **Structure**: `{"duplicate_id": "canonical_id", ...}`
285- **CLI Commands**: Add/remove duplicate mappings with validation
286- **Query Resolution**: Search/list commands resolve duplicates to canonical entries
287
288### Duplicate File Format
289```json
290{
291  "https://example.com/feed/entry/123": "https://canonical.com/posts/same-post",
292  "https://mirror.com/articles/456": "https://canonical.com/posts/same-post",
293  "comment": "Entry IDs that map to the same canonical content"
294}
295```
296
297## Feed Metadata Auto-Discovery
298
299### Extraction Strategy
300When adding a new user with `thicket add user`, the system fetches and parses the feed to extract:
301
302- **Display Name**: From `feed.title` or `feed.author.name`
303- **Email**: From `feed.author.email` or `feed.managingEditor`
304- **Homepage**: From `feed.link` or `feed.author.uri`
305- **Icon**: From `feed.logo`, `feed.icon`, or `feed.image.url`
306
307### Discovery Priority Order
3081. **Author Information**: Prefer `feed.author.*` fields (more specific to person)
3092. **Feed-Level**: Fall back to feed-level metadata
3103. **Manual Override**: CLI flags always take precedence over discovered values
3114. **Update Behavior**: Auto-discovery only runs during initial `add user`, not on sync
312
313### Extracted Metadata Format
314```python
315class FeedMetadata(BaseModel):
316    title: Optional[str] = None
317    author_name: Optional[str] = None
318    author_email: Optional[EmailStr] = None
319    author_uri: Optional[HttpUrl] = None
320    link: Optional[HttpUrl] = None
321    logo: Optional[HttpUrl] = None
322    icon: Optional[HttpUrl] = None
323    image_url: Optional[HttpUrl] = None
324    
325    def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig:
326        """Convert discovered metadata to UserConfig with fallbacks"""
327        return UserConfig(
328            username=username,
329            feeds=[feed_url],
330            display_name=self.author_name or self.title,
331            email=self.author_email,
332            homepage=self.author_uri or self.link,
333            icon=self.logo or self.icon or self.image_url
334        )
335```
336
337## Link Processing and Threading Architecture
338
339### Overview
340The thicket system implements a sophisticated link processing and threading system to create email-style threaded views of blog entries by tracking cross-references between different blogs.
341
342### Link Processing Pipeline
343
344#### 1. Link Extraction (`thicket links`)
345The `links` command systematically extracts all outbound links from blog entries and categorizes them:
346
347```python
348class LinkData(BaseModel):
349    url: str                    # Fully resolved URL
350    entry_id: str              # Source entry ID
351    username: str              # Source username
352    context: str               # Surrounding text context
353    category: str              # "internal", "user", or "unknown"
354    target_username: Optional[str]  # Target user if applicable
355```
356
357**Link Categories:**
358- **Internal**: Links to the same user's domain (self-references)
359- **User**: Links to other tracked users' domains
360- **Unknown**: Links to external sites not tracked by thicket
361
362#### 2. URL Resolution
363All links are properly resolved using the Atom feed's base URL to handle:
364- Relative URLs (converted to absolute)
365- Protocol-relative URLs
366- Fragment identifiers
367- Redirects and canonical URLs
368
369#### 3. Domain Mapping
370The system builds a comprehensive domain mapping from user configuration:
371- Feed URLs → domain extraction
372- Homepage URLs → domain extraction
373- Reverse mapping: domain → username
374
375### Threading System
376
377#### 1. Reference Index Generation (`thicket index`)
378Creates a bidirectional reference index from the categorized links:
379
380```python
381class BlogReference(BaseModel):
382    source_entry_id: str
383    source_username: str
384    target_url: str
385    target_username: Optional[str]
386    target_entry_id: Optional[str]
387    context: str
388```
389
390#### 2. Thread Detection Algorithm
391Uses graph traversal to find connected blog entries:
392- **Outbound references**: Links from an entry to other entries
393- **Inbound references**: Links to an entry from other entries
394- **Thread members**: All entries connected through references
395
396#### 3. Threading Display (`thicket threads`)
397Creates email-style threaded views:
398- Chronological ordering within threads
399- Reference counts (outbound/inbound)
400- Context preservation
401- Filtering options (user, entry, minimum size)
402
403### Data Structures
404
405#### links.json Format (Unified Structure)
406```json
407{
408  "links": {
409    "https://example.com/post/123": {
410      "referencing_entries": ["https://blog.user.com/entry/456"],
411      "target_username": "user2"
412    },
413    "https://external-site.com/article": {
414      "referencing_entries": ["https://blog.user.com/entry/789"]
415    }
416  },
417  "reverse_mapping": {
418    "https://blog.user.com/entry/456": ["https://example.com/post/123"],
419    "https://blog.user.com/entry/789": ["https://external-site.com/article"]
420  },
421  "references": [
422    {
423      "source_entry_id": "https://blog.user.com/entry/456",
424      "source_username": "user1",
425      "target_url": "https://example.com/post/123",
426      "target_username": "user2",
427      "target_entry_id": "https://example.com/post/123",
428      "context": "As mentioned in this post..."
429    }
430  ],
431  "user_domains": {
432    "user1": ["blog.user.com"],
433    "user2": ["example.com"]
434  }
435}
436```
437
438This unified structure eliminates duplication by:
439- Storing each URL only once with minimal metadata
440- Including all link data, reference data, and mappings in one file
441- Using presence of `target_username` to identify tracked vs external links
442- Providing bidirectional mappings for efficient queries
443
444### Unified Structure Benefits
445
446- **Eliminates Duplication**: Each URL appears only once with metadata
447- **Single Source of Truth**: All link-related data in one file
448- **Efficient Queries**: Fast lookups for both directions (URL→entries, entry→URLs)
449- **Atomic Updates**: All link data changes together
450- **Reduced I/O**: Fewer file operations
451
452### Implementation Benefits
453
4541. **Systematic Link Processing**: All links are extracted and categorized consistently
4552. **Proper URL Resolution**: Handles relative URLs and base URL resolution correctly
4563. **Domain-based Categorization**: Automatically identifies user-to-user references
4574. **Bidirectional Indexing**: Supports both "who links to whom" and "who is linked by whom"
4585. **Thread Discovery**: Finds conversation threads automatically
4596. **Rich Context**: Preserves surrounding text for each link
4607. **Performance**: Pre-computed indexes for fast threading queries
461
462### CLI Commands
463
464```bash
465# Extract and categorize all links
466thicket links --verbose
467
468# Build reference index for threading
469thicket index --verbose
470
471# Show all conversation threads
472thicket threads
473
474# Show threads for specific user
475thicket threads --username user1
476
477# Show threads with minimum size
478thicket threads --min-size 3
479```
480
481### Integration with Existing Commands
482
483The link processing system integrates seamlessly with existing thicket commands:
484- `thicket sync` updates entries, requiring `thicket links` to be run afterward
485- `thicket index` uses the output from `thicket links` for improved accuracy
486- `thicket threads` provides the user-facing threading interface
487
488## Current Implementation Status
489
490### ✅ Completed Features
4911. **Core Infrastructure**
492   - Modern CLI with Typer and Rich
493   - Pydantic data models for type safety
494   - Git repository operations with GitPython
495   - Feed parsing and normalization with feedparser
496
4972. **User and Feed Management**
498   - `thicket init` - Initialize git store
499   - `thicket add` - Add users and feeds with auto-discovery
500   - `thicket sync` - Sync feeds with progress tracking
501   - `thicket list` - List users, feeds, and entries
502   - `thicket duplicates` - Manage duplicate entries
503
5043. **Link Processing and Threading**
505   - `thicket links` - Extract and categorize all outbound links
506   - `thicket index` - Build reference index from links
507   - `thicket threads` - Display threaded conversation views
508   - Proper URL resolution with base URL handling
509   - Domain-based link categorization
510   - Context preservation for links
511
512### 📊 System Performance
513- **Link Extraction**: Successfully processes thousands of blog entries
514- **Categorization**: Identifies internal, user, and unknown links
515- **Threading**: Creates email-style threaded views of conversations
516- **Storage**: Efficient JSON-based data structures for links and references
517
518### 🔧 Current Architecture Highlights
519- **Modular Design**: Clear separation between CLI, core logic, and models
520- **Type Safety**: Comprehensive Pydantic models for data validation
521- **Rich CLI**: Beautiful progress bars, tables, and error handling
522- **Extensible**: Easy to add new commands and features
523- **Git Integration**: All data stored in version-controlled JSON files
524
525### 🎯 Proven Functionality
526The system has been tested with real blog data and successfully:
527- Extracted 14,396 total links from blog entries
528- Categorized 3,994 internal links, 363 user-to-user links, and 10,039 unknown links
529- Built comprehensive domain mappings for 16 users across 20 domains
530- Generated threaded views showing blog conversation patterns
531
532### 🚀 Ready for Use
533The thicket system is now fully functional for:
534- Maintaining Git repositories of blog feeds
535- Tracking cross-references between blogs
536- Creating threaded views of blog conversations
537- Discovering blog interaction patterns
538- Building distributed comment systems