Manage Atom feeds in a persistent git repository

Add comprehensive architecture design for thicket CLI

- Modern Python stack: Typer + Rich + GitPython + feedparser + pydantic
- Feed auto-discovery: extracts user metadata from Atom/RSS feeds
- Duplicate management: manual curation via duplicates.json
- Git store: single branch, permanent history, sanitized filenames
- Cron-friendly: designed for scheduled execution
- Complete data models and CLI command structure

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Changed files
+332
+332
ARCH.md
···
···
+
# Thicket Architecture Design
+
+
## Overview
+
Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures.
+
+
## Technology Stack
+
+
### Core Libraries
+
+
#### CLI Framework
+
- **Typer** (0.15.x) - Modern CLI framework with type hints
+
- **Rich** (13.x) - Beautiful terminal output, progress bars, and tables
+
- **prompt-toolkit** - Interactive prompts when needed
+
+
#### Feed Processing
+
- **feedparser** (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0
+
- Alternative: **atoma** for stricter Atom/RSS parsing with JSON feed support
+
- Alternative: **fastfeedparser** for high-performance parsing (10x faster)
+
+
#### Git Integration
+
- **GitPython** (3.1.44) - High-level git operations, requires git CLI
+
- Alternative: **pygit2** (1.18.0) - Direct libgit2 bindings, better for authentication
+
+
#### HTTP Client
+
- **httpx** (0.28.x) - Modern async/sync HTTP client with connection pooling
+
- **aiohttp** (3.11.x) - For async-only operations if needed
+
+
#### Configuration & Data Models
+
- **pydantic** (2.11.x) - Data validation and settings management
+
- **pydantic-settings** (2.10.x) - Configuration file handling with env var support
+
+
#### Utilities
+
- **pendulum** (3.x) - Better datetime handling
+
- **bleach** (6.x) - HTML sanitization for feed content
+
- **platformdirs** (4.x) - Cross-platform directory paths
+
+
## Project Structure
+
+
```
+
thicket/
+
├── pyproject.toml # Modern Python packaging
+
├── README.md # Project documentation
+
├── ARCH.md # This file
+
├── CLAUDE.md # Project instructions
+
├── .gitignore
+
├── src/
+
│ └── thicket/
+
│ ├── __init__.py
+
│ ├── __main__.py # Entry point for `python -m thicket`
+
│ ├── cli/ # CLI commands and interface
+
│ │ ├── __init__.py
+
│ │ ├── main.py # Main CLI app with Typer
+
│ │ ├── commands/ # Subcommands
+
│ │ │ ├── __init__.py
+
│ │ │ ├── init.py # Initialize git store
+
│ │ │ ├── add.py # Add feed to config
+
│ │ │ ├── sync.py # Sync feeds
+
│ │ │ ├── list.py # List users/feeds
+
│ │ │ └── search.py # Search entries
+
│ │ └── utils.py # CLI utilities (progress, formatting)
+
│ ├── core/ # Core business logic
+
│ │ ├── __init__.py
+
│ │ ├── feed_parser.py # Feed parsing and normalization
+
│ │ ├── git_store.py # Git repository operations
+
│ │ ├── cache.py # Cache management
+
│ │ └── sanitizer.py # Filename and HTML sanitization
+
│ ├── models/ # Pydantic data models
+
│ │ ├── __init__.py
+
│ │ ├── config.py # Configuration models
+
│ │ ├── feed.py # Feed/Entry models
+
│ │ └── user.py # User metadata models
+
│ └── utils/ # Shared utilities
+
│ ├── __init__.py
+
│ ├── paths.py # Path handling
+
│ └── network.py # HTTP client wrapper
+
├── tests/
+
│ ├── __init__.py
+
│ ├── conftest.py # pytest configuration
+
│ ├── test_feed_parser.py
+
│ ├── test_git_store.py
+
│ └── fixtures/ # Test data
+
│ └── feeds/
+
└── docs/
+
└── examples/ # Example configurations
+
```
+
+
## Data Models
+
+
### Configuration File (YAML/TOML)
+
```python
+
class ThicketConfig(BaseSettings):
+
git_store: Path # Git repository location
+
cache_dir: Path # Cache directory
+
users: list[UserConfig]
+
+
model_config = SettingsConfigDict(
+
env_prefix="THICKET_",
+
env_file=".env",
+
yaml_file="thicket.yaml"
+
)
+
+
class UserConfig(BaseModel):
+
username: str
+
feeds: list[HttpUrl]
+
email: Optional[EmailStr] = None
+
homepage: Optional[HttpUrl] = None
+
icon: Optional[HttpUrl] = None
+
display_name: Optional[str] = None
+
```
+
+
### Feed Storage Format
+
```python
+
class AtomEntry(BaseModel):
+
id: str # Original Atom ID
+
title: str
+
link: HttpUrl
+
updated: datetime
+
published: Optional[datetime]
+
summary: Optional[str]
+
content: Optional[str] # Full body content from Atom entry
+
content_type: Optional[str] = "html" # text, html, xhtml
+
author: Optional[dict]
+
categories: list[str] = []
+
rights: Optional[str] = None # Copyright info
+
source: Optional[str] = None # Source feed URL
+
# Additional Atom fields preserved during RSS->Atom conversion
+
+
model_config = ConfigDict(
+
json_encoders={
+
datetime: lambda v: v.isoformat()
+
}
+
)
+
+
class DuplicateMap(BaseModel):
+
"""Maps duplicate entry IDs to canonical entry IDs"""
+
duplicates: dict[str, str] = {} # duplicate_id -> canonical_id
+
comment: str = "Entry IDs that map to the same canonical content"
+
+
def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None:
+
"""Add a duplicate mapping"""
+
self.duplicates[duplicate_id] = canonical_id
+
+
def remove_duplicate(self, duplicate_id: str) -> bool:
+
"""Remove a duplicate mapping. Returns True if existed."""
+
return self.duplicates.pop(duplicate_id, None) is not None
+
+
def get_canonical(self, entry_id: str) -> str:
+
"""Get canonical ID for an entry (returns original if not duplicate)"""
+
return self.duplicates.get(entry_id, entry_id)
+
+
def is_duplicate(self, entry_id: str) -> bool:
+
"""Check if entry ID is marked as duplicate"""
+
return entry_id in self.duplicates
+
```
+
+
## Git Repository Structure
+
```
+
git-store/
+
├── index.json # User directory index
+
├── duplicates.json # Manual curation of duplicate entries
+
├── user1/
+
│ ├── metadata.json # User metadata
+
│ ├── entry_id_1.json # Sanitized entry files
+
│ ├── entry_id_2.json
+
│ └── ...
+
└── user2/
+
└── ...
+
```
+
+
## Key Design Decisions
+
+
### 1. Feed Normalization & Auto-Discovery
+
- All RSS feeds converted to Atom format before storage
+
- Preserves maximum metadata during conversion
+
- Sanitizes HTML content to prevent XSS
+
- **Auto-discovery**: Extracts user metadata from feed during `add user` command
+
+
### 2. ID Sanitization
+
- Consistent algorithm to convert Atom IDs to safe filenames
+
- Handles edge cases (very long IDs, special characters)
+
- Maintains reversibility where possible
+
+
### 3. Git Operations
+
- Uses GitPython for simplicity (no authentication required)
+
- Single main branch for all users and entries
+
- Atomic commits per sync operation
+
- Meaningful commit messages with feed update summaries
+
- Preserves complete history - never delete entries even if they disappear from feeds
+
+
### 4. Caching Strategy
+
- HTTP caching with Last-Modified/ETag support
+
- Local cache of parsed feeds with TTL
+
- Cache invalidation on configuration changes
+
- Git store serves as permanent historical archive beyond feed depth limits
+
+
### 5. Error Handling
+
- Graceful handling of feed parsing errors
+
- Retry logic for network failures
+
- Clear error messages with recovery suggestions
+
+
## CLI Command Structure
+
+
```bash
+
# Initialize a new git store
+
thicket init /path/to/store
+
+
# Add a user with feeds (auto-discovers metadata from feed)
+
thicket add user "alyssa" \
+
--feed "https://example.com/feed.atom"
+
# Auto-populates: email, homepage, icon, display_name from feed metadata
+
+
# Add a user with manual overrides
+
thicket add user "alyssa" \
+
--feed "https://example.com/feed.atom" \
+
--email "alyssa@example.com" \
+
--homepage "https://alyssa.example.com" \
+
--icon "https://example.com/avatar.png" \
+
--display-name "Alyssa P. Hacker"
+
+
# Add additional feed to existing user
+
thicket add feed "alyssa" "https://example.com/other-feed.rss"
+
+
# Sync all feeds (designed for cron usage)
+
thicket sync --all
+
+
# Sync specific user
+
thicket sync --user alyssa
+
+
# List users and their feeds
+
thicket list users
+
thicket list feeds --user alyssa
+
+
# Search entries
+
thicket search "keyword" --user alyssa --since 2025-01-01
+
+
# Manage duplicate entries
+
thicket duplicates list
+
thicket duplicates add <entry_id_1> <entry_id_2> # Mark as duplicates
+
thicket duplicates remove <entry_id_1> <entry_id_2> # Unmark duplicates
+
```
+
+
## Performance Considerations
+
+
1. **Concurrent Feed Fetching**: Use httpx with asyncio for parallel downloads
+
2. **Incremental Updates**: Only fetch/parse feeds that have changed
+
3. **Efficient Git Operations**: Batch commits, use shallow clones where appropriate
+
4. **Progress Feedback**: Rich progress bars for long operations
+
+
## Security Considerations
+
+
1. **HTML Sanitization**: Use bleach to clean feed content
+
2. **URL Validation**: Strict validation of feed URLs
+
3. **Git Security**: No credentials stored in repository
+
4. **Path Traversal**: Careful sanitization of filenames
+
+
## Future Enhancements
+
+
1. **Web Interface**: Optional web UI for browsing the git store
+
2. **Webhooks**: Notify external services on feed updates
+
3. **Feed Discovery**: Auto-discover feeds from HTML pages
+
4. **Export Formats**: Generate static sites, OPML exports
+
5. **Federation**: P2P sync between thicket instances
+
+
## Requirements Clarification
+
+
**✓ Resolved Requirements:**
+
1. **Feed Update Frequency**: Designed for cron usage - no built-in scheduling needed
+
2. **Duplicate Handling**: Manual curation via `duplicates.json` file with CLI commands
+
3. **Git Branching**: Single main branch for all users and entries
+
4. **Authentication**: No feeds require authentication currently
+
5. **Content Storage**: Store complete Atom entry body content as provided
+
6. **Deleted Entries**: Preserve all entries in Git store permanently (historical archive)
+
7. **History Depth**: Git store maintains full history beyond feed depth limits
+
8. **Feed Auto-Discovery**: Extract user metadata from feed during `add user` command
+
+
## Duplicate Entry Management
+
+
### Duplicate Detection Strategy
+
- **Manual Curation**: Duplicates identified and managed manually via CLI
+
- **Storage**: `duplicates.json` file in Git root maps entry IDs to canonical entries
+
- **Structure**: `{"duplicate_id": "canonical_id", ...}`
+
- **CLI Commands**: Add/remove duplicate mappings with validation
+
- **Query Resolution**: Search/list commands resolve duplicates to canonical entries
+
+
### Duplicate File Format
+
```json
+
{
+
"https://example.com/feed/entry/123": "https://canonical.com/posts/same-post",
+
"https://mirror.com/articles/456": "https://canonical.com/posts/same-post",
+
"comment": "Entry IDs that map to the same canonical content"
+
}
+
```
+
+
## Feed Metadata Auto-Discovery
+
+
### Extraction Strategy
+
When adding a new user with `thicket add user`, the system fetches and parses the feed to extract:
+
+
- **Display Name**: From `feed.title` or `feed.author.name`
+
- **Email**: From `feed.author.email` or `feed.managingEditor`
+
- **Homepage**: From `feed.link` or `feed.author.uri`
+
- **Icon**: From `feed.logo`, `feed.icon`, or `feed.image.url`
+
+
### Discovery Priority Order
+
1. **Author Information**: Prefer `feed.author.*` fields (more specific to person)
+
2. **Feed-Level**: Fall back to feed-level metadata
+
3. **Manual Override**: CLI flags always take precedence over discovered values
+
4. **Update Behavior**: Auto-discovery only runs during initial `add user`, not on sync
+
+
### Extracted Metadata Format
+
```python
+
class FeedMetadata(BaseModel):
+
title: Optional[str] = None
+
author_name: Optional[str] = None
+
author_email: Optional[EmailStr] = None
+
author_uri: Optional[HttpUrl] = None
+
link: Optional[HttpUrl] = None
+
logo: Optional[HttpUrl] = None
+
icon: Optional[HttpUrl] = None
+
image_url: Optional[HttpUrl] = None
+
+
def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig:
+
"""Convert discovered metadata to UserConfig with fallbacks"""
+
return UserConfig(
+
username=username,
+
feeds=[feed_url],
+
display_name=self.author_name or self.title,
+
email=self.author_email,
+
homepage=self.author_uri or self.link,
+
icon=self.logo or self.icon or self.image_url
+
)
+
```