Manage Atom feeds in a persistent git repository

Thicket Architecture Design#

Overview#

Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures.

Technology Stack#

Core Libraries#

CLI Framework#

  • Typer (0.15.x) - Modern CLI framework with type hints
  • Rich (13.x) - Beautiful terminal output, progress bars, and tables
  • prompt-toolkit - Interactive prompts when needed

Feed Processing#

  • feedparser (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0
    • Alternative: atoma for stricter Atom/RSS parsing with JSON feed support
    • Alternative: fastfeedparser for high-performance parsing (10x faster)

Git Integration#

  • GitPython (3.1.44) - High-level git operations, requires git CLI
    • Alternative: pygit2 (1.18.0) - Direct libgit2 bindings, better for authentication

HTTP Client#

  • httpx (0.28.x) - Modern async/sync HTTP client with connection pooling
  • aiohttp (3.11.x) - For async-only operations if needed

Configuration & Data Models#

  • pydantic (2.11.x) - Data validation and settings management
  • pydantic-settings (2.10.x) - Configuration file handling with env var support

Utilities#

  • pendulum (3.x) - Better datetime handling
  • bleach (6.x) - HTML sanitization for feed content
  • platformdirs (4.x) - Cross-platform directory paths

Project Structure#

thicket/
├── pyproject.toml          # Modern Python packaging
├── README.md               # Project documentation
├── ARCH.md                 # This file
├── CLAUDE.md               # Project instructions
├── .gitignore
├── src/
│   └── thicket/
│       ├── __init__.py
│       ├── __main__.py     # Entry point for `python -m thicket`
│       ├── cli/            # CLI commands and interface
│       │   ├── __init__.py
│       │   ├── main.py     # Main CLI app with Typer
│       │   ├── commands/   # Subcommands
│       │   │   ├── __init__.py
│       │   │   ├── init.py      # Initialize git store
│       │   │   ├── add.py       # Add feed to config
│       │   │   ├── sync.py      # Sync feeds
│       │   │   ├── list.py      # List users/feeds
│       │   │   └── search.py    # Search entries
│       │   └── utils.py    # CLI utilities (progress, formatting)
│       ├── core/           # Core business logic
│       │   ├── __init__.py
│       │   ├── feed_parser.py   # Feed parsing and normalization
│       │   ├── git_store.py     # Git repository operations
│       │   ├── cache.py         # Cache management
│       │   └── sanitizer.py     # Filename and HTML sanitization
│       ├── models/         # Pydantic data models
│       │   ├── __init__.py
│       │   ├── config.py        # Configuration models
│       │   ├── feed.py          # Feed/Entry models
│       │   └── user.py          # User metadata models
│       └── utils/          # Shared utilities
│           ├── __init__.py
│           ├── paths.py         # Path handling
│           └── network.py       # HTTP client wrapper
├── tests/
│   ├── __init__.py
│   ├── conftest.py         # pytest configuration
│   ├── test_feed_parser.py
│   ├── test_git_store.py
│   └── fixtures/           # Test data
│       └── feeds/
└── docs/
    └── examples/           # Example configurations

Data Models#

Configuration File (YAML/TOML)#

class ThicketConfig(BaseSettings):
    git_store: Path  # Git repository location
    cache_dir: Path  # Cache directory
    users: list[UserConfig]
    
    model_config = SettingsConfigDict(
        env_prefix="THICKET_",
        env_file=".env",
        yaml_file="thicket.yaml"
    )

class UserConfig(BaseModel):
    username: str
    feeds: list[HttpUrl]
    email: Optional[EmailStr] = None
    homepage: Optional[HttpUrl] = None
    icon: Optional[HttpUrl] = None
    display_name: Optional[str] = None

Feed Storage Format#

class AtomEntry(BaseModel):
    id: str  # Original Atom ID
    title: str
    link: HttpUrl
    updated: datetime
    published: Optional[datetime]
    summary: Optional[str]
    content: Optional[str]  # Full body content from Atom entry
    content_type: Optional[str] = "html"  # text, html, xhtml
    author: Optional[dict]
    categories: list[str] = []
    rights: Optional[str] = None  # Copyright info
    source: Optional[str] = None  # Source feed URL
    # Additional Atom fields preserved during RSS->Atom conversion
    
    model_config = ConfigDict(
        json_encoders={
            datetime: lambda v: v.isoformat()
        }
    )

class DuplicateMap(BaseModel):
    """Maps duplicate entry IDs to canonical entry IDs"""
    duplicates: dict[str, str] = {}  # duplicate_id -> canonical_id
    comment: str = "Entry IDs that map to the same canonical content"
    
    def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None:
        """Add a duplicate mapping"""
        self.duplicates[duplicate_id] = canonical_id
    
    def remove_duplicate(self, duplicate_id: str) -> bool:
        """Remove a duplicate mapping. Returns True if existed."""
        return self.duplicates.pop(duplicate_id, None) is not None
    
    def get_canonical(self, entry_id: str) -> str:
        """Get canonical ID for an entry (returns original if not duplicate)"""
        return self.duplicates.get(entry_id, entry_id)
    
    def is_duplicate(self, entry_id: str) -> bool:
        """Check if entry ID is marked as duplicate"""
        return entry_id in self.duplicates

Git Repository Structure#

git-store/
├── index.json              # User directory index
├── duplicates.json         # Manual curation of duplicate entries
├── user1/
│   ├── metadata.json       # User metadata
│   ├── entry_id_1.json     # Sanitized entry files
│   ├── entry_id_2.json
│   └── ...
└── user2/
    └── ...

Key Design Decisions#

1. Feed Normalization & Auto-Discovery#

  • All RSS feeds converted to Atom format before storage
  • Preserves maximum metadata during conversion
  • Sanitizes HTML content to prevent XSS
  • Auto-discovery: Extracts user metadata from feed during add user command

2. ID Sanitization#

  • Consistent algorithm to convert Atom IDs to safe filenames
  • Handles edge cases (very long IDs, special characters)
  • Maintains reversibility where possible

3. Git Operations#

  • Uses GitPython for simplicity (no authentication required)
  • Single main branch for all users and entries
  • Atomic commits per sync operation
  • Meaningful commit messages with feed update summaries
  • Preserves complete history - never delete entries even if they disappear from feeds

4. Caching Strategy#

  • HTTP caching with Last-Modified/ETag support
  • Local cache of parsed feeds with TTL
  • Cache invalidation on configuration changes
  • Git store serves as permanent historical archive beyond feed depth limits

5. Error Handling#

  • Graceful handling of feed parsing errors
  • Retry logic for network failures
  • Clear error messages with recovery suggestions

CLI Command Structure#

# Initialize a new git store
thicket init /path/to/store

# Add a user with feeds (auto-discovers metadata from feed)
thicket add user "alyssa" \
  --feed "https://example.com/feed.atom"
  # Auto-populates: email, homepage, icon, display_name from feed metadata

# Add a user with manual overrides
thicket add user "alyssa" \
  --feed "https://example.com/feed.atom" \
  --email "alyssa@example.com" \
  --homepage "https://alyssa.example.com" \
  --icon "https://example.com/avatar.png" \
  --display-name "Alyssa P. Hacker"

# Add additional feed to existing user
thicket add feed "alyssa" "https://example.com/other-feed.rss"

# Sync all feeds (designed for cron usage)
thicket sync --all

# Sync specific user
thicket sync --user alyssa

# List users and their feeds
thicket list users
thicket list feeds --user alyssa

# Search entries
thicket search "keyword" --user alyssa --since 2025-01-01

# Manage duplicate entries
thicket duplicates list
thicket duplicates add <entry_id_1> <entry_id_2>  # Mark as duplicates
thicket duplicates remove <entry_id_1> <entry_id_2>  # Unmark duplicates

Performance Considerations#

  1. Concurrent Feed Fetching: Use httpx with asyncio for parallel downloads
  2. Incremental Updates: Only fetch/parse feeds that have changed
  3. Efficient Git Operations: Batch commits, use shallow clones where appropriate
  4. Progress Feedback: Rich progress bars for long operations

Security Considerations#

  1. HTML Sanitization: Use bleach to clean feed content
  2. URL Validation: Strict validation of feed URLs
  3. Git Security: No credentials stored in repository
  4. Path Traversal: Careful sanitization of filenames

Future Enhancements#

  1. Web Interface: Optional web UI for browsing the git store
  2. Webhooks: Notify external services on feed updates
  3. Feed Discovery: Auto-discover feeds from HTML pages
  4. Export Formats: Generate static sites, OPML exports
  5. Federation: P2P sync between thicket instances

Requirements Clarification#

✓ Resolved Requirements:

  1. Feed Update Frequency: Designed for cron usage - no built-in scheduling needed
  2. Duplicate Handling: Manual curation via duplicates.json file with CLI commands
  3. Git Branching: Single main branch for all users and entries
  4. Authentication: No feeds require authentication currently
  5. Content Storage: Store complete Atom entry body content as provided
  6. Deleted Entries: Preserve all entries in Git store permanently (historical archive)
  7. History Depth: Git store maintains full history beyond feed depth limits
  8. Feed Auto-Discovery: Extract user metadata from feed during add user command

Duplicate Entry Management#

Duplicate Detection Strategy#

  • Manual Curation: Duplicates identified and managed manually via CLI
  • Storage: duplicates.json file in Git root maps entry IDs to canonical entries
  • Structure: {"duplicate_id": "canonical_id", ...}
  • CLI Commands: Add/remove duplicate mappings with validation
  • Query Resolution: Search/list commands resolve duplicates to canonical entries

Duplicate File Format#

{
  "https://example.com/feed/entry/123": "https://canonical.com/posts/same-post",
  "https://mirror.com/articles/456": "https://canonical.com/posts/same-post",
  "comment": "Entry IDs that map to the same canonical content"
}

Feed Metadata Auto-Discovery#

Extraction Strategy#

When adding a new user with thicket add user, the system fetches and parses the feed to extract:

  • Display Name: From feed.title or feed.author.name
  • Email: From feed.author.email or feed.managingEditor
  • Homepage: From feed.link or feed.author.uri
  • Icon: From feed.logo, feed.icon, or feed.image.url

Discovery Priority Order#

  1. Author Information: Prefer feed.author.* fields (more specific to person)
  2. Feed-Level: Fall back to feed-level metadata
  3. Manual Override: CLI flags always take precedence over discovered values
  4. Update Behavior: Auto-discovery only runs during initial add user, not on sync

Extracted Metadata Format#

class FeedMetadata(BaseModel):
    title: Optional[str] = None
    author_name: Optional[str] = None
    author_email: Optional[EmailStr] = None
    author_uri: Optional[HttpUrl] = None
    link: Optional[HttpUrl] = None
    logo: Optional[HttpUrl] = None
    icon: Optional[HttpUrl] = None
    image_url: Optional[HttpUrl] = None
    
    def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig:
        """Convert discovered metadata to UserConfig with fallbacks"""
        return UserConfig(
            username=username,
            feeds=[feed_url],
            display_name=self.author_name or self.title,
            email=self.author_email,
            homepage=self.author_uri or self.link,
            icon=self.logo or self.icon or self.image_url
        )