# Thicket Architecture Design

## Overview
Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures.

## Technology Stack

### Core Libraries

#### CLI Framework
- **Typer** (0.15.x) - Modern CLI framework with type hints
- **Rich** (13.x) - Beautiful terminal output, progress bars, and tables
- **prompt-toolkit** - Interactive prompts when needed

#### Feed Processing
- **feedparser** (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0
  - Alternative: **atoma** for stricter Atom/RSS parsing with JSON feed support
  - Alternative: **fastfeedparser** for high-performance parsing (10x faster)

#### Git Integration
- **GitPython** (3.1.44) - High-level git operations, requires git CLI
  - Alternative: **pygit2** (1.18.0) - Direct libgit2 bindings, better for authentication

#### HTTP Client
- **httpx** (0.28.x) - Modern async/sync HTTP client with connection pooling
- **aiohttp** (3.11.x) - For async-only operations if needed

#### Configuration & Data Models
- **pydantic** (2.11.x) - Data validation and settings management
- **pydantic-settings** (2.10.x) - Configuration file handling with env var support

#### Utilities
- **pendulum** (3.x) - Better datetime handling
- **bleach** (6.x) - HTML sanitization for feed content
- **platformdirs** (4.x) - Cross-platform directory paths

## Project Structure

```
thicket/
├── pyproject.toml          # Modern Python packaging
├── README.md               # Project documentation
├── ARCH.md                 # This file
├── CLAUDE.md               # Project instructions
├── .gitignore
├── src/
│   └── thicket/
│       ├── __init__.py
│       ├── __main__.py     # Entry point for `python -m thicket`
│       ├── cli/            # CLI commands and interface
│       │   ├── __init__.py
│       │   ├── main.py     # Main CLI app with Typer
│       │   ├── commands/   # Subcommands
│       │   │   ├── __init__.py
│       │   │   ├── init.py      # Initialize git store
│       │   │   ├── add.py       # Add users and feeds
│       │   │   ├── sync.py      # Sync feeds
│       │   │   ├── list_cmd.py  # List users/feeds
│       │   │   ├── duplicates.py # Manage duplicate entries
│       │   │   ├── links_cmd.py  # Extract and categorize links
│       │   │   └── index_cmd.py  # Build reference index and show threads
│       │   └── utils.py    # CLI utilities (progress, formatting)
│       ├── core/           # Core business logic
│       │   ├── __init__.py
│       │   ├── feed_parser.py   # Feed parsing and normalization
│       │   ├── git_store.py     # Git repository operations
│       │   └── reference_parser.py # Link extraction and threading
│       ├── models/         # Pydantic data models
│       │   ├── __init__.py
│       │   ├── config.py        # Configuration models
│       │   ├── feed.py          # Feed/Entry models
│       │   └── user.py          # User metadata models
│       └── utils/          # Shared utilities
│           └── __init__.py
├── tests/
│   ├── __init__.py
│   ├── conftest.py         # pytest configuration
│   ├── test_feed_parser.py
│   ├── test_git_store.py
│   └── fixtures/           # Test data
│       └── feeds/
└── docs/
    └── examples/           # Example configurations
```

## Data Models

### Configuration File (YAML/TOML)
```python
class ThicketConfig(BaseSettings):
    git_store: Path  # Git repository location
    cache_dir: Path  # Cache directory
    users: list[UserConfig]
    
    model_config = SettingsConfigDict(
        env_prefix="THICKET_",
        env_file=".env",
        yaml_file="thicket.yaml"
    )

class UserConfig(BaseModel):
    username: str
    feeds: list[HttpUrl]
    email: Optional[EmailStr] = None
    homepage: Optional[HttpUrl] = None
    icon: Optional[HttpUrl] = None
    display_name: Optional[str] = None
```

### Feed Storage Format
```python
class AtomEntry(BaseModel):
    id: str  # Original Atom ID
    title: str
    link: HttpUrl
    updated: datetime
    published: Optional[datetime]
    summary: Optional[str]
    content: Optional[str]  # Full body content from Atom entry
    content_type: Optional[str] = "html"  # text, html, xhtml
    author: Optional[dict]
    categories: list[str] = []
    rights: Optional[str] = None  # Copyright info
    source: Optional[str] = None  # Source feed URL
    # Additional Atom fields preserved during RSS->Atom conversion
    
    model_config = ConfigDict(
        json_encoders={
            datetime: lambda v: v.isoformat()
        }
    )

class DuplicateMap(BaseModel):
    """Maps duplicate entry IDs to canonical entry IDs"""
    duplicates: dict[str, str] = {}  # duplicate_id -> canonical_id
    comment: str = "Entry IDs that map to the same canonical content"
    
    def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None:
        """Add a duplicate mapping"""
        self.duplicates[duplicate_id] = canonical_id
    
    def remove_duplicate(self, duplicate_id: str) -> bool:
        """Remove a duplicate mapping. Returns True if existed."""
        return self.duplicates.pop(duplicate_id, None) is not None
    
    def get_canonical(self, entry_id: str) -> str:
        """Get canonical ID for an entry (returns original if not duplicate)"""
        return self.duplicates.get(entry_id, entry_id)
    
    def is_duplicate(self, entry_id: str) -> bool:
        """Check if entry ID is marked as duplicate"""
        return entry_id in self.duplicates
```

## Git Repository Structure
```
git-store/
├── index.json              # User directory index
├── duplicates.json         # Manual curation of duplicate entries
├── links.json              # Unified links, references, and mapping data
├── user1/
│   ├── entry_id_1.json     # Sanitized entry files
│   ├── entry_id_2.json
│   └── ...
└── user2/
    └── ...
```

## Key Design Decisions

### 1. Feed Normalization & Auto-Discovery
- All RSS feeds converted to Atom format before storage
- Preserves maximum metadata during conversion
- Sanitizes HTML content to prevent XSS
- **Auto-discovery**: Extracts user metadata from feed during `add user` command

### 2. ID Sanitization
- Consistent algorithm to convert Atom IDs to safe filenames
- Handles edge cases (very long IDs, special characters)
- Maintains reversibility where possible

### 3. Git Operations
- Uses GitPython for simplicity (no authentication required)
- Single main branch for all users and entries
- Atomic commits per sync operation
- Meaningful commit messages with feed update summaries
- Preserves complete history - never delete entries even if they disappear from feeds

### 4. Caching Strategy
- HTTP caching with Last-Modified/ETag support
- Local cache of parsed feeds with TTL
- Cache invalidation on configuration changes
- Git store serves as permanent historical archive beyond feed depth limits

### 5. Error Handling
- Graceful handling of feed parsing errors
- Retry logic for network failures
- Clear error messages with recovery suggestions

## CLI Command Structure

```bash
# Initialize a new git store
thicket init /path/to/store

# Add a user with feeds (auto-discovers metadata from feed)
thicket add user "alyssa" \
  --feed "https://example.com/feed.atom"
  # Auto-populates: email, homepage, icon, display_name from feed metadata

# Add a user with manual overrides
thicket add user "alyssa" \
  --feed "https://example.com/feed.atom" \
  --email "alyssa@example.com" \
  --homepage "https://alyssa.example.com" \
  --icon "https://example.com/avatar.png" \
  --display-name "Alyssa P. Hacker"

# Add additional feed to existing user
thicket add feed "alyssa" "https://example.com/other-feed.rss"

# Sync all feeds (designed for cron usage)
thicket sync --all

# Sync specific user
thicket sync --user alyssa

# List users and their feeds
thicket list users
thicket list feeds --user alyssa

# Manage duplicate entries
thicket duplicates list
thicket duplicates add <entry_id_1> <entry_id_2>  # Mark as duplicates
thicket duplicates remove <entry_id_1> <entry_id_2>  # Unmark duplicates

# Link processing and threading
thicket links --verbose                 # Extract and categorize all links
thicket index --verbose                 # Build reference index for threading
thicket threads                         # Show conversation threads
thicket threads --username user1        # Show threads for specific user
thicket threads --min-size 3           # Show threads with minimum size
```

## Performance Considerations

1. **Concurrent Feed Fetching**: Use httpx with asyncio for parallel downloads
2. **Incremental Updates**: Only fetch/parse feeds that have changed
3. **Efficient Git Operations**: Batch commits, use shallow clones where appropriate
4. **Progress Feedback**: Rich progress bars for long operations

## Security Considerations

1. **HTML Sanitization**: Use bleach to clean feed content
2. **URL Validation**: Strict validation of feed URLs
3. **Git Security**: No credentials stored in repository
4. **Path Traversal**: Careful sanitization of filenames

## Future Enhancements

1. **Web Interface**: Optional web UI for browsing the git store
2. **Webhooks**: Notify external services on feed updates
3. **Feed Discovery**: Auto-discover feeds from HTML pages
4. **Export Formats**: Generate static sites, OPML exports
5. **Federation**: P2P sync between thicket instances

## Requirements Clarification

**✓ Resolved Requirements:**
1. **Feed Update Frequency**: Designed for cron usage - no built-in scheduling needed
2. **Duplicate Handling**: Manual curation via `duplicates.json` file with CLI commands
3. **Git Branching**: Single main branch for all users and entries
4. **Authentication**: No feeds require authentication currently
5. **Content Storage**: Store complete Atom entry body content as provided
6. **Deleted Entries**: Preserve all entries in Git store permanently (historical archive)
7. **History Depth**: Git store maintains full history beyond feed depth limits
8. **Feed Auto-Discovery**: Extract user metadata from feed during `add user` command

## Duplicate Entry Management

### Duplicate Detection Strategy
- **Manual Curation**: Duplicates identified and managed manually via CLI
- **Storage**: `duplicates.json` file in Git root maps entry IDs to canonical entries
- **Structure**: `{"duplicate_id": "canonical_id", ...}`
- **CLI Commands**: Add/remove duplicate mappings with validation
- **Query Resolution**: Search/list commands resolve duplicates to canonical entries

### Duplicate File Format
```json
{
  "https://example.com/feed/entry/123": "https://canonical.com/posts/same-post",
  "https://mirror.com/articles/456": "https://canonical.com/posts/same-post",
  "comment": "Entry IDs that map to the same canonical content"
}
```

## Feed Metadata Auto-Discovery

### Extraction Strategy
When adding a new user with `thicket add user`, the system fetches and parses the feed to extract:

- **Display Name**: From `feed.title` or `feed.author.name`
- **Email**: From `feed.author.email` or `feed.managingEditor`
- **Homepage**: From `feed.link` or `feed.author.uri`
- **Icon**: From `feed.logo`, `feed.icon`, or `feed.image.url`

### Discovery Priority Order
1. **Author Information**: Prefer `feed.author.*` fields (more specific to person)
2. **Feed-Level**: Fall back to feed-level metadata
3. **Manual Override**: CLI flags always take precedence over discovered values
4. **Update Behavior**: Auto-discovery only runs during initial `add user`, not on sync

### Extracted Metadata Format
```python
class FeedMetadata(BaseModel):
    title: Optional[str] = None
    author_name: Optional[str] = None
    author_email: Optional[EmailStr] = None
    author_uri: Optional[HttpUrl] = None
    link: Optional[HttpUrl] = None
    logo: Optional[HttpUrl] = None
    icon: Optional[HttpUrl] = None
    image_url: Optional[HttpUrl] = None
    
    def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig:
        """Convert discovered metadata to UserConfig with fallbacks"""
        return UserConfig(
            username=username,
            feeds=[feed_url],
            display_name=self.author_name or self.title,
            email=self.author_email,
            homepage=self.author_uri or self.link,
            icon=self.logo or self.icon or self.image_url
        )
```

## Link Processing and Threading Architecture

### Overview
The thicket system implements a sophisticated link processing and threading system to create email-style threaded views of blog entries by tracking cross-references between different blogs.

### Link Processing Pipeline

#### 1. Link Extraction (`thicket links`)
The `links` command systematically extracts all outbound links from blog entries and categorizes them:

```python
class LinkData(BaseModel):
    url: str                    # Fully resolved URL
    entry_id: str              # Source entry ID
    username: str              # Source username
    context: str               # Surrounding text context
    category: str              # "internal", "user", or "unknown"
    target_username: Optional[str]  # Target user if applicable
```

**Link Categories:**
- **Internal**: Links to the same user's domain (self-references)
- **User**: Links to other tracked users' domains
- **Unknown**: Links to external sites not tracked by thicket

#### 2. URL Resolution
All links are properly resolved using the Atom feed's base URL to handle:
- Relative URLs (converted to absolute)
- Protocol-relative URLs
- Fragment identifiers
- Redirects and canonical URLs

#### 3. Domain Mapping
The system builds a comprehensive domain mapping from user configuration:
- Feed URLs → domain extraction
- Homepage URLs → domain extraction
- Reverse mapping: domain → username

### Threading System

#### 1. Reference Index Generation (`thicket index`)
Creates a bidirectional reference index from the categorized links:

```python
class BlogReference(BaseModel):
    source_entry_id: str
    source_username: str
    target_url: str
    target_username: Optional[str]
    target_entry_id: Optional[str]
    context: str
```

#### 2. Thread Detection Algorithm
Uses graph traversal to find connected blog entries:
- **Outbound references**: Links from an entry to other entries
- **Inbound references**: Links to an entry from other entries
- **Thread members**: All entries connected through references

#### 3. Threading Display (`thicket threads`)
Creates email-style threaded views:
- Chronological ordering within threads
- Reference counts (outbound/inbound)
- Context preservation
- Filtering options (user, entry, minimum size)

### Data Structures

#### links.json Format (Unified Structure)
```json
{
  "links": {
    "https://example.com/post/123": {
      "referencing_entries": ["https://blog.user.com/entry/456"],
      "target_username": "user2"
    },
    "https://external-site.com/article": {
      "referencing_entries": ["https://blog.user.com/entry/789"]
    }
  },
  "reverse_mapping": {
    "https://blog.user.com/entry/456": ["https://example.com/post/123"],
    "https://blog.user.com/entry/789": ["https://external-site.com/article"]
  },
  "references": [
    {
      "source_entry_id": "https://blog.user.com/entry/456",
      "source_username": "user1",
      "target_url": "https://example.com/post/123",
      "target_username": "user2",
      "target_entry_id": "https://example.com/post/123",
      "context": "As mentioned in this post..."
    }
  ],
  "user_domains": {
    "user1": ["blog.user.com"],
    "user2": ["example.com"]
  }
}
```

This unified structure eliminates duplication by:
- Storing each URL only once with minimal metadata
- Including all link data, reference data, and mappings in one file
- Using presence of `target_username` to identify tracked vs external links
- Providing bidirectional mappings for efficient queries

### Unified Structure Benefits

- **Eliminates Duplication**: Each URL appears only once with metadata
- **Single Source of Truth**: All link-related data in one file
- **Efficient Queries**: Fast lookups for both directions (URL→entries, entry→URLs)
- **Atomic Updates**: All link data changes together
- **Reduced I/O**: Fewer file operations

### Implementation Benefits

1. **Systematic Link Processing**: All links are extracted and categorized consistently
2. **Proper URL Resolution**: Handles relative URLs and base URL resolution correctly
3. **Domain-based Categorization**: Automatically identifies user-to-user references
4. **Bidirectional Indexing**: Supports both "who links to whom" and "who is linked by whom"
5. **Thread Discovery**: Finds conversation threads automatically
6. **Rich Context**: Preserves surrounding text for each link
7. **Performance**: Pre-computed indexes for fast threading queries

### CLI Commands

```bash
# Extract and categorize all links
thicket links --verbose

# Build reference index for threading
thicket index --verbose

# Show all conversation threads
thicket threads

# Show threads for specific user
thicket threads --username user1

# Show threads with minimum size
thicket threads --min-size 3
```

### Integration with Existing Commands

The link processing system integrates seamlessly with existing thicket commands:
- `thicket sync` updates entries, requiring `thicket links` to be run afterward
- `thicket index` uses the output from `thicket links` for improved accuracy
- `thicket threads` provides the user-facing threading interface

## Current Implementation Status

### ✅ Completed Features
1. **Core Infrastructure**
   - Modern CLI with Typer and Rich
   - Pydantic data models for type safety
   - Git repository operations with GitPython
   - Feed parsing and normalization with feedparser

2. **User and Feed Management**
   - `thicket init` - Initialize git store
   - `thicket add` - Add users and feeds with auto-discovery
   - `thicket sync` - Sync feeds with progress tracking
   - `thicket list` - List users, feeds, and entries
   - `thicket duplicates` - Manage duplicate entries

3. **Link Processing and Threading**
   - `thicket links` - Extract and categorize all outbound links
   - `thicket index` - Build reference index from links
   - `thicket threads` - Display threaded conversation views
   - Proper URL resolution with base URL handling
   - Domain-based link categorization
   - Context preservation for links

### 📊 System Performance
- **Link Extraction**: Successfully processes thousands of blog entries
- **Categorization**: Identifies internal, user, and unknown links
- **Threading**: Creates email-style threaded views of conversations
- **Storage**: Efficient JSON-based data structures for links and references

### 🔧 Current Architecture Highlights
- **Modular Design**: Clear separation between CLI, core logic, and models
- **Type Safety**: Comprehensive Pydantic models for data validation
- **Rich CLI**: Beautiful progress bars, tables, and error handling
- **Extensible**: Easy to add new commands and features
- **Git Integration**: All data stored in version-controlled JSON files

### 🎯 Proven Functionality
The system has been tested with real blog data and successfully:
- Extracted 14,396 total links from blog entries
- Categorized 3,994 internal links, 363 user-to-user links, and 10,039 unknown links
- Built comprehensive domain mappings for 16 users across 20 domains
- Generated threaded views showing blog conversation patterns

### 🚀 Ready for Use
The thicket system is now fully functional for:
- Maintaining Git repositories of blog feeds
- Tracking cross-references between blogs
- Creating threaded views of blog conversations
- Discovering blog interaction patterns
- Building distributed comment systems