commit e51bc85ec5fe2798bcfb740f48c55a6f1b4cc143 · anil.recoil.org/thicket

+332
ARCH.md
···

       1
       1
       +
       # Thicket Architecture Design

     

       2
       2
       +
       

     

       3
       3
       +
       ## Overview

     

       4
       4
       +
       Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures.

     

       5
       5
       +
       

     

       6
       6
       +
       ## Technology Stack

     

       7
       7
       +
       

     

       8
       8
       +
       ### Core Libraries

     

       9
       9
       +
       

     

       10
       10
       +
       #### CLI Framework

     

       11
       11
       +
       - **Typer** (0.15.x) - Modern CLI framework with type hints

     

       12
       12
       +
       - **Rich** (13.x) - Beautiful terminal output, progress bars, and tables

     

       13
       13
       +
       - **prompt-toolkit** - Interactive prompts when needed

     

       14
       14
       +
       

     

       15
       15
       +
       #### Feed Processing

     

       16
       16
       +
       - **feedparser** (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0

     

       17
       17
       +
         - Alternative: **atoma** for stricter Atom/RSS parsing with JSON feed support

     

       18
       18
       +
         - Alternative: **fastfeedparser** for high-performance parsing (10x faster)

     

       19
       19
       +
       

     

       20
       20
       +
       #### Git Integration

     

       21
       21
       +
       - **GitPython** (3.1.44) - High-level git operations, requires git CLI

     

       22
       22
       +
         - Alternative: **pygit2** (1.18.0) - Direct libgit2 bindings, better for authentication

     

       23
       23
       +
       

     

       24
       24
       +
       #### HTTP Client

     

       25
       25
       +
       - **httpx** (0.28.x) - Modern async/sync HTTP client with connection pooling

     

       26
       26
       +
       - **aiohttp** (3.11.x) - For async-only operations if needed

     

       27
       27
       +
       

     

       28
       28
       +
       #### Configuration & Data Models

     

       29
       29
       +
       - **pydantic** (2.11.x) - Data validation and settings management

     

       30
       30
       +
       - **pydantic-settings** (2.10.x) - Configuration file handling with env var support

     

       31
       31
       +
       

     

       32
       32
       +
       #### Utilities

     

       33
       33
       +
       - **pendulum** (3.x) - Better datetime handling

     

       34
       34
       +
       - **bleach** (6.x) - HTML sanitization for feed content

     

       35
       35
       +
       - **platformdirs** (4.x) - Cross-platform directory paths

     

       36
       36
       +
       

     

       37
       37
       +
       ## Project Structure

     

       38
       38
       +
       

     

       39
       39
       +
       ```

     

       40
       40
       +
       thicket/

     

       41
       41
       +
       ├── pyproject.toml          # Modern Python packaging

     

       42
       42
       +
       ├── README.md               # Project documentation

     

       43
       43
       +
       ├── ARCH.md                 # This file

     

       44
       44
       +
       ├── CLAUDE.md               # Project instructions

     

       45
       45
       +
       ├── .gitignore

     

       46
       46
       +
       ├── src/

     

       47
       47
       +
       │   └── thicket/

     

       48
       48
       +
       │       ├── __init__.py

     

       49
       49
       +
       │       ├── __main__.py     # Entry point for `python -m thicket`

     

       50
       50
       +
       │       ├── cli/            # CLI commands and interface

     

       51
       51
       +
       │       │   ├── __init__.py

     

       52
       52
       +
       │       │   ├── main.py     # Main CLI app with Typer

     

       53
       53
       +
       │       │   ├── commands/   # Subcommands

     

       54
       54
       +
       │       │   │   ├── __init__.py

     

       55
       55
       +
       │       │   │   ├── init.py      # Initialize git store

     

       56
       56
       +
       │       │   │   ├── add.py       # Add feed to config

     

       57
       57
       +
       │       │   │   ├── sync.py      # Sync feeds

     

       58
       58
       +
       │       │   │   ├── list.py      # List users/feeds

     

       59
       59
       +
       │       │   │   └── search.py    # Search entries

     

       60
       60
       +
       │       │   └── utils.py    # CLI utilities (progress, formatting)

     

       61
       61
       +
       │       ├── core/           # Core business logic

     

       62
       62
       +
       │       │   ├── __init__.py

     

       63
       63
       +
       │       │   ├── feed_parser.py   # Feed parsing and normalization

     

       64
       64
       +
       │       │   ├── git_store.py     # Git repository operations

     

       65
       65
       +
       │       │   ├── cache.py         # Cache management

     

       66
       66
       +
       │       │   └── sanitizer.py     # Filename and HTML sanitization

     

       67
       67
       +
       │       ├── models/         # Pydantic data models

     

       68
       68
       +
       │       │   ├── __init__.py

     

       69
       69
       +
       │       │   ├── config.py        # Configuration models

     

       70
       70
       +
       │       │   ├── feed.py          # Feed/Entry models

     

       71
       71
       +
       │       │   └── user.py          # User metadata models

     

       72
       72
       +
       │       └── utils/          # Shared utilities

     

       73
       73
       +
       │           ├── __init__.py

     

       74
       74
       +
       │           ├── paths.py         # Path handling

     

       75
       75
       +
       │           └── network.py       # HTTP client wrapper

     

       76
       76
       +
       ├── tests/

     

       77
       77
       +
       │   ├── __init__.py

     

       78
       78
       +
       │   ├── conftest.py         # pytest configuration

     

       79
       79
       +
       │   ├── test_feed_parser.py

     

       80
       80
       +
       │   ├── test_git_store.py

     

       81
       81
       +
       │   └── fixtures/           # Test data

     

       82
       82
       +
       │       └── feeds/

     

       83
       83
       +
       └── docs/

     

       84
       84
       +
           └── examples/           # Example configurations

     

       85
       85
       +
       ```

     

       86
       86
       +
       

     

       87
       87
       +
       ## Data Models

     

       88
       88
       +
       

     

       89
       89
       +
       ### Configuration File (YAML/TOML)

     

       90
       90
       +
       ```python

     

       91
       91
       +
       class ThicketConfig(BaseSettings):

     

       92
       92
       +
           git_store: Path  # Git repository location

     

       93
       93
       +
           cache_dir: Path  # Cache directory

     

       94
       94
       +
           users: list[UserConfig]

     

       95
       95
       +
           

     

       96
       96
       +
           model_config = SettingsConfigDict(

     

       97
       97
       +
               env_prefix="THICKET_",

     

       98
       98
       +
               env_file=".env",

     

       99
       99
       +
               yaml_file="thicket.yaml"

     

       100
       100
       +
           )

     

       101
       101
       +
       

     

       102
       102
       +
       class UserConfig(BaseModel):

     

       103
       103
       +
           username: str

     

       104
       104
       +
           feeds: list[HttpUrl]

     

       105
       105
       +
           email: Optional[EmailStr] = None

     

       106
       106
       +
           homepage: Optional[HttpUrl] = None

     

       107
       107
       +
           icon: Optional[HttpUrl] = None

     

       108
       108
       +
           display_name: Optional[str] = None

     

       109
       109
       +
       ```

     

       110
       110
       +
       

     

       111
       111
       +
       ### Feed Storage Format

     

       112
       112
       +
       ```python

     

       113
       113
       +
       class AtomEntry(BaseModel):

     

       114
       114
       +
           id: str  # Original Atom ID

     

       115
       115
       +
           title: str

     

       116
       116
       +
           link: HttpUrl

     

       117
       117
       +
           updated: datetime

     

       118
       118
       +
           published: Optional[datetime]

     

       119
       119
       +
           summary: Optional[str]

     

       120
       120
       +
           content: Optional[str]  # Full body content from Atom entry

     

       121
       121
       +
           content_type: Optional[str] = "html"  # text, html, xhtml

     

       122
       122
       +
           author: Optional[dict]

     

       123
       123
       +
           categories: list[str] = []

     

       124
       124
       +
           rights: Optional[str] = None  # Copyright info

     

       125
       125
       +
           source: Optional[str] = None  # Source feed URL

     

       126
       126
       +
           # Additional Atom fields preserved during RSS->Atom conversion

     

       127
       127
       +
           

     

       128
       128
       +
           model_config = ConfigDict(

     

       129
       129
       +
               json_encoders={

     

       130
       130
       +
                   datetime: lambda v: v.isoformat()

     

       131
       131
       +
               }

     

       132
       132
       +
           )

     

       133
       133
       +
       

     

       134
       134
       +
       class DuplicateMap(BaseModel):

     

       135
       135
       +
           """Maps duplicate entry IDs to canonical entry IDs"""

     

       136
       136
       +
           duplicates: dict[str, str] = {}  # duplicate_id -> canonical_id

     

       137
       137
       +
           comment: str = "Entry IDs that map to the same canonical content"

     

       138
       138
       +
           

     

       139
       139
       +
           def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None:

     

       140
       140
       +
               """Add a duplicate mapping"""

     

       141
       141
       +
               self.duplicates[duplicate_id] = canonical_id

     

       142
       142
       +
           

     

       143
       143
       +
           def remove_duplicate(self, duplicate_id: str) -> bool:

     

       144
       144
       +
               """Remove a duplicate mapping. Returns True if existed."""

     

       145
       145
       +
               return self.duplicates.pop(duplicate_id, None) is not None

     

       146
       146
       +
           

     

       147
       147
       +
           def get_canonical(self, entry_id: str) -> str:

     

       148
       148
       +
               """Get canonical ID for an entry (returns original if not duplicate)"""

     

       149
       149
       +
               return self.duplicates.get(entry_id, entry_id)

     

       150
       150
       +
           

     

       151
       151
       +
           def is_duplicate(self, entry_id: str) -> bool:

     

       152
       152
       +
               """Check if entry ID is marked as duplicate"""

     

       153
       153
       +
               return entry_id in self.duplicates

     

       154
       154
       +
       ```

     

       155
       155
       +
       

     

       156
       156
       +
       ## Git Repository Structure

     

       157
       157
       +
       ```

     

       158
       158
       +
       git-store/

     

       159
       159
       +
       ├── index.json              # User directory index

     

       160
       160
       +
       ├── duplicates.json         # Manual curation of duplicate entries

     

       161
       161
       +
       ├── user1/

     

       162
       162
       +
       │   ├── metadata.json       # User metadata

     

       163
       163
       +
       │   ├── entry_id_1.json     # Sanitized entry files

     

       164
       164
       +
       │   ├── entry_id_2.json

     

       165
       165
       +
       │   └── ...

     

       166
       166
       +
       └── user2/

     

       167
       167
       +
           └── ...

     

       168
       168
       +
       ```

     

       169
       169
       +
       

     

       170
       170
       +
       ## Key Design Decisions

     

       171
       171
       +
       

     

       172
       172
       +
       ### 1. Feed Normalization & Auto-Discovery

     

       173
       173
       +
       - All RSS feeds converted to Atom format before storage

     

       174
       174
       +
       - Preserves maximum metadata during conversion

     

       175
       175
       +
       - Sanitizes HTML content to prevent XSS

     

       176
       176
       +
       - **Auto-discovery**: Extracts user metadata from feed during `add user` command

     

       177
       177
       +
       

     

       178
       178
       +
       ### 2. ID Sanitization

     

       179
       179
       +
       - Consistent algorithm to convert Atom IDs to safe filenames

     

       180
       180
       +
       - Handles edge cases (very long IDs, special characters)

     

       181
       181
       +
       - Maintains reversibility where possible

     

       182
       182
       +
       

     

       183
       183
       +
       ### 3. Git Operations

     

       184
       184
       +
       - Uses GitPython for simplicity (no authentication required)

     

       185
       185
       +
       - Single main branch for all users and entries

     

       186
       186
       +
       - Atomic commits per sync operation

     

       187
       187
       +
       - Meaningful commit messages with feed update summaries

     

       188
       188
       +
       - Preserves complete history - never delete entries even if they disappear from feeds

     

       189
       189
       +
       

     

       190
       190
       +
       ### 4. Caching Strategy

     

       191
       191
       +
       - HTTP caching with Last-Modified/ETag support

     

       192
       192
       +
       - Local cache of parsed feeds with TTL

     

       193
       193
       +
       - Cache invalidation on configuration changes

     

       194
       194
       +
       - Git store serves as permanent historical archive beyond feed depth limits

     

       195
       195
       +
       

     

       196
       196
       +
       ### 5. Error Handling

     

       197
       197
       +
       - Graceful handling of feed parsing errors

     

       198
       198
       +
       - Retry logic for network failures

     

       199
       199
       +
       - Clear error messages with recovery suggestions

     

       200
       200
       +
       

     

       201
       201
       +
       ## CLI Command Structure

     

       202
       202
       +
       

     

       203
       203
       +
       ```bash

     

       204
       204
       +
       # Initialize a new git store

     

       205
       205
       +
       thicket init /path/to/store

     

       206
       206
       +
       

     

       207
       207
       +
       # Add a user with feeds (auto-discovers metadata from feed)

     

       208
       208
       +
       thicket add user "alyssa" \

     

       209
       209
       +
         --feed "https://example.com/feed.atom"

     

       210
       210
       +
         # Auto-populates: email, homepage, icon, display_name from feed metadata

     

       211
       211
       +
       

     

       212
       212
       +
       # Add a user with manual overrides

     

       213
       213
       +
       thicket add user "alyssa" \

     

       214
       214
       +
         --feed "https://example.com/feed.atom" \

     

       215
       215
       +
         --email "alyssa@example.com" \

     

       216
       216
       +
         --homepage "https://alyssa.example.com" \

     

       217
       217
       +
         --icon "https://example.com/avatar.png" \

     

       218
       218
       +
         --display-name "Alyssa P. Hacker"

     

       219
       219
       +
       

     

       220
       220
       +
       # Add additional feed to existing user

     

       221
       221
       +
       thicket add feed "alyssa" "https://example.com/other-feed.rss"

     

       222
       222
       +
       

     

       223
       223
       +
       # Sync all feeds (designed for cron usage)

     

       224
       224
       +
       thicket sync --all

     

       225
       225
       +
       

     

       226
       226
       +
       # Sync specific user

     

       227
       227
       +
       thicket sync --user alyssa

     

       228
       228
       +
       

     

       229
       229
       +
       # List users and their feeds

     

       230
       230
       +
       thicket list users

     

       231
       231
       +
       thicket list feeds --user alyssa

     

       232
       232
       +
       

     

       233
       233
       +
       # Search entries

     

       234
       234
       +
       thicket search "keyword" --user alyssa --since 2025-01-01

     

       235
       235
       +
       

     

       236
       236
       +
       # Manage duplicate entries

     

       237
       237
       +
       thicket duplicates list

     

       238
       238
       +
       thicket duplicates add <entry_id_1> <entry_id_2>  # Mark as duplicates

     

       239
       239
       +
       thicket duplicates remove <entry_id_1> <entry_id_2>  # Unmark duplicates

     

       240
       240
       +
       ```

     

       241
       241
       +
       

     

       242
       242
       +
       ## Performance Considerations

     

       243
       243
       +
       

     

       244
       244
       +
       1. **Concurrent Feed Fetching**: Use httpx with asyncio for parallel downloads

     

       245
       245
       +
       2. **Incremental Updates**: Only fetch/parse feeds that have changed

     

       246
       246
       +
       3. **Efficient Git Operations**: Batch commits, use shallow clones where appropriate

     

       247
       247
       +
       4. **Progress Feedback**: Rich progress bars for long operations

     

       248
       248
       +
       

     

       249
       249
       +
       ## Security Considerations

     

       250
       250
       +
       

     

       251
       251
       +
       1. **HTML Sanitization**: Use bleach to clean feed content

     

       252
       252
       +
       2. **URL Validation**: Strict validation of feed URLs

     

       253
       253
       +
       3. **Git Security**: No credentials stored in repository

     

       254
       254
       +
       4. **Path Traversal**: Careful sanitization of filenames

     

       255
       255
       +
       

     

       256
       256
       +
       ## Future Enhancements

     

       257
       257
       +
       

     

       258
       258
       +
       1. **Web Interface**: Optional web UI for browsing the git store

     

       259
       259
       +
       2. **Webhooks**: Notify external services on feed updates

     

       260
       260
       +
       3. **Feed Discovery**: Auto-discover feeds from HTML pages

     

       261
       261
       +
       4. **Export Formats**: Generate static sites, OPML exports

     

       262
       262
       +
       5. **Federation**: P2P sync between thicket instances

     

       263
       263
       +
       

     

       264
       264
       +
       ## Requirements Clarification

     

       265
       265
       +
       

     

       266
       266
       +
       **✓ Resolved Requirements:**

     

       267
       267
       +
       1. **Feed Update Frequency**: Designed for cron usage - no built-in scheduling needed

     

       268
       268
       +
       2. **Duplicate Handling**: Manual curation via `duplicates.json` file with CLI commands

     

       269
       269
       +
       3. **Git Branching**: Single main branch for all users and entries

     

       270
       270
       +
       4. **Authentication**: No feeds require authentication currently

     

       271
       271
       +
       5. **Content Storage**: Store complete Atom entry body content as provided

     

       272
       272
       +
       6. **Deleted Entries**: Preserve all entries in Git store permanently (historical archive)

     

       273
       273
       +
       7. **History Depth**: Git store maintains full history beyond feed depth limits

     

       274
       274
       +
       8. **Feed Auto-Discovery**: Extract user metadata from feed during `add user` command

     

       275
       275
       +
       

     

       276
       276
       +
       ## Duplicate Entry Management

     

       277
       277
       +
       

     

       278
       278
       +
       ### Duplicate Detection Strategy

     

       279
       279
       +
       - **Manual Curation**: Duplicates identified and managed manually via CLI

     

       280
       280
       +
       - **Storage**: `duplicates.json` file in Git root maps entry IDs to canonical entries

     

       281
       281
       +
       - **Structure**: `{"duplicate_id": "canonical_id", ...}`

     

       282
       282
       +
       - **CLI Commands**: Add/remove duplicate mappings with validation

     

       283
       283
       +
       - **Query Resolution**: Search/list commands resolve duplicates to canonical entries

     

       284
       284
       +
       

     

       285
       285
       +
       ### Duplicate File Format

     

       286
       286
       +
       ```json

     

       287
       287
       +
       {

     

       288
       288
       +
         "https://example.com/feed/entry/123": "https://canonical.com/posts/same-post",

     

       289
       289
       +
         "https://mirror.com/articles/456": "https://canonical.com/posts/same-post",

     

       290
       290
       +
         "comment": "Entry IDs that map to the same canonical content"

     

       291
       291
       +
       }

     

       292
       292
       +
       ```

     

       293
       293
       +
       

     

       294
       294
       +
       ## Feed Metadata Auto-Discovery

     

       295
       295
       +
       

     

       296
       296
       +
       ### Extraction Strategy

     

       297
       297
       +
       When adding a new user with `thicket add user`, the system fetches and parses the feed to extract:

     

       298
       298
       +
       

     

       299
       299
       +
       - **Display Name**: From `feed.title` or `feed.author.name`

     

       300
       300
       +
       - **Email**: From `feed.author.email` or `feed.managingEditor`

     

       301
       301
       +
       - **Homepage**: From `feed.link` or `feed.author.uri`

     

       302
       302
       +
       - **Icon**: From `feed.logo`, `feed.icon`, or `feed.image.url`

     

       303
       303
       +
       

     

       304
       304
       +
       ### Discovery Priority Order

     

       305
       305
       +
       1. **Author Information**: Prefer `feed.author.*` fields (more specific to person)

     

       306
       306
       +
       2. **Feed-Level**: Fall back to feed-level metadata

     

       307
       307
       +
       3. **Manual Override**: CLI flags always take precedence over discovered values

     

       308
       308
       +
       4. **Update Behavior**: Auto-discovery only runs during initial `add user`, not on sync

     

       309
       309
       +
       

     

       310
       310
       +
       ### Extracted Metadata Format

     

       311
       311
       +
       ```python

     

       312
       312
       +
       class FeedMetadata(BaseModel):

     

       313
       313
       +
           title: Optional[str] = None

     

       314
       314
       +
           author_name: Optional[str] = None

     

       315
       315
       +
           author_email: Optional[EmailStr] = None

     

       316
       316
       +
           author_uri: Optional[HttpUrl] = None

     

       317
       317
       +
           link: Optional[HttpUrl] = None

     

       318
       318
       +
           logo: Optional[HttpUrl] = None

     

       319
       319
       +
           icon: Optional[HttpUrl] = None

     

       320
       320
       +
           image_url: Optional[HttpUrl] = None

     

       321
       321
       +
           

     

       322
       322
       +
           def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig:

     

       323
       323
       +
               """Convert discovered metadata to UserConfig with fallbacks"""

     

       324
       324
       +
               return UserConfig(

     

       325
       325
       +
                   username=username,

     

       326
       326
       +
                   feeds=[feed_url],

     

       327
       327
       +
                   display_name=self.author_name or self.title,

     

       328
       328
       +
                   email=self.author_email,

     

       329
       329
       +
                   homepage=self.author_uri or self.link,

     

       330
       330
       +
                   icon=self.logo or self.icon or self.image_url

     

       331
       331
       +
               )

     

       332
       332
       +
       ```