Manage Atom feeds in a persistent git repository
1# Thicket Git Store Specification 2 3This document comprehensively defines the JSON format and structure of the Thicket Git repository, enabling third-party clients to read and write to the store while leveraging Thicket's existing Python classes for data validation and business logic. 4 5## Overview 6 7The Thicket Git store is a structured repository that persists Atom/RSS feed entries in JSON format. The store is designed to be both human-readable and machine-parseable, with a clear directory structure and standardized JSON schemas. 8 9## Repository Structure 10 11``` 12<git_store>/ 13├── index.json # Main index of all users and metadata 14├── duplicates.json # Maps duplicate entry IDs to canonical IDs 15├── index.opml # OPML export of all feeds (generated) 16├── <username1>/ # User directory (sanitized username) 17│ ├── <entry_id1>.json # Individual feed entry 18│ ├── <entry_id2>.json # Individual feed entry 19│ └── ... 20├── <username2>/ 21│ ├── <entry_id3>.json 22│ └── ... 23└── ... 24``` 25 26## JSON Schemas 27 28### 1. Index File (`index.json`) 29 30The main index tracks all users, their metadata, and repository statistics. 31 32**Schema:** 33```json 34{ 35 "users": { 36 "<username>": { 37 "username": "string", 38 "display_name": "string | null", 39 "email": "string | null", 40 "homepage": "string (URL) | null", 41 "icon": "string (URL) | null", 42 "feeds": ["string (URL)", ...], 43 "zulip_associations": [ 44 { 45 "server": "string", 46 "user_id": "string" 47 }, 48 ... 49 ], 50 "directory": "string", 51 "created": "string (ISO 8601 datetime)", 52 "last_updated": "string (ISO 8601 datetime)", 53 "entry_count": "integer" 54 } 55 }, 56 "created": "string (ISO 8601 datetime)", 57 "last_updated": "string (ISO 8601 datetime)", 58 "total_entries": "integer" 59} 60``` 61 62**Example:** 63```json 64{ 65 "users": { 66 "johndoe": { 67 "username": "johndoe", 68 "display_name": "John Doe", 69 "email": "john@example.com", 70 "homepage": "https://johndoe.blog", 71 "icon": "https://johndoe.blog/avatar.png", 72 "feeds": [ 73 "https://johndoe.blog/feed.xml", 74 "https://johndoe.blog/categories/tech/feed.xml" 75 ], 76 "zulip_associations": [ 77 { 78 "server": "myorg.zulipchat.com", 79 "user_id": "john.doe" 80 }, 81 { 82 "server": "community.zulipchat.com", 83 "user_id": "johndoe@example.com" 84 } 85 ], 86 "directory": "johndoe", 87 "created": "2024-01-15T10:30:00", 88 "last_updated": "2024-01-20T14:22:00", 89 "entry_count": 42 90 } 91 }, 92 "created": "2024-01-15T10:30:00", 93 "last_updated": "2024-01-20T14:22:00", 94 "total_entries": 42 95} 96``` 97 98### 2. Duplicates File (`duplicates.json`) 99 100Maps duplicate entry IDs to their canonical representations to handle feed entries that appear with different IDs but identical content. 101 102**Schema:** 103```json 104{ 105 "duplicates": { 106 "<duplicate_id>": "<canonical_id>" 107 }, 108 "comment": "Entry IDs that map to the same canonical content" 109} 110``` 111 112**Example:** 113```json 114{ 115 "duplicates": { 116 "https://example.com/posts/123?utm_source=rss": "https://example.com/posts/123", 117 "https://example.com/feed/item-duplicate": "https://example.com/feed/item-original" 118 }, 119 "comment": "Entry IDs that map to the same canonical content" 120} 121``` 122 123### 3. Feed Entry Files (`<username>/<entry_id>.json`) 124 125Individual feed entries are stored as normalized Atom entries, regardless of their original format (RSS/Atom). 126 127**Schema:** 128```json 129{ 130 "id": "string", 131 "title": "string", 132 "link": "string (URL)", 133 "updated": "string (ISO 8601 datetime)", 134 "published": "string (ISO 8601 datetime) | null", 135 "summary": "string | null", 136 "content": "string | null", 137 "content_type": "html | text | xhtml", 138 "author": { 139 "name": "string | null", 140 "email": "string | null", 141 "uri": "string (URL) | null" 142 } | null, 143 "categories": ["string", ...], 144 "rights": "string | null", 145 "source": "string (URL) | null" 146} 147``` 148 149**Example:** 150```json 151{ 152 "id": "https://johndoe.blog/posts/my-first-post", 153 "title": "My First Blog Post", 154 "link": "https://johndoe.blog/posts/my-first-post", 155 "updated": "2024-01-20T14:22:00", 156 "published": "2024-01-20T09:00:00", 157 "summary": "This is a summary of my first blog post.", 158 "content": "<p>This is the full content of my <strong>first</strong> blog post with HTML formatting.</p>", 159 "content_type": "html", 160 "author": { 161 "name": "John Doe", 162 "email": "john@example.com", 163 "uri": "https://johndoe.blog" 164 }, 165 "categories": ["blogging", "personal"], 166 "rights": "Copyright 2024 John Doe", 167 "source": "https://johndoe.blog/feed.xml" 168} 169``` 170 171## Python Class Integration 172 173To leverage Thicket's existing validation and business logic, third-party clients should use the following Python classes from the `thicket.models` package: 174 175### Core Data Models 176 177```python 178from thicket.models import ( 179 AtomEntry, # Feed entry representation 180 GitStoreIndex, # Repository index 181 UserMetadata, # User information 182 DuplicateMap, # Duplicate ID mappings 183 FeedMetadata, # Feed-level metadata 184 ThicketConfig, # Configuration 185 UserConfig, # User configuration 186 ZulipAssociation # Zulip server/user_id pairs 187) 188``` 189 190### Repository Operations 191 192```python 193from thicket.core.git_store import GitStore 194from thicket.core.feed_parser import FeedParser 195 196# Initialize git store 197store = GitStore(Path("/path/to/git/store")) 198 199# Read data 200index = store._load_index() # Load index.json 201user = store.get_user("username") # Get user metadata 202entries = store.list_entries("username", limit=10) 203entry = store.get_entry("username", "entry_id") 204duplicates = store.get_duplicates() # Load duplicates.json 205 206# Write data 207store.add_user("username", display_name="Display Name") 208store.store_entry("username", atom_entry) 209store.add_duplicate("duplicate_id", "canonical_id") 210store.commit_changes("Commit message") 211 212# Zulip associations 213store.add_zulip_association("username", "myorg.zulipchat.com", "user@example.com") 214store.remove_zulip_association("username", "myorg.zulipchat.com", "user@example.com") 215associations = store.get_zulip_associations("username") 216 217# Search and statistics 218results = store.search_entries("query", username="optional") 219stats = store.get_stats() 220``` 221 222### Feed Processing 223 224```python 225from thicket.core.feed_parser import FeedParser 226from pydantic import HttpUrl 227 228parser = FeedParser() 229 230# Fetch and parse feeds 231content = await parser.fetch_feed(HttpUrl("https://example.com/feed.xml")) 232feed_metadata, entries = parser.parse_feed(content, source_url) 233 234# Entry ID sanitization for filenames 235safe_filename = parser.sanitize_entry_id(entry.id) 236``` 237 238## File Naming and ID Sanitization 239 240Entry IDs from feeds are sanitized to create safe filenames using `FeedParser.sanitize_entry_id()`: 241 242- URLs are parsed and the path component is used as the base 243- Characters are limited to alphanumeric, hyphens, underscores, and periods 244- Other characters are replaced with underscores 245- Maximum length is 200 characters 246- Empty results default to "entry" 247 248**Examples:** 249- `https://example.com/posts/my-post``posts_my-post.json` 250- `https://blog.com/2024/01/title?utm=source``2024_01_title.json` 251 252## Data Validation 253 254All JSON data should be validated using Pydantic models before writing to the store: 255 256```python 257from thicket.models import AtomEntry 258from pydantic import ValidationError 259 260try: 261 entry = AtomEntry(**json_data) 262 # Data is valid, safe to store 263 store.store_entry(username, entry) 264except ValidationError as e: 265 # Handle validation errors 266 print(f"Invalid entry data: {e}") 267``` 268 269## Timestamps 270 271All timestamps use ISO 8601 format in UTC: 272- `created`: When the record was first created 273- `last_updated`: When the record was last modified 274- `updated`: When the feed entry was last updated (from feed) 275- `published`: When the feed entry was originally published (from feed) 276 277## Content Sanitization 278 279HTML content in entries is sanitized using the `FeedParser._sanitize_html()` method to prevent XSS attacks. Allowed tags and attributes are strictly controlled. 280 281**Allowed HTML tags:** 282`a`, `abbr`, `acronym`, `b`, `blockquote`, `br`, `code`, `em`, `i`, `li`, `ol`, `p`, `pre`, `strong`, `ul`, `h1`-`h6`, `img`, `div`, `span` 283 284**Allowed attributes:** 285- `a`: `href`, `title` 286- `img`: `src`, `alt`, `title`, `width`, `height` 287- `blockquote`: `cite` 288- `abbr`/`acronym`: `title` 289 290## Error Handling and Robustness 291 292The store is designed to be fault-tolerant: 293 294- Invalid entries are skipped during processing with error logging 295- Malformed JSON files are ignored in listings 296- Missing files return `None` rather than raising exceptions 297- Git operations are atomic where possible 298 299## Example Usage 300 301### Reading the Store 302 303```python 304from pathlib import Path 305from thicket.core.git_store import GitStore 306 307# Initialize 308store = GitStore(Path("/path/to/thicket/store")) 309 310# Get all users 311index = store._load_index() 312for username, user_metadata in index.users.items(): 313 print(f"User: {user_metadata.display_name} ({username})") 314 print(f" Feeds: {user_metadata.feeds}") 315 print(f" Entries: {user_metadata.entry_count}") 316 317# Get recent entries for a user 318entries = store.list_entries("johndoe", limit=5) 319for entry in entries: 320 print(f" - {entry.title} ({entry.updated})") 321``` 322 323### Adding Data 324 325```python 326from thicket.models import AtomEntry 327from datetime import datetime 328from pydantic import HttpUrl 329 330# Create entry 331entry = AtomEntry( 332 id="https://example.com/new-post", 333 title="New Post", 334 link=HttpUrl("https://example.com/new-post"), 335 updated=datetime.now(), 336 content="<p>Post content</p>", 337 content_type="html" 338) 339 340# Store entry 341store.store_entry("johndoe", entry) 342store.commit_changes("Add new blog post") 343``` 344 345## Zulip Integration 346 347The Thicket Git store supports Zulip bot integration for automatic feed posting with user mentions. 348 349### Zulip Associations 350 351Users can be associated with their Zulip identities to enable @mentions: 352 353```python 354# UserMetadata includes zulip_associations field 355user.zulip_associations = [ 356 ZulipAssociation(server="myorg.zulipchat.com", user_id="alice"), 357 ZulipAssociation(server="other.zulipchat.com", user_id="alice@example.com") 358] 359 360# Methods for managing associations 361user.add_zulip_association("myorg.zulipchat.com", "alice") 362user.get_zulip_mention("myorg.zulipchat.com") # Returns "alice" 363user.remove_zulip_association("myorg.zulipchat.com", "alice") 364``` 365 366### CLI Management 367 368```bash 369# Add association 370thicket zulip-add alice myorg.zulipchat.com alice@example.com 371 372# Remove association 373thicket zulip-remove alice myorg.zulipchat.com alice@example.com 374 375# List associations 376thicket zulip-list # All users 377thicket zulip-list alice # Specific user 378 379# Bulk import from CSV 380thicket zulip-import associations.csv 381``` 382 383### Bot Behavior 384 385When the Thicket Zulip bot posts articles: 386 3871. It checks for Zulip associations matching the current server 3882. If found, adds @mention to the post: `@**alice** posted:` 3893. The mentioned user receives a notification in Zulip 390 391This enables automatic notifications when someone's blog post is shared. 392 393## Versioning and Compatibility 394 395This specification describes version 1.1 of the Thicket Git store format. Changes from 1.0: 396- Added `zulip_associations` field to UserMetadata (backwards compatible - defaults to empty list) 397 398Future versions will maintain backward compatibility where possible, with migration tools provided for breaking changes. 399 400To check the store format version, examine the repository structure and JSON schemas. Stores created by Thicket 0.1.0+ follow this specification.