# Thicket Git Store Specification This document comprehensively defines the JSON format and structure of the Thicket Git repository, enabling third-party clients to read and write to the store while leveraging Thicket's existing Python classes for data validation and business logic. ## Overview The Thicket Git store is a structured repository that persists Atom/RSS feed entries in JSON format. The store is designed to be both human-readable and machine-parseable, with a clear directory structure and standardized JSON schemas. ## Repository Structure ``` / ├── index.json # Main index of all users and metadata ├── duplicates.json # Maps duplicate entry IDs to canonical IDs ├── index.opml # OPML export of all feeds (generated) ├── / # User directory (sanitized username) │ ├── .json # Individual feed entry │ ├── .json # Individual feed entry │ └── ... ├── / │ ├── .json │ └── ... └── ... ``` ## JSON Schemas ### 1. Index File (`index.json`) The main index tracks all users, their metadata, and repository statistics. **Schema:** ```json { "users": { "": { "username": "string", "display_name": "string | null", "email": "string | null", "homepage": "string (URL) | null", "icon": "string (URL) | null", "feeds": ["string (URL)", ...], "zulip_associations": [ { "server": "string", "user_id": "string" }, ... ], "directory": "string", "created": "string (ISO 8601 datetime)", "last_updated": "string (ISO 8601 datetime)", "entry_count": "integer" } }, "created": "string (ISO 8601 datetime)", "last_updated": "string (ISO 8601 datetime)", "total_entries": "integer" } ``` **Example:** ```json { "users": { "johndoe": { "username": "johndoe", "display_name": "John Doe", "email": "john@example.com", "homepage": "https://johndoe.blog", "icon": "https://johndoe.blog/avatar.png", "feeds": [ "https://johndoe.blog/feed.xml", "https://johndoe.blog/categories/tech/feed.xml" ], "zulip_associations": [ { "server": "myorg.zulipchat.com", "user_id": "john.doe" }, { "server": "community.zulipchat.com", "user_id": "johndoe@example.com" } ], "directory": "johndoe", "created": "2024-01-15T10:30:00", "last_updated": "2024-01-20T14:22:00", "entry_count": 42 } }, "created": "2024-01-15T10:30:00", "last_updated": "2024-01-20T14:22:00", "total_entries": 42 } ``` ### 2. Duplicates File (`duplicates.json`) Maps duplicate entry IDs to their canonical representations to handle feed entries that appear with different IDs but identical content. **Schema:** ```json { "duplicates": { "": "" }, "comment": "Entry IDs that map to the same canonical content" } ``` **Example:** ```json { "duplicates": { "https://example.com/posts/123?utm_source=rss": "https://example.com/posts/123", "https://example.com/feed/item-duplicate": "https://example.com/feed/item-original" }, "comment": "Entry IDs that map to the same canonical content" } ``` ### 3. Feed Entry Files (`/.json`) Individual feed entries are stored as normalized Atom entries, regardless of their original format (RSS/Atom). **Schema:** ```json { "id": "string", "title": "string", "link": "string (URL)", "updated": "string (ISO 8601 datetime)", "published": "string (ISO 8601 datetime) | null", "summary": "string | null", "content": "string | null", "content_type": "html | text | xhtml", "author": { "name": "string | null", "email": "string | null", "uri": "string (URL) | null" } | null, "categories": ["string", ...], "rights": "string | null", "source": "string (URL) | null" } ``` **Example:** ```json { "id": "https://johndoe.blog/posts/my-first-post", "title": "My First Blog Post", "link": "https://johndoe.blog/posts/my-first-post", "updated": "2024-01-20T14:22:00", "published": "2024-01-20T09:00:00", "summary": "This is a summary of my first blog post.", "content": "

This is the full content of my first blog post with HTML formatting.

", "content_type": "html", "author": { "name": "John Doe", "email": "john@example.com", "uri": "https://johndoe.blog" }, "categories": ["blogging", "personal"], "rights": "Copyright 2024 John Doe", "source": "https://johndoe.blog/feed.xml" } ``` ## Python Class Integration To leverage Thicket's existing validation and business logic, third-party clients should use the following Python classes from the `thicket.models` package: ### Core Data Models ```python from thicket.models import ( AtomEntry, # Feed entry representation GitStoreIndex, # Repository index UserMetadata, # User information DuplicateMap, # Duplicate ID mappings FeedMetadata, # Feed-level metadata ThicketConfig, # Configuration UserConfig, # User configuration ZulipAssociation # Zulip server/user_id pairs ) ``` ### Repository Operations ```python from thicket.core.git_store import GitStore from thicket.core.feed_parser import FeedParser # Initialize git store store = GitStore(Path("/path/to/git/store")) # Read data index = store._load_index() # Load index.json user = store.get_user("username") # Get user metadata entries = store.list_entries("username", limit=10) entry = store.get_entry("username", "entry_id") duplicates = store.get_duplicates() # Load duplicates.json # Write data store.add_user("username", display_name="Display Name") store.store_entry("username", atom_entry) store.add_duplicate("duplicate_id", "canonical_id") store.commit_changes("Commit message") # Zulip associations store.add_zulip_association("username", "myorg.zulipchat.com", "user@example.com") store.remove_zulip_association("username", "myorg.zulipchat.com", "user@example.com") associations = store.get_zulip_associations("username") # Search and statistics results = store.search_entries("query", username="optional") stats = store.get_stats() ``` ### Feed Processing ```python from thicket.core.feed_parser import FeedParser from pydantic import HttpUrl parser = FeedParser() # Fetch and parse feeds content = await parser.fetch_feed(HttpUrl("https://example.com/feed.xml")) feed_metadata, entries = parser.parse_feed(content, source_url) # Entry ID sanitization for filenames safe_filename = parser.sanitize_entry_id(entry.id) ``` ## File Naming and ID Sanitization Entry IDs from feeds are sanitized to create safe filenames using `FeedParser.sanitize_entry_id()`: - URLs are parsed and the path component is used as the base - Characters are limited to alphanumeric, hyphens, underscores, and periods - Other characters are replaced with underscores - Maximum length is 200 characters - Empty results default to "entry" **Examples:** - `https://example.com/posts/my-post` → `posts_my-post.json` - `https://blog.com/2024/01/title?utm=source` → `2024_01_title.json` ## Data Validation All JSON data should be validated using Pydantic models before writing to the store: ```python from thicket.models import AtomEntry from pydantic import ValidationError try: entry = AtomEntry(**json_data) # Data is valid, safe to store store.store_entry(username, entry) except ValidationError as e: # Handle validation errors print(f"Invalid entry data: {e}") ``` ## Timestamps All timestamps use ISO 8601 format in UTC: - `created`: When the record was first created - `last_updated`: When the record was last modified - `updated`: When the feed entry was last updated (from feed) - `published`: When the feed entry was originally published (from feed) ## Content Sanitization HTML content in entries is sanitized using the `FeedParser._sanitize_html()` method to prevent XSS attacks. Allowed tags and attributes are strictly controlled. **Allowed HTML tags:** `a`, `abbr`, `acronym`, `b`, `blockquote`, `br`, `code`, `em`, `i`, `li`, `ol`, `p`, `pre`, `strong`, `ul`, `h1`-`h6`, `img`, `div`, `span` **Allowed attributes:** - `a`: `href`, `title` - `img`: `src`, `alt`, `title`, `width`, `height` - `blockquote`: `cite` - `abbr`/`acronym`: `title` ## Error Handling and Robustness The store is designed to be fault-tolerant: - Invalid entries are skipped during processing with error logging - Malformed JSON files are ignored in listings - Missing files return `None` rather than raising exceptions - Git operations are atomic where possible ## Example Usage ### Reading the Store ```python from pathlib import Path from thicket.core.git_store import GitStore # Initialize store = GitStore(Path("/path/to/thicket/store")) # Get all users index = store._load_index() for username, user_metadata in index.users.items(): print(f"User: {user_metadata.display_name} ({username})") print(f" Feeds: {user_metadata.feeds}") print(f" Entries: {user_metadata.entry_count}") # Get recent entries for a user entries = store.list_entries("johndoe", limit=5) for entry in entries: print(f" - {entry.title} ({entry.updated})") ``` ### Adding Data ```python from thicket.models import AtomEntry from datetime import datetime from pydantic import HttpUrl # Create entry entry = AtomEntry( id="https://example.com/new-post", title="New Post", link=HttpUrl("https://example.com/new-post"), updated=datetime.now(), content="

Post content

", content_type="html" ) # Store entry store.store_entry("johndoe", entry) store.commit_changes("Add new blog post") ``` ## Zulip Integration The Thicket Git store supports Zulip bot integration for automatic feed posting with user mentions. ### Zulip Associations Users can be associated with their Zulip identities to enable @mentions: ```python # UserMetadata includes zulip_associations field user.zulip_associations = [ ZulipAssociation(server="myorg.zulipchat.com", user_id="alice"), ZulipAssociation(server="other.zulipchat.com", user_id="alice@example.com") ] # Methods for managing associations user.add_zulip_association("myorg.zulipchat.com", "alice") user.get_zulip_mention("myorg.zulipchat.com") # Returns "alice" user.remove_zulip_association("myorg.zulipchat.com", "alice") ``` ### CLI Management ```bash # Add association thicket zulip-add alice myorg.zulipchat.com alice@example.com # Remove association thicket zulip-remove alice myorg.zulipchat.com alice@example.com # List associations thicket zulip-list # All users thicket zulip-list alice # Specific user # Bulk import from CSV thicket zulip-import associations.csv ``` ### Bot Behavior When the Thicket Zulip bot posts articles: 1. It checks for Zulip associations matching the current server 2. If found, adds @mention to the post: `@**alice** posted:` 3. The mentioned user receives a notification in Zulip This enables automatic notifications when someone's blog post is shared. ## Versioning and Compatibility This specification describes version 1.1 of the Thicket Git store format. Changes from 1.0: - Added `zulip_associations` field to UserMetadata (backwards compatible - defaults to empty list) Future versions will maintain backward compatibility where possible, with migration tools provided for breaking changes. To check the store format version, examine the repository structure and JSON schemas. Stores created by Thicket 0.1.0+ follow this specification.