Thicket Git Store Specification#
This document comprehensively defines the JSON format and structure of the Thicket Git repository, enabling third-party clients to read and write to the store while leveraging Thicket's existing Python classes for data validation and business logic.
Overview#
The Thicket Git store is a structured repository that persists Atom/RSS feed entries in JSON format. The store is designed to be both human-readable and machine-parseable, with a clear directory structure and standardized JSON schemas.
Repository Structure#
<git_store>/
├── index.json # Main index of all users and metadata
├── duplicates.json # Maps duplicate entry IDs to canonical IDs
├── index.opml # OPML export of all feeds (generated)
├── <username1>/ # User directory (sanitized username)
│ ├── <entry_id1>.json # Individual feed entry
│ ├── <entry_id2>.json # Individual feed entry
│ └── ...
├── <username2>/
│ ├── <entry_id3>.json
│ └── ...
└── ...
JSON Schemas#
1. Index File (index.json)#
The main index tracks all users, their metadata, and repository statistics.
Schema:
{
"users": {
"<username>": {
"username": "string",
"display_name": "string | null",
"email": "string | null",
"homepage": "string (URL) | null",
"icon": "string (URL) | null",
"feeds": ["string (URL)", ...],
"zulip_associations": [
{
"server": "string",
"user_id": "string"
},
...
],
"directory": "string",
"created": "string (ISO 8601 datetime)",
"last_updated": "string (ISO 8601 datetime)",
"entry_count": "integer"
}
},
"created": "string (ISO 8601 datetime)",
"last_updated": "string (ISO 8601 datetime)",
"total_entries": "integer"
}
Example:
{
"users": {
"johndoe": {
"username": "johndoe",
"display_name": "John Doe",
"email": "john@example.com",
"homepage": "https://johndoe.blog",
"icon": "https://johndoe.blog/avatar.png",
"feeds": [
"https://johndoe.blog/feed.xml",
"https://johndoe.blog/categories/tech/feed.xml"
],
"zulip_associations": [
{
"server": "myorg.zulipchat.com",
"user_id": "john.doe"
},
{
"server": "community.zulipchat.com",
"user_id": "johndoe@example.com"
}
],
"directory": "johndoe",
"created": "2024-01-15T10:30:00",
"last_updated": "2024-01-20T14:22:00",
"entry_count": 42
}
},
"created": "2024-01-15T10:30:00",
"last_updated": "2024-01-20T14:22:00",
"total_entries": 42
}
2. Duplicates File (duplicates.json)#
Maps duplicate entry IDs to their canonical representations to handle feed entries that appear with different IDs but identical content.
Schema:
{
"duplicates": {
"<duplicate_id>": "<canonical_id>"
},
"comment": "Entry IDs that map to the same canonical content"
}
Example:
{
"duplicates": {
"https://example.com/posts/123?utm_source=rss": "https://example.com/posts/123",
"https://example.com/feed/item-duplicate": "https://example.com/feed/item-original"
},
"comment": "Entry IDs that map to the same canonical content"
}
3. Feed Entry Files (<username>/<entry_id>.json)#
Individual feed entries are stored as normalized Atom entries, regardless of their original format (RSS/Atom).
Schema:
{
"id": "string",
"title": "string",
"link": "string (URL)",
"updated": "string (ISO 8601 datetime)",
"published": "string (ISO 8601 datetime) | null",
"summary": "string | null",
"content": "string | null",
"content_type": "html | text | xhtml",
"author": {
"name": "string | null",
"email": "string | null",
"uri": "string (URL) | null"
} | null,
"categories": ["string", ...],
"rights": "string | null",
"source": "string (URL) | null"
}
Example:
{
"id": "https://johndoe.blog/posts/my-first-post",
"title": "My First Blog Post",
"link": "https://johndoe.blog/posts/my-first-post",
"updated": "2024-01-20T14:22:00",
"published": "2024-01-20T09:00:00",
"summary": "This is a summary of my first blog post.",
"content": "<p>This is the full content of my <strong>first</strong> blog post with HTML formatting.</p>",
"content_type": "html",
"author": {
"name": "John Doe",
"email": "john@example.com",
"uri": "https://johndoe.blog"
},
"categories": ["blogging", "personal"],
"rights": "Copyright 2024 John Doe",
"source": "https://johndoe.blog/feed.xml"
}
Python Class Integration#
To leverage Thicket's existing validation and business logic, third-party clients should use the following Python classes from the thicket.models package:
Core Data Models#
from thicket.models import (
AtomEntry, # Feed entry representation
GitStoreIndex, # Repository index
UserMetadata, # User information
DuplicateMap, # Duplicate ID mappings
FeedMetadata, # Feed-level metadata
ThicketConfig, # Configuration
UserConfig, # User configuration
ZulipAssociation # Zulip server/user_id pairs
)
Repository Operations#
from thicket.core.git_store import GitStore
from thicket.core.feed_parser import FeedParser
# Initialize git store
store = GitStore(Path("/path/to/git/store"))
# Read data
index = store._load_index() # Load index.json
user = store.get_user("username") # Get user metadata
entries = store.list_entries("username", limit=10)
entry = store.get_entry("username", "entry_id")
duplicates = store.get_duplicates() # Load duplicates.json
# Write data
store.add_user("username", display_name="Display Name")
store.store_entry("username", atom_entry)
store.add_duplicate("duplicate_id", "canonical_id")
store.commit_changes("Commit message")
# Zulip associations
store.add_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
store.remove_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
associations = store.get_zulip_associations("username")
# Search and statistics
results = store.search_entries("query", username="optional")
stats = store.get_stats()
Feed Processing#
from thicket.core.feed_parser import FeedParser
from pydantic import HttpUrl
parser = FeedParser()
# Fetch and parse feeds
content = await parser.fetch_feed(HttpUrl("https://example.com/feed.xml"))
feed_metadata, entries = parser.parse_feed(content, source_url)
# Entry ID sanitization for filenames
safe_filename = parser.sanitize_entry_id(entry.id)
File Naming and ID Sanitization#
Entry IDs from feeds are sanitized to create safe filenames using FeedParser.sanitize_entry_id():
- URLs are parsed and the path component is used as the base
- Characters are limited to alphanumeric, hyphens, underscores, and periods
- Other characters are replaced with underscores
- Maximum length is 200 characters
- Empty results default to "entry"
Examples:
https://example.com/posts/my-post→posts_my-post.jsonhttps://blog.com/2024/01/title?utm=source→2024_01_title.json
Data Validation#
All JSON data should be validated using Pydantic models before writing to the store:
from thicket.models import AtomEntry
from pydantic import ValidationError
try:
entry = AtomEntry(**json_data)
# Data is valid, safe to store
store.store_entry(username, entry)
except ValidationError as e:
# Handle validation errors
print(f"Invalid entry data: {e}")
Timestamps#
All timestamps use ISO 8601 format in UTC:
created: When the record was first createdlast_updated: When the record was last modifiedupdated: When the feed entry was last updated (from feed)published: When the feed entry was originally published (from feed)
Content Sanitization#
HTML content in entries is sanitized using the FeedParser._sanitize_html() method to prevent XSS attacks. Allowed tags and attributes are strictly controlled.
Allowed HTML tags:
a, abbr, acronym, b, blockquote, br, code, em, i, li, ol, p, pre, strong, ul, h1-h6, img, div, span
Allowed attributes:
a:href,titleimg:src,alt,title,width,heightblockquote:citeabbr/acronym:title
Error Handling and Robustness#
The store is designed to be fault-tolerant:
- Invalid entries are skipped during processing with error logging
- Malformed JSON files are ignored in listings
- Missing files return
Nonerather than raising exceptions - Git operations are atomic where possible
Example Usage#
Reading the Store#
from pathlib import Path
from thicket.core.git_store import GitStore
# Initialize
store = GitStore(Path("/path/to/thicket/store"))
# Get all users
index = store._load_index()
for username, user_metadata in index.users.items():
print(f"User: {user_metadata.display_name} ({username})")
print(f" Feeds: {user_metadata.feeds}")
print(f" Entries: {user_metadata.entry_count}")
# Get recent entries for a user
entries = store.list_entries("johndoe", limit=5)
for entry in entries:
print(f" - {entry.title} ({entry.updated})")
Adding Data#
from thicket.models import AtomEntry
from datetime import datetime
from pydantic import HttpUrl
# Create entry
entry = AtomEntry(
id="https://example.com/new-post",
title="New Post",
link=HttpUrl("https://example.com/new-post"),
updated=datetime.now(),
content="<p>Post content</p>",
content_type="html"
)
# Store entry
store.store_entry("johndoe", entry)
store.commit_changes("Add new blog post")
Zulip Integration#
The Thicket Git store supports Zulip bot integration for automatic feed posting with user mentions.
Zulip Associations#
Users can be associated with their Zulip identities to enable @mentions:
# UserMetadata includes zulip_associations field
user.zulip_associations = [
ZulipAssociation(server="myorg.zulipchat.com", user_id="alice"),
ZulipAssociation(server="other.zulipchat.com", user_id="alice@example.com")
]
# Methods for managing associations
user.add_zulip_association("myorg.zulipchat.com", "alice")
user.get_zulip_mention("myorg.zulipchat.com") # Returns "alice"
user.remove_zulip_association("myorg.zulipchat.com", "alice")
CLI Management#
# Add association
thicket zulip-add alice myorg.zulipchat.com alice@example.com
# Remove association
thicket zulip-remove alice myorg.zulipchat.com alice@example.com
# List associations
thicket zulip-list # All users
thicket zulip-list alice # Specific user
# Bulk import from CSV
thicket zulip-import associations.csv
Bot Behavior#
When the Thicket Zulip bot posts articles:
- It checks for Zulip associations matching the current server
- If found, adds @mention to the post:
@**alice** posted: - The mentioned user receives a notification in Zulip
This enables automatic notifications when someone's blog post is shared.
Versioning and Compatibility#
This specification describes version 1.1 of the Thicket Git store format. Changes from 1.0:
- Added
zulip_associationsfield to UserMetadata (backwards compatible - defaults to empty list)
Future versions will maintain backward compatibility where possible, with migration tools provided for breaking changes.
To check the store format version, examine the repository structure and JSON schemas. Stores created by Thicket 0.1.0+ follow this specification.