Thicket Git Store Specification#

This document comprehensively defines the JSON format and structure of the Thicket Git repository, enabling third-party clients to read and write to the store while leveraging Thicket's existing Python classes for data validation and business logic.

Overview#

The Thicket Git store is a structured repository that persists Atom/RSS feed entries in JSON format. The store is designed to be both human-readable and machine-parseable, with a clear directory structure and standardized JSON schemas.

Repository Structure#

<git_store>/
├── index.json              # Main index of all users and metadata
├── duplicates.json         # Maps duplicate entry IDs to canonical IDs
├── index.opml             # OPML export of all feeds (generated)
├── <username1>/           # User directory (sanitized username)
│   ├── <entry_id1>.json   # Individual feed entry
│   ├── <entry_id2>.json   # Individual feed entry
│   └── ...
├── <username2>/
│   ├── <entry_id3>.json
│   └── ...
└── ...

JSON Schemas#

1. Index File (`index.json`)#

The main index tracks all users, their metadata, and repository statistics.

Schema:

{
  "users": {
    "<username>": {
      "username": "string",
      "display_name": "string | null",
      "email": "string | null", 
      "homepage": "string (URL) | null",
      "icon": "string (URL) | null",
      "feeds": ["string (URL)", ...],
      "zulip_associations": [
        {
          "server": "string",
          "user_id": "string"
        },
        ...
      ],
      "directory": "string",
      "created": "string (ISO 8601 datetime)",
      "last_updated": "string (ISO 8601 datetime)",
      "entry_count": "integer"
    }
  },
  "created": "string (ISO 8601 datetime)",
  "last_updated": "string (ISO 8601 datetime)", 
  "total_entries": "integer"
}

Example:

{
  "users": {
    "johndoe": {
      "username": "johndoe",
      "display_name": "John Doe",
      "email": "john@example.com",
      "homepage": "https://johndoe.blog",
      "icon": "https://johndoe.blog/avatar.png",
      "feeds": [
        "https://johndoe.blog/feed.xml",
        "https://johndoe.blog/categories/tech/feed.xml"
      ],
      "zulip_associations": [
        {
          "server": "myorg.zulipchat.com",
          "user_id": "john.doe"
        },
        {
          "server": "community.zulipchat.com",
          "user_id": "johndoe@example.com"
        }
      ],
      "directory": "johndoe",
      "created": "2024-01-15T10:30:00",
      "last_updated": "2024-01-20T14:22:00",
      "entry_count": 42
    }
  },
  "created": "2024-01-15T10:30:00",
  "last_updated": "2024-01-20T14:22:00",
  "total_entries": 42
}

2. Duplicates File (`duplicates.json`)#

Maps duplicate entry IDs to their canonical representations to handle feed entries that appear with different IDs but identical content.

Schema:

{
  "duplicates": {
    "<duplicate_id>": "<canonical_id>"
  },
  "comment": "Entry IDs that map to the same canonical content"
}

Example:

{
  "duplicates": {
    "https://example.com/posts/123?utm_source=rss": "https://example.com/posts/123",
    "https://example.com/feed/item-duplicate": "https://example.com/feed/item-original"
  },
  "comment": "Entry IDs that map to the same canonical content"
}

3. Feed Entry Files (`<username>/<entry_id>.json`)#

Individual feed entries are stored as normalized Atom entries, regardless of their original format (RSS/Atom).

Schema:

{
  "id": "string",
  "title": "string", 
  "link": "string (URL)",
  "updated": "string (ISO 8601 datetime)",
  "published": "string (ISO 8601 datetime) | null",
  "summary": "string | null",
  "content": "string | null",
  "content_type": "html | text | xhtml",
  "author": {
    "name": "string | null",
    "email": "string | null", 
    "uri": "string (URL) | null"
  } | null,
  "categories": ["string", ...],
  "rights": "string | null",
  "source": "string (URL) | null"
}

Example:

{
  "id": "https://johndoe.blog/posts/my-first-post",
  "title": "My First Blog Post",
  "link": "https://johndoe.blog/posts/my-first-post",
  "updated": "2024-01-20T14:22:00",
  "published": "2024-01-20T09:00:00", 
  "summary": "This is a summary of my first blog post.",
  "content": "<p>This is the full content of my <strong>first</strong> blog post with HTML formatting.</p>",
  "content_type": "html",
  "author": {
    "name": "John Doe",
    "email": "john@example.com",
    "uri": "https://johndoe.blog"
  },
  "categories": ["blogging", "personal"],
  "rights": "Copyright 2024 John Doe",
  "source": "https://johndoe.blog/feed.xml"
}

Python Class Integration#

To leverage Thicket's existing validation and business logic, third-party clients should use the following Python classes from the thicket.models package:

Core Data Models#

from thicket.models import (
    AtomEntry,           # Feed entry representation
    GitStoreIndex,       # Repository index
    UserMetadata,        # User information  
    DuplicateMap,        # Duplicate ID mappings
    FeedMetadata,        # Feed-level metadata
    ThicketConfig,       # Configuration
    UserConfig,          # User configuration
    ZulipAssociation     # Zulip server/user_id pairs
)

Repository Operations#

from thicket.core.git_store import GitStore
from thicket.core.feed_parser import FeedParser

# Initialize git store
store = GitStore(Path("/path/to/git/store"))

# Read data
index = store._load_index()          # Load index.json
user = store.get_user("username")    # Get user metadata
entries = store.list_entries("username", limit=10)
entry = store.get_entry("username", "entry_id")
duplicates = store.get_duplicates()  # Load duplicates.json

# Write data  
store.add_user("username", display_name="Display Name")
store.store_entry("username", atom_entry)
store.add_duplicate("duplicate_id", "canonical_id") 
store.commit_changes("Commit message")

# Zulip associations
store.add_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
store.remove_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
associations = store.get_zulip_associations("username")

# Search and statistics
results = store.search_entries("query", username="optional")
stats = store.get_stats()

Feed Processing#

from thicket.core.feed_parser import FeedParser
from pydantic import HttpUrl

parser = FeedParser()

# Fetch and parse feeds
content = await parser.fetch_feed(HttpUrl("https://example.com/feed.xml"))
feed_metadata, entries = parser.parse_feed(content, source_url)

# Entry ID sanitization for filenames
safe_filename = parser.sanitize_entry_id(entry.id)

File Naming and ID Sanitization#

Entry IDs from feeds are sanitized to create safe filenames using FeedParser.sanitize_entry_id():

URLs are parsed and the path component is used as the base
Characters are limited to alphanumeric, hyphens, underscores, and periods
Other characters are replaced with underscores
Maximum length is 200 characters
Empty results default to "entry"

Examples:

https://example.com/posts/my-post → posts_my-post.json
https://blog.com/2024/01/title?utm=source → 2024_01_title.json

Data Validation#

All JSON data should be validated using Pydantic models before writing to the store:

from thicket.models import AtomEntry
from pydantic import ValidationError

try:
    entry = AtomEntry(**json_data)
    # Data is valid, safe to store
    store.store_entry(username, entry)
except ValidationError as e:
    # Handle validation errors
    print(f"Invalid entry data: {e}")

Timestamps#

All timestamps use ISO 8601 format in UTC:

created: When the record was first created
last_updated: When the record was last modified
updated: When the feed entry was last updated (from feed)
published: When the feed entry was originally published (from feed)

Content Sanitization#

HTML content in entries is sanitized using the FeedParser._sanitize_html() method to prevent XSS attacks. Allowed tags and attributes are strictly controlled.

Allowed HTML tags: a, abbr, acronym, b, blockquote, br, code, em, i, li, ol, p, pre, strong, ul, h1-h6, img, div, span

Allowed attributes:

a: href, title
img: src, alt, title, width, height
blockquote: cite
abbr/acronym: title

Error Handling and Robustness#

The store is designed to be fault-tolerant:

Invalid entries are skipped during processing with error logging
Malformed JSON files are ignored in listings
Missing files return None rather than raising exceptions
Git operations are atomic where possible

Example Usage#

Reading the Store#

from pathlib import Path
from thicket.core.git_store import GitStore

# Initialize
store = GitStore(Path("/path/to/thicket/store"))

# Get all users
index = store._load_index()
for username, user_metadata in index.users.items():
    print(f"User: {user_metadata.display_name} ({username})")
    print(f"  Feeds: {user_metadata.feeds}")
    print(f"  Entries: {user_metadata.entry_count}")

# Get recent entries for a user
entries = store.list_entries("johndoe", limit=5)
for entry in entries:
    print(f"  - {entry.title} ({entry.updated})")

Adding Data#

from thicket.models import AtomEntry
from datetime import datetime
from pydantic import HttpUrl

# Create entry
entry = AtomEntry(
    id="https://example.com/new-post",
    title="New Post",
    link=HttpUrl("https://example.com/new-post"),
    updated=datetime.now(),
    content="<p>Post content</p>",
    content_type="html"
)

# Store entry
store.store_entry("johndoe", entry)
store.commit_changes("Add new blog post")

Zulip Integration#

The Thicket Git store supports Zulip bot integration for automatic feed posting with user mentions.

Zulip Associations#

Users can be associated with their Zulip identities to enable @mentions:

# UserMetadata includes zulip_associations field
user.zulip_associations = [
    ZulipAssociation(server="myorg.zulipchat.com", user_id="alice"),
    ZulipAssociation(server="other.zulipchat.com", user_id="alice@example.com")
]

# Methods for managing associations
user.add_zulip_association("myorg.zulipchat.com", "alice")
user.get_zulip_mention("myorg.zulipchat.com")  # Returns "alice"
user.remove_zulip_association("myorg.zulipchat.com", "alice")

CLI Management#

# Add association
thicket zulip-add alice myorg.zulipchat.com alice@example.com

# Remove association  
thicket zulip-remove alice myorg.zulipchat.com alice@example.com

# List associations
thicket zulip-list           # All users
thicket zulip-list alice     # Specific user

# Bulk import from CSV
thicket zulip-import associations.csv

Bot Behavior#

When the Thicket Zulip bot posts articles:

It checks for Zulip associations matching the current server
If found, adds @mention to the post: @**alice** posted:
The mentioned user receives a notification in Zulip

This enables automatic notifications when someone's blog post is shared.

Versioning and Compatibility#

This specification describes version 1.1 of the Thicket Git store format. Changes from 1.0:

Added zulip_associations field to UserMetadata (backwards compatible - defaults to empty list)

Future versions will maintain backward compatibility where possible, with migration tools provided for breaking changes.

To check the store format version, examine the repository structure and JSON schemas. Stores created by Thicket 0.1.0+ follow this specification.

Thicket Git Store Specification#

Overview#

Repository Structure#

JSON Schemas#

1. Index File (index.json)#

2. Duplicates File (duplicates.json)#

3. Feed Entry Files (<username>/<entry_id>.json)#

Python Class Integration#

Core Data Models#

Repository Operations#

Feed Processing#

File Naming and ID Sanitization#

Data Validation#

Timestamps#

Content Sanitization#

Error Handling and Robustness#

Example Usage#

Reading the Store#

Adding Data#

Zulip Integration#

Zulip Associations#

CLI Management#

Bot Behavior#

Versioning and Compatibility#

1. Index File (`index.json`)#

2. Duplicates File (`duplicates.json`)#

3. Feed Entry Files (`<username>/<entry_id>.json`)#