# Thicket Git Store Specification

This document comprehensively defines the JSON format and structure of the Thicket Git repository, enabling third-party clients to read and write to the store while leveraging Thicket's existing Python classes for data validation and business logic.

## Overview

The Thicket Git store is a structured repository that persists Atom/RSS feed entries in JSON format. The store is designed to be both human-readable and machine-parseable, with a clear directory structure and standardized JSON schemas.

## Repository Structure

```
<git_store>/
├── index.json              # Main index of all users and metadata
├── duplicates.json         # Maps duplicate entry IDs to canonical IDs
├── index.opml             # OPML export of all feeds (generated)
├── <username1>/           # User directory (sanitized username)
│   ├── <entry_id1>.json   # Individual feed entry
│   ├── <entry_id2>.json   # Individual feed entry
│   └── ...
├── <username2>/
│   ├── <entry_id3>.json
│   └── ...
└── ...
```

## JSON Schemas

### 1. Index File (`index.json`)

The main index tracks all users, their metadata, and repository statistics.

**Schema:**
```json
{
  "users": {
    "<username>": {
      "username": "string",
      "display_name": "string | null",
      "email": "string | null", 
      "homepage": "string (URL) | null",
      "icon": "string (URL) | null",
      "feeds": ["string (URL)", ...],
      "zulip_associations": [
        {
          "server": "string",
          "user_id": "string"
        },
        ...
      ],
      "directory": "string",
      "created": "string (ISO 8601 datetime)",
      "last_updated": "string (ISO 8601 datetime)",
      "entry_count": "integer"
    }
  },
  "created": "string (ISO 8601 datetime)",
  "last_updated": "string (ISO 8601 datetime)", 
  "total_entries": "integer"
}
```

**Example:**
```json
{
  "users": {
    "johndoe": {
      "username": "johndoe",
      "display_name": "John Doe",
      "email": "john@example.com",
      "homepage": "https://johndoe.blog",
      "icon": "https://johndoe.blog/avatar.png",
      "feeds": [
        "https://johndoe.blog/feed.xml",
        "https://johndoe.blog/categories/tech/feed.xml"
      ],
      "zulip_associations": [
        {
          "server": "myorg.zulipchat.com",
          "user_id": "john.doe"
        },
        {
          "server": "community.zulipchat.com",
          "user_id": "johndoe@example.com"
        }
      ],
      "directory": "johndoe",
      "created": "2024-01-15T10:30:00",
      "last_updated": "2024-01-20T14:22:00",
      "entry_count": 42
    }
  },
  "created": "2024-01-15T10:30:00",
  "last_updated": "2024-01-20T14:22:00",
  "total_entries": 42
}
```

### 2. Duplicates File (`duplicates.json`)

Maps duplicate entry IDs to their canonical representations to handle feed entries that appear with different IDs but identical content.

**Schema:**
```json
{
  "duplicates": {
    "<duplicate_id>": "<canonical_id>"
  },
  "comment": "Entry IDs that map to the same canonical content"
}
```

**Example:**
```json
{
  "duplicates": {
    "https://example.com/posts/123?utm_source=rss": "https://example.com/posts/123",
    "https://example.com/feed/item-duplicate": "https://example.com/feed/item-original"
  },
  "comment": "Entry IDs that map to the same canonical content"
}
```

### 3. Feed Entry Files (`<username>/<entry_id>.json`)

Individual feed entries are stored as normalized Atom entries, regardless of their original format (RSS/Atom).

**Schema:**
```json
{
  "id": "string",
  "title": "string", 
  "link": "string (URL)",
  "updated": "string (ISO 8601 datetime)",
  "published": "string (ISO 8601 datetime) | null",
  "summary": "string | null",
  "content": "string | null",
  "content_type": "html | text | xhtml",
  "author": {
    "name": "string | null",
    "email": "string | null", 
    "uri": "string (URL) | null"
  } | null,
  "categories": ["string", ...],
  "rights": "string | null",
  "source": "string (URL) | null"
}
```

**Example:**
```json
{
  "id": "https://johndoe.blog/posts/my-first-post",
  "title": "My First Blog Post",
  "link": "https://johndoe.blog/posts/my-first-post",
  "updated": "2024-01-20T14:22:00",
  "published": "2024-01-20T09:00:00", 
  "summary": "This is a summary of my first blog post.",
  "content": "<p>This is the full content of my <strong>first</strong> blog post with HTML formatting.</p>",
  "content_type": "html",
  "author": {
    "name": "John Doe",
    "email": "john@example.com",
    "uri": "https://johndoe.blog"
  },
  "categories": ["blogging", "personal"],
  "rights": "Copyright 2024 John Doe",
  "source": "https://johndoe.blog/feed.xml"
}
```

## Python Class Integration

To leverage Thicket's existing validation and business logic, third-party clients should use the following Python classes from the `thicket.models` package:

### Core Data Models

```python
from thicket.models import (
    AtomEntry,           # Feed entry representation
    GitStoreIndex,       # Repository index
    UserMetadata,        # User information  
    DuplicateMap,        # Duplicate ID mappings
    FeedMetadata,        # Feed-level metadata
    ThicketConfig,       # Configuration
    UserConfig,          # User configuration
    ZulipAssociation     # Zulip server/user_id pairs
)
```

### Repository Operations

```python
from thicket.core.git_store import GitStore
from thicket.core.feed_parser import FeedParser

# Initialize git store
store = GitStore(Path("/path/to/git/store"))

# Read data
index = store._load_index()          # Load index.json
user = store.get_user("username")    # Get user metadata
entries = store.list_entries("username", limit=10)
entry = store.get_entry("username", "entry_id")
duplicates = store.get_duplicates()  # Load duplicates.json

# Write data  
store.add_user("username", display_name="Display Name")
store.store_entry("username", atom_entry)
store.add_duplicate("duplicate_id", "canonical_id") 
store.commit_changes("Commit message")

# Zulip associations
store.add_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
store.remove_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
associations = store.get_zulip_associations("username")

# Search and statistics
results = store.search_entries("query", username="optional")
stats = store.get_stats()
```

### Feed Processing

```python
from thicket.core.feed_parser import FeedParser
from pydantic import HttpUrl

parser = FeedParser()

# Fetch and parse feeds
content = await parser.fetch_feed(HttpUrl("https://example.com/feed.xml"))
feed_metadata, entries = parser.parse_feed(content, source_url)

# Entry ID sanitization for filenames
safe_filename = parser.sanitize_entry_id(entry.id)
```

## File Naming and ID Sanitization

Entry IDs from feeds are sanitized to create safe filenames using `FeedParser.sanitize_entry_id()`:

- URLs are parsed and the path component is used as the base
- Characters are limited to alphanumeric, hyphens, underscores, and periods
- Other characters are replaced with underscores
- Maximum length is 200 characters
- Empty results default to "entry"

**Examples:**
- `https://example.com/posts/my-post` → `posts_my-post.json`
- `https://blog.com/2024/01/title?utm=source` → `2024_01_title.json`

## Data Validation

All JSON data should be validated using Pydantic models before writing to the store:

```python
from thicket.models import AtomEntry
from pydantic import ValidationError

try:
    entry = AtomEntry(**json_data)
    # Data is valid, safe to store
    store.store_entry(username, entry)
except ValidationError as e:
    # Handle validation errors
    print(f"Invalid entry data: {e}")
```

## Timestamps

All timestamps use ISO 8601 format in UTC:
- `created`: When the record was first created
- `last_updated`: When the record was last modified  
- `updated`: When the feed entry was last updated (from feed)
- `published`: When the feed entry was originally published (from feed)

## Content Sanitization

HTML content in entries is sanitized using the `FeedParser._sanitize_html()` method to prevent XSS attacks. Allowed tags and attributes are strictly controlled.

**Allowed HTML tags:**
`a`, `abbr`, `acronym`, `b`, `blockquote`, `br`, `code`, `em`, `i`, `li`, `ol`, `p`, `pre`, `strong`, `ul`, `h1`-`h6`, `img`, `div`, `span`

**Allowed attributes:**
- `a`: `href`, `title`
- `img`: `src`, `alt`, `title`, `width`, `height` 
- `blockquote`: `cite`
- `abbr`/`acronym`: `title`

## Error Handling and Robustness

The store is designed to be fault-tolerant:

- Invalid entries are skipped during processing with error logging
- Malformed JSON files are ignored in listings
- Missing files return `None` rather than raising exceptions
- Git operations are atomic where possible

## Example Usage

### Reading the Store

```python
from pathlib import Path
from thicket.core.git_store import GitStore

# Initialize
store = GitStore(Path("/path/to/thicket/store"))

# Get all users
index = store._load_index()
for username, user_metadata in index.users.items():
    print(f"User: {user_metadata.display_name} ({username})")
    print(f"  Feeds: {user_metadata.feeds}")
    print(f"  Entries: {user_metadata.entry_count}")

# Get recent entries for a user
entries = store.list_entries("johndoe", limit=5)
for entry in entries:
    print(f"  - {entry.title} ({entry.updated})")
```

### Adding Data

```python
from thicket.models import AtomEntry
from datetime import datetime
from pydantic import HttpUrl

# Create entry
entry = AtomEntry(
    id="https://example.com/new-post",
    title="New Post",
    link=HttpUrl("https://example.com/new-post"),
    updated=datetime.now(),
    content="<p>Post content</p>",
    content_type="html"
)

# Store entry
store.store_entry("johndoe", entry)
store.commit_changes("Add new blog post")
```

## Zulip Integration

The Thicket Git store supports Zulip bot integration for automatic feed posting with user mentions.

### Zulip Associations

Users can be associated with their Zulip identities to enable @mentions:

```python
# UserMetadata includes zulip_associations field
user.zulip_associations = [
    ZulipAssociation(server="myorg.zulipchat.com", user_id="alice"),
    ZulipAssociation(server="other.zulipchat.com", user_id="alice@example.com")
]

# Methods for managing associations
user.add_zulip_association("myorg.zulipchat.com", "alice")
user.get_zulip_mention("myorg.zulipchat.com")  # Returns "alice"
user.remove_zulip_association("myorg.zulipchat.com", "alice")
```

### CLI Management

```bash
# Add association
thicket zulip-add alice myorg.zulipchat.com alice@example.com

# Remove association  
thicket zulip-remove alice myorg.zulipchat.com alice@example.com

# List associations
thicket zulip-list           # All users
thicket zulip-list alice     # Specific user

# Bulk import from CSV
thicket zulip-import associations.csv
```

### Bot Behavior

When the Thicket Zulip bot posts articles:

1. It checks for Zulip associations matching the current server
2. If found, adds @mention to the post: `@**alice** posted:`
3. The mentioned user receives a notification in Zulip

This enables automatic notifications when someone's blog post is shared.

## Versioning and Compatibility

This specification describes version 1.1 of the Thicket Git store format. Changes from 1.0:
- Added `zulip_associations` field to UserMetadata (backwards compatible - defaults to empty list)

Future versions will maintain backward compatibility where possible, with migration tools provided for breaking changes.

To check the store format version, examine the repository structure and JSON schemas. Stores created by Thicket 0.1.0+ follow this specification.