SPEC.md at main · anil.recoil.org/thicket

anil.recoil.org / thicket
Manage Atom feeds in a persistent git repository
thicket / SPEC.md
at main 12 kB view raw view rendered
  1# Thicket Git Store Specification
  2
  3This document comprehensively defines the JSON format and structure of the Thicket Git repository, enabling third-party clients to read and write to the store while leveraging Thicket's existing Python classes for data validation and business logic.
  4
  5## Overview
  6
  7The Thicket Git store is a structured repository that persists Atom/RSS feed entries in JSON format. The store is designed to be both human-readable and machine-parseable, with a clear directory structure and standardized JSON schemas.
  8
  9## Repository Structure
 10
 11```
 12<git_store>/
 13├── index.json              # Main index of all users and metadata
 14├── duplicates.json         # Maps duplicate entry IDs to canonical IDs
 15├── index.opml             # OPML export of all feeds (generated)
 16├── <username1>/           # User directory (sanitized username)
 17│   ├── <entry_id1>.json   # Individual feed entry
 18│   ├── <entry_id2>.json   # Individual feed entry
 19│   └── ...
 20├── <username2>/
 21│   ├── <entry_id3>.json
 22│   └── ...
 23└── ...
 24```
 25
 26## JSON Schemas
 27
 28### 1. Index File (`index.json`)
 29
 30The main index tracks all users, their metadata, and repository statistics.
 31
 32**Schema:**
 33```json
 34{
 35  "users": {
 36    "<username>": {
 37      "username": "string",
 38      "display_name": "string | null",
 39      "email": "string | null", 
 40      "homepage": "string (URL) | null",
 41      "icon": "string (URL) | null",
 42      "feeds": ["string (URL)", ...],
 43      "zulip_associations": [
 44        {
 45          "server": "string",
 46          "user_id": "string"
 47        },
 48        ...
 49      ],
 50      "directory": "string",
 51      "created": "string (ISO 8601 datetime)",
 52      "last_updated": "string (ISO 8601 datetime)",
 53      "entry_count": "integer"
 54    }
 55  },
 56  "created": "string (ISO 8601 datetime)",
 57  "last_updated": "string (ISO 8601 datetime)", 
 58  "total_entries": "integer"
 59}
 60```
 61
 62**Example:**
 63```json
 64{
 65  "users": {
 66    "johndoe": {
 67      "username": "johndoe",
 68      "display_name": "John Doe",
 69      "email": "john@example.com",
 70      "homepage": "https://johndoe.blog",
 71      "icon": "https://johndoe.blog/avatar.png",
 72      "feeds": [
 73        "https://johndoe.blog/feed.xml",
 74        "https://johndoe.blog/categories/tech/feed.xml"
 75      ],
 76      "zulip_associations": [
 77        {
 78          "server": "myorg.zulipchat.com",
 79          "user_id": "john.doe"
 80        },
 81        {
 82          "server": "community.zulipchat.com",
 83          "user_id": "johndoe@example.com"
 84        }
 85      ],
 86      "directory": "johndoe",
 87      "created": "2024-01-15T10:30:00",
 88      "last_updated": "2024-01-20T14:22:00",
 89      "entry_count": 42
 90    }
 91  },
 92  "created": "2024-01-15T10:30:00",
 93  "last_updated": "2024-01-20T14:22:00",
 94  "total_entries": 42
 95}
 96```
 97
 98### 2. Duplicates File (`duplicates.json`)
 99
100Maps duplicate entry IDs to their canonical representations to handle feed entries that appear with different IDs but identical content.
101
102**Schema:**
103```json
104{
105  "duplicates": {
106    "<duplicate_id>": "<canonical_id>"
107  },
108  "comment": "Entry IDs that map to the same canonical content"
109}
110```
111
112**Example:**
113```json
114{
115  "duplicates": {
116    "https://example.com/posts/123?utm_source=rss": "https://example.com/posts/123",
117    "https://example.com/feed/item-duplicate": "https://example.com/feed/item-original"
118  },
119  "comment": "Entry IDs that map to the same canonical content"
120}
121```
122
123### 3. Feed Entry Files (`<username>/<entry_id>.json`)
124
125Individual feed entries are stored as normalized Atom entries, regardless of their original format (RSS/Atom).
126
127**Schema:**
128```json
129{
130  "id": "string",
131  "title": "string", 
132  "link": "string (URL)",
133  "updated": "string (ISO 8601 datetime)",
134  "published": "string (ISO 8601 datetime) | null",
135  "summary": "string | null",
136  "content": "string | null",
137  "content_type": "html | text | xhtml",
138  "author": {
139    "name": "string | null",
140    "email": "string | null", 
141    "uri": "string (URL) | null"
142  } | null,
143  "categories": ["string", ...],
144  "rights": "string | null",
145  "source": "string (URL) | null"
146}
147```
148
149**Example:**
150```json
151{
152  "id": "https://johndoe.blog/posts/my-first-post",
153  "title": "My First Blog Post",
154  "link": "https://johndoe.blog/posts/my-first-post",
155  "updated": "2024-01-20T14:22:00",
156  "published": "2024-01-20T09:00:00", 
157  "summary": "This is a summary of my first blog post.",
158  "content": "<p>This is the full content of my <strong>first</strong> blog post with HTML formatting.</p>",
159  "content_type": "html",
160  "author": {
161    "name": "John Doe",
162    "email": "john@example.com",
163    "uri": "https://johndoe.blog"
164  },
165  "categories": ["blogging", "personal"],
166  "rights": "Copyright 2024 John Doe",
167  "source": "https://johndoe.blog/feed.xml"
168}
169```
170
171## Python Class Integration
172
173To leverage Thicket's existing validation and business logic, third-party clients should use the following Python classes from the `thicket.models` package:
174
175### Core Data Models
176
177```python
178from thicket.models import (
179    AtomEntry,           # Feed entry representation
180    GitStoreIndex,       # Repository index
181    UserMetadata,        # User information  
182    DuplicateMap,        # Duplicate ID mappings
183    FeedMetadata,        # Feed-level metadata
184    ThicketConfig,       # Configuration
185    UserConfig,          # User configuration
186    ZulipAssociation     # Zulip server/user_id pairs
187)
188```
189
190### Repository Operations
191
192```python
193from thicket.core.git_store import GitStore
194from thicket.core.feed_parser import FeedParser
195
196# Initialize git store
197store = GitStore(Path("/path/to/git/store"))
198
199# Read data
200index = store._load_index()          # Load index.json
201user = store.get_user("username")    # Get user metadata
202entries = store.list_entries("username", limit=10)
203entry = store.get_entry("username", "entry_id")
204duplicates = store.get_duplicates()  # Load duplicates.json
205
206# Write data  
207store.add_user("username", display_name="Display Name")
208store.store_entry("username", atom_entry)
209store.add_duplicate("duplicate_id", "canonical_id") 
210store.commit_changes("Commit message")
211
212# Zulip associations
213store.add_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
214store.remove_zulip_association("username", "myorg.zulipchat.com", "user@example.com")
215associations = store.get_zulip_associations("username")
216
217# Search and statistics
218results = store.search_entries("query", username="optional")
219stats = store.get_stats()
220```
221
222### Feed Processing
223
224```python
225from thicket.core.feed_parser import FeedParser
226from pydantic import HttpUrl
227
228parser = FeedParser()
229
230# Fetch and parse feeds
231content = await parser.fetch_feed(HttpUrl("https://example.com/feed.xml"))
232feed_metadata, entries = parser.parse_feed(content, source_url)
233
234# Entry ID sanitization for filenames
235safe_filename = parser.sanitize_entry_id(entry.id)
236```
237
238## File Naming and ID Sanitization
239
240Entry IDs from feeds are sanitized to create safe filenames using `FeedParser.sanitize_entry_id()`:
241
242- URLs are parsed and the path component is used as the base
243- Characters are limited to alphanumeric, hyphens, underscores, and periods
244- Other characters are replaced with underscores
245- Maximum length is 200 characters
246- Empty results default to "entry"
247
248**Examples:**
249- `https://example.com/posts/my-post` → `posts_my-post.json`
250- `https://blog.com/2024/01/title?utm=source` → `2024_01_title.json`
251
252## Data Validation
253
254All JSON data should be validated using Pydantic models before writing to the store:
255
256```python
257from thicket.models import AtomEntry
258from pydantic import ValidationError
259
260try:
261    entry = AtomEntry(**json_data)
262    # Data is valid, safe to store
263    store.store_entry(username, entry)
264except ValidationError as e:
265    # Handle validation errors
266    print(f"Invalid entry data: {e}")
267```
268
269## Timestamps
270
271All timestamps use ISO 8601 format in UTC:
272- `created`: When the record was first created
273- `last_updated`: When the record was last modified  
274- `updated`: When the feed entry was last updated (from feed)
275- `published`: When the feed entry was originally published (from feed)
276
277## Content Sanitization
278
279HTML content in entries is sanitized using the `FeedParser._sanitize_html()` method to prevent XSS attacks. Allowed tags and attributes are strictly controlled.
280
281**Allowed HTML tags:**
282`a`, `abbr`, `acronym`, `b`, `blockquote`, `br`, `code`, `em`, `i`, `li`, `ol`, `p`, `pre`, `strong`, `ul`, `h1`-`h6`, `img`, `div`, `span`
283
284**Allowed attributes:**
285- `a`: `href`, `title`
286- `img`: `src`, `alt`, `title`, `width`, `height` 
287- `blockquote`: `cite`
288- `abbr`/`acronym`: `title`
289
290## Error Handling and Robustness
291
292The store is designed to be fault-tolerant:
293
294- Invalid entries are skipped during processing with error logging
295- Malformed JSON files are ignored in listings
296- Missing files return `None` rather than raising exceptions
297- Git operations are atomic where possible
298
299## Example Usage
300
301### Reading the Store
302
303```python
304from pathlib import Path
305from thicket.core.git_store import GitStore
306
307# Initialize
308store = GitStore(Path("/path/to/thicket/store"))
309
310# Get all users
311index = store._load_index()
312for username, user_metadata in index.users.items():
313    print(f"User: {user_metadata.display_name} ({username})")
314    print(f"  Feeds: {user_metadata.feeds}")
315    print(f"  Entries: {user_metadata.entry_count}")
316
317# Get recent entries for a user
318entries = store.list_entries("johndoe", limit=5)
319for entry in entries:
320    print(f"  - {entry.title} ({entry.updated})")
321```
322
323### Adding Data
324
325```python
326from thicket.models import AtomEntry
327from datetime import datetime
328from pydantic import HttpUrl
329
330# Create entry
331entry = AtomEntry(
332    id="https://example.com/new-post",
333    title="New Post",
334    link=HttpUrl("https://example.com/new-post"),
335    updated=datetime.now(),
336    content="<p>Post content</p>",
337    content_type="html"
338)
339
340# Store entry
341store.store_entry("johndoe", entry)
342store.commit_changes("Add new blog post")
343```
344
345## Zulip Integration
346
347The Thicket Git store supports Zulip bot integration for automatic feed posting with user mentions.
348
349### Zulip Associations
350
351Users can be associated with their Zulip identities to enable @mentions:
352
353```python
354# UserMetadata includes zulip_associations field
355user.zulip_associations = [
356    ZulipAssociation(server="myorg.zulipchat.com", user_id="alice"),
357    ZulipAssociation(server="other.zulipchat.com", user_id="alice@example.com")
358]
359
360# Methods for managing associations
361user.add_zulip_association("myorg.zulipchat.com", "alice")
362user.get_zulip_mention("myorg.zulipchat.com")  # Returns "alice"
363user.remove_zulip_association("myorg.zulipchat.com", "alice")
364```
365
366### CLI Management
367
368```bash
369# Add association
370thicket zulip-add alice myorg.zulipchat.com alice@example.com
371
372# Remove association  
373thicket zulip-remove alice myorg.zulipchat.com alice@example.com
374
375# List associations
376thicket zulip-list           # All users
377thicket zulip-list alice     # Specific user
378
379# Bulk import from CSV
380thicket zulip-import associations.csv
381```
382
383### Bot Behavior
384
385When the Thicket Zulip bot posts articles:
386
3871. It checks for Zulip associations matching the current server
3882. If found, adds @mention to the post: `@**alice** posted:`
3893. The mentioned user receives a notification in Zulip
390
391This enables automatic notifications when someone's blog post is shared.
392
393## Versioning and Compatibility
394
395This specification describes version 1.1 of the Thicket Git store format. Changes from 1.0:
396- Added `zulip_associations` field to UserMetadata (backwards compatible - defaults to empty list)
397
398Future versions will maintain backward compatibility where possible, with migration tools provided for breaking changes.
399
400To check the store format version, examine the repository structure and JSON schemas. Stores created by Thicket 0.1.0+ follow this specification.