Lexicon Codegen Plan#
Goal#
Generate idiomatic Rust types from AT Protocol lexicon schemas with minimal nesting/indirection.
Existing Infrastructure#
Already Implemented#
- lexicon.rs: Complete lexicon parsing types (
LexiconDoc,LexUserType,LexObject, etc) - fs.rs: Directory walking for finding
.jsonlexicon files - schema.rs:
find_ref_unions()- collects union fields from a single lexicon - output.rs: Partial - has string type mapping and doc comment generation
Attribute Macros#
#[lexicon]- addsextra_datafield to structs#[open_union]- addsUnknown(Data<'s>)variant to enums
Design Decisions#
Module/File Structure#
- NSID
app.bsky.feed.post→app_bsky/feed/post.rs - Flat module names (no
app::bsky, justapp_bsky) - Parent modules:
app_bsky/feed.rswithpub mod post;
Type Naming#
- Main def: Use last segment of NSID
app.bsky.feed.post#main→Post
- Other defs: Pascal-case the def name
replyRef→ReplyRef
- Union variants: Use last segment of ref NSID
app.bsky.embed.images→Images- Collisions resolved by module path, not type name
- No proliferation of
Maintypes like atrium has
Type Generation#
Records (lexRecord)#
#[lexicon]
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)]
#[serde(rename_all = "camelCase")]
pub struct Post<'s> {
/// Client-declared timestamp...
pub created_at: Datetime,
#[serde(skip_serializing_if = "Option::is_none")]
pub embed: Option<RecordEmbed<'s>>,
pub text: CowStr<'s>,
}
Objects (lexObject)#
Same as records but without #[lexicon] if inline/not a top-level def.
Unions (lexRefUnion)#
#[open_union]
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)]
#[serde(tag = "$type")]
pub enum RecordEmbed<'s> {
#[serde(rename = "app.bsky.embed.images")]
Images(Box<jacquard_api::app_bsky::embed::Images<'s>>),
#[serde(rename = "app.bsky.embed.video")]
Video(Box<jacquard_api::app_bsky::embed::Video<'s>>),
}
- Use
Box<T>for all variants (handles circular refs) #[open_union]addsUnknown(Data<'s>)catch-all
Queries (lexXrpcQuery)#
pub struct GetAuthorFeedParams<'s> {
pub actor: AtIdentifier<'s>,
pub limit: Option<i64>,
pub cursor: Option<CowStr<'s>>,
}
pub struct GetAuthorFeedOutput<'s> {
pub cursor: Option<CowStr<'s>>,
pub feed: Vec<FeedViewPost<'s>>,
}
- Flat params/output structs
- No nesting like
Input { params: {...} }
Procedures (lexXrpcProcedure)#
Same as queries but with both Input and Output structs.
Field Handling#
Optional Fields#
- Fields not in
required: []→Option<T> - Add
#[serde(skip_serializing_if = "Option::is_none")]
Lifetimes#
- All types have
'alifetime for borrowing from input #[serde(borrow)]where needed for zero-copy
Type Mapping#
LexStringwith format → specific types (Datetime,Did, etc)LexStringwithout format →CowStr<'a>LexInteger→i64LexBoolean→boolLexBytes→BytesLexCidLink→CidLink<'a>LexBlob→Blob<'a>LexRef→ resolve to actual type pathLexRefUnion→ generate enumLexArray→Vec<T>LexUnknown→Data<'a>
Reference Resolution#
Known Refs#
- Check corpus for ref existence
#ref: "app.bsky.embed.images"→jacquard_api::app_bsky::embed::Images<'a>- Handle fragments:
#ref: "com.example.foo#bar"→jacquard_api::com_example::foo::Bar<'a>
Unknown Refs#
- In struct fields: use
Data<'a>as fallback type - In union variants: handled by
Unknown(Data<'a>)variant from#[open_union] - Optional: log warnings for missing refs
Implementation Phases#
Phase 1: Corpus Loading & Registry#
Goal: Load all lexicons into memory for ref resolution
Tasks:
- Create
LexiconCorpusstructBTreeMap<SmolStr, LexiconDoc<'static>>- NSID → doc- Methods:
load_from_dir(),get(),resolve_ref()
- Load all
.jsonfiles from lexicon directory - Parse into
LexiconDocand insert into registry - Handle fragments in refs (
nsid#def)
Output: Corpus registry that can resolve any ref
Phase 2: Ref Analysis & Union Collection#
Goal: Build complete picture of what refs exist and what unions need
Tasks:
- Extend
find_ref_unions()to work across entire corpus - For each union, collect all refs and check existence
- Build
UnionRegistry:- Union name → list of (known refs, unknown refs)
- Detect circular refs (optional - or just Box everything)
Output: Complete list of unions to generate with their variants
Phase 3: Code Generation - Core Types#
Goal: Generate Rust code for individual types
Tasks:
- Implement type generators:
generate_struct()for records/objectsgenerate_enum()for unionsgenerate_field()for object propertiesgenerate_type()for primitives/refs
- Handle optional fields (
requiredlist) - Add doc comments from
description - Apply
#[lexicon]/#[open_union]macros - Add serde attributes
Output: TokenStream for each type
Phase 4: Module Organization#
Goal: Organize generated types into module hierarchy
Tasks:
- Parse NSID into components:
["app", "bsky", "feed", "post"] - Determine file paths:
app_bsky/feed/post.rs - Generate module files:
app_bsky/feed.rswithpub mod post; - Generate root module:
app_bsky.rs - Handle re-exports if needed
Output: File path → generated code mapping
Phase 5: File Writing#
Goal: Write generated code to filesystem
Tasks:
- Format code with
prettyplease - Create directory structure
- Write module files
- Write type files
- Optional: run
rustfmt
Output: Generated code on disk
Phase 6: Testing & Validation#
Goal: Ensure generated code compiles and works
Tasks:
- Generate code for test lexicons
- Compile generated code
- Test serialization/deserialization
- Test union variant matching
- Test extra_data capture
Edge Cases & Considerations#
Circular References#
- Simple approach: Union variants always use
Box<T>→ handles all circular refs - Alternative: DFS cycle detection to only Box when needed
- Track visited refs and recursion stack
- If ref appears in rec_stack → cycle detected
- Algorithm:
fn has_cycle(corpus, start_ref, visited, rec_stack) -> bool { visited.insert(start_ref); rec_stack.insert(start_ref); for child_ref in collect_refs_from_def(resolve(start_ref)) { if !visited.contains(child_ref) { if has_cycle(corpus, child_ref, visited, rec_stack) { return true; } } else if rec_stack.contains(child_ref) { return true; // back edge = cycle } } rec_stack.remove(start_ref); false } - Only box variants that participate in cycles
- Recommendation: Start with simple (always Box), optimize later if needed
Name Collisions#
- Multiple types with same name in different lexicons
- Module path disambiguates:
app_bsky::feed::Postvscom_example::feed::Post
Unknown Refs#
- Fallback to
Data<'s>in struct fields - Caught by
Unknownvariant in unions - Warn during generation
Inline Defs#
- Nested objects/unions in same lexicon
- Generate as separate types in same file
- Keep names scoped to parent (e.g.,
PostReplyRef)
Arrays#
Vec<T>for arrays- Handle nested unions in arrays
Tokens#
- Simple marker types
- Generate as unit structs or type aliases?
Traits for Generated Types#
Collection Trait (Records)#
Records implement the existing Collection trait from jacquard-common:
pub struct Post<'a> {
// ... fields
}
impl Collection for Post<'p> {
const NSID: &'static str = "app.bsky.feed.post";
type Record = Post<'p>;
}
XrpcRequest Trait (Queries/Procedures)#
New trait for XRPC endpoints:
pub trait XrpcRequest<'x> {
/// The NSID for this XRPC method
const NSID: &'static str;
/// XRPC method (query/GET, procedure/POST)
const METHOD: XrpcMethod;
/// Input encoding (MIME type, e.g., "application/json")
/// None for queries (no body)
const INPUT_ENCODING: Option<&'static str>;
/// Output encoding (MIME type)
const OUTPUT_ENCODING: &'static str;
/// Request parameters type (query params or body)
type Params: Serialize;
/// Response output type
type Output: Deserialize<'x>;
type Err: Error;
}
pub enum XrpcMethod {
Query, // GET
Procedure, // POST
}
Generated implementation:
pub struct GetAuthorFeedParams<'a> {
pub actor: AtIdentifier<'a>,
pub limit: Option<i64>,
pub cursor: Option<CowStr<'a>>,
}
pub struct GetAuthorFeedOutput<'a> {
pub cursor: Option<CowStr<'a>>,
pub feed: Vec<FeedViewPost<'a>>,
}
impl XrpcRequest for GetAuthorFeedParams<'_> {
const NSID: &'static str = "app.bsky.feed.getAuthorFeed";
const METHOD: XrpcMethod = XrpcMethod::Query;
const INPUT_ENCODING: Option<&'static str> = None; // queries have no body
const OUTPUT_ENCODING: &'static str = "application/json";
type Params = Self;
type Output = GetAuthorFeedOutput<'static>;
type Err = GetAuthorFeedError;
}
Encoding variations:
- Most procedures:
"application/json"for input/output - Blob uploads:
"*/*"or specific MIME type for input - CAR files:
"application/vnd.ipld.car"for repo operations - Read from lexicon's
input.encodingandoutput.encodingfields
Trait benefits:
- Allows monomorphization (static dispatch) for performance
- Also supports
dyn XrpcRequestfor dynamic dispatch if needed - Client code can be generic over
impl XrpcRequest
XRPC Errors#
Lexicons contain information on the kind of errors they can return. Trait contains an associated error type. Error enum with thiserror::Error and miette:Diagnostic derives and appropriate content generated based on lexicon info.
Subscriptions#
WebSocket streams - defer for now. Will need separate trait with message types.
Open Questions#
- Validation: Generate runtime validation (min/max length, regex, etc)?
- Tokens: How to represent token types?
- Errors: How to handle codegen errors (missing refs, invalid schemas)?
- Incremental: Support incremental codegen (only changed lexicons)?
- Formatting: Always run rustfmt or rely on prettyplease?
- XrpcRequest location: Should trait live in jacquard-common or separate jacquard-xrpc crate?
- Import shortening: Track imports and shorten ref paths in generated code
- Instead of
jacquard_api::app_bsky::richtext::Facet<'a>emituse jacquard_api::app_bsky::richtext::Facet;and justFacet<'a> - Would require threading
ImportTrackerthrough all generate functions or post-processing token stream - Long paths are ugly but explicit - revisit once basic codegen is confirmed working
- Instead of
- Web-based lexicon resolution: Fetch lexicons from the web instead of requiring local files
- Implement lexicon publication and resolution spec
LexiconCorpus::fetch_from_web(nsids: &[&str])- fetch specific NSIDsLexiconCorpus::fetch_from_authority(authority: &str)- fetch all from DID/domain- Resolution:
https://{authority}/.well-known/atproto/lexicon/{nsid}.json - Recursively fetch refs, handle redirects/errors
- Use
reqwestfor HTTP - still fits in jacquard-lexicon as it's corpus loading
Success Criteria#
- Generates code for all official AT Protocol lexicons
- Generated code compiles without errors
- No
Mainproliferation - Union variants have readable names
- Unknown refs handled gracefully
-
#[lexicon]and#[open_union]applied correctly - Serialization round-trips correctly