problem#
currently, we only embed the image/gif content when generating embeddings with voyage-multimodal-3. we're missing semantic information from filenames like "bufo-jumping-on-bed" that could improve search relevance.
example: searching for "bufo jumping on bed" should better match a gif that both visually shows jumping AND has that filename, but currently we only use the visual content.
research findings#
voyage-multimodal-3 supports early fusion - sending text + image in a single API request to create a unified embedding. this is the optimal approach:
early fusion (recommended)#
- send
filename text + imagetogether to voyage-multimodal-3 - model creates single 1024-dim embedding capturing both modalities
- 15-20% accuracy improvement over image-only (based on academic research)
- no additional cost or latency
- better semantic alignment between text and visual content
why not late fusion?#
- would require 2 separate embeddings per bufo (text + image)
- 2x embedding costs
- more complex search logic
- research shows early fusion outperforms when you have a unified multimodal model
implementation approach#
1. modify ingestion script (scripts/ingest_bufos.py)#
current (lines 151-154):
content = [{
"type": "image_base64",
"image_base64": f"data:image/webp;base64,{img_base64}",
}]
proposed:
# extract semantic meaning from filename
filename_text = image_path.stem.replace("-", " ") # "bufo-jumping-on-bed" -> "bufo jumping on bed"
content = [
{
"type": "text",
"text": filename_text
}
]
# add frame(s)
if is_animated:
for frame_idx in frame_indices:
# ... existing frame extraction ...
content.append({
"type": "image_base64",
"image_base64": f"data:image/webp;base64,{img_base64}",
})
else:
# ... existing static image code ...
content.append({
"type": "image_base64",
"image_base64": f"data:image/webp;base64,{img_base64}",
})
2. re-run ingestion#
after modifying the script, re-run ingestion to update all embeddings:
uv run scripts/ingest_bufos.py
3. optional: weighted RRF tuning#
if testing shows benefit, consider weighting vector search more heavily in src/search.rs:
let vector_weight = 0.7; // prioritize semantic similarity
let bm25_weight = 0.3; // still use keyword matching
expected benefits#
- queries like "bufo jumping on bed" match both visual content AND filename semantics
- queries with specific terms (colors, actions, objects) better aligned
- single unified embedding maintains simple architecture
- 15-20% accuracy improvement based on multimodal fusion literature
references#
- voyage AI docs: multimodal-3 interleaved content
- research: "FuseLIP: Multimodal Embeddings via Early Fusion" (arXiv 2506.03096)
- best practices from turbopuffer's multimodal search documentation
testing#
create a test set of ~20 queries to validate improvements:
- "bufo jumping" → should match jumping-related filenames + visual content
- "yellow toad" → should match color in filename + visual yellow
- "bufo sleeping peacefully" → semantic filename matching
actually addressed by https://tangled.org/@zzstoatzz.io/find-bufo/commit/ed622b4b7ab31adbab3069102e5cba100ef88304