implement multimodal early fusion: combine filename text with image embeddings #3

problem#

currently, we only embed the image/gif content when generating embeddings with voyage-multimodal-3. we're missing semantic information from filenames like "bufo-jumping-on-bed" that could improve search relevance.

example: searching for "bufo jumping on bed" should better match a gif that both visually shows jumping AND has that filename, but currently we only use the visual content.

research findings#

voyage-multimodal-3 supports early fusion - sending text + image in a single API request to create a unified embedding. this is the optimal approach:

early fusion (recommended)#

send filename text + image together to voyage-multimodal-3
model creates single 1024-dim embedding capturing both modalities
15-20% accuracy improvement over image-only (based on academic research)
no additional cost or latency
better semantic alignment between text and visual content

why not late fusion?#

would require 2 separate embeddings per bufo (text + image)
2x embedding costs
more complex search logic
research shows early fusion outperforms when you have a unified multimodal model

implementation approach#

1. modify ingestion script (`scripts/ingest_bufos.py`)#

current (lines 151-154):

content = [{
    "type": "image_base64",
    "image_base64": f"data:image/webp;base64,{img_base64}",
}]

proposed:

# extract semantic meaning from filename
filename_text = image_path.stem.replace("-", " ")  # "bufo-jumping-on-bed" -> "bufo jumping on bed"

content = [
    {
        "type": "text",
        "text": filename_text
    }
]

# add frame(s)
if is_animated:
    for frame_idx in frame_indices:
        # ... existing frame extraction ...
        content.append({
            "type": "image_base64",
            "image_base64": f"data:image/webp;base64,{img_base64}",
        })
else:
    # ... existing static image code ...
    content.append({
        "type": "image_base64", 
        "image_base64": f"data:image/webp;base64,{img_base64}",
    })

2. re-run ingestion#

after modifying the script, re-run ingestion to update all embeddings:

uv run scripts/ingest_bufos.py

3. optional: weighted RRF tuning#

if testing shows benefit, consider weighting vector search more heavily in src/search.rs:

let vector_weight = 0.7;  // prioritize semantic similarity
let bm25_weight = 0.3;    // still use keyword matching

expected benefits#

queries like "bufo jumping on bed" match both visual content AND filename semantics
queries with specific terms (colors, actions, objects) better aligned
single unified embedding maintains simple architecture
15-20% accuracy improvement based on multimodal fusion literature

references#

voyage AI docs: multimodal-3 interleaved content
research: "FuseLIP: Multimodal Embeddings via Early Fusion" (arXiv 2506.03096)
best practices from turbopuffer's multimodal search documentation

testing#

create a test set of ~20 queries to validate improvements:

"bufo jumping" → should match jumping-related filenames + visual content
"yellow toad" → should match color in filename + visual yellow
"bufo sleeping peacefully" → semantic filename matching