Multimodal Memory

Pensyve doesn't just remember text. The multimodal engine maintains three separate vector spaces — one each for text, images, and code — with specialized embedding models for each modality.

How It Works

When you store a memory, Pensyve detects the content type and routes it to the appropriate embedding model and vector space:

ModalityModelDimensionsWhat it captures
TextGTE-Small (ONNX)384dNatural language — conversations, facts, preferences
ImageFlorence-2 (ONNX)768dVisual content — screenshots, diagrams, UI mockups
CodeUniXcoder (ONNX)768dSource code — functions, patterns, implementations

Each modality gets its own HNSW vector index. During recall, Pensyve searches across all spaces where the query embedding dimensions match, normalizes scores, and merges results before feeding into the 8-signal RRF pipeline.

Image Memory

Powered by Microsoft's Florence-2 vision-language model (~230 MB ONNX), image memory lets agents remember visual content.

Use cases:

  • UI state tracking — remember what a dashboard looked like before a change
  • Diagram comprehension — store architecture diagrams and retrieve them by description
  • Screenshot analysis — remember error screens, deployment states, visual diffs

Images are embedded into a 768-dimensional space. The model runs locally via ONNX Runtime — no external API calls.

Code Memory

Powered by Microsoft's UniXcoder (~125 MB ONNX), code memory understands programming language structure at a deeper level than generic text embeddings.

Use cases:

  • Pattern recall — "How did we implement the rate limiter?"
  • Implementation memory — remember code patterns across sessions
  • Cross-language understanding — UniXcoder works across Python, TypeScript, Rust, Go, and more

Code embeddings capture semantic structure (function signatures, control flow patterns) rather than just token overlap, so a search for "retry with backoff" will find relevant implementations even if they don't contain those exact words.

Multimodal Recall

When you call recall(), Pensyve's query intent classifier detects whether the query is visual ("show me the diagram"), code-related ("how did we implement auth"), or general text. Intent classification influences which vector spaces are searched and how results are weighted.

All modalities participate in the same 8-signal fusion pipeline — a code memory can outrank a text memory if it's more relevant, more recent, and has higher activation.

All models run locally via ONNX Runtime. No data leaves your machine (or your Pensyve Cloud region). The total model footprint is approximately 750 MB across all three modalities plus the reranker.

Pricing

On Pensyve Cloud, multimodal operations are billed at $0.005/op (5× the standard text rate) to reflect the additional compute required for specialized model inference. Your first 50 multimodal operations each month are free.

See Pricing for the full breakdown.