Inject Vision Into Any Language Model.

Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.
Low-level. Fast. Free. Built by OpceanAI.

What is OpenLLaVA?

OpenLLaVA is an open-source framework that injects vision capabilities into any language model — no architecture restrictions, no hardcoded backends, no compromises. Built on the LLaVA-style projection architecture and extended with custom CUDA kernels, a C++ core, and a clean Python API.

The framework is developed and maintained by OpceanAI as infrastructure for their vision model pipeline. Every model OpceanAI releases through OpenLLaVA feeds improvements back into the framework.

The central design goal: when a new language model drops, you should have a vision version in 48 hours.

Quickstart

pip install openllava

from openllava import patch_model
from transformers import AutoModelForCausalLM, AutoTokenizer

# Any HuggingFace model. Any vision encoder.
model = AutoModelForCausalLM.from_pretrained("your-org/your-llm")
tokenizer = AutoTokenizer.from_pretrained("your-org/your-llm")

model = patch_model(
    model,
    vision_encoder="google/siglip2-so400m-patch14-384",
    projector_layers=3,
)

That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.

Architecture

Vision Encoder

Any encoder from HuggingFace — SigLIP 2, CLIP, EVA-CLIP, InternViT. OpenLLaVA auto-reads the output dimension and handles tokenization regardless of encoder architecture.

Projector Engine

3-layer MLP with GELU activation, implemented as a fused CUDA kernel. Faster than PyTorch naive by design. Hidden dimension auto-computed from encoder output → LLM input.

Model Patcher

Patches any HuggingFace causal LM to accept vision tokens. Adds <image> special token, extends the embedding layer, and wires the projector output into the LLM input stream. Supports LoRA-patched models.

Training Engine

Two-phase training built in. Phase 1: projector warmup with frozen LLM. Phase 2: joint fine-tuning with LoRA. Gradient checkpointing, Flash Attention 2, and bfloat16 enabled by default.

Stack

Layer	Technology	Purpose
CUDA Kernels	C/CUDA	Fused projector ops, vision token attention
Core	C++	Memory management, tensor routing
Bindings	pybind11	C++ → Python bridge
API	Python	Public interface
Export	HuggingFace	Standard model format + GGUF

Training Pipeline

from openllava import OpenLLaVATrainer

trainer = OpenLLaVATrainer(
    model=model,
    vision_encoder="google/siglip2-so400m-patch14-384",
    pretrain_dataset="liuhaotian/LLaVA-Pretrain",   # Phase 1
    instruct_dataset="liuhaotian/LLaVA-Instruct-150K",  # Phase 2
    lora_r=64,
    lora_alpha=128,
    lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

trainer.train()  # Handles both phases automatically

OpenLLaVA manages phase transitions, learning rate schedules, and checkpoint saving. You run one command.

OpceanAI Vision Models

OpceanAI uses OpenLLaVA to publish vision versions of new language models as they release. These are the models built with the framework:

Yaki YuuKi+ Vision (in development)

Vision-language model built on Yuuki RxG 8B (DeepSeek-R1-Qwen2.5-8B fine-tune). Complex visual reasoning, bilingual (ES/EN), preserves the Yuuki <think> chain-of-thought behavior for multimodal tasks.

Vision encoder: SigLIP 2 SO400M · LoRA r=64

Yuuki NxG VL

7B vision-language model fine-tuned from Qwen2.5-VL-7B-Instruct. Extends the NxG model family to multimodal tasks. The first OpceanAI vision model and the validation case for the OpenLLaVA pipeline.

Philosophy

Model-Agnostic by Design

Every major framework for multimodal training — LLaVA, LLaVA-Next, InstructBLIP — is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.

Speed Over Ceremony

When a new language model drops, the window to publish a vision version is 48–72 hours before the ecosystem moves on. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.

Low Level Where It Matters

The projector is the critical path. Everything else can be Python. The CUDA kernel for the fused MLP op and the C++ memory manager exist because training throughput on a single A100 is the binding constraint for a zero-budget lab.

Fully Open

Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher — with any model, any hardware, any budget — can build a competitive vision-language model.

Roadmap

Framework

Feature	Status
Python API + model patcher	In development
MLP projector (PyTorch)	In development
Two-phase training engine	In development
Fused CUDA projector kernel	Planned
C++ memory core	Planned
GGUF vision export	Planned
Multi-encoder support (BRAVE-style)	Planned

Vision Models

Model	Status
Yuuki NxG VL	Released
Yaki YuuKi+ Vision (8B)	In development
Community model pipeline	Planned

Contributing

OpenLLaVA is built to be extended. If you patch a model family that isn't supported yet, the contribution belongs in the framework. If you find a faster kernel implementation, open a PR.

The project is maintained by OpceanAI but owned by the community.

git clone https://github.com/OpceanAI/openllava
cd openllava
pip install -e ".[dev]"

Built by OpceanAI

OpenLLaVA is the vision infrastructure layer of OpceanAI — an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on Google Colab Pro and validated on consumer hardware.

Open framework. Open models. Zero budget. Measurable results.

The fastest path from any language model to a vision-language model.