OpenLLaVA



Inject Vision Into Any Language Model.

Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.
Low-level. Fast. Free. Built by OpceanAI.



What is OpenLLaVA?

OpenLLaVA is an open-source framework that injects vision capabilities into any language model — no architecture restrictions, no hardcoded backends, no compromises. Built on the LLaVA-style projection architecture and extended with custom CUDA kernels, a C++ core, and a clean Python API.

The framework is developed and maintained by OpceanAI as infrastructure for their vision model pipeline. Every model OpceanAI releases through OpenLLaVA feeds improvements back into the framework.

The central design goal: when a new language model drops, you should have a vision version in 48 hours.




Quickstart


pip install openllava
from openllava import patch_model
from transformers import AutoModelForCausalLM, AutoTokenizer

# Any HuggingFace model. Any vision encoder.
model = AutoModelForCausalLM.from_pretrained("your-org/your-llm")
tokenizer = AutoTokenizer.from_pretrained("your-org/your-llm")

model = patch_model(
    model,
    vision_encoder="google/siglip2-so400m-patch14-384",
    projector_layers=3,
)

That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.




Architecture


Vision Encoder

Any encoder from HuggingFace — SigLIP 2, CLIP, EVA-CLIP, InternViT. OpenLLaVA auto-reads the output dimension and handles tokenization regardless of encoder architecture.


Projector Engine

3-layer MLP with GELU activation, implemented as a fused CUDA kernel. Faster than PyTorch naive by design. Hidden dimension auto-computed from encoder output → LLM input.

Model Patcher

Patches any HuggingFace causal LM to accept vision tokens. Adds <image> special token, extends the embedding layer, and wires the projector output into the LLM input stream. Supports LoRA-patched models.


Training Engine

Two-phase training built in. Phase 1: projector warmup with frozen LLM. Phase 2: joint fine-tuning with LoRA. Gradient checkpointing, Flash Attention 2, and bfloat16 enabled by default.




Stack


Layer Technology Purpose
CUDA Kernels C/CUDA Fused projector ops, vision token attention
Core C++ Memory management, tensor routing
Bindings pybind11 C++ → Python bridge
API Python Public interface
Export HuggingFace Standard model format + GGUF



Training Pipeline


from openllava import OpenLLaVATrainer

trainer = OpenLLaVATrainer(
    model=model,
    vision_encoder="google/siglip2-so400m-patch14-384",
    pretrain_dataset="liuhaotian/LLaVA-Pretrain",   # Phase 1
    instruct_dataset="liuhaotian/LLaVA-Instruct-150K",  # Phase 2
    lora_r=64,
    lora_alpha=128,
    lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

trainer.train()  # Handles both phases automatically

OpenLLaVA manages phase transitions, learning rate schedules, and checkpoint saving. You run one command.




OpceanAI Vision Models


OpceanAI uses OpenLLaVA to publish vision versions of new language models as they release. These are the models built with the framework:


Yaki YuuKi+ Vision (in development)

Vision-language model built on Yuuki RxG 8B (DeepSeek-R1-Qwen2.5-8B fine-tune). Complex visual reasoning, bilingual (ES/EN), preserves the Yuuki <think> chain-of-thought behavior for multimodal tasks.

Vision encoder: SigLIP 2 SO400M · LoRA r=64

Status

Yuuki NxG VL

7B vision-language model fine-tuned from Qwen2.5-VL-7B-Instruct. Extends the NxG model family to multimodal tasks. The first OpceanAI vision model and the validation case for the OpenLLaVA pipeline.

Model




Philosophy


Model-Agnostic by Design

Every major framework for multimodal training — LLaVA, LLaVA-Next, InstructBLIP — is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.

Speed Over Ceremony

When a new language model drops, the window to publish a vision version is 48–72 hours before the ecosystem moves on. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.

Low Level Where It Matters

The projector is the critical path. Everything else can be Python. The CUDA kernel for the fused MLP op and the C++ memory manager exist because training throughput on a single A100 is the binding constraint for a zero-budget lab.

Fully Open

Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher — with any model, any hardware, any budget — can build a competitive vision-language model.




Roadmap


Framework

Feature Status
Python API + model patcher In development
MLP projector (PyTorch) In development
Two-phase training engine In development
Fused CUDA projector kernel Planned
C++ memory core Planned
GGUF vision export Planned
Multi-encoder support (BRAVE-style) Planned

Vision Models

Model Status
Yuuki NxG VL Released
Yaki YuuKi+ Vision (8B) In development
Community model pipeline Planned



Contributing


OpenLLaVA is built to be extended. If you patch a model family that isn't supported yet, the contribution belongs in the framework. If you find a faster kernel implementation, open a PR.

The project is maintained by OpceanAI but owned by the community.

git clone https://github.com/OpceanAI/openllava
cd openllava
pip install -e ".[dev]"



Built by OpceanAI


OpenLLaVA is the vision infrastructure layer of OpceanAI — an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on Google Colab Pro and validated on consumer hardware.


OpceanAI   HuggingFace   Sponsor




Open framework. Open models. Zero budget. Measurable results.


OpenLLaVA


The fastest path from any language model to a vision-language model.