Last checked: April 19, 2026. Model libraries move quickly, and Ollama tags can be repointed or added after this snapshot. Treat this as a practical map of the families, not as a permanent ranking.
The short version:
- Newest practical local default:
gemma4:e4b,gemma4:26b,qwen3.6, ordeepseek-r1, depending on hardware. - Best cloud-scale open model choices on Ollama:
deepseek-v3.2:cloud,mistral-large-3:675b-cloud,qwen3.5:397b-cloud,gpt-oss:120b-cloud, andglm-5.1:cloud. - Best small local reasoning:
deepseek-r1,phi4-mini-reasoning,gemma4:e4b, orqwen3.5:9b. - Best local coding/agent choices:
qwen3.6,qwen3-coder-next,gemma4:26b,gpt-oss:20b, or cloud models if latency and privacy allow it. - Best OCR/document extraction:
deepseek-ocr,glm-ocr,granite3.2-vision, or a modern multimodal general model.
How to read Ollama names
Ollama model names usually follow this pattern:
family-version-specialization:size-or-variant
Examples:
qwen3.6:35bmeans the Qwen 3.6 family, 35B-class variant.deepseek-r1:32bmeans DeepSeek R1 reasoning, 32B distilled model.mistral-large-3:675b-cloudmeans Mistral Large 3, 675B cloud-hosted variant.gemma4:e4bmeans Gemma 4 edge model with about 4B effective parameters.gpt-oss:120bmeans OpenAI's 120B open-weight reasoning model.
Important tags:
latestis just Ollama's default tag for a family. It does not always mean biggest or best.cloudmeans Ollama Cloud, not local inference.thinkingmeans the model can expose or internally use a reasoning mode.toolsmeans the model and Ollama template support tool/function calling.visionmeans image input is supported.embeddingmeans the model is for vector search, not chat.q4_K_M,q8_0,MXFP4, and similar tags are quantization/format hints. Smaller formats use less memory and may lose some quality.
Quick chooser
| Use case | Start here |
|---|---|
| Good local chat on ordinary hardware | gemma4:e4b, qwen3.5:9b, llama3.2:3b, phi4-mini |
| Strong local reasoning | deepseek-r1, phi4-reasoning, gemma4:26b, qwen3.6 |
| Large local workstation | gemma4:31b, qwen3.6, deepseek-r1:32b, llama3.3:70b |
| Best cloud reasoning/agents | deepseek-v3.2:cloud, mistral-large-3:675b-cloud, glm-5.1:cloud |
| Coding agents | qwen3.6, qwen3-coder-next, gpt-oss:20b, deepseek-v3.2:cloud |
| Vision and screenshots | gemma4, qwen3.5, llama4, mistral-small3.2, ministral-3 |
| OCR and documents | deepseek-ocr, glm-ocr, granite3.2-vision |
| RAG embeddings | qwen3-embedding, embeddinggemma, granite-embedding |
Qwen
Qwen is Alibaba's open model family. It has become one of the strongest all-around families for local users because it covers small laptops, coding agents, long-context MoE models, vision, embeddings, and cloud-scale frontier models.
Main generations
qwen is legacy Qwen 1.5. It is mostly superseded.
qwen2 improved multilingual coverage, coding, math, and long context over Qwen 1.5.
qwen2.5 was the strong 2024/2025 default. It improved knowledge, coding, math, JSON/structured outputs, long-text
generation, and instruction following. Ollama lists sizes from 0.5B to 72B.
qwen3 introduced hybrid thinking/non-thinking behavior, stronger reasoning, better tool/agent use, dense models, and
MoE models. Qwen's own Qwen3 blog says Qwen3 was trained on about 36T tokens and supports 119 languages and dialects.
qwen3.5 is newer and more unified: multimodal, long context, thinking, tools, and cloud variants. Ollama lists sizes
from 0.8B to 122B plus larger cloud models.
qwen3.6 is currently the newest Qwen model surfaced on Ollama's search page. Ollama lists it as a 35B model with about
24GB local size, 256K context, text+image input, tools, and thinking. Upstream model cards describe it as a 35B total /
3B active MoE model focused on agentic coding, multimodal reasoning, and "thinking preservation."
Special variants
qwen3-coder and qwen3-coder-next are coding-agent models. qwen3-coder-next is based on the Qwen3-Next
architecture, with 80B total / 3B active parameters on Ollama, 256K context, and non-thinking mode only.
qwen3-next is an efficiency branch using hybrid attention and high-sparsity MoE ideas.
qwen3-vl and qwen2.5vl are vision-language branches. Use them for screenshots, charts, visual grounding, GUI tasks,
OCR-ish extraction, and image reasoning.
qwen3-embedding is for embeddings and retrieval, not chat.
qwq was the older Qwen reasoning line. Qwen3 mostly absorbed that role.
Practical picks
Use qwen3.6 if you can run the 24GB local model and want the newest local Qwen for coding, agents, vision, and general
chat.
Use qwen3.5:9b or qwen3.5:4b if you want a newer, easier local multimodal model.
Use qwen3-coder-next or qwen3-coder:30b for repo-scale coding agents.
Use qwen3-embedding for RAG.
DeepSeek
DeepSeek is best understood as two parallel lines: V-series general MoE models and R-series reasoning models.
Main generations
deepseek-llm is the old first-generation bilingual general family. Ollama lists 7B and 67B variants with 4K context.
deepseek-coder is the older code family, trained heavily on code and natural language.
deepseek-v2 moved the family into economical MoE models. Ollama's practical local variant is deepseek-v2:16b with
160K context; the 236B variant is much larger.
deepseek-coder-v2 is the coding-specialized V2 branch.
deepseek-v2.5 merged the general ability of V2-Chat with Coder-V2-Instruct. It was useful in its window but is not the
first thing to reach for now.
deepseek-v3 is the 671B total / 37B active MoE flagship generation. It is huge locally on Ollama.
deepseek-r1 is the reasoning family. Ollama's current default deepseek-r1 points to the updated
DeepSeek-R1-0528-Qwen3-8B distilled model, with 128K context and a much smaller local footprint. The full
deepseek-r1:671b model is enormous.
deepseek-v3.1 is a hybrid model: one model can run in thinking or non-thinking mode. DeepSeek's model card lists 671B
total / 37B active parameters and 128K context upstream. Ollama exposes a 160K-context local package and a cloud tag.
deepseek-v3.2 is the newest DeepSeek general model on Ollama and is cloud-only there. DeepSeek's model card describes
it as a 685B-parameter model using DeepSeek Sparse Attention, scalable RL, and agentic task synthesis.
deepseek-ocr is a 3B image-text OCR/document model, not a normal chat model.
Special variants
r1-1776 is a Perplexity post-trained R1 variant aimed at reducing censorship/refusal behavior. It is not the canonical
DeepSeek release.
deepscaler is a fine-tuned DeepSeek-R1-Distill-Qwen-1.5B model focused on efficient math reasoning.
Practical picks
Use deepseek-v3.2:cloud for the newest and strongest DeepSeek model through Ollama Cloud.
Use deepseek-r1 for practical local reasoning on normal hardware.
Use deepseek-r1:32b if you have enough memory and want stronger local reasoning.
Use deepseek-ocr for document OCR and image-to-markdown extraction.
Mistral
Mistral models are attractive when you care about permissive licensing, strong multilingual behavior, function calling, and efficient models that are not tied to a single giant deployment shape.
Main generations
mistral is the original Mistral 7B line. Ollama's page says it has been updated to v0.3 and supports tools. It is old
but still common.
mixtral is the older MoE family, including 8x7B and 8x22B. It matters historically because it popularized strong
open-weight MoE models.
mistral-nemo is a 12B model from Mistral and NVIDIA with 128K context. It remains a useful middle-size local model.
mistral-small, mistral-small3.1, and mistral-small3.2 are the small-but-capable 22B/24B family. On Ollama,
mistral-small3.2 is a 24B text+image model with 128K context, vision, and tools. It specifically improves instruction
following, repetition errors, and function calling over Small 3.1.
ministral-3 is the newer edge-focused Mistral 3 family. Ollama lists 3B, 8B, and 14B variants, all text+image with
256K context, tools, cloud options, and Apache 2.0 licensing. Mistral's docs describe the 14B as optimized for local
deployment with performance comparable to Mistral Small 3.2.
mistral-large-3 is Mistral's current frontier-scale model on Ollama. It is cloud-only there as
mistral-large-3:675b-cloud, with 256K context, text+image input, native function calling, JSON output, and Apache 2.0
licensing. Mistral's docs list it as 675B total / 41B active parameters.
codestral is Mistral's older code-specialized line.
mathstral is the older math/science-specialized Mistral line.
Practical picks
Use mistral-large-3:675b-cloud for best Mistral quality through Ollama Cloud.
Use ministral-3:8b or ministral-3:14b for modern local Mistral with vision, 256K context, and tool use.
Use mistral-small3.2 if you want the 24B local model and can afford the 15GB footprint.
Use mistral-nemo if you want a mature 12B text model with long context.
Llama
Llama is Meta's open-weight model family. It is widely supported and widely fine-tuned. The main catch is that the Llama license is not the same as Apache/MIT; check the license for commercial and scale restrictions.
Main generations
llama2 is old now but still appears in many fine-tunes.
llama3 improved general quality substantially and came in 8B and 70B sizes.
llama3.1 added 8B, 70B, and 405B variants and 128K context. It remains important because many third-party fine-tunes
and tool-use models are based on it.
llama3.2 split into small text models and separate vision models. Ollama's llama3.2 page covers 1B and 3B text
models; llama3.2-vision covers 11B and 90B image-reasoning models.
llama3.3 is a 70B text model that Meta positioned as offering performance similar to Llama 3.1 405B. Ollama lists 43GB
local size, 128K context, and tools.
llama4 is the newest Llama family on Ollama. Ollama lists two MoE multimodal variants: Scout and Maverick. Scout is
109B total / 17B active with a 10M context window on Ollama; Maverick is 400B total / 17B active with a 1M context
window. Both are text+image models.
Special variants
llama-guard3 is a safety classifier, not a chat model.
llama3-chatqa is NVIDIA's Llama 3-based conversational QA/RAG model.
llava-llama3 is a LLaVA vision model based on Llama 3.
Many Dolphin, Hermes, Vicuna, and other assistant models are fine-tuned from Llama bases.
Practical picks
Use llama4 if you want Meta's newest multimodal model and can run the large local package.
Use llama3.3 if you want a mature 70B text model with broad support.
Use llama3.2:3b for a small, fast, common local model.
Use llama3.2-vision if you specifically need the older Llama vision branch and do not want Llama 4.
Gemma
Gemma is Google's open model family. It is strong for local deployment because Google provides several sizes and keeps the family focused on small-to-medium efficient models.
Main generations
gemma is the original Gemma 1 family.
gemma2 improved performance and efficiency in 2B, 9B, and 27B sizes.
gemma3 added multimodal models and was, for a while, Ollama's "current, most capable model that runs on a single GPU."
Ollama lists 270M, 1B, 4B, 12B, and 27B variants.
gemma3n is the efficient on-device branch for laptops, tablets, and phones.
gemma4 is the newest Gemma family on Ollama. Google's Gemma page positions Gemma 4 as its most intelligent open model
family, aimed at advanced reasoning, multimodal reasoning, coding, and agentic workflows. Ollama lists E2B, E4B, 26B,
and 31B variants, with text+image input, tools, thinking, audio support for the edge models, and 128K or 256K context
depending on size.
Special variants
codegemma is the older coding-specialized Gemma line.
embeddinggemma is Google's small embedding model for retrieval.
functiongemma is a tiny function-calling specialist.
shieldgemma is a safety classification family.
medgemma and medgemma1.5 are medical text/image comprehension models. Do not treat them as a substitute for clinical
review.
translategemma is a translation-focused Gemma 3 branch.
Practical picks
Use gemma4:e4b as a strong small local default.
Use gemma4:26b or gemma4:31b for workstation local reasoning, coding, and multimodal work.
Use embeddinggemma for lightweight RAG embeddings.
Use gemma3 only when you need compatibility with older Gemma 3 tooling or a smaller variant not covered by Gemma 4.
Phi
Phi is Microsoft's small-language-model family. The point of Phi is not to win every frontier benchmark; it is to be useful in memory/latency constrained settings.
Main generations
phi is Phi-2, a 2.7B model.
phi3 and phi3.5 are older small models with surprisingly good reasoning for their size.
phi4 is a 14B general model with 16K context on Ollama. Microsoft positions it for memory/compute constrained and
latency-bound use cases.
phi4-mini is a 3.8B model with tool support on Ollama. It is the better small general Phi pick than older Phi 3
models.
phi4-reasoning is a 14B reasoning model. Ollama includes both the base reasoning model and the :plus variant. The
Microsoft model card says the plus model uses RL, gets higher accuracy, and tends to generate about 50% more tokens.
phi4-mini-reasoning is a 3.8B reasoning model with 128K context on Ollama, focused on mathematical and structured
reasoning under tight memory and latency limits.
Practical picks
Use phi4-mini for small local general work with tool calling.
Use phi4-mini-reasoning for small local math/reasoning.
Use phi4-reasoning:plus for the strongest Phi reasoning behavior, accepting slower/longer outputs.
Do not use Phi as a first choice for broad multilingual or vision tasks; that is not where the family is strongest.
Granite
Granite is IBM's open enterprise-oriented family. It is less fashionable than Qwen, DeepSeek, or Llama, but useful for RAG, tool use, code, enterprise deployment, and small efficient models.
Main generations
granite3-dense, granite3.1-dense, granite3.2, and granite3.3 are 2B/8B-class text models with tool and
long-context support.
granite3-moe and granite3.1-moe are low-latency MoE variants.
granite3.2-vision is a compact vision-language model for visual document understanding.
granite-code is the older code family.
granite-embedding is for vector search.
granite4 is the newest main Granite family on Ollama. IBM's Granite 4 documentation describes hybrid Mamba-2 /
Transformer architecture, MoE in select models, Apache 2.0 licensing, and a focus on RAG, agents, edge deployment, and
tool calling. Ollama lists 350M, 1B, and 3B variants.
Practical picks
Use granite4 when you want a small enterprise-friendly model with permissive licensing and tool-calling emphasis.
Use granite3.2-vision for lightweight document/image understanding.
Use granite-embedding for IBM-flavored RAG pipelines.
GLM / Z.ai
GLM is the Z.ai family. It is increasingly relevant for reasoning, coding, and agentic workflows, especially through cloud-scale models.
Main generations
glm4 is the older local multilingual model on Ollama.
glm-4.6, glm-4.7, and glm-4.7-flash are newer agentic/coding-oriented models.
glm-5 is a 744B total / 40B active MoE model on Ollama Cloud, built for reasoning, coding, and long-horizon agents.
glm-5.1 is currently the newest GLM listing surfaced by Ollama search. Ollama describes it as the next-generation
flagship for agentic engineering, with stronger coding capabilities than GLM-5.
glm-ocr is the OCR/document understanding branch.
Practical picks
Use glm-5.1:cloud for GLM's newest cloud model on Ollama.
Use glm-4.7-flash if you want a lighter GLM option.
Use glm-ocr for OCR/document work.
OpenAI gpt-oss
gpt-oss is OpenAI's open-weight family on Ollama. It is different from OpenAI's hosted GPT models: these are
downloadable open-weight models released under Apache 2.0.
Main variants
gpt-oss:20b is the practical local model. Ollama lists it as about 14GB with 128K context.
gpt-oss:120b is the larger local model. Ollama lists it as about 65GB with 128K context.
Both variants support thinking and tools on Ollama. OpenAI's launch post describes them as open-weight reasoning models for real-world performance at low cost, with strong tool-use capabilities.
Ollama's page notes that the models use MXFP4 quantization for MoE weights. That is why gpt-oss:120b can fit in a
single high-memory GPU class rather than requiring dense-120B memory.
Practical picks
Use gpt-oss:20b for local agentic/reasoning work if you have around 16GB+ available memory.
Use gpt-oss:120b or gpt-oss:120b-cloud when you want the stronger OpenAI open-weight model.
What I would install first
For a normal laptop:
ollama run gemma4:e4b
ollama run deepseek-r1
ollama run phi4-mini
For a workstation:
ollama run qwen3.6
ollama run gemma4:26b
ollama run deepseek-r1:32b
ollama run gpt-oss:20b
For cloud-scale experiments:
ollama run deepseek-v3.2:cloud
ollama run mistral-large-3:675b-cloud
ollama run glm-5.1:cloud
ollama run gpt-oss:120b-cloud
For documents and retrieval:
ollama run deepseek-ocr
ollama pull qwen3-embedding
ollama pull embeddinggemma
Sources checked
- Ollama search pages for Qwen, DeepSeek, Mistral, Llama, Gemma, Phi, GLM, and Granite.
- Ollama model pages for qwen3.6, qwen3.5, qwen3, qwen3-coder-next, deepseek-v3.2, deepseek-v3.1, deepseek-r1, deepseek-ocr, mistral-large-3, ministral-3, mistral-small3.2, llama4, llama3.3, gemma4, phi4-reasoning, granite4, and gpt-oss.
- Upstream sources from Qwen, DeepSeek V3.2, DeepSeek R1 0528, Mistral 3, Mistral Large 3 docs, Meta Llama 3.3 model card, Google DeepMind Gemma, Google DeepMind Gemma 4, Microsoft Phi reasoning, Phi-4-reasoning-plus model card, IBM Granite 4, and OpenAI gpt-oss.
