HELGE SVERREAll-stack Developer
Bergen, Norwayv13.0
est. 2012  |  300+ repos  |  4000+ contributions
Tools  |   Theme:
Open Model Family Guide
A practical guide to Qwen, DeepSeek, Mistral, Llama, Gemma, Phi, Granite, GLM, and gpt-oss model families on Ollama.

Last checked: April 19, 2026. Model libraries move quickly, and Ollama tags can be repointed or added after this snapshot. Treat this as a practical map of the families, not as a permanent ranking.

The short version:

  • Newest practical local default: gemma4:e4b, gemma4:26b, qwen3.6, or deepseek-r1, depending on hardware.
  • Best cloud-scale open model choices on Ollama: deepseek-v3.2:cloud, mistral-large-3:675b-cloud, qwen3.5:397b-cloud, gpt-oss:120b-cloud, and glm-5.1:cloud.
  • Best small local reasoning: deepseek-r1, phi4-mini-reasoning, gemma4:e4b, or qwen3.5:9b.
  • Best local coding/agent choices: qwen3.6, qwen3-coder-next, gemma4:26b, gpt-oss:20b, or cloud models if latency and privacy allow it.
  • Best OCR/document extraction: deepseek-ocr, glm-ocr, granite3.2-vision, or a modern multimodal general model.

How to read Ollama names

Ollama model names usually follow this pattern:

family-version-specialization:size-or-variant

Examples:

  • qwen3.6:35b means the Qwen 3.6 family, 35B-class variant.
  • deepseek-r1:32b means DeepSeek R1 reasoning, 32B distilled model.
  • mistral-large-3:675b-cloud means Mistral Large 3, 675B cloud-hosted variant.
  • gemma4:e4b means Gemma 4 edge model with about 4B effective parameters.
  • gpt-oss:120b means OpenAI's 120B open-weight reasoning model.

Important tags:

  • latest is just Ollama's default tag for a family. It does not always mean biggest or best.
  • cloud means Ollama Cloud, not local inference.
  • thinking means the model can expose or internally use a reasoning mode.
  • tools means the model and Ollama template support tool/function calling.
  • vision means image input is supported.
  • embedding means the model is for vector search, not chat.
  • q4_K_M, q8_0, MXFP4, and similar tags are quantization/format hints. Smaller formats use less memory and may lose some quality.

Quick chooser

Use caseStart here
Good local chat on ordinary hardwaregemma4:e4b, qwen3.5:9b, llama3.2:3b, phi4-mini
Strong local reasoningdeepseek-r1, phi4-reasoning, gemma4:26b, qwen3.6
Large local workstationgemma4:31b, qwen3.6, deepseek-r1:32b, llama3.3:70b
Best cloud reasoning/agentsdeepseek-v3.2:cloud, mistral-large-3:675b-cloud, glm-5.1:cloud
Coding agentsqwen3.6, qwen3-coder-next, gpt-oss:20b, deepseek-v3.2:cloud
Vision and screenshotsgemma4, qwen3.5, llama4, mistral-small3.2, ministral-3
OCR and documentsdeepseek-ocr, glm-ocr, granite3.2-vision
RAG embeddingsqwen3-embedding, embeddinggemma, granite-embedding

Qwen

Qwen is Alibaba's open model family. It has become one of the strongest all-around families for local users because it covers small laptops, coding agents, long-context MoE models, vision, embeddings, and cloud-scale frontier models.

Main generations

qwen is legacy Qwen 1.5. It is mostly superseded.

qwen2 improved multilingual coverage, coding, math, and long context over Qwen 1.5.

qwen2.5 was the strong 2024/2025 default. It improved knowledge, coding, math, JSON/structured outputs, long-text generation, and instruction following. Ollama lists sizes from 0.5B to 72B.

qwen3 introduced hybrid thinking/non-thinking behavior, stronger reasoning, better tool/agent use, dense models, and MoE models. Qwen's own Qwen3 blog says Qwen3 was trained on about 36T tokens and supports 119 languages and dialects.

qwen3.5 is newer and more unified: multimodal, long context, thinking, tools, and cloud variants. Ollama lists sizes from 0.8B to 122B plus larger cloud models.

qwen3.6 is currently the newest Qwen model surfaced on Ollama's search page. Ollama lists it as a 35B model with about 24GB local size, 256K context, text+image input, tools, and thinking. Upstream model cards describe it as a 35B total / 3B active MoE model focused on agentic coding, multimodal reasoning, and "thinking preservation."

Special variants

qwen3-coder and qwen3-coder-next are coding-agent models. qwen3-coder-next is based on the Qwen3-Next architecture, with 80B total / 3B active parameters on Ollama, 256K context, and non-thinking mode only.

qwen3-next is an efficiency branch using hybrid attention and high-sparsity MoE ideas.

qwen3-vl and qwen2.5vl are vision-language branches. Use them for screenshots, charts, visual grounding, GUI tasks, OCR-ish extraction, and image reasoning.

qwen3-embedding is for embeddings and retrieval, not chat.

qwq was the older Qwen reasoning line. Qwen3 mostly absorbed that role.

Practical picks

Use qwen3.6 if you can run the 24GB local model and want the newest local Qwen for coding, agents, vision, and general chat.

Use qwen3.5:9b or qwen3.5:4b if you want a newer, easier local multimodal model.

Use qwen3-coder-next or qwen3-coder:30b for repo-scale coding agents.

Use qwen3-embedding for RAG.

DeepSeek

DeepSeek is best understood as two parallel lines: V-series general MoE models and R-series reasoning models.

Main generations

deepseek-llm is the old first-generation bilingual general family. Ollama lists 7B and 67B variants with 4K context.

deepseek-coder is the older code family, trained heavily on code and natural language.

deepseek-v2 moved the family into economical MoE models. Ollama's practical local variant is deepseek-v2:16b with 160K context; the 236B variant is much larger.

deepseek-coder-v2 is the coding-specialized V2 branch.

deepseek-v2.5 merged the general ability of V2-Chat with Coder-V2-Instruct. It was useful in its window but is not the first thing to reach for now.

deepseek-v3 is the 671B total / 37B active MoE flagship generation. It is huge locally on Ollama.

deepseek-r1 is the reasoning family. Ollama's current default deepseek-r1 points to the updated DeepSeek-R1-0528-Qwen3-8B distilled model, with 128K context and a much smaller local footprint. The full deepseek-r1:671b model is enormous.

deepseek-v3.1 is a hybrid model: one model can run in thinking or non-thinking mode. DeepSeek's model card lists 671B total / 37B active parameters and 128K context upstream. Ollama exposes a 160K-context local package and a cloud tag.

deepseek-v3.2 is the newest DeepSeek general model on Ollama and is cloud-only there. DeepSeek's model card describes it as a 685B-parameter model using DeepSeek Sparse Attention, scalable RL, and agentic task synthesis.

deepseek-ocr is a 3B image-text OCR/document model, not a normal chat model.

Special variants

r1-1776 is a Perplexity post-trained R1 variant aimed at reducing censorship/refusal behavior. It is not the canonical DeepSeek release.

deepscaler is a fine-tuned DeepSeek-R1-Distill-Qwen-1.5B model focused on efficient math reasoning.

Practical picks

Use deepseek-v3.2:cloud for the newest and strongest DeepSeek model through Ollama Cloud.

Use deepseek-r1 for practical local reasoning on normal hardware.

Use deepseek-r1:32b if you have enough memory and want stronger local reasoning.

Use deepseek-ocr for document OCR and image-to-markdown extraction.

Mistral

Mistral models are attractive when you care about permissive licensing, strong multilingual behavior, function calling, and efficient models that are not tied to a single giant deployment shape.

Main generations

mistral is the original Mistral 7B line. Ollama's page says it has been updated to v0.3 and supports tools. It is old but still common.

mixtral is the older MoE family, including 8x7B and 8x22B. It matters historically because it popularized strong open-weight MoE models.

mistral-nemo is a 12B model from Mistral and NVIDIA with 128K context. It remains a useful middle-size local model.

mistral-small, mistral-small3.1, and mistral-small3.2 are the small-but-capable 22B/24B family. On Ollama, mistral-small3.2 is a 24B text+image model with 128K context, vision, and tools. It specifically improves instruction following, repetition errors, and function calling over Small 3.1.

ministral-3 is the newer edge-focused Mistral 3 family. Ollama lists 3B, 8B, and 14B variants, all text+image with 256K context, tools, cloud options, and Apache 2.0 licensing. Mistral's docs describe the 14B as optimized for local deployment with performance comparable to Mistral Small 3.2.

mistral-large-3 is Mistral's current frontier-scale model on Ollama. It is cloud-only there as mistral-large-3:675b-cloud, with 256K context, text+image input, native function calling, JSON output, and Apache 2.0 licensing. Mistral's docs list it as 675B total / 41B active parameters.

codestral is Mistral's older code-specialized line.

mathstral is the older math/science-specialized Mistral line.

Practical picks

Use mistral-large-3:675b-cloud for best Mistral quality through Ollama Cloud.

Use ministral-3:8b or ministral-3:14b for modern local Mistral with vision, 256K context, and tool use.

Use mistral-small3.2 if you want the 24B local model and can afford the 15GB footprint.

Use mistral-nemo if you want a mature 12B text model with long context.

Llama

Llama is Meta's open-weight model family. It is widely supported and widely fine-tuned. The main catch is that the Llama license is not the same as Apache/MIT; check the license for commercial and scale restrictions.

Main generations

llama2 is old now but still appears in many fine-tunes.

llama3 improved general quality substantially and came in 8B and 70B sizes.

llama3.1 added 8B, 70B, and 405B variants and 128K context. It remains important because many third-party fine-tunes and tool-use models are based on it.

llama3.2 split into small text models and separate vision models. Ollama's llama3.2 page covers 1B and 3B text models; llama3.2-vision covers 11B and 90B image-reasoning models.

llama3.3 is a 70B text model that Meta positioned as offering performance similar to Llama 3.1 405B. Ollama lists 43GB local size, 128K context, and tools.

llama4 is the newest Llama family on Ollama. Ollama lists two MoE multimodal variants: Scout and Maverick. Scout is 109B total / 17B active with a 10M context window on Ollama; Maverick is 400B total / 17B active with a 1M context window. Both are text+image models.

Special variants

llama-guard3 is a safety classifier, not a chat model.

llama3-chatqa is NVIDIA's Llama 3-based conversational QA/RAG model.

llava-llama3 is a LLaVA vision model based on Llama 3.

Many Dolphin, Hermes, Vicuna, and other assistant models are fine-tuned from Llama bases.

Practical picks

Use llama4 if you want Meta's newest multimodal model and can run the large local package.

Use llama3.3 if you want a mature 70B text model with broad support.

Use llama3.2:3b for a small, fast, common local model.

Use llama3.2-vision if you specifically need the older Llama vision branch and do not want Llama 4.

Gemma

Gemma is Google's open model family. It is strong for local deployment because Google provides several sizes and keeps the family focused on small-to-medium efficient models.

Main generations

gemma is the original Gemma 1 family.

gemma2 improved performance and efficiency in 2B, 9B, and 27B sizes.

gemma3 added multimodal models and was, for a while, Ollama's "current, most capable model that runs on a single GPU." Ollama lists 270M, 1B, 4B, 12B, and 27B variants.

gemma3n is the efficient on-device branch for laptops, tablets, and phones.

gemma4 is the newest Gemma family on Ollama. Google's Gemma page positions Gemma 4 as its most intelligent open model family, aimed at advanced reasoning, multimodal reasoning, coding, and agentic workflows. Ollama lists E2B, E4B, 26B, and 31B variants, with text+image input, tools, thinking, audio support for the edge models, and 128K or 256K context depending on size.

Special variants

codegemma is the older coding-specialized Gemma line.

embeddinggemma is Google's small embedding model for retrieval.

functiongemma is a tiny function-calling specialist.

shieldgemma is a safety classification family.

medgemma and medgemma1.5 are medical text/image comprehension models. Do not treat them as a substitute for clinical review.

translategemma is a translation-focused Gemma 3 branch.

Practical picks

Use gemma4:e4b as a strong small local default.

Use gemma4:26b or gemma4:31b for workstation local reasoning, coding, and multimodal work.

Use embeddinggemma for lightweight RAG embeddings.

Use gemma3 only when you need compatibility with older Gemma 3 tooling or a smaller variant not covered by Gemma 4.

Phi

Phi is Microsoft's small-language-model family. The point of Phi is not to win every frontier benchmark; it is to be useful in memory/latency constrained settings.

Main generations

phi is Phi-2, a 2.7B model.

phi3 and phi3.5 are older small models with surprisingly good reasoning for their size.

phi4 is a 14B general model with 16K context on Ollama. Microsoft positions it for memory/compute constrained and latency-bound use cases.

phi4-mini is a 3.8B model with tool support on Ollama. It is the better small general Phi pick than older Phi 3 models.

phi4-reasoning is a 14B reasoning model. Ollama includes both the base reasoning model and the :plus variant. The Microsoft model card says the plus model uses RL, gets higher accuracy, and tends to generate about 50% more tokens.

phi4-mini-reasoning is a 3.8B reasoning model with 128K context on Ollama, focused on mathematical and structured reasoning under tight memory and latency limits.

Practical picks

Use phi4-mini for small local general work with tool calling.

Use phi4-mini-reasoning for small local math/reasoning.

Use phi4-reasoning:plus for the strongest Phi reasoning behavior, accepting slower/longer outputs.

Do not use Phi as a first choice for broad multilingual or vision tasks; that is not where the family is strongest.

Granite

Granite is IBM's open enterprise-oriented family. It is less fashionable than Qwen, DeepSeek, or Llama, but useful for RAG, tool use, code, enterprise deployment, and small efficient models.

Main generations

granite3-dense, granite3.1-dense, granite3.2, and granite3.3 are 2B/8B-class text models with tool and long-context support.

granite3-moe and granite3.1-moe are low-latency MoE variants.

granite3.2-vision is a compact vision-language model for visual document understanding.

granite-code is the older code family.

granite-embedding is for vector search.

granite4 is the newest main Granite family on Ollama. IBM's Granite 4 documentation describes hybrid Mamba-2 / Transformer architecture, MoE in select models, Apache 2.0 licensing, and a focus on RAG, agents, edge deployment, and tool calling. Ollama lists 350M, 1B, and 3B variants.

Practical picks

Use granite4 when you want a small enterprise-friendly model with permissive licensing and tool-calling emphasis.

Use granite3.2-vision for lightweight document/image understanding.

Use granite-embedding for IBM-flavored RAG pipelines.

GLM / Z.ai

GLM is the Z.ai family. It is increasingly relevant for reasoning, coding, and agentic workflows, especially through cloud-scale models.

Main generations

glm4 is the older local multilingual model on Ollama.

glm-4.6, glm-4.7, and glm-4.7-flash are newer agentic/coding-oriented models.

glm-5 is a 744B total / 40B active MoE model on Ollama Cloud, built for reasoning, coding, and long-horizon agents.

glm-5.1 is currently the newest GLM listing surfaced by Ollama search. Ollama describes it as the next-generation flagship for agentic engineering, with stronger coding capabilities than GLM-5.

glm-ocr is the OCR/document understanding branch.

Practical picks

Use glm-5.1:cloud for GLM's newest cloud model on Ollama.

Use glm-4.7-flash if you want a lighter GLM option.

Use glm-ocr for OCR/document work.

OpenAI gpt-oss

gpt-oss is OpenAI's open-weight family on Ollama. It is different from OpenAI's hosted GPT models: these are downloadable open-weight models released under Apache 2.0.

Main variants

gpt-oss:20b is the practical local model. Ollama lists it as about 14GB with 128K context.

gpt-oss:120b is the larger local model. Ollama lists it as about 65GB with 128K context.

Both variants support thinking and tools on Ollama. OpenAI's launch post describes them as open-weight reasoning models for real-world performance at low cost, with strong tool-use capabilities.

Ollama's page notes that the models use MXFP4 quantization for MoE weights. That is why gpt-oss:120b can fit in a single high-memory GPU class rather than requiring dense-120B memory.

Practical picks

Use gpt-oss:20b for local agentic/reasoning work if you have around 16GB+ available memory.

Use gpt-oss:120b or gpt-oss:120b-cloud when you want the stronger OpenAI open-weight model.

What I would install first

For a normal laptop:

ollama run gemma4:e4b
ollama run deepseek-r1
ollama run phi4-mini

For a workstation:

ollama run qwen3.6
ollama run gemma4:26b
ollama run deepseek-r1:32b
ollama run gpt-oss:20b

For cloud-scale experiments:

ollama run deepseek-v3.2:cloud
ollama run mistral-large-3:675b-cloud
ollama run glm-5.1:cloud
ollama run gpt-oss:120b-cloud

For documents and retrieval:

ollama run deepseek-ocr
ollama pull qwen3-embedding
ollama pull embeddinggemma

Sources checked




<!-- generated with nested tables and zero regrets -->