Gemma 4 Guides

Kimi K2.6 on Hugging Face: Model Card, Deployment, and Recommended Inference Engines

8 min read
kimi k2.6hugging facevllmsglangmodel deployment
Kimi K2.6 on Hugging Face: Model Card, Deployment, and Recommended Inference Engines

Kimi K2.6 on Hugging Face: Model Card, Deployment, and Recommended Inference Engines

Moonshot AI publishes the official Kimi K2.6 weights on Hugging Face at moonshotai/Kimi-K2.6 under a Modified MIT license. This is the definitive place to get the real model — not a reupload, not a quantized fork, not a cloud proxy. If you are planning to self-host K2.6, evaluate its capabilities from primary sources, or just want to read the spec before committing, the Hugging Face repo is the right starting point.

This guide walks through what the model card actually contains, what the architecture numbers mean for your deployment, which inference engines Moonshot recommends, and how to decide between self-hosting and just using the official API.

Kimi K2.6 Hugging Face deployment illustration showing model shards, GPU servers, and inference engine logos in a clean technical workspace

Quick answer

  • Official repo: huggingface.co/moonshotai/Kimi-K2.6.
  • Architecture: Mixture-of-Experts, ~1T total parameters, ~32B activated per token.
  • Context window: 256K (262,144 tokens on the official API pricing page).
  • Modalities: text, image, and video input, via the MoonViT 400M-parameter vision encoder.
  • Recommended inference engines: vLLM, SGLang, and KTransformers.
  • License: Modified MIT — permissive for most use, with a visible-attribution clause for very large deployments.
  • Thinking mode is on by default. Deployments need the --reasoning-parser kimi_k2 flag for correct behavior.

What the official Hugging Face page includes

The moonshotai/Kimi-K2.6 repo is structured like Moonshot's prior K2-series releases. You get:

  • A model card with the canonical description of what K2.6 is, the key capability claims, and the architecture summary.
  • Evaluation results — the same benchmark tables Moonshot publishes in their blog, rendered inline on the model card.
  • A deployment guide under docs/deploy_guidance.md with vLLM, SGLang, and KTransformers examples.
  • Usage examples in Python covering thinking vs instant mode, image input, video input, tool calling, and reasoning_content preservation across agent turns.
  • The safetensors weight shards plus tokenizer and config files.
  • A figures/ directory with the assets referenced from the model card (including a demo video used in the multimodal example).

If you've worked with K2.5 on Hugging Face, the layout will be immediately familiar. Moonshot deliberately keeps things consistent — same deployment commands, same environment assumptions, same tool-call parser — so that infra built for K2.5 carries over to K2.6 with a weight swap.

Kimi K2.6 model summary

The architecture numbers that matter most when you are sizing hardware or reasoning about latency:

Spec Value
Architecture Mixture-of-Experts (MoE)
Total parameters ~1 trillion
Activated parameters per token ~32 billion
Experts 384 routed, 8 active + 1 shared per token
Layers 61
Context window 256K tokens
Vision encoder MoonViT, 400M parameters
Attention Multi-head Latent Attention (MLA)
Activation SwiGLU

A few things worth noticing about this configuration:

Total vs active parameters are different numbers and both matter. The 1T figure determines how much GPU memory you need to hold the model. The 32B figure determines how much compute each token costs. If you hear "1T parameter model" and picture a thousand H100s, you are thinking of the storage footprint — the per-token inference cost is closer to a 32B dense model.

MLA attention is a deliberate KV-cache choice. It compresses keys and values into a lower-dimensional latent space, cutting KV-cache memory substantially at long context. This is a big part of why 256K is actually usable in practice rather than nominal.

384 experts with 8+1 active per token is sparse routing. Inference engines need to support the specific routing pattern — which is why Moonshot recommends engines that have explicit K2 integration rather than generic MoE support.

The MoonViT encoder is native. K2.6 was not bolted onto a vision model after the fact. Vision and language were trained together, which is why screenshot-to-code and vision-guided tool use work without a separate preprocessing pipeline.

What the benchmark section says

The model card includes Moonshot's full evaluation tables. The highlights, in rough categories:

Coding: SWE-Bench Pro 58.6, SWE-Bench Verified 80.2, SWE-Bench Multilingual 76.7, LiveCodeBench v6 89.6, Terminal-Bench 2.0 (Terminus-2 harness) 66.7.

Agentic / tool use: Humanity's Last Exam with tools 54.0, BrowseComp 83.2, DeepSearchQA F1 92.5, Toolathlon 50.0.

Vision: Charxiv with Python 86.7, Math Vision with Python 93.2, V* 96.9.

Two important caveats the model card itself flags:

  1. These are self-reported, evaluated on Moonshot's chosen harnesses. For SWE-Bench in particular, Moonshot notes they used an internally developed evaluation framework with a minimal set of tools (bash, createfile, insert, view, strreplace, submit) and tailored system prompts. Different harnesses will produce different numbers.
  2. Terminal-Bench 2.0 was evaluated in non-thinking mode, because Moonshot's current context management strategy for thinking mode is incompatible with the Terminus-2 framework. This is the kind of detail that only appears in the model card, not in the blog post — and it matters if you plan to reproduce the number.

Recommended deployment engines

Moonshot's deploy guide explicitly recommends three engines. They are not the only ones that can run K2.6, but they are the ones with first-class K2 support and the flags Moonshot has verified.

vLLM

vLLM is the most widely adopted LLM serving engine, with PagedAttention, continuous batching, and an OpenAI-compatible API out of the box.

A verified single-node, 8× H200 serving command from Moonshot's deploy guide looks like this:

vllm serve $MODEL_PATH -tp 8 \
  --mm-encoder-tp-mode data \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2

Key flags to understand:

  • --tool-call-parser kimi_k2 — required to enable tool calling correctly.
  • --reasoning-parser kimi_k2 — required because K2.6 has thinking mode on by default; without this flag, the reasoning content is not parsed and tool calls can misbehave.
  • --mm-encoder-tp-mode data — data-parallel placement of the vision encoder for better throughput.
  • -tp 8 — tensor parallelism across 8 GPUs.

Moonshot's notes call out vLLM 0.19.1 as the manually verified stable version for K2-series models. Nightly wheels work for the newest features but are explicitly flagged as experimental.

SGLang

SGLang is the engine to reach for when your workload is structured generation, JSON output, tool-calling chains, or multi-turn conversation with prefix reuse. RadixAttention caches KV states across turns, which is a real win for agent workflows.

A single-node TP8 serving command:

sglang serve \
  --model-path $MODEL_PATH \
  --tp 8 \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2

The same kimi_k2 parser flags apply. For very new features, Moonshot suggests installing SGLang from source:

pip install "sglang @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
pip install nvidia-cudnn-cu12==9.16.0.29

KTransformers

KTransformers is Moonshot's own inference engine, purpose-built for the K2 model family. Unlike vLLM and SGLang, which are general-purpose, KTransformers is optimized specifically for K2's MoE routing, MLA attention, and expert placement — including CPU offloading that can fit K2 onto more modest hardware.

For teams running the K2 family in production at scale, KTransformers typically gives the best throughput per dollar — at the cost of being less general-purpose and less well-documented for non-K2 models.

Why not just pick any MoE engine?

K2.6 uses a specific expert routing scheme (384 routed, 8+1 active), a custom tool-call format, a thinking-mode reasoning parser, and vision encoder attached to the text model. Engines without K2-specific support will either refuse to load the model, load it but produce garbage for tool calls, or produce correct text but drop reasoning content. The three engines Moonshot lists are the ones where all of these pieces are wired up correctly.

Official API vs self-deployment

Hugging Face weights give you control. The official API gives you speed to production. Choose based on workload.

Use the official Moonshot API when:

  • You are in validation or early production and want zero infrastructure work.
  • Your monthly token volume is below the break-even point for dedicated GPUs.
  • You need video input on day one — video is flagged experimental on third-party engines and fully supported only on Moonshot's own API.
  • You want first-party behavior guarantees and direct vendor support.

Self-host from Hugging Face when:

  • You need air-gapped or on-premises deployment for compliance reasons.
  • Your monthly token volume is large enough that amortizing dedicated H200s or similar hardware beats the API bill (commonly a meaningful threshold once you pass tens of billions of tokens per month).
  • You want to customize the inference engine — quantization, batching policy, expert placement, multi-model routing.
  • You want a predictable fixed cost instead of variable per-token billing.
  • You are building a research artifact or an open-source project that must not depend on a third-party API.

For most teams doing an initial evaluation, the right move is: prototype on the API, measure your real token mix and latency needs, then decide whether self-hosting is worth the infrastructure investment.

What to check before deploying

A short pre-flight checklist that will save you time:

Versions. vLLM 0.19.1 is Moonshot's manually verified stable version for the K2 series. SGLang should be recent. transformers>=4.57.1 is commonly required. Pin your versions rather than floating.

Hardware. A single-node deployment typically assumes 8× H200 or equivalent for the full-precision weights. INT4 quantization can bring this down to 4× H100 class hardware. CPU offloading via KTransformers is possible but much slower.

Thinking mode. It is on by default. If your application does not want reasoning tokens in every response, explicitly disable it with extra_body={"thinking": {"type": "disabled"}} on the official API, or the equivalent chat_template_kwargs for vLLM/SGLang.

Tool calling + thinking interaction. When thinking is enabled and you are using tools, tool_choice must be auto or none. And across multi-turn tool calls, you must preserve the assistant message's reasoning_content in the conversation history — otherwise you will get errors.

Multimodal limits. Recommended max image resolution is ~4K (4096×2160). Recommended max video resolution is ~2K (2048×1080). Higher resolutions only increase processing time without improving understanding. Very large videos should use the file-upload flow rather than inline base64.

Built-in web search + thinking. The official $web_search tool is currently incompatible with thinking mode on K2.6 and K2.5. Disable thinking if you want to use the builtin search tool.

Temperature and top_p. The model card recommends temperature 1.0 for thinking mode, 0.6 for instant mode, and top_p 0.95 in both.

Final recommendation

The Hugging Face model card is the single best technical document on Kimi K2.6 — everything that actually determines whether your deployment works lives in the deploy guide and usage examples, not in the marketing blog. For developers doing a serious evaluation, the read-order is: model card (understand capabilities), docs/deploy_guidance.md (get a serving command that works), and the usage examples (wire up thinking-mode and tool-calling correctly).

If you are planning to self-host, expect to pin vLLM or SGLang to a specific version, run with the K2-specific tool and reasoning parsers, and budget for 8× H200-class GPUs at full precision. If you are not ready for that infrastructure commitment, start with the official Moonshot API (see our API and pricing guide) and move to self-hosting only once your token volume justifies it.

FAQ

Is Kimi K2.6 on Hugging Face official? Yes — moonshotai/Kimi-K2.6 is the official Moonshot AI organization and the canonical source of K2.6 weights. Any other reupload or community fork (including GGUF quantizations) is derivative.

How many parameters does Kimi K2.6 have? Approximately 1 trillion total parameters with 32 billion activated per token, via a Mixture-of-Experts architecture with 384 routed experts.

What is the context length of Kimi K2.6? 256,000 tokens on the model card (262,144 tokens exactly, as listed on the Moonshot API pricing page).

Which inference engines are recommended for Kimi K2.6? Moonshot's official deploy guide recommends vLLM, SGLang, and KTransformers. Each has K2-specific tool-call and reasoning parsers that are required for correct behavior.

Does Kimi K2.6 support video input when self-hosted? Yes, the weights support video input — but Moonshot flags chat with video content as an experimental feature that is fully supported only on their official API for now. Test your specific pipeline if video is critical.

Should you use the Kimi API or self-host from Hugging Face? Use the official Moonshot API for validation, small workloads, and production where video input matters. Self-host from Hugging Face when you need air-gapped deployment, have large sustained token volume, or require full control over the inference engine. For most teams, prototype on the API and only move to self-hosting once token volume justifies the infrastructure cost.

What license is Kimi K2.6 released under? A Modified MIT license. It permits most commercial use, with a visible-attribution clause that applies to very large deployments (roughly above 100M monthly active users or $20M monthly revenue). For the vast majority of teams, the license is effectively permissive.

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.