Gemma 4 Guides
Kimi K2.6 API Key and Pricing: Official Costs, Rate Limits, and Web Search Fees

Kimi K2.6 API Key and Pricing: Official Costs, Rate Limits, and Web Search Fees
If you are about to sign up for a Kimi API key to run K2.6, the token price is only part of the picture. Caching, rate limit tiers, web-search fees, and agent-style retries all quietly shape your monthly bill. This guide walks through each one, using the numbers currently published on Moonshot's own platform pages.

Quick answer
- Kimi K2.6 uses the Moonshot OpenAI-compatible API at
https://api.moonshot.ai/v1— any OpenAI SDK works as a drop-in client. - Official K2.6 pricing on Moonshot's platform page:
- Cached input: ¥1.10 / 1M tokens
- Uncached input: ¥6.50 / 1M tokens
- Output: ¥27.00 / 1M tokens
- Context window: 262,144 tokens
- You get an API key by signing up at
platform.moonshot.aiand creating one in the console. - Built-in web search is billed at ¥0.03 per call, plus whatever tokens the search results consume on the next
/chat/completionsrequest. - The free tier (Tier 0) allows 3 RPM, 1 concurrent request, and has a daily token ceiling. Heavier usage needs a paid top-up to move up tiers.
Everything below unpacks these numbers and the landmines around them.
How to create a Kimi API key
The flow is the same as most LLM providers:
- Go to
platform.moonshot.aiand sign in (or sign up). - Verify your account if prompted.
- Open the API keys section of the console and click Create API key.
- Copy the key immediately — it is shown once.
- Optional but recommended: set a budget cap and a balance-low alert on your account before running any workload.
Treat the key like a password: store it in an environment variable or secret manager, not in source files. If you leak it, rotate it from the same console page.
One thing worth flagging for new accounts: Moonshot operates tier-based rate limits that scale with cumulative top-up. A brand new account starts at Tier 0 with very tight limits — fine for a handful of test requests, not fine for an always-on coding agent. See the rate-limits section below before you start benchmarking.
Kimi K2.6 official pricing
The numbers currently published on Moonshot's K2.6 pricing page:
| Item | Price | Unit |
|---|---|---|
| Cached input | ¥1.10 | per 1M tokens |
| Uncached input | ¥6.50 | per 1M tokens |
| Output | ¥27.00 | per 1M tokens |
| Context window | 262,144 | tokens |
Two things to notice. First, the token prices are in RMB (¥), not USD — if you are comparing with Anthropic or OpenAI pricing, do the currency conversion yourself; do not eyeball "¥6.50" as "$6.50." Second, cached input is roughly 6× cheaper than uncached input. That single line item dominates the economics of long-context and agent workloads.
What "cached input" vs "uncached input" means
Moonshot, like most frontier providers, implements context caching: when parts of your prompt have been seen recently, the server skips recomputing the prefix and charges a much lower rate for those tokens.
Concretely:
- Cache hit (cached input) — a prefix you already sent (system prompt, prior turns of the conversation, large document context) matches what is cached server-side. You pay the cached rate.
- Cache miss (uncached input) — new prompt content, a different ordering, or a prefix that has aged out of cache. You pay the full uncached rate.
Why this matters for real workflows:
- Long-context RAG — if you stuff a 100K-token knowledge base into the system prompt and reuse it across requests, caching turns a painful bill into a cheap one.
- Agent loops — each step in a tool-using agent typically re-sends the system prompt, tool schemas, and the running conversation. Without caching, every step pays uncached rates. With caching, only the newly appended tool result and assistant turn cost full price.
- Identical prompts, different users — if two users hit your service with the same system prompt, the second one benefits from caching.
The practical implication: design your prompts so the stable, reusable parts (instructions, long documents, tool definitions) come first, and the user-specific, changing parts come last. That maximizes cache hit rate and can cut input costs by a factor of five or more.
OpenAI-compatible request format
Moonshot's API is OpenAI-compatible, which means any OpenAI SDK works with a new base URL and API key.
curl
curl https://api.moonshot.ai/v1/chat/completions \
-H "Authorization: Bearer $MOONSHOT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "kimi-k2.6",
"messages": [
{"role": "user", "content": "Explain caching in one paragraph."}
]
}'
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="your-moonshot-api-key",
base_url="https://api.moonshot.ai/v1",
)
response = client.chat.completions.create(
model="kimi-k2.6",
messages=[
{"role": "user", "content": "Write a Python function to debounce calls."}
],
)
print(response.choices[0].message.content)
Thinking vs Instant mode
K2.6 defaults to Thinking mode. To force Instant (no reasoning tokens), pass:
response = client.chat.completions.create(
model="kimi-k2.6",
messages=[...],
extra_body={"thinking": {"type": "disabled"}},
)
Thinking mode generates reasoning tokens you pay for as output. If you do not need it, disabling it is a cheap win.
Multimodal input
K2.6 is natively multimodal — text, image, and video input. Images are straightforward via the standard OpenAI image_url content part. Video input is supported on the official API (Moonshot flags it as experimental for third-party deployments), so test it end-to-end if your product depends on it.
Rate limits and account tiers
Moonshot applies per-account tier rate limits. The progression is based on cumulative top-up amount — not your current balance, the total you have ever added.
Representative shape of the tier ladder currently published:
| Tier | Cumulative top-up | Concurrency | RPM | TPM | TPD |
|---|---|---|---|---|---|
| Tier 0 | ¥0 | 1 | 3 | 500,000 | 1,500,000 |
| Tier 1 | ¥50 | higher | higher | higher | higher |
| … | … | … | … | … | … |
Exact numbers for Tier 1 and above change over time — check the limits page on the platform before sizing a workload. A few guidelines:
- Tier 0 is fine for validation. You can write the integration, run a handful of test calls, confirm the OpenAI SDK works — all inside the free tier.
- Tier 0 is not fine for coding agents. Three requests per minute and a single concurrent request will bottleneck any real agent loop. You will spend more time being rate-limited than getting work done.
- Commit early to get throughput. The cheapest way to unblock a real workload is usually a small top-up to reach Tier 1, not trying to optimize around Tier 0 limits.
Extra costs people miss
The per-token table is not the whole story. Three categories of cost quietly show up in production.
Built-in web search. Moonshot offers a $web_search tool that the model can call during a generation. Each invocation is billed at ¥0.03 per call. That sounds trivial, but search result content then gets inserted into the next /chat/completions request as additional input tokens — and those tokens are billed at the normal input rate. A chatty agent that searches ten times per user turn is paying ten search fees and ten chunks of input tokens.
Reasoning tokens. In Thinking mode, the model generates internal reasoning tokens that count as output. On simple questions this is fine. On an agent that calls tools in a loop, the accumulated reasoning across 50 tool calls can easily be your largest line item. If the task does not need it, turn Thinking off.
Agent retries and long-horizon loops. Moonshot's own materials highlight K2.6 executing 4,000+ tool calls over 12 hours. That is impressive capability — and a very real bill. Long-horizon agent demos are genuinely useful, but they are also the fastest way to burn ¥10,000 without noticing. Always cap max steps and max tokens when running agent workflows.
Cache-miss patterns. Reordering your prompt, changing your system message frequently, or serving many unique users with unique context all hurt cache hit rates. If you see your "input" line item looking bigger than expected, caching is usually the reason.
Is Kimi K2.6 free?
There are three different "free" questions, and they have three different answers:
Using Kimi in the browser at kimi.com. Moonshot's consumer products typically include a free tier with daily usage caps. That is not the API — conversations there do not spend API credits.
Using the Kimi K2.6 API without paying. The Tier 0 free limits let you make a small number of calls without topping up. That is enough for integration testing, not for any sustained workload. Beyond Tier 0, API usage is paid.
Using Kimi K2.6 via Ollama cloud, OpenRouter, or similar. Those are separate billing systems with their own free credits and pricing. They are not "the Kimi API" even though they route to the same model.
So: there is a free way to try it, but there is no free way to run a production workload on K2.6 through the official API.
How to control Kimi API cost
A short checklist before you scale up:
- Set a hard budget cap in the console. Your future self will thank you.
- Enable balance-low alerts so you find out about unexpected spend before the credit card does.
- Always pass
max_tokenson output, especially in agent loops where the model could otherwise talk forever. - Put stable context first, user-specific content last — maximize cache hits.
- Disable Thinking mode for tasks that don't need it.
- Gate
$web_searchbehind explicit intent; do not let every prompt trigger it. - Bound agent loops with a max-step counter and a wall-clock timeout.
- Log per-request input vs output vs cached-input tokens so you can see where cost is actually going.
Final recommendation
If you are evaluating Kimi K2.6 for a coding agent or long-context workflow, the cost structure is workable but not automatic. The headline token prices are competitive, and the cached-input rate is excellent — but only if you structure your prompts to hit the cache. For short, stateless calls without caching, K2.6 is not the cheapest option, and the output rate in particular (¥27.00 / 1M) will dominate any cost model that involves lots of generated code.
For most teams, the right starting point is: top up just enough to clear Tier 0, build your integration, measure your actual cache hit rate and token distribution in production, and only then decide whether K2.6 is the right ongoing choice — or whether something with a different pricing shape fits your workflow better.
FAQ
How do you get a Kimi API key?
Sign in at platform.moonshot.ai, open the API keys section, and create a new key. Copy it immediately; it is only shown once. Set a budget cap at the same time.
How much does Kimi K2.6 cost? On the official pricing page, cached input is ¥1.10 per 1M tokens, uncached input is ¥6.50 per 1M tokens, output is ¥27.00 per 1M tokens, and the context window is 262,144 tokens. Prices are in RMB.
Is Kimi K2.6 free to use? The Tier 0 free tier allows a small number of calls (3 RPM, 1 concurrent) with a daily token ceiling — enough for testing, not for production. The consumer product at kimi.com has its own free tier separate from API billing.
Does Kimi API support OpenAI SDKs?
Yes. The Kimi API is OpenAI-compatible. Point any OpenAI SDK at https://api.moonshot.ai/v1 with your Moonshot key and set model to kimi-k2.6.
What are the Kimi API rate limits? Limits are tier-based and scale with cumulative top-up. Tier 0 (¥0) is 3 RPM and 1 concurrent request with a daily token cap. Tier 1 starts at ¥50 cumulative top-up with substantially higher limits. Higher tiers require larger cumulative top-ups.
How much does Kimi web search cost?
The built-in $web_search tool is billed at ¥0.03 per call. Search result content is then added to the next chat completion request and billed at the normal input token rate.
Can I use Kimi K2.6 with tools and function calling?
Yes. K2.6 supports tool use and function calling in the same style as OpenAI. Note one constraint from Moonshot's docs: when Thinking mode is enabled, tool_choice should be auto or none, and you must preserve the assistant's reasoning_content across tool-calling turns.
Related guides
Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Kimi K2.6 Review: Benchmarks, Pricing, API, and Whether It Is Worth Using
Kimi K2.6 arrived on April 20, 2026 as an open-weight agentic coding model with 256K context, native vision and video input, and an aggressive agent-swarm story. This review breaks down what's real, what's marketing, and who should actually switch.

Kimi K2.6 on Hugging Face: Model Card, Deployment, and Recommended Inference Engines
Everything developers need from the moonshotai/Kimi-K2.6 model card: what the weights actually include, how to deploy with vLLM or SGLang, and how to decide between self-hosting and the official API.

Kimi K2.6 vs GLM-5.1: Benchmarks, Context Window, Pricing, and Which Model Fits Better
Two of 2026's strongest open-weight models from China, released two weeks apart, aimed at similar long-horizon coding workloads — but with real differences in modality, context, and pricing shape. Here is how to pick between them.
Still deciding what to read next?
Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.
