Gemma 4 Guides

Kimi K2.6 API Key and Pricing: Official Costs, Rate Limits, and Web Search Fees

8 min read
kimi k2.6kimi apiapi pricingllm pricingmoonshot ai
Kimi K2.6 API Key and Pricing: Official Costs, Rate Limits, and Web Search Fees

Kimi K2.6 API Key and Pricing: Official Costs, Rate Limits, and Web Search Fees

If you are about to sign up for a Kimi API key to run K2.6, the token price is only part of the picture. Caching, rate limit tiers, web-search fees, and agent-style retries all quietly shape your monthly bill. This guide walks through each one, using the numbers currently published on Moonshot's own platform pages.

Kimi K2.6 API pricing dashboard illustration with token pricing tiers, rate-limit meters, and Moonshot-style developer console visuals

Quick answer

  • Kimi K2.6 uses the Moonshot OpenAI-compatible API at https://api.moonshot.ai/v1 — any OpenAI SDK works as a drop-in client.
  • Official K2.6 pricing on Moonshot's platform page:
    • Cached input: ¥1.10 / 1M tokens
    • Uncached input: ¥6.50 / 1M tokens
    • Output: ¥27.00 / 1M tokens
    • Context window: 262,144 tokens
  • You get an API key by signing up at platform.moonshot.ai and creating one in the console.
  • Built-in web search is billed at ¥0.03 per call, plus whatever tokens the search results consume on the next /chat/completions request.
  • The free tier (Tier 0) allows 3 RPM, 1 concurrent request, and has a daily token ceiling. Heavier usage needs a paid top-up to move up tiers.

Everything below unpacks these numbers and the landmines around them.

How to create a Kimi API key

The flow is the same as most LLM providers:

  1. Go to platform.moonshot.ai and sign in (or sign up).
  2. Verify your account if prompted.
  3. Open the API keys section of the console and click Create API key.
  4. Copy the key immediately — it is shown once.
  5. Optional but recommended: set a budget cap and a balance-low alert on your account before running any workload.

Treat the key like a password: store it in an environment variable or secret manager, not in source files. If you leak it, rotate it from the same console page.

One thing worth flagging for new accounts: Moonshot operates tier-based rate limits that scale with cumulative top-up. A brand new account starts at Tier 0 with very tight limits — fine for a handful of test requests, not fine for an always-on coding agent. See the rate-limits section below before you start benchmarking.

Kimi K2.6 official pricing

The numbers currently published on Moonshot's K2.6 pricing page:

Item Price Unit
Cached input ¥1.10 per 1M tokens
Uncached input ¥6.50 per 1M tokens
Output ¥27.00 per 1M tokens
Context window 262,144 tokens

Two things to notice. First, the token prices are in RMB (¥), not USD — if you are comparing with Anthropic or OpenAI pricing, do the currency conversion yourself; do not eyeball "¥6.50" as "$6.50." Second, cached input is roughly 6× cheaper than uncached input. That single line item dominates the economics of long-context and agent workloads.

What "cached input" vs "uncached input" means

Moonshot, like most frontier providers, implements context caching: when parts of your prompt have been seen recently, the server skips recomputing the prefix and charges a much lower rate for those tokens.

Concretely:

  • Cache hit (cached input) — a prefix you already sent (system prompt, prior turns of the conversation, large document context) matches what is cached server-side. You pay the cached rate.
  • Cache miss (uncached input) — new prompt content, a different ordering, or a prefix that has aged out of cache. You pay the full uncached rate.

Why this matters for real workflows:

  • Long-context RAG — if you stuff a 100K-token knowledge base into the system prompt and reuse it across requests, caching turns a painful bill into a cheap one.
  • Agent loops — each step in a tool-using agent typically re-sends the system prompt, tool schemas, and the running conversation. Without caching, every step pays uncached rates. With caching, only the newly appended tool result and assistant turn cost full price.
  • Identical prompts, different users — if two users hit your service with the same system prompt, the second one benefits from caching.

The practical implication: design your prompts so the stable, reusable parts (instructions, long documents, tool definitions) come first, and the user-specific, changing parts come last. That maximizes cache hit rate and can cut input costs by a factor of five or more.

OpenAI-compatible request format

Moonshot's API is OpenAI-compatible, which means any OpenAI SDK works with a new base URL and API key.

curl

curl https://api.moonshot.ai/v1/chat/completions \
  -H "Authorization: Bearer $MOONSHOT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2.6",
    "messages": [
      {"role": "user", "content": "Explain caching in one paragraph."}
    ]
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="your-moonshot-api-key",
    base_url="https://api.moonshot.ai/v1",
)

response = client.chat.completions.create(
    model="kimi-k2.6",
    messages=[
        {"role": "user", "content": "Write a Python function to debounce calls."}
    ],
)
print(response.choices[0].message.content)

Thinking vs Instant mode

K2.6 defaults to Thinking mode. To force Instant (no reasoning tokens), pass:

response = client.chat.completions.create(
    model="kimi-k2.6",
    messages=[...],
    extra_body={"thinking": {"type": "disabled"}},
)

Thinking mode generates reasoning tokens you pay for as output. If you do not need it, disabling it is a cheap win.

Multimodal input

K2.6 is natively multimodal — text, image, and video input. Images are straightforward via the standard OpenAI image_url content part. Video input is supported on the official API (Moonshot flags it as experimental for third-party deployments), so test it end-to-end if your product depends on it.

Rate limits and account tiers

Moonshot applies per-account tier rate limits. The progression is based on cumulative top-up amount — not your current balance, the total you have ever added.

Representative shape of the tier ladder currently published:

Tier Cumulative top-up Concurrency RPM TPM TPD
Tier 0 ¥0 1 3 500,000 1,500,000
Tier 1 ¥50 higher higher higher higher

Exact numbers for Tier 1 and above change over time — check the limits page on the platform before sizing a workload. A few guidelines:

  • Tier 0 is fine for validation. You can write the integration, run a handful of test calls, confirm the OpenAI SDK works — all inside the free tier.
  • Tier 0 is not fine for coding agents. Three requests per minute and a single concurrent request will bottleneck any real agent loop. You will spend more time being rate-limited than getting work done.
  • Commit early to get throughput. The cheapest way to unblock a real workload is usually a small top-up to reach Tier 1, not trying to optimize around Tier 0 limits.

Extra costs people miss

The per-token table is not the whole story. Three categories of cost quietly show up in production.

Built-in web search. Moonshot offers a $web_search tool that the model can call during a generation. Each invocation is billed at ¥0.03 per call. That sounds trivial, but search result content then gets inserted into the next /chat/completions request as additional input tokens — and those tokens are billed at the normal input rate. A chatty agent that searches ten times per user turn is paying ten search fees and ten chunks of input tokens.

Reasoning tokens. In Thinking mode, the model generates internal reasoning tokens that count as output. On simple questions this is fine. On an agent that calls tools in a loop, the accumulated reasoning across 50 tool calls can easily be your largest line item. If the task does not need it, turn Thinking off.

Agent retries and long-horizon loops. Moonshot's own materials highlight K2.6 executing 4,000+ tool calls over 12 hours. That is impressive capability — and a very real bill. Long-horizon agent demos are genuinely useful, but they are also the fastest way to burn ¥10,000 without noticing. Always cap max steps and max tokens when running agent workflows.

Cache-miss patterns. Reordering your prompt, changing your system message frequently, or serving many unique users with unique context all hurt cache hit rates. If you see your "input" line item looking bigger than expected, caching is usually the reason.

Is Kimi K2.6 free?

There are three different "free" questions, and they have three different answers:

Using Kimi in the browser at kimi.com. Moonshot's consumer products typically include a free tier with daily usage caps. That is not the API — conversations there do not spend API credits.

Using the Kimi K2.6 API without paying. The Tier 0 free limits let you make a small number of calls without topping up. That is enough for integration testing, not for any sustained workload. Beyond Tier 0, API usage is paid.

Using Kimi K2.6 via Ollama cloud, OpenRouter, or similar. Those are separate billing systems with their own free credits and pricing. They are not "the Kimi API" even though they route to the same model.

So: there is a free way to try it, but there is no free way to run a production workload on K2.6 through the official API.

How to control Kimi API cost

A short checklist before you scale up:

  • Set a hard budget cap in the console. Your future self will thank you.
  • Enable balance-low alerts so you find out about unexpected spend before the credit card does.
  • Always pass max_tokens on output, especially in agent loops where the model could otherwise talk forever.
  • Put stable context first, user-specific content last — maximize cache hits.
  • Disable Thinking mode for tasks that don't need it.
  • Gate $web_search behind explicit intent; do not let every prompt trigger it.
  • Bound agent loops with a max-step counter and a wall-clock timeout.
  • Log per-request input vs output vs cached-input tokens so you can see where cost is actually going.

Final recommendation

If you are evaluating Kimi K2.6 for a coding agent or long-context workflow, the cost structure is workable but not automatic. The headline token prices are competitive, and the cached-input rate is excellent — but only if you structure your prompts to hit the cache. For short, stateless calls without caching, K2.6 is not the cheapest option, and the output rate in particular (¥27.00 / 1M) will dominate any cost model that involves lots of generated code.

For most teams, the right starting point is: top up just enough to clear Tier 0, build your integration, measure your actual cache hit rate and token distribution in production, and only then decide whether K2.6 is the right ongoing choice — or whether something with a different pricing shape fits your workflow better.

FAQ

How do you get a Kimi API key? Sign in at platform.moonshot.ai, open the API keys section, and create a new key. Copy it immediately; it is only shown once. Set a budget cap at the same time.

How much does Kimi K2.6 cost? On the official pricing page, cached input is ¥1.10 per 1M tokens, uncached input is ¥6.50 per 1M tokens, output is ¥27.00 per 1M tokens, and the context window is 262,144 tokens. Prices are in RMB.

Is Kimi K2.6 free to use? The Tier 0 free tier allows a small number of calls (3 RPM, 1 concurrent) with a daily token ceiling — enough for testing, not for production. The consumer product at kimi.com has its own free tier separate from API billing.

Does Kimi API support OpenAI SDKs? Yes. The Kimi API is OpenAI-compatible. Point any OpenAI SDK at https://api.moonshot.ai/v1 with your Moonshot key and set model to kimi-k2.6.

What are the Kimi API rate limits? Limits are tier-based and scale with cumulative top-up. Tier 0 (¥0) is 3 RPM and 1 concurrent request with a daily token cap. Tier 1 starts at ¥50 cumulative top-up with substantially higher limits. Higher tiers require larger cumulative top-ups.

How much does Kimi web search cost? The built-in $web_search tool is billed at ¥0.03 per call. Search result content is then added to the next chat completion request and billed at the normal input token rate.

Can I use Kimi K2.6 with tools and function calling? Yes. K2.6 supports tool use and function calling in the same style as OpenAI. Note one constraint from Moonshot's docs: when Thinking mode is enabled, tool_choice should be auto or none, and you must preserve the assistant's reasoning_content across tool-calling turns.

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.