Skip to content

countTokens() should support mediaResolution in config for accurate multimodal token estimation #1134

@arun279

Description

@arun279

Is your feature related to a problem? Please describe.

When building applications that process video/images with the Gemini API, I need to implement rate limiting to stay within TPM (Tokens Per Minute) quotas. To do this effectively, I need accurate token estimates before making generateContent() calls.

The countTokens() API is ideal for this, but it doesn't support the mediaResolution config option. Since mediaResolution significantly affects token count (64 tokens/frame for LOW vs 256 tokens/frame for MEDIUM/HIGH on Gemini 2.5), the token estimate from countTokens() doesn't reflect the actual tokens that will be used when generateContent() is called with a specific mediaResolution.

Describe the solution you'd like

Add mediaResolution?: MediaResolution to the CountTokensConfig interface, matching its availability in GenerateContentConfig.

Currently in @google/[email protected]:

// CountTokensConfig - does NOT have mediaResolution
export declare interface CountTokensConfig {
    httpOptions?: HttpOptions;
    abortSignal?: AbortSignal;
    systemInstruction?: ContentUnion;
    tools?: Tool[];
    generationConfig?: GenerationConfig;
}

// GenerateContentConfig - DOES have mediaResolution
export declare interface GenerateContentConfig {
    // ... other options ...
    mediaResolution?: MediaResolution;  // ✓ Supported here
}

Expected behavior:

const ai = new GoogleGenAI({ apiKey: 'xxx' });

// Should be able to get accurate token count with mediaResolution
const tokenCount = await ai.models.countTokens({
  model: 'gemini-2.5-flash',
  contents: [{ role: 'user', parts: [videoPart, textPart] }],
  config: {
    mediaResolution: MediaResolution.MEDIA_RESOLUTION_LOW  // Currently not supported
  }
});

// Then use that estimate for rate limiting before calling generateContent
await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: [{ role: 'user', parts: [videoPart, textPart] }],
  config: {
    mediaResolution: MediaResolution.MEDIA_RESOLUTION_LOW  // This IS supported
  }
});

Describe alternatives you've considered

  1. Formula-based estimation: Using the documented token rates (64/256 tokens per frame + 32 audio tokens/second) to estimate manually. This works but duplicates logic that the API already has, and may drift from actual API behavior.

  2. Post-hoc adjustment: Using usageMetadata.promptTokenCount from responses to retroactively update rate limiting state. This helps adapt over time but doesn't prevent the initial rate limit violations.

  3. Conservative over-estimation: Always estimate using the highest resolution rate (256 tokens/frame). This wastes rate limit budget when using LOW resolution.

Additional context

  • SDK Version: @google/[email protected]

  • Documentation reference: Media Resolution docs show significant token differences:

    MediaResolution Video (tokens/frame)
    LOW 64
    MEDIUM 256
    HIGH 256
  • Use case: Video processing applications that chunk long videos and need to respect TPM quotas require accurate pre-call token estimation to implement proper rate limiting.

  • Related: The GenerationConfig type inside CountTokensConfig also doesn't appear to surface mediaResolution, though the documentation suggests token counts depend on it.

Metadata

Metadata

Assignees

Labels

api:gemini-apipriority: p3Desirable enhancement or fix. May not be included in next release.status:awaiting user responseissues requiring a response from the usertype: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions