ArchiveAI / API.md
billingsmoore's picture
add Drive folder text-file support; disable Gemini thinking; update API docs
71ce4a7

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

Full Pipeline — Gradio API Reference

Gradio automatically exposes every event handler as an HTTP endpoint. This document covers the endpoints useful for programmatic integration, with examples in Python, JavaScript, and curl.


Base URL

https://garchenarchive-archiveai.hf.space

When running locally the base URL is http://localhost:7860.

Auto-generated schema

Every running Gradio app publishes its full schema at:

GET {base_url}/info

You can also browse interactive API docs in the Gradio UI by clicking "Use via API" in the footer, or visiting {base_url}/docs.


Client libraries (recommended)

Prefer the official Gradio clients over raw HTTP — they handle file uploads, streaming, and session state automatically.

Python

pip install gradio_client

JavaScript / Node

npm install @gradio/client

Authentication

This app does not use HTTP-level authentication. The Gemini API key is passed as a plain parameter on each call that requires it (gemini_api_key). Keep it out of client-side code — proxy calls through your backend.


Endpoints

1. Run the full pipeline

Chains Speech-to-Text → Translation → Text-to-Speech in one call.

API name: /run_pipeline

Inputs

# Parameter Type Description
0 file_input filepath | null Audio, video, .srt, .txt, or .json file
1 drive_url string Google Drive share URL (alternative to file upload)
2 do_stt boolean Run Speech-to-Text stage
3 do_translation boolean Run Translation stage
4 do_tts boolean Run Text-to-Speech stage
5 do_summary boolean Generate a Gemini summary
6 language string STT language — "English", "Tibetan", "Tibetan (Base)", or "Both"
7 selected_speakers string[] Speaker names to keep (empty = all speakers)
8 speaker_threshold float Speaker similarity threshold, 0.0–1.0 (default 0.5)
9 use_gemini_post_edit boolean Correct transcription via Gemini — works on STT output and uploaded transcripts
10 gemini_model string Gemini model name (see Models)
11 min_clip_duration float Minimum segment duration in seconds (default 3); shorter segments are not split further
12 max_clip_duration float Maximum segment duration in seconds (default 30); longer segments are split into chunks
13 target_language string Translation target language, e.g. "English", "French"
14 gemini_api_key string Gemini API key
15 voice_label string TTS voice (see Voices)
16 prose_speed float Prose playback speed, 0.5–1.0 (default 1.0)
17 mantra_speed float Mantra playback speed, 0.5–1.0 (default 0.75)
18 state object Pass null to start fresh

Outputs

The pipeline is a streaming generator — it yields intermediate status updates before the final result. The last yielded value contains:

Field Type Description
state object Updated app state including all segments
status string Human-readable status message
srt_download filepath | null Path to generated SRT file
json_download filepath | null Path to generated JSON file
summary string Summary text (if requested)
audio [sample_rate, array] | null Synthesized audio (if TTS enabled)

Python example

from gradio_client import Client, handle_file

client = Client("https://garchenarchive-archiveai.hf.space")

result = client.predict(
    file_input=handle_file("/path/to/audio.mp3"),
    drive_url="",
    do_stt=True,
    do_translation=True,
    do_tts=False,
    do_summary=False,
    language="Both",
    selected_speakers=[],
    speaker_threshold=0.5,
    use_gemini_post_edit=False,
    gemini_model="gemini-2.5-flash",
    min_clip_duration=3,
    max_clip_duration=30,
    target_language="English",
    gemini_api_key="AIza...",
    voice_label="Female: Sarah",
    prose_speed=1.0,
    mantra_speed=0.75,
    state=None,
    api_name="/run_pipeline",
)

JavaScript example

import { Client } from "@gradio/client";

const client = await Client.connect("https://garchenarchive-archiveai.hf.space");

const result = await client.predict("/run_pipeline", {
  file_input: new Blob([audioBuffer], { type: "audio/mpeg" }),
  drive_url: "",
  do_stt: true,
  do_translation: true,
  do_tts: false,
  do_summary: false,
  language: "Both",
  selected_speakers: [],
  speaker_threshold: 0.5,
  use_gemini_post_edit: false,
  gemini_model: "gemini-2.5-flash",
  min_clip_duration: 3,
  max_clip_duration: 30,
  target_language: "English",
  gemini_api_key: "AIza...",
  voice_label: "Female: Sarah",
  prose_speed: 1.0,
  mantra_speed: 0.75,
  state: null,
});

2. Translate a single segment

Translates one piece of text and returns the translation. The simplest endpoint for one-off translation calls.

API name: /translate_one_segment

Inputs

# Parameter Type Description
0 source string Source text to translate
1 target_lang string Target language, e.g. "English"
2 api_key string Gemini API key
3 gemini_model string Gemini model name

Output

string — the translated text, or [Translation error: ...] on failure.

Python example

translation = client.predict(
    source="རང་གི་སེམས་ལ་བལྟ་བར་གྱིས།",
    target_lang="English",
    api_key="AIza...",
    gemini_model="gemini-2.5-flash",
    api_name="/translate_one_segment",
)

curl example

curl -X POST https://garchenarchive-archiveai.hf.space/run/predict \
  -H "Content-Type: application/json" \
  -d '{
    "fn_index": <fn_index>,
    "data": [
      "རང་གི་སེམས་ལ་བལྟ་བར་གྱིས།",
      "English",
      "AIza...",
      "gemini-2.5-flash"
    ]
  }'

Note: For raw HTTP calls, look up the correct fn_index from GET /info — the index is assigned at startup and depends on event registration order.


3. Translate all segments

Translates all segments currently in the app state and refreshes the segment editor.

API name: /translate_all_segments

This is primarily used by the UI. For programmatic use, calling /translate_one_segment in a loop gives you finer error handling per segment.


4. Synthesize audio

Runs TTS on the segments currently in state.

API name: /handle_synthesize

Inputs

# Parameter Type Description
0 state object App state containing segments
1 voice_label string Voice name (see Voices)
2 prose_speed float 0.5–1.0
3 mantra_speed float 0.5–1.0
4–28 slot_sources[0..24] string Source text for each visible slot
29–53 slot_targets[0..24] string Target text for each visible slot

Output

[sample_rate: int, samples: float[]] — NumPy-style audio array as returned by Gradio's gr.Audio(type="numpy").


5. Pronunciation editor

Three endpoints for managing the TTS pronunciation glossary.

Look up a word

API name: /lookup_word

Input: word: string Output: pronunciation: string (empty if not found)

Save a pronunciation

API name: /save_pronunciation

Inputs: word: string, pronunciation: string Output: status: string

Remove a pronunciation

API name: /remove_pronunciation

Input: word: string Output: [pronunciation: string, status: string]


Reference

Models

Value Notes
gemini-2.5-flash Default — fast, good quality
gemini-2.5-pro Higher quality, slower
gemini-3-flash-preview Preview
gemini-3.1-pro-preview Preview

The app automatically falls back through gemini-2.5-flash → gemini-2.5-pro if the requested model fails.

Voices

Value Voice
Female: Sarah af_sarah
Female: Heart af_heart
Female: Alice bf_alice
Female: Emma bf_emma
Male: Adam am_adam
Male: Onyx am_onyx
Male: Daniel bm_daniel
Male: George bm_george

Segment object schema

Segments are the core data structure passed through the pipeline:

{
  "source": "Original transcription text",
  "target": "Translated text (empty string if not yet translated)",
  "timestamp": "00:00:01,000 --> 00:00:05,000"
}

timestamp follows SRT format and may be an empty string for plain-text inputs.

Supported input file types

Extension Handled as
.mp3, .wav, .m4a, .mp4, .mov, etc. Audio/video — passed to STT
.srt Subtitle file — parsed into segments
.txt Plain text — each line becomes a segment
.json Segment array — must match the segment object schema above