Kaka Subtitle Assistant
An LLM-powered video subtitle processing assistant, supporting speech recognition, subtitle segmentation, optimization, and translation.
Kaka Subtitle Assistant (VideoCaptioner) is easy to operate and doesn't require high-end hardware. It supports both online API calls and local offline processing (with GPU support) for speech recognition. It leverages Large Language Models (LLMs) for intelligent subtitle segmentation, correction, and translation. It offers a one-click solution for the entire video subtitle workflow! Add stunning subtitles to your videos.
The latest version now supports VAD, vocal separation, word-level timestamps, batch subtitle processing, and other practical features.

Processing a 14-minute 1080P English TED video from Bilibili end-to-end, using the local Whisper model for speech recognition and the gpt-4o-mini model for optimization and translation into Chinese, took approximately 3 minutes.
Based on backend calculations, the cost for model optimization and translation was less than ¥0.01 (calculated using OpenAI's official pricing).
For detailed results of subtitle and video synthesis, please refer to the TED Video Test.
The software is lightweight, with a package size of less than 60MB, and includes all necessary environments. Download and run directly.
Download the latest version of the executable from the Release page. Or: Lanzou Cloud Download
Open the installer to install.
(Optional) LLM API Configuration, choose whether to enable subtitle optimization or subtitle translation.
Drag and drop the video file into the software window for fully automatic processing.
Note: Each step supports independent processing and file drag-and-drop.
Due to my lack of a Mac, I cannot test and package for MacOS. MacOS executables are temporarily unavailable.
Mac users, please download the source code and install Python dependencies to run. (Local Whisper functionality is currently not supported on MacOS.)
brew install ffmpeg brew install aria2 brew install python@3.**
git clone https://github.com/WEIFENG2333/VideoCaptioner.git
cd VideoCaptioner
python3.** -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python main.py
The current application is relatively basic. We welcome PR contributions.
git clone https://github.com/WEIFENG2333/VideoCaptioner.git
cd VideoCaptioner
docker build -t video-captioner .
Run with custom API configuration:
docker run -d \
-p 8501:8501 \
-v $(pwd)/temp:/app/temp \
-e OPENAI_BASE_URL="Your API address" \
-e OPENAI_API_KEY="Your API key" \
--name video-captioner \
video-captioner
Open your browser and go to: http://localhost:8501
The software fully utilizes the advantages of Large Language Models (LLMs) in understanding context to further process subtitles generated by speech recognition. It effectively corrects typos, unifies terminology, and makes the subtitle content more accurate and coherent, providing users with an excellent viewing experience!
| Configuration Item | Description |
|---|---|
| Built-in Model | The software includes a basic large language model (gpt-4o-mini), which can be used without configuration (public service is unstable). |
| API Support | Supports standard OpenAI API format. Compatible with SiliconCloud, DeepSeek, Ollama, etc. For configuration methods, please refer to the Configuration Documentation. |
Recommended models: For higher quality, choose Claude-3.5-sonnet or gpt-4o.
There are two Whisper versions: WhisperCpp and fasterWhisper (recommended). The latter has better performance and both require downloading models within the software.
| Model | Disk Space | RAM Usage | Description |
|---|---|---|---|
| Tiny | 75 MiB | ~273 MB | Transcription is mediocre, for testing only. |
| Small | 466 MiB | ~852 MB | English recognition is already good. |
| Medium | 1.5 GiB | ~2.1 GB | This version is recommended as the minimum for Chinese recognition. |
| Large-v1/v2 👍 | 2.9 GiB | ~3.9 GB | Good performance, recommended if your configuration allows. |
| Large-v3 | 2.9 GiB | ~3.9 GB | Community feedback suggests potential hallucination/subtitle repetition issues. |
Recommended model: Large-v1 is stable and of good quality.
Note: The above models can be downloaded directly within the software using a domestic network; both GPU and integrated graphics are supported.
| Type | Description | Example |
|---|---|---|
| Glossary | Correction table for terminology, names, and specific words. | 机器学习->Machine Learning 马斯克->Elon Musk 打call -> Cheer on Turing patterns Bus paradox |
| Original Subtitle Text | The original manuscript or related content of the video. | Complete speech scripts, lecture notes, etc. |
| Correction Requirements | Specific correction requirements related to the content. | Unify personal pronouns, standardize terminology, etc. Fill in requirements related to the content, example reference |
| Interface Name | Supported Languages | Running Mode | Description |
|---|---|---|---|
| Interface B | Chinese, English | Online | Free, fast |
| Interface J | Chinese, English | Online | Free, fast |
| WhisperCpp | Chinese, Japanese, Korean, English, and 99 other languages. Good performance for foreign languages. | Local | (Actual use is unstable) Requires downloading transcription models. Chinese: Medium or larger model recommended. English, etc.: Smaller models can achieve good results. |
| fasterWhisper 👍 | Chinese, English, and 99 other languages. Excellent performance for foreign languages, more accurate timeline. | Local | (🌟Highly Recommended🌟) Requires downloading the program and transcription models. Supports CUDA, faster, accurate transcription. Super accurate timestamp subtitles. Prioritize using this. |
If you encounter the following situations when using the URL download function:
cookies.txt file in the AppData directory of the software installation directory to download high-quality videos normally.The simple processing flow of the program is as follows:
Speech Recognition -> Subtitle Segmentation (optional) -> Subtitle Optimization & Translation (optional) -> Subtitle & Video Synthesis
The main directory structure after installing the software is as follows:
VideoCaptioner/ ├── runtime/ # Runtime environment directory (do not modify) ├── resources/ # Software resource file directory (binaries, icons, etc., and downloaded faster-whisper program) ├── work-dir/ # Working directory, where processed videos and subtitle files are saved ├── AppData/ # Application data directory ├── cache/ # Cache directory, caching transcription and large model request data. ├── models/ # Stores Whisper model files ├── logs/ # Log directory, recording software running status ├── settings.json # Stores user settings └── cookies.txt # Cookie information for video platforms (required for downloading high-definition videos) └── VideoCaptioner.exe # Main program executable file
The quality of subtitle segmentation is crucial for the viewing experience. For this, I developed SubtitleSpliter, which can intelligently reorganize word-by-word subtitles into paragraphs that conform to natural language habits and perfectly synchronize with the video frames.
During processing, only the text content is sent to the large language model, without timeline information, which greatly reduces processing overhead.
In the translation stage, we adopt the "translate-reflect-translate" methodology proposed by Andrew Ng. This iterative optimization method not only ensures the accuracy of the translation.
The author is a junior college student. Both my personal abilities and the project have many shortcomings. The project is also constantly being improved. If you encounter any bugs during use, please feel free to submit Issues and Pull Requests to help improve the project.
If you find this project helpful, please give it a Star. This will be the greatest encouragement and support for me!