August 12th, 2024

Show HN: LLM Aided Transcription Improvement

The LLM-Aided Transcription Improvement Project on GitHub enhances audio transcription quality using a multi-stage pipeline, supporting local and cloud-based models, requiring Python 3.12 for installation and execution.

Read original articleLink Icon
Show HN: LLM Aided Transcription Improvement

The LLM-Aided Transcription Improvement Project on GitHub focuses on enhancing the quality of audio transcriptions generated by models like OpenAI's Whisper. It employs a multi-stage processing pipeline that utilizes language model (LLM) prompts to improve the structure, readability, and formatting of transcription outputs. Key features include a multi-stage processing approach that cleans errors and formats text in markdown, parallel processing for efficiency, support for both local and cloud-based LLMs with OpenAI's GPT-4o-mini as the default, and mechanisms for quality assessment of the final output compared to the original transcription. To use the project, users need Python 3.12 or higher and must follow installation steps that include cloning the repository, creating a virtual environment, and configuring environment variables for API keys and model selection. Users can then execute the script with their transcription JSON file to generate a formatted markdown file. Example outputs are provided to illustrate the transformation from raw JSON to structured markdown.

- The project enhances audio transcription quality using a multi-stage processing pipeline.

- It supports both local and cloud-based language models.

- Users can process transcriptions concurrently for improved efficiency.

- Installation requires Python 3.12 and specific library dependencies.

- Example outputs demonstrate the project's effectiveness in formatting transcriptions.

Link Icon 2 comments
By @gavmor - 6 months
I record long, rambling voice memos in noisy environments which Whisper struggles to parse. Perhaps this can rescue me from the tedium of hand-stitching the fragmented results together. GIGO, of course, but there's an equilibrium here that might be struck.
By @ramonverse - 6 months
i'm curious about the chunk splitting approach you mentioned. how do you determine the optimal chunk size for processing? seems like there could be a tradeoff between context preservation and processing efficiency. have you experimented with different chunk sizes and their impact on the quality of the final output? this could be really important for handling things like long-range dependencies in the text.