It’s a bit technical, I haven’t found any pre-packaged software to do what I’m doing yet.
First I installed https://github.com/openai/whisper , the speech-to-text model that OpenAI released back when they were less blinded by dollar signs. I wrote a Python script that used it to go through all of the audio files in the directory tree where I’m storing this stuff and produced a transcript that I stored in a .json file alongside it.
For the LLM, I installed https://github.com/LostRuins/koboldcpp/releases/ and used the https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF model, which is just barely small enough to run smoothly on my RTX 4090. I wrote another Python script that methodically goes through those .json files that Whisper produced, takes the raw text of the transcript, and feeds it to the LLM with a couple of prompts explaining what the transcript is and what I’d like the LLM to do with it (write a summary, or write a bullet-point list of subject tags). Those get saved in the .json file too.
Most recently I’ve been experimenting with creating an index of the transcripts using those LLM results and the Whoosh library in Python, so that I can do local searches of the transcripts based on topics. I’m building towards writing up something where I can literally tell it “Tell me about Uncle Pete” and it’ll first search for the relevant transcripts and then feed those into the LLM with a prompt to extract the relevant information from them.
If you don’t find the idea of writing scripts for that sort of thing literally fun (like me) then you may need to wait a bit for someone more capable and more focused than I am to create a user-friendly application to do all this. In the meantime, though, hoard that data. Storage is cheap.
If you don’t find the idea of writing scripts for that sort of thing literally fun…
I absolutely do. What I find as a potential showstopper for me right now, is that I don’t have a nonintegrated GPU, which makes complex LLMs hard to run. Basically, if I can’t push the processing to CPU, I’m looking at around 2-5 seconds per token; it’s rough. But I like your workflow a lot, and I’m going to try to get something similar going with my incredibly old hardware, and see if CPU-only processing of this would be something feasible (though, I’m not super hopeful there).
And, yes, I, too, am aware of the hallucinations and such that come from the technology. But, honestly, for this non-critical use case, I don’t really care.
I only just recently discovered that my installation of Whisper was completely unaware that I had a GPU, and was running entirely on my CPU. So even if you can’t get a good LLM running locally you might still be able to get everything turned into text transcripts for eventual future processing. :)
It sounds like something similar to RAG (retrieval augmented generation) or a database lookup. Are you storing the transcripts in a SQL like database or noSQL db or doing semantic similarity on any of it?
I was thinking of a similar project and building a knowledge graph for each person.
If you’re interested in “chatting” with your writing there’s a couple of out of the box solutions right now, like Kortex or Reflect Notes. They are AI first note taking apps. I don’t use them out of privacy concerns but if you don’t care that much they might allow you to do what you want. They claim to be E2E encrypted and the AI unable to phone home but these are companies that sprung out of nowhere so I don’t trust they necessarily have done all their homework to actually provide full privacy.
Alternatively there’s an Obsidian plugin that I believe allows you to do such a thing as well with local LLMs if you wanted to which is the privacy first way to this. I’ve just moved to Obsidian from Capacities so I have yet to try it out as I’m still setting up my vault.
Privacy first is my only path. There are a lot of privacyless solutions for this, and they’re all dead to me. The obsidian route is pretty cool. Personally, I don’t care to chat with it, but I like the auto-tags and auto-summaries.
You said you released it on your writing. How did you go about doing that? It’s a cool use case, and I’m intrigued.
It’s a bit technical, I haven’t found any pre-packaged software to do what I’m doing yet.
First I installed https://github.com/openai/whisper , the speech-to-text model that OpenAI released back when they were less blinded by dollar signs. I wrote a Python script that used it to go through all of the audio files in the directory tree where I’m storing this stuff and produced a transcript that I stored in a .json file alongside it.
For the LLM, I installed https://github.com/LostRuins/koboldcpp/releases/ and used the https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF model, which is just barely small enough to run smoothly on my RTX 4090. I wrote another Python script that methodically goes through those .json files that Whisper produced, takes the raw text of the transcript, and feeds it to the LLM with a couple of prompts explaining what the transcript is and what I’d like the LLM to do with it (write a summary, or write a bullet-point list of subject tags). Those get saved in the .json file too.
Most recently I’ve been experimenting with creating an index of the transcripts using those LLM results and the Whoosh library in Python, so that I can do local searches of the transcripts based on topics. I’m building towards writing up something where I can literally tell it “Tell me about Uncle Pete” and it’ll first search for the relevant transcripts and then feed those into the LLM with a prompt to extract the relevant information from them.
If you don’t find the idea of writing scripts for that sort of thing literally fun (like me) then you may need to wait a bit for someone more capable and more focused than I am to create a user-friendly application to do all this. In the meantime, though, hoard that data. Storage is cheap.
That’s awesome! Thank you!
I absolutely do. What I find as a potential showstopper for me right now, is that I don’t have a nonintegrated GPU, which makes complex LLMs hard to run. Basically, if I can’t push the processing to CPU, I’m looking at around 2-5 seconds per token; it’s rough. But I like your workflow a lot, and I’m going to try to get something similar going with my incredibly old hardware, and see if CPU-only processing of this would be something feasible (though, I’m not super hopeful there).
And, yes, I, too, am aware of the hallucinations and such that come from the technology. But, honestly, for this non-critical use case, I don’t really care.
I only just recently discovered that my installation of Whisper was completely unaware that I had a GPU, and was running entirely on my CPU. So even if you can’t get a good LLM running locally you might still be able to get everything turned into text transcripts for eventual future processing. :)
Nicceeeee! Thank you!
It sounds like something similar to RAG (retrieval augmented generation) or a database lookup. Are you storing the transcripts in a SQL like database or noSQL db or doing semantic similarity on any of it?
I was thinking of a similar project and building a knowledge graph for each person.
If you’re interested in “chatting” with your writing there’s a couple of out of the box solutions right now, like Kortex or Reflect Notes. They are AI first note taking apps. I don’t use them out of privacy concerns but if you don’t care that much they might allow you to do what you want. They claim to be E2E encrypted and the AI unable to phone home but these are companies that sprung out of nowhere so I don’t trust they necessarily have done all their homework to actually provide full privacy.
Alternatively there’s an Obsidian plugin that I believe allows you to do such a thing as well with local LLMs if you wanted to which is the privacy first way to this. I’ve just moved to Obsidian from Capacities so I have yet to try it out as I’m still setting up my vault.
Privacy first is my only path. There are a lot of privacyless solutions for this, and they’re all dead to me. The obsidian route is pretty cool. Personally, I don’t care to chat with it, but I like the auto-tags and auto-summaries.
Looked through the plugins and found this: https://github.com/niehu2018/obsidian-ai-tagger-universe