dkvit/uuid_5ce0118b-ad98-441d-b041-896a4287b46c.json at main · anil.recoil.org/thicket-eeg

Thicket data repository for the EEG
thicket-eeg / dkvit / uuid_5ce0118b-ad98-441d-b041-896a4287b46c.json
at main 4.4 kB view raw
 1{
 2  "id": "urn:uuid:5ce0118b-ad98-441d-b041-896a4287b46c",
 3  "title": "Week 2_2",
 4  "link": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/week2_2",
 5  "updated": "2025-07-29T10:37:00",
 6  "published": "2025-08-10T19:12:43.305291",
 7  "summary": "<h2>Week 2 part 2</h2>\n<p>After evaluation of previous models' performances, we decided to try to fit Voxtral - another transformer model.</p>\n<p>The mini version of the model is 3b parameters, weights 8 gb, which took quite a long time to download even for Mac. As kyutai with 1b parameters was way too slow on rPi, I decided that there's no point in trying to run Voxtral on rPi.</p>\n<p>At this point it became obvious that most models are made for powerful devices with GPU. Thus, a decision was made to rather look for a model definitely smaller than 1b params rather than trying out every model we pass by.</p>\n<p>Of course the exact speed of the model depends on the pipeline itself but the constant\nfactor caused by this cannot outweight the fact that kuytai took about 10s to transcribe 1s\nof audio on 4/4 threads.</p>\n<h3>Hugging face</h3>\n<p>Hugging face is an open-source platform for AI models. Similar to github, not only it provides most (if not all) models with their &quot;model cards&quot;, but also has leaderboards for the models. This is what I'll be working with next.</p>\n<p><a href=\"https://huggingface.co/spaces/hf-audio/open_asr_leaderboard\">Here</a> one can find\nthe leaderboard of the speech-recognition models. We are interested in two criteria: WER (word error rate) and RTFx (time of the audio being transcribed/transcription time).</p>\n<p>The tiny.en model without quantization has RTFx of 348, base.en has 320.</p>\n<p>Interesting model:</p>\n<p>UsefulSensors/moonshine-tiny - 9.99 / 565.97</p>\n<p>The following seem extremely fast too, but later turned out that they require Nvidia GPU architecture</p>\n<p>nvidia/parakeet-tdt_ctc-110m - 7.49 /\n5345.14</p>\n<p>nvidia/parakeet-tdt-0.6b-v2  - 6.05\n3386.02</p>\n<p>nvidia/canary-180m-flash     - 7.12 / 1213.58</p>\n<p>nvidia/parakeet-rnnt-0.6b    - 7.5  / 2815.72      (no punctuation/capitalization)</p>\n<p>nvidia/parakeet-ctc-0.6b     - 7.69 / 4281.53       (no punctuation/capitalization)</p>",
 8  "content": "<h2>Week 2 part 2</h2>\n<p>After evaluation of previous models' performances, we decided to try to fit Voxtral - another transformer model.</p>\n<p>The mini version of the model is 3b parameters, weights 8 gb, which took quite a long time to download even for Mac. As kyutai with 1b parameters was way too slow on rPi, I decided that there's no point in trying to run Voxtral on rPi.</p>\n<p>At this point it became obvious that most models are made for powerful devices with GPU. Thus, a decision was made to rather look for a model definitely smaller than 1b params rather than trying out every model we pass by.</p>\n<p>Of course the exact speed of the model depends on the pipeline itself but the constant\nfactor caused by this cannot outweight the fact that kuytai took about 10s to transcribe 1s\nof audio on 4/4 threads.</p>\n<h3>Hugging face</h3>\n<p>Hugging face is an open-source platform for AI models. Similar to github, not only it provides most (if not all) models with their &quot;model cards&quot;, but also has leaderboards for the models. This is what I'll be working with next.</p>\n<p><a href=\"https://huggingface.co/spaces/hf-audio/open_asr_leaderboard\">Here</a> one can find\nthe leaderboard of the speech-recognition models. We are interested in two criteria: WER (word error rate) and RTFx (time of the audio being transcribed/transcription time).</p>\n<p>The tiny.en model without quantization has RTFx of 348, base.en has 320.</p>\n<p>Interesting model:</p>\n<p>UsefulSensors/moonshine-tiny - 9.99 / 565.97</p>\n<p>The following seem extremely fast too, but later turned out that they require Nvidia GPU architecture</p>\n<p>nvidia/parakeet-tdt_ctc-110m - 7.49 /\n5345.14</p>\n<p>nvidia/parakeet-tdt-0.6b-v2  - 6.05\n3386.02</p>\n<p>nvidia/canary-180m-flash     - 7.12 / 1213.58</p>\n<p>nvidia/parakeet-rnnt-0.6b    - 7.5  / 2815.72      (no punctuation/capitalization)</p>\n<p>nvidia/parakeet-ctc-0.6b     - 7.69 / 4281.53       (no punctuation/capitalization)</p>",
 9  "content_type": "html",
10  "categories": [],
11  "source": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/feed.xml"
12}