Thicket data repository for the EEG

Sync feeds: 5 new entries, 6 updated

+12
dkvit/uuid_062a1210-a952-48be-9d8d-f02c5c276682.json
···
+
{
+
"id": "urn:uuid:062a1210-a952-48be-9d8d-f02c5c276682",
+
"title": "Week 3",
+
"link": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/week3",
+
"updated": "2025-07-30T10:59:00",
+
"published": "2025-08-10T19:12:43.303312",
+
"summary": "<h2>Week 3</h2>\n<p>(Note: this blog will be updated throughout the week)</p>\n<p><code>nvidia/parakeet-tdt_ctc-110m</code> - cannot be run on rPi:\nwhen trying to run the program, it just exits after a while at the point of importing the\n<code>nemo.collections.asr</code>.</p>\n<p>As discovered later on, all nvidia models require nvidia GPU to run. Thus we are left with\n<code>moonhsine</code>.</p>\n<p>Also came across <code>vosk</code> and <code>faster-whisper</code> which are interesting to try.</p>\n<h3>Results and Comparison:</h3>\n<h4>Moonshine tiny</h4>\n<p>And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.</p>\n<h4>Moonshine/base</h4>\n<p>And so my fellow Americans ask not what your country can do for you ask what you can do for your country</p>\n\n\n \n Model\n 11s transcription time \n Word Error Rate \n \n \n whisper.cpp/base \n 21 s \n 10.32 \n \n \n whisper.cpp/base-Q4_K \n 12.6 s \n -- \n \n \n Moonshine/base \n 2.76 s \n 9.99 \n \n \n whisper.cpp/tiny \n 8.3 s \n 12.81 \n \n \n Moonshine/tiny \n 1.48 s \n 12.65 \n \n\n\n<h3>Connecting microphone to rPi</h3>\n<p>Just connect it via USB.\nRun <code>arecord -l</code> to see information about connected audio devices, say card X and device Y.</p>\n<p>To make it a default audio input device (strongly recommended), add this into ~/.asoundrc:</p>\n<pre><code>\npcm.!default{\n type hw\n card X\n}\n\nctl.!default{\n type hw\n card X\n}\n</code></pre>\n\n<p>You can test it with</p>\n<pre><code>\n# record\narecord -D plughw:X,Y -f cd -t wav -d 5 test.wav\n# play\naplay test.wav\n</code></pre>\n\n<h3>Moonshine in streaming mode</h3>\n<p>Simple demo:</p>\n<pre><code>\ngit clone https://github.com/moonshine-ai/moonshine\nuv pip install numba\nuv pip install -r moonshine/demo/moonshine-onnx/requirements.txt\nsudo apt update\nsudo apt upgrade -y\nsudo apt install -y portaudio19-dev\n# run:\npython3 moonshine/demo/moonshine-onnx/live_captions.py\n</code></pre>\n\n\n\n\n<h3>Testing on realisticly long audios</h3>\nDatasets used for the <a href=\"https://huggingface.co/spaces/hf-audio/open_asr_leaderboard\">model leaderboard</a>\n![Models](week3.png)\n\n<p>From the listed above, I chose SPGISpeech, Earnings-22, and AMI for evalutaion of a model, as the model will be mostly used during meetings.</p>\n<p>The raw datasets are can be included</p>",
+
"content": "<h2>Week 3</h2>\n<p>(Note: this blog will be updated throughout the week)</p>\n<p><code>nvidia/parakeet-tdt_ctc-110m</code> - cannot be run on rPi:\nwhen trying to run the program, it just exits after a while at the point of importing the\n<code>nemo.collections.asr</code>.</p>\n<p>As discovered later on, all nvidia models require nvidia GPU to run. Thus we are left with\n<code>moonhsine</code>.</p>\n<p>Also came across <code>vosk</code> and <code>faster-whisper</code> which are interesting to try.</p>\n<h3>Results and Comparison:</h3>\n<h4>Moonshine tiny</h4>\n<p>And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.</p>\n<h4>Moonshine/base</h4>\n<p>And so my fellow Americans ask not what your country can do for you ask what you can do for your country</p>\n\n\n \n Model\n 11s transcription time \n Word Error Rate \n \n \n whisper.cpp/base \n 21 s \n 10.32 \n \n \n whisper.cpp/base-Q4_K \n 12.6 s \n -- \n \n \n Moonshine/base \n 2.76 s \n 9.99 \n \n \n whisper.cpp/tiny \n 8.3 s \n 12.81 \n \n \n Moonshine/tiny \n 1.48 s \n 12.65 \n \n\n\n<h3>Connecting microphone to rPi</h3>\n<p>Just connect it via USB.\nRun <code>arecord -l</code> to see information about connected audio devices, say card X and device Y.</p>\n<p>To make it a default audio input device (strongly recommended), add this into ~/.asoundrc:</p>\n<pre><code>\npcm.!default{\n type hw\n card X\n}\n\nctl.!default{\n type hw\n card X\n}\n</code></pre>\n\n<p>You can test it with</p>\n<pre><code>\n# record\narecord -D plughw:X,Y -f cd -t wav -d 5 test.wav\n# play\naplay test.wav\n</code></pre>\n\n<h3>Moonshine in streaming mode</h3>\n<p>Simple demo:</p>\n<pre><code>\ngit clone https://github.com/moonshine-ai/moonshine\nuv pip install numba\nuv pip install -r moonshine/demo/moonshine-onnx/requirements.txt\nsudo apt update\nsudo apt upgrade -y\nsudo apt install -y portaudio19-dev\n# run:\npython3 moonshine/demo/moonshine-onnx/live_captions.py\n</code></pre>\n\n\n\n\n<h3>Testing on realisticly long audios</h3>\nDatasets used for the <a href=\"https://huggingface.co/spaces/hf-audio/open_asr_leaderboard\">model leaderboard</a>\n![Models](week3.png)\n\n<p>From the listed above, I chose SPGISpeech, Earnings-22, and AMI for evalutaion of a model, as the model will be mostly used during meetings.</p>\n<p>The raw datasets are can be included</p>",
+
"content_type": "html",
+
"categories": [],
+
"source": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/feed.xml"
+
}
+12
dkvit/uuid_1225c695-cfb8-4ebb-aaaa-80da344efa6a.json
···
+
{
+
"id": "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a",
+
"title": "Week 1",
+
"link": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/week1",
+
"updated": "2025-07-18T14:39:00",
+
"published": "2025-08-10T19:12:43.309993",
+
"summary": "<h2>Whisper</h2>\n<p>Went through the paper on Whisper - speech recognition model from OpenAI.</p>\n<p>It's open source and available on GitHub.</p>\n<p>Many models are available to choose from:</p>\n<p><img alt=\"Models\" src=\"week1.1.png\">\nChoice of model:</p>\n<ol>\n<li>By taking into account other processes running on the device -- better for deployment</li>\n<li>Customizable by user?</li>\n</ol>\n\n<p><i>There can be some custom vocabulary/promting added to the model -- interesting what it can be achieved with it.</i></p>\n<p>Training dataset is 2/3 english and 1/3 uneven mix, but model's &quot;knowledge&quot; is transferable across the languages (for instance slavic languages parts enhance each other).</p>\n<p>Installed both whisper and whisper.cpp on Mac</p>\n<p>Ran transcription with whisper</p>\n<p>Ran transcription with whisper.cpp</p>\n<p><code>sox -d &lt;filename&gt;</code>\nnice tool to record audio\n-d stands for default input device</p>\n<h2>rPi</h2>\n<p>Tried to set up the rPI. The system didn't boot. Turns out it's the problem with the rPi itself - it didn't read from the SD card (indication of no reading: no green LED blinking, only red).</p>\n<p>Got new board - gives green light</p>\n<h2>new rPi</h2>\n<p>Booting rPi with 64-bit standart (not headless) OS.\n<i>for production and further testing - headless (Lite) version should be tested as it's smaller and faster than the standart OS.</i></p>\n<h3>Connecting Mac to the rPi ssh via ethernet via switch</h3>\n<p>! don't forget about setting host when writing OS to the SD-card</p>\n<p><i>just figured out you can update bootloader with the same sd - just different stuff needs to be loaded on it. Could I fix the &quot;broken&quot; rPi by updating the boot? (to be done)</i></p>\n<ol>\n<li>connect both rPi and Mac to an ethernet switch (NetGear GS108 in my case)</li>\n\n<p><i>Had problem with detecting connection from rPi to the switch.</i></p>\n<li>When using ethernet on Mac, one should add the ethernet as service. (Done in *Settings/Network*)</li>\n\n<li>To make the connection work, one should make static IP addresses on the connection for both Mac and rPi</li>\n</ol>\n\n<p>For Mac:</p>\n<ol>\n<li>goto Settings/Network/Apple Adapter(or how else you named the service) -&gt; Details -&gt; TCP/IP tab</li>\n<li>change configure ipv4 to manual</li>\n<li>Input the static address (I chose 192.168.5.1)</li>\n<li>Subnet mask is left 255.255.0.0, other empty fields are left empty</li>\n</ol>\n\n<p>For standart rPi setup:</p>\n<ol>\n<li>Click on the double-arrow network symbol in the top right corner</li>\n<li>Advanced Options/Edit Connections/Wired Connection X/IPv4 Settings/</li>\n<i>note: previously set Link negotiation on Wired Connection X/Ethernet to Automatic - what has it fixed??</i>\n<i>also set cloned MAC address to Permanent - not sure I completely understand what it does</i>\n<li>Set *Method* to *Manual*</li>\n<li>*Add*</li>\n<li>Set parameters (192.168.5.2, 24, 192.168.5.1 for me (not sure what 24 does))</li>\n<li>Save</li>\n<li>Reboot the rPi</li>\n</ol>\n\n<p>For headless rPi setup:<strong>TODO</strong></p>\n<p>Finally, we got the working rPi-Mac connection</p>\n<p>To verify: turn off wifi and try\n<code>ping raspberrypi.local</code>\nOr even try to login (on my rPi I made user = &quot;user&quot;):\n<code>ssh <a href=\"mailto:&#x75;&#x73;&#x65;&#x72;&#64;&#114;&#x61;&#x73;&#x70;&#x62;&#x65;&#x72;&#114;&#121;&#x70;&#x69;&#x2e;&#x6c;&#111;&#99;&#97;&#108;\">&#x75;&#x73;&#x65;&#x72;&#64;&#114;&#x61;&#x73;&#x70;&#x62;&#x65;&#x72;&#114;&#121;&#x70;&#x69;&#x2e;&#x6c;&#111;&#99;&#97;&#108;</a></code>\nAlso ensure in .ssh/known_hosts there's no entry for raspberrypi.local, as there exists a with such URL, thus when you try to connect to ssh for the first time the website is accessed.</p>\n<h3>Connecting rPi to eduroam via wlan</h3>\n<p>needs to be done via loading configuration as /etc/wpa_supplicant/wpa_supplicant.conf:</p>\n<pre><code>\nnetwork={\n ssid=\"eduroam\"\n key_mgmt=WPA-EAP\n eap=PEAP\n identity=\"\"\n password=\"\"\n phase1=\"peaplabel=0\"\n phase2=\"auth=MSCHAPV2\"\n ca_cert=\"\"\n priority=1\n}\n</code></pre>\n\n<p>restarting the service:</p>\n<pre><code>\nsudo killall wpa_supplicant\nsudo wpa_supplicant -B -i wlan0 -c /etc/wpa_supplicant/wpa_supplicant.conf\nsudo dhclient wlan0\n</code></pre>\n\n<p>check by</p>\n<pre><code>\niwgetid\nping 1.1.1.1\n</code></pre>\n\n<h3>Ran whisper.cpp on rPi</h3>\n<p>Took ~18s to transcribe 11s audio.\nLite OS optimization wouldn't be that effective + other processes are to be run in the background.</p>\n<p>Before thinking on optimization decided to run kyutai, as if kyutai is 5 times faster, optimization efforts are wasted.</p>\n<h2>Kyutai</h2>\n<p>Alternative model: kyutai</p>\n<ul>\n<li>Smaller, better performance than whisper</li>\n<li>Inputs stream instead of recording, thus much better for live transcription</li>\n<li>Only English and French</li>\n</ul>\n<p>Trying to run kyutai model on rPi</p>\n<ol>\n<li>Clone repo from git</li>\n<li>Install rust</li>\n<li>cd stt-rs</li>\n<li>sudo apt install libssl-dev</li>\n<li>export PKG_CONFIG_PATH=/usr/lib/aarch64-linux-gnu/pkgconfig</li>\n<li>cargo run -r ../audio/bria.mp3</li>\n</ol>\n<i>takes a long to build - haven't tried with <code>uv</code> though</i>\n\n<p><i>github guide also includes &quot;--features cuda&quot; in the last stage, but as there's no gpu on rPi, it's been removed</i></p>\n<p>Problem: kyutai is too big and thus cannot fit into 3.3 RAM -&gt; the process gets killed</p>\n<p>sudo install python-msgpack</p>",
+
"content": "<h2>Whisper</h2>\n<p>Went through the paper on Whisper - speech recognition model from OpenAI.</p>\n<p>It's open source and available on GitHub.</p>\n<p>Many models are available to choose from:</p>\n<p><img alt=\"Models\" src=\"week1.1.png\">\nChoice of model:</p>\n<ol>\n<li>By taking into account other processes running on the device -- better for deployment</li>\n<li>Customizable by user?</li>\n</ol>\n\n<p><i>There can be some custom vocabulary/promting added to the model -- interesting what it can be achieved with it.</i></p>\n<p>Training dataset is 2/3 english and 1/3 uneven mix, but model's &quot;knowledge&quot; is transferable across the languages (for instance slavic languages parts enhance each other).</p>\n<p>Installed both whisper and whisper.cpp on Mac</p>\n<p>Ran transcription with whisper</p>\n<p>Ran transcription with whisper.cpp</p>\n<p><code>sox -d &lt;filename&gt;</code>\nnice tool to record audio\n-d stands for default input device</p>\n<h2>rPi</h2>\n<p>Tried to set up the rPI. The system didn't boot. Turns out it's the problem with the rPi itself - it didn't read from the SD card (indication of no reading: no green LED blinking, only red).</p>\n<p>Got new board - gives green light</p>\n<h2>new rPi</h2>\n<p>Booting rPi with 64-bit standart (not headless) OS.\n<i>for production and further testing - headless (Lite) version should be tested as it's smaller and faster than the standart OS.</i></p>\n<h3>Connecting Mac to the rPi ssh via ethernet via switch</h3>\n<p>! don't forget about setting host when writing OS to the SD-card</p>\n<p><i>just figured out you can update bootloader with the same sd - just different stuff needs to be loaded on it. Could I fix the &quot;broken&quot; rPi by updating the boot? (to be done)</i></p>\n<ol>\n<li>connect both rPi and Mac to an ethernet switch (NetGear GS108 in my case)</li>\n\n<p><i>Had problem with detecting connection from rPi to the switch.</i></p>\n<li>When using ethernet on Mac, one should add the ethernet as service. (Done in *Settings/Network*)</li>\n\n<li>To make the connection work, one should make static IP addresses on the connection for both Mac and rPi</li>\n</ol>\n\n<p>For Mac:</p>\n<ol>\n<li>goto Settings/Network/Apple Adapter(or how else you named the service) -&gt; Details -&gt; TCP/IP tab</li>\n<li>change configure ipv4 to manual</li>\n<li>Input the static address (I chose 192.168.5.1)</li>\n<li>Subnet mask is left 255.255.0.0, other empty fields are left empty</li>\n</ol>\n\n<p>For standart rPi setup:</p>\n<ol>\n<li>Click on the double-arrow network symbol in the top right corner</li>\n<li>Advanced Options/Edit Connections/Wired Connection X/IPv4 Settings/</li>\n<i>note: previously set Link negotiation on Wired Connection X/Ethernet to Automatic - what has it fixed??</i>\n<i>also set cloned MAC address to Permanent - not sure I completely understand what it does</i>\n<li>Set *Method* to *Manual*</li>\n<li>*Add*</li>\n<li>Set parameters (192.168.5.2, 24, 192.168.5.1 for me (not sure what 24 does))</li>\n<li>Save</li>\n<li>Reboot the rPi</li>\n</ol>\n\n<p>For headless rPi setup:<strong>TODO</strong></p>\n<p>Finally, we got the working rPi-Mac connection</p>\n<p>To verify: turn off wifi and try\n<code>ping raspberrypi.local</code>\nOr even try to login (on my rPi I made user = &quot;user&quot;):\n<code>ssh <a href=\"mailto:&#x75;&#x73;&#x65;&#x72;&#64;&#114;&#x61;&#x73;&#x70;&#x62;&#x65;&#x72;&#114;&#121;&#x70;&#x69;&#x2e;&#x6c;&#111;&#99;&#97;&#108;\">&#x75;&#x73;&#x65;&#x72;&#64;&#114;&#x61;&#x73;&#x70;&#x62;&#x65;&#x72;&#114;&#121;&#x70;&#x69;&#x2e;&#x6c;&#111;&#99;&#97;&#108;</a></code>\nAlso ensure in .ssh/known_hosts there's no entry for raspberrypi.local, as there exists a with such URL, thus when you try to connect to ssh for the first time the website is accessed.</p>\n<h3>Connecting rPi to eduroam via wlan</h3>\n<p>needs to be done via loading configuration as /etc/wpa_supplicant/wpa_supplicant.conf:</p>\n<pre><code>\nnetwork={\n ssid=\"eduroam\"\n key_mgmt=WPA-EAP\n eap=PEAP\n identity=\"\"\n password=\"\"\n phase1=\"peaplabel=0\"\n phase2=\"auth=MSCHAPV2\"\n ca_cert=\"\"\n priority=1\n}\n</code></pre>\n\n<p>restarting the service:</p>\n<pre><code>\nsudo killall wpa_supplicant\nsudo wpa_supplicant -B -i wlan0 -c /etc/wpa_supplicant/wpa_supplicant.conf\nsudo dhclient wlan0\n</code></pre>\n\n<p>check by</p>\n<pre><code>\niwgetid\nping 1.1.1.1\n</code></pre>\n\n<h3>Ran whisper.cpp on rPi</h3>\n<p>Took ~18s to transcribe 11s audio.\nLite OS optimization wouldn't be that effective + other processes are to be run in the background.</p>\n<p>Before thinking on optimization decided to run kyutai, as if kyutai is 5 times faster, optimization efforts are wasted.</p>\n<h2>Kyutai</h2>\n<p>Alternative model: kyutai</p>\n<ul>\n<li>Smaller, better performance than whisper</li>\n<li>Inputs stream instead of recording, thus much better for live transcription</li>\n<li>Only English and French</li>\n</ul>\n<p>Trying to run kyutai model on rPi</p>\n<ol>\n<li>Clone repo from git</li>\n<li>Install rust</li>\n<li>cd stt-rs</li>\n<li>sudo apt install libssl-dev</li>\n<li>export PKG_CONFIG_PATH=/usr/lib/aarch64-linux-gnu/pkgconfig</li>\n<li>cargo run -r ../audio/bria.mp3</li>\n</ol>\n<i>takes a long to build - haven't tried with <code>uv</code> though</i>\n\n<p><i>github guide also includes &quot;--features cuda&quot; in the last stage, but as there's no gpu on rPi, it's been removed</i></p>\n<p>Problem: kyutai is too big and thus cannot fit into 3.3 RAM -&gt; the process gets killed</p>\n<p>sudo install python-msgpack</p>",
+
"content_type": "html",
+
"categories": [],
+
"source": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/feed.xml"
+
}
+12
dkvit/uuid_26a31438-e93b-469e-97df-f5543150a1f6.json
···
+
{
+
"id": "urn:uuid:26a31438-e93b-469e-97df-f5543150a1f6",
+
"title": "Week 2_1",
+
"link": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/week2_1",
+
"updated": "2025-07-23T13:50:00",
+
"published": "2025-08-10T19:12:43.306219",
+
"summary": "<h2>Week 2 part 1</h2>\n<p>From last week, problem of memory shortage exists: track of memory usage shows that the process tries to use more and more memory, resulting in a crash and thus the process being killed by the OS.</p>\n<p>Solution 1:\nUsing microSD partially as RAM:</p>\n<pre><code>\n# Enabling usage of 8GB for swapping\nsudo fallocate -l 8G /swapfile\nsudo chmod 600 /swapfile\nsudo mkswap /swapfile\nsudo swapon /swapfile\n\n# Making it permanent\necho '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab\n\n...\n\n# Disabling swapping\nsudo swapoff /swapfile\n\n# Permanent disabling\nsudo rm /swapfile\n... (remove line from fstab)\nsudo reboot\n</code></pre>\n\n<p>This showed that the model needs only 1.6GB more memory. As microSD memory is too slow, the model running took enormous time to complete and thus was terminated.</p>\n<p>One could 1) use ssd instead - too costly and crosses idea of small-power; 2) use rPi with bigger RAM (currenty 4 gb).</p>\n<h2>Returning back to whisper.cpp</h2>\n<h3>Evaluation of the three models</h3>\n<p>Decided to do evaluation of speed of transcription using different models.</p>\n<p>Here is time and memory usage for transcribing an 11s JFK speech using 4/4 threads and standart OS:</p>\n\n\n \n Model\n Time\n Memory\n \n \n tiny\n 8.3 s\n 77 MB\n \n \n tiny.en\n 8.5 s\n 77 MB\n \n \n base\n 18 s\n 147 MB\n \n \n base.en\n 21 s\n 256MB\n \n \n small\n 64 s\n 487 MB\n \n \n small.en\n 65 s\n 487 MB\n \n\n\n<p>The performance test was performed once and only on one recording.</p>\n<p>Optimization of loading time and other inter-sample could be considered for real-time transcription.</p>\n<p><i>Same evaluation on rPi 5 (possibly with 8gb RAM) could be reasonable due to CPU difference, but despite being 2x times faster, it requires fan/active cooling.</i></p>\n<p>After iterational refinement, the following script is used as <code>~/eval.sh</code> for evaluation:</p>\n<pre><code>\n#!/bin/bash\n\nmodels=()\nwhile [ $# -gt 0 ]; do\n models+=( \"$1\" )\n shift\ndone\n\necho \"models: ${models[@]}\"\ntouch report.log\necho \"Report on model evaluation. The duration of sample recording is 11s (JFK speech)\" &gt; report.log\ncd whisper.cpp\necho -n \"Building whisper-cli... \"\ncmake -B build &gt; /dev/null\ncmake --build build -j --config Release &gt; /dev/null\necho \"whisper-cli build\"\nbase_models=(\"tiny\" \"tiny.en\" \"base\" \"base.en\" \"small\" \"small.en\" \"medium\" \"medium.en\")\necho \"-----------------------------\"\necho \"-----------------------------\" &gt;&gt; ../report.log\n\nis_base_model(){\n for bm in \"${base_models[@]}\"; do\n if [[ \"$1\" =~ ^\"${bm}\"$ ]]; then\n echo \"$1 IS base model\"\n return 0\n fi\n done\n echo \"$1 is not a base model\"\n return 1\n}\n\n\nfor model in \"${models[@]}\"; do\n echo \"Model $model\" &gt;&gt; ../report.log\n if is_base_model $model; then\n echo \"Starting model $model evaluation\"\n if [ ! -f models/$model.bin ]; then\n echo -n \"Model not found... Downloading $model... \"\n sh ./models/download-ggml-model.sh $model &gt; /dev/null\n mv models/ggml-$model.bin models/$model.bin\n echo \"Downloaded\"\n fi\n path=\"models/$model.bin\"\n else\n echo -n \"Looking for quantized model $model... \"\n if [ ! -f quantized_models/$model.bin ]; then\n echo \"Quantized model not found. Skipping...\"\n continue\n fi\n path=\"quantized_models/$model.bin\"\n echo \"Quantized model found\"\n fi\n echo -n \"Runtime: \" &gt;&gt; ../report.log\n echo -n \"Running $model... \"\n ./build/bin/whisper-cli -m $path -f samples/jfk.wav &gt; tmp.out 2&gt;&amp;1\n\n # for debugging\n # cat tmp.out\n\n grep -i -E \"total memory|total time\" tmp.out &gt;&gt; ../report.log\n echo \"run\"\n echo \"----------------------------------\" &gt;&gt; ../report.log\n echo \"----------------------------------\"\ndone\n</code></pre>\n\n<h3>Quantization of whisper</h3>\n<p>Unlike kyutai, whisper supports built-in quantization.</p>\n<p>Notes on choosing quantizations:</p>\n<p>Qx_y - x bits per weight, y - legacy flag, deprecated in favour of Qx_K</p>\n<p>Qx_K - K-quants, better than standard, have mixed bit-widths</p>\n<p>TQx - ternary quantization (ters instead of bits), extreme compression and quality drops too much</p>\n<p>IQx_s - importance-aware quantization, much better quality for the same bit rates. s - size (S/M/L)</p>\n<p>Based on this, will try with IQ4_M first.</p>\n<p>After iterational refinement, this script was used as <code>~/qt.sh</code> for quantization:</p>\n<pre><code>\n\n#!/bin/bash\n\necho \"args: $@\"\n\ncd whisper.cpp\nif [ $# -eq 0 ]; then\n echo \"Error: quantization method is not provided.\"\n echo \"Usage: $0 ... [-r ] \"\n exit 1\nfi\nqms=()\nmodel=\"base\"\nwhile [ $# -gt 0 ]; do\n echo \"curr arg: $1\"\n if [[ \"$1\" == \"-m\" ]]; then\n echo \"equals to -m\"\n shift\n model=\"$1\"\n break\n fi\n qms+=(\"$1\")\n shift\ndone\necho \"qms: ${sqm[@]}\"\n\nif [ ! -d \"quantized_models\" ]; then\n mkdir quantized_models\nfi\nfor qm in \"${qms[@]}\"; do\n ./build/bin/quantize models/$model.bin quantized_models/$model-$qm.bin $qm\ndone\n\n</code></pre>\n\n\n\n<p>After spending some time figuring why the model doesn't want to be quantized to IQ4_M, it turns out that models possible for quantization are listed in lines 50-80 of file common-ggml.cpp.</p>\n<p>After small experimenting with <code>base</code> model:</p>\n<p>q5_0 - improvement from 18.1 to 14.3 (encoding time: 14.5 to 11.4 )</p>\n<p>q2_k - model starts outputing &quot;you you you&quot; -&gt; not enough quality</p>\n<p>q5_k - improvement from 18.1 to 13.2 (encoding time: 14.7 to 10.6)</p>\n<p>Further evaluations:</p>\n<h3>Model Evaluation on 11s sample</h3>\n Model Evaluation Report (11s JFK Speech Sample)\n\n\n \n \n Model\n Runtime (s)\n \n \n \n \n Small Models\n \n \n small-q2_k\n 38.4\n \n \n small-q3_k\n 46.2\n \n \n small-q4_0\n 39.8\n \n \n small-q4_1\n 39.1\n \n \n small-q4_k\n 37.3\n \n \n small-q5_0\n 47\n \n \n small-q5_1\n 49.7\n \n \n small-q5_k\n 44.7\n \n \n small-q6_k\n 46.6\n \n \n small-q8_0\n 40.5\n \n \n small\n 76.3\n \n \n Base Models\n \n \n base-q2_k\n 75.9\n \n \n base-q3_k\n 13.7\n \n \n base-q4_0\n 12.6\n \n \n base-q4_1\n 12.3\n \n \n base-q4_k\n 11.9\n \n \n base-q5_0\n 14.4\n \n \n base-q5_1\n 14.4\n \n \n base-q5_k\n 13.3\n \n \n base-q6_k\n 13.6\n \n \n base-q8_0\n 12.8\n \n \n base\n 18.2\n \n \n \n\n<p>Issue: q2_k should be smaller and faster, while it's not. Small-q2_k doesn't get stuck and actually produces the correct transcription, so performance decrease is somewhere else.</p>\n<p>Turns out q2_k/q3_k are optimized for AVX2/AVX512 (Single Instruction, Multiple Data commands extensions) in x86 architecture. For rPi running on ARM CPU, those are absent and quantization overhead becomes cosmic, thus slowing down in performance. Model getting stuck on &quot;you you you&quot; is likely result of poor resulting precision of the model.</p>\n<p>In theory, base-q4_k run on a headless setup should be sufficient at least for with additional bit of time for transcription (for instance, additional 5-10 mins after an hour-long meeting). But if we want to achieve real-time\ntranscription, one should seek for alternatives.</p>",
+
"content": "<h2>Week 2 part 1</h2>\n<p>From last week, problem of memory shortage exists: track of memory usage shows that the process tries to use more and more memory, resulting in a crash and thus the process being killed by the OS.</p>\n<p>Solution 1:\nUsing microSD partially as RAM:</p>\n<pre><code>\n# Enabling usage of 8GB for swapping\nsudo fallocate -l 8G /swapfile\nsudo chmod 600 /swapfile\nsudo mkswap /swapfile\nsudo swapon /swapfile\n\n# Making it permanent\necho '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab\n\n...\n\n# Disabling swapping\nsudo swapoff /swapfile\n\n# Permanent disabling\nsudo rm /swapfile\n... (remove line from fstab)\nsudo reboot\n</code></pre>\n\n<p>This showed that the model needs only 1.6GB more memory. As microSD memory is too slow, the model running took enormous time to complete and thus was terminated.</p>\n<p>One could 1) use ssd instead - too costly and crosses idea of small-power; 2) use rPi with bigger RAM (currenty 4 gb).</p>\n<h2>Returning back to whisper.cpp</h2>\n<h3>Evaluation of the three models</h3>\n<p>Decided to do evaluation of speed of transcription using different models.</p>\n<p>Here is time and memory usage for transcribing an 11s JFK speech using 4/4 threads and standart OS:</p>\n\n\n \n Model\n Time\n Memory\n \n \n tiny\n 8.3 s\n 77 MB\n \n \n tiny.en\n 8.5 s\n 77 MB\n \n \n base\n 18 s\n 147 MB\n \n \n base.en\n 21 s\n 256MB\n \n \n small\n 64 s\n 487 MB\n \n \n small.en\n 65 s\n 487 MB\n \n\n\n<p>The performance test was performed once and only on one recording.</p>\n<p>Optimization of loading time and other inter-sample could be considered for real-time transcription.</p>\n<p><i>Same evaluation on rPi 5 (possibly with 8gb RAM) could be reasonable due to CPU difference, but despite being 2x times faster, it requires fan/active cooling.</i></p>\n<p>After iterational refinement, the following script is used as <code>~/eval.sh</code> for evaluation:</p>\n<pre><code>\n#!/bin/bash\n\nmodels=()\nwhile [ $# -gt 0 ]; do\n models+=( \"$1\" )\n shift\ndone\n\necho \"models: ${models[@]}\"\ntouch report.log\necho \"Report on model evaluation. The duration of sample recording is 11s (JFK speech)\" &gt; report.log\ncd whisper.cpp\necho -n \"Building whisper-cli... \"\ncmake -B build &gt; /dev/null\ncmake --build build -j --config Release &gt; /dev/null\necho \"whisper-cli build\"\nbase_models=(\"tiny\" \"tiny.en\" \"base\" \"base.en\" \"small\" \"small.en\" \"medium\" \"medium.en\")\necho \"-----------------------------\"\necho \"-----------------------------\" &gt;&gt; ../report.log\n\nis_base_model(){\n for bm in \"${base_models[@]}\"; do\n if [[ \"$1\" =~ ^\"${bm}\"$ ]]; then\n echo \"$1 IS base model\"\n return 0\n fi\n done\n echo \"$1 is not a base model\"\n return 1\n}\n\n\nfor model in \"${models[@]}\"; do\n echo \"Model $model\" &gt;&gt; ../report.log\n if is_base_model $model; then\n echo \"Starting model $model evaluation\"\n if [ ! -f models/$model.bin ]; then\n echo -n \"Model not found... Downloading $model... \"\n sh ./models/download-ggml-model.sh $model &gt; /dev/null\n mv models/ggml-$model.bin models/$model.bin\n echo \"Downloaded\"\n fi\n path=\"models/$model.bin\"\n else\n echo -n \"Looking for quantized model $model... \"\n if [ ! -f quantized_models/$model.bin ]; then\n echo \"Quantized model not found. Skipping...\"\n continue\n fi\n path=\"quantized_models/$model.bin\"\n echo \"Quantized model found\"\n fi\n echo -n \"Runtime: \" &gt;&gt; ../report.log\n echo -n \"Running $model... \"\n ./build/bin/whisper-cli -m $path -f samples/jfk.wav &gt; tmp.out 2&gt;&amp;1\n\n # for debugging\n # cat tmp.out\n\n grep -i -E \"total memory|total time\" tmp.out &gt;&gt; ../report.log\n echo \"run\"\n echo \"----------------------------------\" &gt;&gt; ../report.log\n echo \"----------------------------------\"\ndone\n</code></pre>\n\n<h3>Quantization of whisper</h3>\n<p>Unlike kyutai, whisper supports built-in quantization.</p>\n<p>Notes on choosing quantizations:</p>\n<p>Qx_y - x bits per weight, y - legacy flag, deprecated in favour of Qx_K</p>\n<p>Qx_K - K-quants, better than standard, have mixed bit-widths</p>\n<p>TQx - ternary quantization (ters instead of bits), extreme compression and quality drops too much</p>\n<p>IQx_s - importance-aware quantization, much better quality for the same bit rates. s - size (S/M/L)</p>\n<p>Based on this, will try with IQ4_M first.</p>\n<p>After iterational refinement, this script was used as <code>~/qt.sh</code> for quantization:</p>\n<pre><code>\n\n#!/bin/bash\n\necho \"args: $@\"\n\ncd whisper.cpp\nif [ $# -eq 0 ]; then\n echo \"Error: quantization method is not provided.\"\n echo \"Usage: $0 ... [-r ] \"\n exit 1\nfi\nqms=()\nmodel=\"base\"\nwhile [ $# -gt 0 ]; do\n echo \"curr arg: $1\"\n if [[ \"$1\" == \"-m\" ]]; then\n echo \"equals to -m\"\n shift\n model=\"$1\"\n break\n fi\n qms+=(\"$1\")\n shift\ndone\necho \"qms: ${sqm[@]}\"\n\nif [ ! -d \"quantized_models\" ]; then\n mkdir quantized_models\nfi\nfor qm in \"${qms[@]}\"; do\n ./build/bin/quantize models/$model.bin quantized_models/$model-$qm.bin $qm\ndone\n\n</code></pre>\n\n\n\n<p>After spending some time figuring why the model doesn't want to be quantized to IQ4_M, it turns out that models possible for quantization are listed in lines 50-80 of file common-ggml.cpp.</p>\n<p>After small experimenting with <code>base</code> model:</p>\n<p>q5_0 - improvement from 18.1 to 14.3 (encoding time: 14.5 to 11.4 )</p>\n<p>q2_k - model starts outputing &quot;you you you&quot; -&gt; not enough quality</p>\n<p>q5_k - improvement from 18.1 to 13.2 (encoding time: 14.7 to 10.6)</p>\n<p>Further evaluations:</p>\n<h3>Model Evaluation on 11s sample</h3>\n Model Evaluation Report (11s JFK Speech Sample)\n\n\n \n \n Model\n Runtime (s)\n \n \n \n \n Small Models\n \n \n small-q2_k\n 38.4\n \n \n small-q3_k\n 46.2\n \n \n small-q4_0\n 39.8\n \n \n small-q4_1\n 39.1\n \n \n small-q4_k\n 37.3\n \n \n small-q5_0\n 47\n \n \n small-q5_1\n 49.7\n \n \n small-q5_k\n 44.7\n \n \n small-q6_k\n 46.6\n \n \n small-q8_0\n 40.5\n \n \n small\n 76.3\n \n \n Base Models\n \n \n base-q2_k\n 75.9\n \n \n base-q3_k\n 13.7\n \n \n base-q4_0\n 12.6\n \n \n base-q4_1\n 12.3\n \n \n base-q4_k\n 11.9\n \n \n base-q5_0\n 14.4\n \n \n base-q5_1\n 14.4\n \n \n base-q5_k\n 13.3\n \n \n base-q6_k\n 13.6\n \n \n base-q8_0\n 12.8\n \n \n base\n 18.2\n \n \n \n\n<p>Issue: q2_k should be smaller and faster, while it's not. Small-q2_k doesn't get stuck and actually produces the correct transcription, so performance decrease is somewhere else.</p>\n<p>Turns out q2_k/q3_k are optimized for AVX2/AVX512 (Single Instruction, Multiple Data commands extensions) in x86 architecture. For rPi running on ARM CPU, those are absent and quantization overhead becomes cosmic, thus slowing down in performance. Model getting stuck on &quot;you you you&quot; is likely result of poor resulting precision of the model.</p>\n<p>In theory, base-q4_k run on a headless setup should be sufficient at least for with additional bit of time for transcription (for instance, additional 5-10 mins after an hour-long meeting). But if we want to achieve real-time\ntranscription, one should seek for alternatives.</p>",
+
"content_type": "html",
+
"categories": [],
+
"source": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/feed.xml"
+
}
+12
dkvit/uuid_5ce0118b-ad98-441d-b041-896a4287b46c.json
···
+
{
+
"id": "urn:uuid:5ce0118b-ad98-441d-b041-896a4287b46c",
+
"title": "Week 2_2",
+
"link": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/week2_2",
+
"updated": "2025-07-29T10:37:00",
+
"published": "2025-08-10T19:12:43.305291",
+
"summary": "<h2>Week 2 part 2</h2>\n<p>After evaluation of previous models' performances, we decided to try to fit Voxtral - another transformer model.</p>\n<p>The mini version of the model is 3b parameters, weights 8 gb, which took quite a long time to download even for Mac. As kyutai with 1b parameters was way too slow on rPi, I decided that there's no point in trying to run Voxtral on rPi.</p>\n<p>At this point it became obvious that most models are made for powerful devices with GPU. Thus, a decision was made to rather look for a model definitely smaller than 1b params rather than trying out every model we pass by.</p>\n<p>Of course the exact speed of the model depends on the pipeline itself but the constant\nfactor caused by this cannot outweight the fact that kuytai took about 10s to transcribe 1s\nof audio on 4/4 threads.</p>\n<h3>Hugging face</h3>\n<p>Hugging face is an open-source platform for AI models. Similar to github, not only it provides most (if not all) models with their &quot;model cards&quot;, but also has leaderboards for the models. This is what I'll be working with next.</p>\n<p><a href=\"https://huggingface.co/spaces/hf-audio/open_asr_leaderboard\">Here</a> one can find\nthe leaderboard of the speech-recognition models. We are interested in two criteria: WER (word error rate) and RTFx (time of the audio being transcribed/transcription time).</p>\n<p>The tiny.en model without quantization has RTFx of 348, base.en has 320.</p>\n<p>Interesting model:</p>\n<p>UsefulSensors/moonshine-tiny - 9.99 / 565.97</p>\n<p>The following seem extremely fast too, but later turned out that they require Nvidia GPU architecture</p>\n<p>nvidia/parakeet-tdt_ctc-110m - 7.49 /\n5345.14</p>\n<p>nvidia/parakeet-tdt-0.6b-v2 - 6.05\n3386.02</p>\n<p>nvidia/canary-180m-flash - 7.12 / 1213.58</p>\n<p>nvidia/parakeet-rnnt-0.6b - 7.5 / 2815.72 (no punctuation/capitalization)</p>\n<p>nvidia/parakeet-ctc-0.6b - 7.69 / 4281.53 (no punctuation/capitalization)</p>",
+
"content": "<h2>Week 2 part 2</h2>\n<p>After evaluation of previous models' performances, we decided to try to fit Voxtral - another transformer model.</p>\n<p>The mini version of the model is 3b parameters, weights 8 gb, which took quite a long time to download even for Mac. As kyutai with 1b parameters was way too slow on rPi, I decided that there's no point in trying to run Voxtral on rPi.</p>\n<p>At this point it became obvious that most models are made for powerful devices with GPU. Thus, a decision was made to rather look for a model definitely smaller than 1b params rather than trying out every model we pass by.</p>\n<p>Of course the exact speed of the model depends on the pipeline itself but the constant\nfactor caused by this cannot outweight the fact that kuytai took about 10s to transcribe 1s\nof audio on 4/4 threads.</p>\n<h3>Hugging face</h3>\n<p>Hugging face is an open-source platform for AI models. Similar to github, not only it provides most (if not all) models with their &quot;model cards&quot;, but also has leaderboards for the models. This is what I'll be working with next.</p>\n<p><a href=\"https://huggingface.co/spaces/hf-audio/open_asr_leaderboard\">Here</a> one can find\nthe leaderboard of the speech-recognition models. We are interested in two criteria: WER (word error rate) and RTFx (time of the audio being transcribed/transcription time).</p>\n<p>The tiny.en model without quantization has RTFx of 348, base.en has 320.</p>\n<p>Interesting model:</p>\n<p>UsefulSensors/moonshine-tiny - 9.99 / 565.97</p>\n<p>The following seem extremely fast too, but later turned out that they require Nvidia GPU architecture</p>\n<p>nvidia/parakeet-tdt_ctc-110m - 7.49 /\n5345.14</p>\n<p>nvidia/parakeet-tdt-0.6b-v2 - 6.05\n3386.02</p>\n<p>nvidia/canary-180m-flash - 7.12 / 1213.58</p>\n<p>nvidia/parakeet-rnnt-0.6b - 7.5 / 2815.72 (no punctuation/capitalization)</p>\n<p>nvidia/parakeet-ctc-0.6b - 7.69 / 4281.53 (no punctuation/capitalization)</p>",
+
"content_type": "html",
+
"categories": [],
+
"source": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/feed.xml"
+
}
+12
dkvit/uuid_fd52ec3b-5a92-480a-ab72-ab8ddc426352.json
···
+
{
+
"id": "urn:uuid:fd52ec3b-5a92-480a-ab72-ab8ddc426352",
+
"title": "Week 4",
+
"link": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/week4",
+
"updated": "2025-08-08T15:07:00",
+
"published": "2025-08-10T19:12:43.300725",
+
"summary": "<h2>Week 4</h2>\n<p><a href=\"https://github.com/DakPro/low_power_speech_recognition\">Repo</a> for the project code.\nMany of the used services use huggingface client, so setting up huggingface access token is recommended. </p>\n<h4>Setting up access token</h4>\n<ol>\n<li>Login in <a href=\"https://huggingface.co\">huggingface</a></li>\n<li>Goto <a href=\"https://huggingface.co/settings/profile\">Settings</a></li>\n<li>Goto Access tokens</li>\n<li>Create a new token (read-only recommended)</li>\n</ol>\n<h4>Using access token</h4>\n<ol>\n<li><code>brew install huggingface-cli</code></li>\n<li><code> hf auth login </code></li>\n<li>Input the access token</li>\n</ol>\n<p>When making requests to huggingface client, programs will automatically use the token.</p>\n<h3>Planned structure of the repo</h3>\n<ul>\n<li>Outer file <code> transcription_from_mic.py</code>: given a model name runs \na runtime transcription demo.</li>\n<li>Outer file <code> transcription_from_file.py</code>: given a model name and file \ntranscribes the file. </li>\n<li>Separate directory for each model, includes<ul>\n<li>The irreplaceable part of model pipeline (usually copied from the model source)</li>\n<li>Some stuff used before (like reports, scripts)?</li>\n<li>Interface to use the model, both for demo (with printing captions) and production</li>\n</ul>\n</li>\n<li>Directory for testing - for interaction with datasets</li>\n</ul>",
+
"content": "<h2>Week 4</h2>\n<p><a href=\"https://github.com/DakPro/low_power_speech_recognition\">Repo</a> for the project code.\nMany of the used services use huggingface client, so setting up huggingface access token is recommended. </p>\n<h4>Setting up access token</h4>\n<ol>\n<li>Login in <a href=\"https://huggingface.co\">huggingface</a></li>\n<li>Goto <a href=\"https://huggingface.co/settings/profile\">Settings</a></li>\n<li>Goto Access tokens</li>\n<li>Create a new token (read-only recommended)</li>\n</ol>\n<h4>Using access token</h4>\n<ol>\n<li><code>brew install huggingface-cli</code></li>\n<li><code> hf auth login </code></li>\n<li>Input the access token</li>\n</ol>\n<p>When making requests to huggingface client, programs will automatically use the token.</p>\n<h3>Planned structure of the repo</h3>\n<ul>\n<li>Outer file <code> transcription_from_mic.py</code>: given a model name runs \na runtime transcription demo.</li>\n<li>Outer file <code> transcription_from_file.py</code>: given a model name and file \ntranscribes the file. </li>\n<li>Separate directory for each model, includes<ul>\n<li>The irreplaceable part of model pipeline (usually copied from the model source)</li>\n<li>Some stuff used before (like reports, scripts)?</li>\n<li>Interface to use the model, both for demo (with printing captions) and production</li>\n</ul>\n</li>\n<li>Directory for testing - for interaction with datasets</li>\n</ul>",
+
"content_type": "html",
+
"categories": [],
+
"source": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/feed.xml"
+
}
+4 -4
index.json
···
],
"directory": "dkvit",
"created": "2025-08-10T19:12:30.090698",
-
"last_updated": "2025-08-10T19:12:30.090700",
-
"entry_count": 0
+
"last_updated": "2025-08-10T19:12:43.316257",
+
"entry_count": 5
}
},
"created": "2025-07-15T16:04:07.657530",
-
"last_updated": "2025-08-10T19:12:30.090706",
-
"total_entries": 266
+
"last_updated": "2025-08-10T19:12:43.316258",
+
"total_entries": 271
}