Thicket data repository for the EEG
1{
2 "id": "urn:uuid:26a31438-e93b-469e-97df-f5543150a1f6",
3 "title": "Week 2_1",
4 "link": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/week2_1",
5 "updated": "2025-07-23T13:50:00",
6 "published": "2025-08-10T19:12:43.306219",
7 "summary": "<h2>Week 2 part 1</h2>\n<p>From last week, problem of memory shortage exists: track of memory usage shows that the process tries to use more and more memory, resulting in a crash and thus the process being killed by the OS.</p>\n<p>Solution 1:\nUsing microSD partially as RAM:</p>\n<pre><code>\n# Enabling usage of 8GB for swapping\nsudo fallocate -l 8G /swapfile\nsudo chmod 600 /swapfile\nsudo mkswap /swapfile\nsudo swapon /swapfile\n\n# Making it permanent\necho '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab\n\n...\n\n# Disabling swapping\nsudo swapoff /swapfile\n\n# Permanent disabling\nsudo rm /swapfile\n... (remove line from fstab)\nsudo reboot\n</code></pre>\n\n<p>This showed that the model needs only 1.6GB more memory. As microSD memory is too slow, the model running took enormous time to complete and thus was terminated.</p>\n<p>One could 1) use ssd instead - too costly and crosses idea of small-power; 2) use rPi with bigger RAM (currenty 4 gb).</p>\n<h2>Returning back to whisper.cpp</h2>\n<h3>Evaluation of the three models</h3>\n<p>Decided to do evaluation of speed of transcription using different models.</p>\n<p>Here is time and memory usage for transcribing an 11s JFK speech using 4/4 threads and standart OS:</p>\n\n\n \n Model\n Time\n Memory\n \n \n tiny\n 8.3 s\n 77 MB\n \n \n tiny.en\n 8.5 s\n 77 MB\n \n \n base\n 18 s\n 147 MB\n \n \n base.en\n 21 s\n 256MB\n \n \n small\n 64 s\n 487 MB\n \n \n small.en\n 65 s\n 487 MB\n \n\n\n<p>The performance test was performed once and only on one recording.</p>\n<p>Optimization of loading time and other inter-sample could be considered for real-time transcription.</p>\n<p><i>Same evaluation on rPi 5 (possibly with 8gb RAM) could be reasonable due to CPU difference, but despite being 2x times faster, it requires fan/active cooling.</i></p>\n<p>After iterational refinement, the following script is used as <code>~/eval.sh</code> for evaluation:</p>\n<pre><code>\n#!/bin/bash\n\nmodels=()\nwhile [ $# -gt 0 ]; do\n models+=( \"$1\" )\n shift\ndone\n\necho \"models: ${models[@]}\"\ntouch report.log\necho \"Report on model evaluation. The duration of sample recording is 11s (JFK speech)\" > report.log\ncd whisper.cpp\necho -n \"Building whisper-cli... \"\ncmake -B build > /dev/null\ncmake --build build -j --config Release > /dev/null\necho \"whisper-cli build\"\nbase_models=(\"tiny\" \"tiny.en\" \"base\" \"base.en\" \"small\" \"small.en\" \"medium\" \"medium.en\")\necho \"-----------------------------\"\necho \"-----------------------------\" >> ../report.log\n\nis_base_model(){\n for bm in \"${base_models[@]}\"; do\n if [[ \"$1\" =~ ^\"${bm}\"$ ]]; then\n echo \"$1 IS base model\"\n return 0\n fi\n done\n echo \"$1 is not a base model\"\n return 1\n}\n\n\nfor model in \"${models[@]}\"; do\n echo \"Model $model\" >> ../report.log\n if is_base_model $model; then\n echo \"Starting model $model evaluation\"\n if [ ! -f models/$model.bin ]; then\n echo -n \"Model not found... Downloading $model... \"\n sh ./models/download-ggml-model.sh $model > /dev/null\n mv models/ggml-$model.bin models/$model.bin\n echo \"Downloaded\"\n fi\n path=\"models/$model.bin\"\n else\n echo -n \"Looking for quantized model $model... \"\n if [ ! -f quantized_models/$model.bin ]; then\n echo \"Quantized model not found. Skipping...\"\n continue\n fi\n path=\"quantized_models/$model.bin\"\n echo \"Quantized model found\"\n fi\n echo -n \"Runtime: \" >> ../report.log\n echo -n \"Running $model... \"\n ./build/bin/whisper-cli -m $path -f samples/jfk.wav > tmp.out 2>&1\n\n # for debugging\n # cat tmp.out\n\n grep -i -E \"total memory|total time\" tmp.out >> ../report.log\n echo \"run\"\n echo \"----------------------------------\" >> ../report.log\n echo \"----------------------------------\"\ndone\n</code></pre>\n\n<h3>Quantization of whisper</h3>\n<p>Unlike kyutai, whisper supports built-in quantization.</p>\n<p>Notes on choosing quantizations:</p>\n<p>Qx_y - x bits per weight, y - legacy flag, deprecated in favour of Qx_K</p>\n<p>Qx_K - K-quants, better than standard, have mixed bit-widths</p>\n<p>TQx - ternary quantization (ters instead of bits), extreme compression and quality drops too much</p>\n<p>IQx_s - importance-aware quantization, much better quality for the same bit rates. s - size (S/M/L)</p>\n<p>Based on this, will try with IQ4_M first.</p>\n<p>After iterational refinement, this script was used as <code>~/qt.sh</code> for quantization:</p>\n<pre><code>\n\n#!/bin/bash\n\necho \"args: $@\"\n\ncd whisper.cpp\nif [ $# -eq 0 ]; then\n echo \"Error: quantization method is not provided.\"\n echo \"Usage: $0 ... [-r ] \"\n exit 1\nfi\nqms=()\nmodel=\"base\"\nwhile [ $# -gt 0 ]; do\n echo \"curr arg: $1\"\n if [[ \"$1\" == \"-m\" ]]; then\n echo \"equals to -m\"\n shift\n model=\"$1\"\n break\n fi\n qms+=(\"$1\")\n shift\ndone\necho \"qms: ${sqm[@]}\"\n\nif [ ! -d \"quantized_models\" ]; then\n mkdir quantized_models\nfi\nfor qm in \"${qms[@]}\"; do\n ./build/bin/quantize models/$model.bin quantized_models/$model-$qm.bin $qm\ndone\n\n</code></pre>\n\n\n\n<p>After spending some time figuring why the model doesn't want to be quantized to IQ4_M, it turns out that models possible for quantization are listed in lines 50-80 of file common-ggml.cpp.</p>\n<p>After small experimenting with <code>base</code> model:</p>\n<p>q5_0 - improvement from 18.1 to 14.3 (encoding time: 14.5 to 11.4 )</p>\n<p>q2_k - model starts outputing "you you you" -> not enough quality</p>\n<p>q5_k - improvement from 18.1 to 13.2 (encoding time: 14.7 to 10.6)</p>\n<p>Further evaluations:</p>\n<h3>Model Evaluation on 11s sample</h3>\n Model Evaluation Report (11s JFK Speech Sample)\n\n\n \n \n Model\n Runtime (s)\n \n \n \n \n Small Models\n \n \n small-q2_k\n 38.4\n \n \n small-q3_k\n 46.2\n \n \n small-q4_0\n 39.8\n \n \n small-q4_1\n 39.1\n \n \n small-q4_k\n 37.3\n \n \n small-q5_0\n 47\n \n \n small-q5_1\n 49.7\n \n \n small-q5_k\n 44.7\n \n \n small-q6_k\n 46.6\n \n \n small-q8_0\n 40.5\n \n \n small\n 76.3\n \n \n Base Models\n \n \n base-q2_k\n 75.9\n \n \n base-q3_k\n 13.7\n \n \n base-q4_0\n 12.6\n \n \n base-q4_1\n 12.3\n \n \n base-q4_k\n 11.9\n \n \n base-q5_0\n 14.4\n \n \n base-q5_1\n 14.4\n \n \n base-q5_k\n 13.3\n \n \n base-q6_k\n 13.6\n \n \n base-q8_0\n 12.8\n \n \n base\n 18.2\n \n \n \n\n<p>Issue: q2_k should be smaller and faster, while it's not. Small-q2_k doesn't get stuck and actually produces the correct transcription, so performance decrease is somewhere else.</p>\n<p>Turns out q2_k/q3_k are optimized for AVX2/AVX512 (Single Instruction, Multiple Data commands extensions) in x86 architecture. For rPi running on ARM CPU, those are absent and quantization overhead becomes cosmic, thus slowing down in performance. Model getting stuck on "you you you" is likely result of poor resulting precision of the model.</p>\n<p>In theory, base-q4_k run on a headless setup should be sufficient at least for with additional bit of time for transcription (for instance, additional 5-10 mins after an hour-long meeting). But if we want to achieve real-time\ntranscription, one should seek for alternatives.</p>",
8 "content": "<h2>Week 2 part 1</h2>\n<p>From last week, problem of memory shortage exists: track of memory usage shows that the process tries to use more and more memory, resulting in a crash and thus the process being killed by the OS.</p>\n<p>Solution 1:\nUsing microSD partially as RAM:</p>\n<pre><code>\n# Enabling usage of 8GB for swapping\nsudo fallocate -l 8G /swapfile\nsudo chmod 600 /swapfile\nsudo mkswap /swapfile\nsudo swapon /swapfile\n\n# Making it permanent\necho '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab\n\n...\n\n# Disabling swapping\nsudo swapoff /swapfile\n\n# Permanent disabling\nsudo rm /swapfile\n... (remove line from fstab)\nsudo reboot\n</code></pre>\n\n<p>This showed that the model needs only 1.6GB more memory. As microSD memory is too slow, the model running took enormous time to complete and thus was terminated.</p>\n<p>One could 1) use ssd instead - too costly and crosses idea of small-power; 2) use rPi with bigger RAM (currenty 4 gb).</p>\n<h2>Returning back to whisper.cpp</h2>\n<h3>Evaluation of the three models</h3>\n<p>Decided to do evaluation of speed of transcription using different models.</p>\n<p>Here is time and memory usage for transcribing an 11s JFK speech using 4/4 threads and standart OS:</p>\n\n\n \n Model\n Time\n Memory\n \n \n tiny\n 8.3 s\n 77 MB\n \n \n tiny.en\n 8.5 s\n 77 MB\n \n \n base\n 18 s\n 147 MB\n \n \n base.en\n 21 s\n 256MB\n \n \n small\n 64 s\n 487 MB\n \n \n small.en\n 65 s\n 487 MB\n \n\n\n<p>The performance test was performed once and only on one recording.</p>\n<p>Optimization of loading time and other inter-sample could be considered for real-time transcription.</p>\n<p><i>Same evaluation on rPi 5 (possibly with 8gb RAM) could be reasonable due to CPU difference, but despite being 2x times faster, it requires fan/active cooling.</i></p>\n<p>After iterational refinement, the following script is used as <code>~/eval.sh</code> for evaluation:</p>\n<pre><code>\n#!/bin/bash\n\nmodels=()\nwhile [ $# -gt 0 ]; do\n models+=( \"$1\" )\n shift\ndone\n\necho \"models: ${models[@]}\"\ntouch report.log\necho \"Report on model evaluation. The duration of sample recording is 11s (JFK speech)\" > report.log\ncd whisper.cpp\necho -n \"Building whisper-cli... \"\ncmake -B build > /dev/null\ncmake --build build -j --config Release > /dev/null\necho \"whisper-cli build\"\nbase_models=(\"tiny\" \"tiny.en\" \"base\" \"base.en\" \"small\" \"small.en\" \"medium\" \"medium.en\")\necho \"-----------------------------\"\necho \"-----------------------------\" >> ../report.log\n\nis_base_model(){\n for bm in \"${base_models[@]}\"; do\n if [[ \"$1\" =~ ^\"${bm}\"$ ]]; then\n echo \"$1 IS base model\"\n return 0\n fi\n done\n echo \"$1 is not a base model\"\n return 1\n}\n\n\nfor model in \"${models[@]}\"; do\n echo \"Model $model\" >> ../report.log\n if is_base_model $model; then\n echo \"Starting model $model evaluation\"\n if [ ! -f models/$model.bin ]; then\n echo -n \"Model not found... Downloading $model... \"\n sh ./models/download-ggml-model.sh $model > /dev/null\n mv models/ggml-$model.bin models/$model.bin\n echo \"Downloaded\"\n fi\n path=\"models/$model.bin\"\n else\n echo -n \"Looking for quantized model $model... \"\n if [ ! -f quantized_models/$model.bin ]; then\n echo \"Quantized model not found. Skipping...\"\n continue\n fi\n path=\"quantized_models/$model.bin\"\n echo \"Quantized model found\"\n fi\n echo -n \"Runtime: \" >> ../report.log\n echo -n \"Running $model... \"\n ./build/bin/whisper-cli -m $path -f samples/jfk.wav > tmp.out 2>&1\n\n # for debugging\n # cat tmp.out\n\n grep -i -E \"total memory|total time\" tmp.out >> ../report.log\n echo \"run\"\n echo \"----------------------------------\" >> ../report.log\n echo \"----------------------------------\"\ndone\n</code></pre>\n\n<h3>Quantization of whisper</h3>\n<p>Unlike kyutai, whisper supports built-in quantization.</p>\n<p>Notes on choosing quantizations:</p>\n<p>Qx_y - x bits per weight, y - legacy flag, deprecated in favour of Qx_K</p>\n<p>Qx_K - K-quants, better than standard, have mixed bit-widths</p>\n<p>TQx - ternary quantization (ters instead of bits), extreme compression and quality drops too much</p>\n<p>IQx_s - importance-aware quantization, much better quality for the same bit rates. s - size (S/M/L)</p>\n<p>Based on this, will try with IQ4_M first.</p>\n<p>After iterational refinement, this script was used as <code>~/qt.sh</code> for quantization:</p>\n<pre><code>\n\n#!/bin/bash\n\necho \"args: $@\"\n\ncd whisper.cpp\nif [ $# -eq 0 ]; then\n echo \"Error: quantization method is not provided.\"\n echo \"Usage: $0 ... [-r ] \"\n exit 1\nfi\nqms=()\nmodel=\"base\"\nwhile [ $# -gt 0 ]; do\n echo \"curr arg: $1\"\n if [[ \"$1\" == \"-m\" ]]; then\n echo \"equals to -m\"\n shift\n model=\"$1\"\n break\n fi\n qms+=(\"$1\")\n shift\ndone\necho \"qms: ${sqm[@]}\"\n\nif [ ! -d \"quantized_models\" ]; then\n mkdir quantized_models\nfi\nfor qm in \"${qms[@]}\"; do\n ./build/bin/quantize models/$model.bin quantized_models/$model-$qm.bin $qm\ndone\n\n</code></pre>\n\n\n\n<p>After spending some time figuring why the model doesn't want to be quantized to IQ4_M, it turns out that models possible for quantization are listed in lines 50-80 of file common-ggml.cpp.</p>\n<p>After small experimenting with <code>base</code> model:</p>\n<p>q5_0 - improvement from 18.1 to 14.3 (encoding time: 14.5 to 11.4 )</p>\n<p>q2_k - model starts outputing "you you you" -> not enough quality</p>\n<p>q5_k - improvement from 18.1 to 13.2 (encoding time: 14.7 to 10.6)</p>\n<p>Further evaluations:</p>\n<h3>Model Evaluation on 11s sample</h3>\n Model Evaluation Report (11s JFK Speech Sample)\n\n\n \n \n Model\n Runtime (s)\n \n \n \n \n Small Models\n \n \n small-q2_k\n 38.4\n \n \n small-q3_k\n 46.2\n \n \n small-q4_0\n 39.8\n \n \n small-q4_1\n 39.1\n \n \n small-q4_k\n 37.3\n \n \n small-q5_0\n 47\n \n \n small-q5_1\n 49.7\n \n \n small-q5_k\n 44.7\n \n \n small-q6_k\n 46.6\n \n \n small-q8_0\n 40.5\n \n \n small\n 76.3\n \n \n Base Models\n \n \n base-q2_k\n 75.9\n \n \n base-q3_k\n 13.7\n \n \n base-q4_0\n 12.6\n \n \n base-q4_1\n 12.3\n \n \n base-q4_k\n 11.9\n \n \n base-q5_0\n 14.4\n \n \n base-q5_1\n 14.4\n \n \n base-q5_k\n 13.3\n \n \n base-q6_k\n 13.6\n \n \n base-q8_0\n 12.8\n \n \n base\n 18.2\n \n \n \n\n<p>Issue: q2_k should be smaller and faster, while it's not. Small-q2_k doesn't get stuck and actually produces the correct transcription, so performance decrease is somewhere else.</p>\n<p>Turns out q2_k/q3_k are optimized for AVX2/AVX512 (Single Instruction, Multiple Data commands extensions) in x86 architecture. For rPi running on ARM CPU, those are absent and quantization overhead becomes cosmic, thus slowing down in performance. Model getting stuck on "you you you" is likely result of poor resulting precision of the model.</p>\n<p>In theory, base-q4_k run on a headless setup should be sufficient at least for with additional bit of time for transcription (for instance, additional 5-10 mins after an hour-long meeting). But if we want to achieve real-time\ntranscription, one should seek for alternatives.</p>",
9 "content_type": "html",
10 "categories": [],
11 "source": "https://dakpro.github.io/project_feeds/low_power_speech_recognition/feed.xml"
12}