I've been working on making an offline voice timer using the ESP32 for voice recognition and the RP2040 for everything else (a little display, sounds, ...). I've hit a bit of a wall with the speech recognition portion and I was wondering if anyone had any suggestions or had a better performing method for speech detection. Knowing very little about machine learning, I followed along with the Google Colab documentation for training 'tflite' models for microcontrollers, but I wanted to adjust it to detect the words: marvin (as a wake word), stop, and the digits 0-9. I've tooled around with the settings as well as implemented the methods from a few papers (e.g. this one) to try and improve performance, but I'm typically limited to ~85% accuracy on a reduced set of words (marvin, stop, 0-3 and 5). When I try modifying the example code for an ESP32 (from the espressif repo) I can see that it is now trying to detect the correct words, but it has very low confidence (typically ~130 on the scale it uses) and I am fairly certain it would never "work" in any real sense. Does anyone have any suggestions about how to improve accuracy? I also tried using ESP-Skainet, but I couldn't get it to recognize any commands (though I am not sure it was receiving audio correctly).