Wrong Assumptions & Hard Lessons Learned
This document catalogs all the incorrect assumptions we made during development and what the reality was. Hopefully this saves you from the same painful debugging.
Audio Processing Assumptions
Assumption 1: Simple Downsampling is Fine
What we thought:
# Simple downsampling from 48kHz to 16kHz
audio_data = audio_data[::3] # Keep every 3rd sample
Reality: This destroys audio quality through aliasing and loses critical frequency information. The wake word model expects properly resampled audio.
What actually works:
from scipy import signal
audio_data = signal.resample(audio_data, target_samples).astype(np.int16)
Impact: Wake word scores went from ~0.000 to 0.5-0.95
Assumption 2: Audio Format Doesn’t Matter Much
What we thought:
# float32 should be fine, it's more precise
audio_data = np.frombuffer(indata, dtype=np.float32)
Reality: The wake word model was trained on int16 audio. Float32 changes the amplitude scale and confuses the model’s feature extraction.
What actually works:
audio_data = np.frombuffer(indata, dtype=np.int16).flatten()
Impact: This was the #1 reason wake word detection failed
Assumption 3: We Can Use Any Sample Rate
What we thought:
# Just set sounddevice to 16000Hz
sd.InputStream(samplerate=16000, ...)
Reality: The AIY Voice HAT only supports 48000Hz via PortAudio/SoundDevice. Attempting 16000Hz causes errors or silent failures.
What actually works:
# Hardware at 48000Hz, resample to 16000Hz for model
sd.InputStream(samplerate=48000, ...)
audio_data = signal.resample(audio_data, 16000_chunk_size)
Impact: Without this, the audio stream wouldn’t even open
Assumption 4: Check the Immediate Prediction Result
What we thought:
prediction = oww_model.predict(audio_data)
if prediction > threshold:
# Wake word detected!
Reality:
OpenWakeWord uses a sliding window of predictions (prediction_buffer), not instantaneous results. The immediate return value is meaningless.
What actually works:
oww_model.predict(audio_data) # Just updates the buffer
for mdl in oww_model.prediction_buffer.keys():
score = list(oww_model.prediction_buffer[mdl])[-1]
if score > threshold:
# Actually detected!
Impact: This was the #2 reason detection failed
Assumption 5: ALSA Default Device Works
What we thought:
arecord -D default -r 16000 -c 1 -f S16_LE test.wav
Reality: Batocera uses PipeWire, which conflicts with direct ALSA access. The “default” device routes through PipeWire and causes “Host is down” errors.
What actually works:
# Bypass PipeWire entirely
arecord -D plughw:0,0 -r 16000 -c 1 -f S16_LE test.wav
Impact: Recording was completely broken until we found this
Model & Binary Assumptions
Assumption 6: Pre-built Binaries Work on Pi 5
What we thought: Download whisper.cpp binaries from GitHub releases
Reality: Pre-built binaries are compiled for older ARM architectures and crash with SIGILL (illegal instruction) on Pi 5’s ARMv8.2-A.
What actually works: Compile whisper.cpp specifically for Pi 5 using Docker or cross-compilation:
docker run --rm -v $(pwd):/work arm64v8/debian:latest \
bash -c "apt-get update && apt-get install -y cmake build-essential && \
cd /work && cmake -B build -DWHISPER_BUILD_EXAMPLES=ON && \
cmake --build build --config Release"
Impact: SIGILL crashes until we compiled ourselves
Assumption 7: Just Include the Main Binary
What we thought:
Copy only whisper-cli to the Pi
Reality: whisper-cli depends on multiple .so libraries (libwhisper.so.1, libggml*.so*) that must be in the same directory or LD_LIBRARY_PATH.
What actually works: Copy the entire build output:
whisper-cli
libwhisper.so.1
libggml.so.0
libggml-base.so.0
libggml-cpu.so.0
Impact: “Library not found” errors
Assumption 8: Any ONNX Model Works
What we thought: Any “Hey Jarvis” ONNX model from the internet would work
Reality: OpenWakeWord models are specifically trained with MFCC preprocessing and expect exact input dimensions [1, 16, 96]. Random ONNX models won’t work.
What actually works: Use models specifically trained for OpenWakeWord from their repository.
Impact: Model would load but produce garbage predictions
Assumption 9: Model Works on Raw Audio
What we thought: The model takes raw audio samples and does the feature extraction
Reality:
The model expects pre-computed MFCC (Mel-Frequency Cepstral Coefficients) features, not raw audio. OpenWakeWord’s predict() method handles this internally.
What actually works: Use OpenWakeWord’s high-level API - it handles MFCC extraction internally.
Impact: Tried to manually compute features (waste of time)
Hardware & System Assumptions
Assumption 10: GPIO Button is Active High
What we thought:
if GPIO.input(BUTTON_PIN) == GPIO.HIGH:
# Button pressed
Reality: The AIY HAT button is wired active-low (connected to ground when pressed).
What actually works:
if GPIO.input(BUTTON_PIN) == GPIO.LOW:
# Button pressed
Impact: Button detection was inverted
Assumption 11: Audio Chunk Size Doesn’t Matter
What we thought: Any chunk size would work - just process whatever we get
Reality: OpenWakeWord expects specific chunk sizes (1280 samples = 80ms at 16kHz) for its internal buffering and MFCC computation.
What actually works:
CHUNK_SIZE = 1280 # 80ms at 16000Hz
input_chunk_size = int(CHUNK_SIZE * (input_rate / OWW_SAMPLE_RATE))
Impact: Wrong chunk sizes caused prediction delays and inaccuracies
Assumption 12: We Can Use System Python Packages
What we thought: Install scipy, numpy, etc. via pip on Batocera
Reality:
Batocera is a read-only root filesystem. We must use /userdata directory and set PYTHONPATH.
What actually works:
sys.path.insert(0, '/userdata/voice-assistant/lib')
Impact: Couldn’t install packages the normal way
Process & Debugging Assumptions
Assumption 13: Wake Word Should Work Immediately
What we thought: If the wake word doesn’t detect on the first try, it’s broken
Reality: Wake word detection requires:
- Proper audio levels (not too quiet, not clipping)
- Clear pronunciation
- Appropriate distance from microphone
- Some models need a few seconds to “warm up”
What actually works: Test with consistent, clear speech at 6-12 inches from mic. Check audio levels first.
Impact: Thought it was broken when it just needed better test conditions
Assumption 14: High Score = Good Detection
What we thought: Scores near 1.0 are required for reliable detection
Reality: The “Hey Jarvis” model typically scores 0.5-0.95 when working correctly. Scores of 0.999 are suspicious and might indicate overfitting or wrong model.
What actually works: Threshold of 0.5 works well for this model.
Impact: Set threshold too high (0.8) and missed valid detections
Assumption 15: One Detection Per Wake Word
What we thought: Say “Hey Jarvis” once → one detection
Reality: Depending on chunk boundaries and audio processing, you might get multiple detections from a single utterance if you don’t reset the buffer.
What actually works:
if score > threshold:
oww_model.reset() # Clear the prediction buffer
# Process command...
Impact: Multiple activations from single wake word
Architecture Assumptions
Assumption 16: Use Same Audio Path for Everything
What we thought: Use SoundDevice for both wake word detection AND recording
Reality: SoundDevice holds the audio device open, blocking arecord from accessing it. Also, SoundDevice doesn’t work well with ALSA direct mode.
What actually works:
- SoundDevice for wake word detection (PortAudio)
- arecord for command recording (direct ALSA)
- Close SoundDevice stream before calling arecord
Impact: Recording failed with “Device busy” errors
Assumption 17: Synchronous Processing is Fine
What we thought: Process everything in the audio callback
Reality: Audio callbacks must be fast (<10ms) or you get dropouts. LLM inference takes seconds.
What actually works:
def audio_callback(indata, frames, time_info, status):
# Only do fast operations here
wake_detected = check_wake_word(indata)
if wake_detected:
trigger_processing_thread() # Do slow work elsewhere
Impact: Audio dropouts, missed wake words
The Big Picture Mistakes
Mistake 1: Not Reading be-more-agent Code First
We spent hours debugging when be-more-agent had already solved these problems. Lesson: Look for working reference implementations first.
Mistake 2: Assuming Errors Mean Broken Hardware
Multiple “Host is down” and SIGILL errors made us think the hardware was faulty. Lesson: Software/configuration issues are more likely than hardware failure.
Mistake 3: Changing Too Many Things at Once
We tried different sample rates, formats, and models simultaneously. Lesson: Change one variable at a time and test.
Mistake 4: Not Checking Audio Quality First
We assumed audio was good because the stream opened. Lesson: Always verify audio quality with test recordings before processing.
Checklist for Future Voice Projects
Before you start debugging:
- Record test audio:
arecord -D plughw:0,0 -r 16000 -f S16_LE test.wav - Verify audio quality by playing it back:
aplay test.wav - Check audio format matches model expectations
- Find a working reference implementation
- Test with simplest possible setup first
- Verify binary compatibility (ARM64 vs ARM32)
- Check all library dependencies
- Confirm chunk sizes match model requirements
Summary Table
| Assumption | Reality | Time Wasted |
|---|---|---|
| Simple downsampling [::-3] | Use scipy.signal.resample | 2 hours |
| float32 audio format | Must use int16 | 4 hours |
| Check immediate prediction | Check prediction_buffer | 3 hours |
| ALSA default device | Must use plughw:0,0 | 1 hour |
| Pre-built binaries work | Must compile for Pi 5 | 2 hours |
| Include only main binary | Need all .so libraries | 30 minutes |
| Any ONNX model works | Need OpenWakeWord specific models | 1 hour |
| Model takes raw audio | Needs MFCC features | 2 hours |
| GPIO button active high | Actually active low | 30 minutes |
| Audio chunk size flexible | Must be 1280 samples | 1 hour |
| System Python packages | Must use /userdata/lib | 1 hour |
| High score = good | 0.5-0.95 is normal | 30 minutes |
| One detection per utterance | Need to reset buffer | 1 hour |
| Same audio path for all | Close stream before recording | 2 hours |
| Synchronous processing | Must use threads | 2 hours |
Total time wasted on wrong assumptions: ~23 hours
Final Advice
When something doesn’t work:
- Don’t assume - Test every assumption
- Look for working examples - Someone has solved this before
- Read the source - Documentation lies, code doesn’t
- Check the basics - Audio quality, format, levels
- Change one thing at a time - Isolate variables
- Log everything - You can’t debug what you can’t see
The working implementation is the result of correcting ALL of these assumptions. Miss even one, and things break mysteriously.