Wrong Assumptions & Hard Lessons Learned

This document catalogs all the incorrect assumptions we made during development and what the reality was. Hopefully this saves you from the same painful debugging.

Audio Processing Assumptions

Assumption 1: Simple Downsampling is Fine

What we thought:

# Simple downsampling from 48kHz to 16kHz
audio_data = audio_data[::3]  # Keep every 3rd sample

Reality: This destroys audio quality through aliasing and loses critical frequency information. The wake word model expects properly resampled audio.

What actually works:

from scipy import signal
audio_data = signal.resample(audio_data, target_samples).astype(np.int16)

Impact: Wake word scores went from ~0.000 to 0.5-0.95

Assumption 2: Audio Format Doesn’t Matter Much

What we thought:

# float32 should be fine, it's more precise
audio_data = np.frombuffer(indata, dtype=np.float32)

Reality: The wake word model was trained on int16 audio. Float32 changes the amplitude scale and confuses the model’s feature extraction.

What actually works:

audio_data = np.frombuffer(indata, dtype=np.int16).flatten()

Impact: This was the #1 reason wake word detection failed

Assumption 3: We Can Use Any Sample Rate

What we thought:

# Just set sounddevice to 16000Hz
sd.InputStream(samplerate=16000, ...)

Reality: The AIY Voice HAT only supports 48000Hz via PortAudio/SoundDevice. Attempting 16000Hz causes errors or silent failures.

What actually works:

# Hardware at 48000Hz, resample to 16000Hz for model
sd.InputStream(samplerate=48000, ...)
audio_data = signal.resample(audio_data, 16000_chunk_size)

Impact: Without this, the audio stream wouldn’t even open

Assumption 4: Check the Immediate Prediction Result

What we thought:

prediction = oww_model.predict(audio_data)
if prediction > threshold:
    # Wake word detected!

Reality: OpenWakeWord uses a sliding window of predictions (prediction_buffer), not instantaneous results. The immediate return value is meaningless.

What actually works:

oww_model.predict(audio_data)  # Just updates the buffer
for mdl in oww_model.prediction_buffer.keys():
    score = list(oww_model.prediction_buffer[mdl])[-1]
    if score > threshold:
        # Actually detected!

Impact: This was the #2 reason detection failed

Assumption 5: ALSA Default Device Works

What we thought:

arecord -D default -r 16000 -c 1 -f S16_LE test.wav

Reality: Batocera uses PipeWire, which conflicts with direct ALSA access. The “default” device routes through PipeWire and causes “Host is down” errors.

What actually works:

# Bypass PipeWire entirely
arecord -D plughw:0,0 -r 16000 -c 1 -f S16_LE test.wav

Impact: Recording was completely broken until we found this

Model & Binary Assumptions

Assumption 6: Pre-built Binaries Work on Pi 5

What we thought: Download whisper.cpp binaries from GitHub releases

Reality: Pre-built binaries are compiled for older ARM architectures and crash with SIGILL (illegal instruction) on Pi 5’s ARMv8.2-A.

What actually works: Compile whisper.cpp specifically for Pi 5 using Docker or cross-compilation:

docker run --rm -v $(pwd):/work arm64v8/debian:latest \
  bash -c "apt-get update && apt-get install -y cmake build-essential && \
  cd /work && cmake -B build -DWHISPER_BUILD_EXAMPLES=ON && \
  cmake --build build --config Release"

Impact: SIGILL crashes until we compiled ourselves

Assumption 7: Just Include the Main Binary

What we thought: Copy only whisper-cli to the Pi

Reality: whisper-cli depends on multiple .so libraries (libwhisper.so.1, libggml*.so*) that must be in the same directory or LD_LIBRARY_PATH.

What actually works: Copy the entire build output:

whisper-cli
libwhisper.so.1
libggml.so.0
libggml-base.so.0
libggml-cpu.so.0

Impact: “Library not found” errors

Assumption 8: Any ONNX Model Works

What we thought: Any “Hey Jarvis” ONNX model from the internet would work

Reality: OpenWakeWord models are specifically trained with MFCC preprocessing and expect exact input dimensions [1, 16, 96]. Random ONNX models won’t work.

What actually works: Use models specifically trained for OpenWakeWord from their repository.

Impact: Model would load but produce garbage predictions

Assumption 9: Model Works on Raw Audio

What we thought: The model takes raw audio samples and does the feature extraction

Reality: The model expects pre-computed MFCC (Mel-Frequency Cepstral Coefficients) features, not raw audio. OpenWakeWord’s predict() method handles this internally.

What actually works: Use OpenWakeWord’s high-level API - it handles MFCC extraction internally.

Impact: Tried to manually compute features (waste of time)

Hardware & System Assumptions

Assumption 10: GPIO Button is Active High

What we thought:

if GPIO.input(BUTTON_PIN) == GPIO.HIGH:
    # Button pressed

Reality: The AIY HAT button is wired active-low (connected to ground when pressed).

What actually works:

if GPIO.input(BUTTON_PIN) == GPIO.LOW:
    # Button pressed

Impact: Button detection was inverted

Assumption 11: Audio Chunk Size Doesn’t Matter

What we thought: Any chunk size would work - just process whatever we get

Reality: OpenWakeWord expects specific chunk sizes (1280 samples = 80ms at 16kHz) for its internal buffering and MFCC computation.

What actually works:

CHUNK_SIZE = 1280  # 80ms at 16000Hz
input_chunk_size = int(CHUNK_SIZE * (input_rate / OWW_SAMPLE_RATE))

Impact: Wrong chunk sizes caused prediction delays and inaccuracies

Assumption 12: We Can Use System Python Packages

What we thought: Install scipy, numpy, etc. via pip on Batocera

Reality: Batocera is a read-only root filesystem. We must use /userdata directory and set PYTHONPATH.

What actually works:

sys.path.insert(0, '/userdata/voice-assistant/lib')

Impact: Couldn’t install packages the normal way

Process & Debugging Assumptions

Assumption 13: Wake Word Should Work Immediately

What we thought: If the wake word doesn’t detect on the first try, it’s broken

Reality: Wake word detection requires:

Proper audio levels (not too quiet, not clipping)
Clear pronunciation
Appropriate distance from microphone
Some models need a few seconds to “warm up”

What actually works: Test with consistent, clear speech at 6-12 inches from mic. Check audio levels first.

Impact: Thought it was broken when it just needed better test conditions

Assumption 14: High Score = Good Detection

What we thought: Scores near 1.0 are required for reliable detection

Reality: The “Hey Jarvis” model typically scores 0.5-0.95 when working correctly. Scores of 0.999 are suspicious and might indicate overfitting or wrong model.

What actually works: Threshold of 0.5 works well for this model.

Impact: Set threshold too high (0.8) and missed valid detections

Assumption 15: One Detection Per Wake Word

What we thought: Say “Hey Jarvis” once → one detection

Reality: Depending on chunk boundaries and audio processing, you might get multiple detections from a single utterance if you don’t reset the buffer.

What actually works:

if score > threshold:
    oww_model.reset()  # Clear the prediction buffer
    # Process command...

Impact: Multiple activations from single wake word

Architecture Assumptions

Assumption 16: Use Same Audio Path for Everything

What we thought: Use SoundDevice for both wake word detection AND recording

Reality: SoundDevice holds the audio device open, blocking arecord from accessing it. Also, SoundDevice doesn’t work well with ALSA direct mode.

What actually works:

SoundDevice for wake word detection (PortAudio)
arecord for command recording (direct ALSA)
Close SoundDevice stream before calling arecord

Impact: Recording failed with “Device busy” errors

Assumption 17: Synchronous Processing is Fine

What we thought: Process everything in the audio callback

Reality: Audio callbacks must be fast (<10ms) or you get dropouts. LLM inference takes seconds.

What actually works:

def audio_callback(indata, frames, time_info, status):
    # Only do fast operations here
    wake_detected = check_wake_word(indata)
    if wake_detected:
        trigger_processing_thread()  # Do slow work elsewhere

Impact: Audio dropouts, missed wake words

Record test audio: arecord -D plughw:0,0 -r 16000 -f S16_LE test.wav
Verify audio quality by playing it back: aplay test.wav
Check audio format matches model expectations
Find a working reference implementation
Test with simplest possible setup first
Verify binary compatibility (ARM64 vs ARM32)
Check all library dependencies
Confirm chunk sizes match model requirements

Summary Table

Assumption	Reality	Time Wasted
Simple downsampling [::-3]	Use scipy.signal.resample	2 hours
float32 audio format	Must use int16	4 hours
Check immediate prediction	Check prediction_buffer	3 hours
ALSA default device	Must use plughw:0,0	1 hour
Pre-built binaries work	Must compile for Pi 5	2 hours
Include only main binary	Need all .so libraries	30 minutes
Any ONNX model works	Need OpenWakeWord specific models	1 hour
Model takes raw audio	Needs MFCC features	2 hours
GPIO button active high	Actually active low	30 minutes
Audio chunk size flexible	Must be 1280 samples	1 hour
System Python packages	Must use /userdata/lib	1 hour
High score = good	0.5-0.95 is normal	30 minutes
One detection per utterance	Need to reset buffer	1 hour
Same audio path for all	Close stream before recording	2 hours
Synchronous processing	Must use threads	2 hours

Total time wasted on wrong assumptions: ~23 hours

Final Advice

When something doesn’t work:

Don’t assume - Test every assumption
Look for working examples - Someone has solved this before
Read the source - Documentation lies, code doesn’t
Check the basics - Audio quality, format, levels
Change one thing at a time - Isolate variables
Log everything - You can’t debug what you can’t see

The working implementation is the result of correcting ALL of these assumptions. Miss even one, and things break mysteriously.

Botface Voice Assistant Documentation

Wrong Assumptions & Hard Lessons Learned

Audio Processing Assumptions

Assumption 1: Simple Downsampling is Fine

Assumption 2: Audio Format Doesn’t Matter Much

Assumption 3: We Can Use Any Sample Rate

Assumption 4: Check the Immediate Prediction Result

Assumption 5: ALSA Default Device Works

Model & Binary Assumptions

Assumption 6: Pre-built Binaries Work on Pi 5

Assumption 7: Just Include the Main Binary

Assumption 8: Any ONNX Model Works

Assumption 9: Model Works on Raw Audio

Hardware & System Assumptions

Assumption 10: GPIO Button is Active High

Assumption 11: Audio Chunk Size Doesn’t Matter

Assumption 12: We Can Use System Python Packages

Process & Debugging Assumptions

Assumption 13: Wake Word Should Work Immediately

Assumption 14: High Score = Good Detection

Assumption 15: One Detection Per Wake Word

Architecture Assumptions

Assumption 16: Use Same Audio Path for Everything

Assumption 17: Synchronous Processing is Fine

The Big Picture Mistakes

Mistake 1: Not Reading be-more-agent Code First

Mistake 2: Assuming Errors Mean Broken Hardware

Mistake 3: Changing Too Many Things at Once

Mistake 4: Not Checking Audio Quality First

Checklist for Future Voice Projects

Summary Table

Final Advice

Keyboard shortcuts

Botface Voice Assistant Documentation