Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Wrong Assumptions & Hard Lessons Learned

This document catalogs all the incorrect assumptions we made during development and what the reality was. Hopefully this saves you from the same painful debugging.

Audio Processing Assumptions

Assumption 1: Simple Downsampling is Fine

What we thought:

# Simple downsampling from 48kHz to 16kHz
audio_data = audio_data[::3]  # Keep every 3rd sample

Reality: This destroys audio quality through aliasing and loses critical frequency information. The wake word model expects properly resampled audio.

What actually works:

from scipy import signal
audio_data = signal.resample(audio_data, target_samples).astype(np.int16)

Impact: Wake word scores went from ~0.000 to 0.5-0.95


Assumption 2: Audio Format Doesn’t Matter Much

What we thought:

# float32 should be fine, it's more precise
audio_data = np.frombuffer(indata, dtype=np.float32)

Reality: The wake word model was trained on int16 audio. Float32 changes the amplitude scale and confuses the model’s feature extraction.

What actually works:

audio_data = np.frombuffer(indata, dtype=np.int16).flatten()

Impact: This was the #1 reason wake word detection failed


Assumption 3: We Can Use Any Sample Rate

What we thought:

# Just set sounddevice to 16000Hz
sd.InputStream(samplerate=16000, ...)

Reality: The AIY Voice HAT only supports 48000Hz via PortAudio/SoundDevice. Attempting 16000Hz causes errors or silent failures.

What actually works:

# Hardware at 48000Hz, resample to 16000Hz for model
sd.InputStream(samplerate=48000, ...)
audio_data = signal.resample(audio_data, 16000_chunk_size)

Impact: Without this, the audio stream wouldn’t even open


Assumption 4: Check the Immediate Prediction Result

What we thought:

prediction = oww_model.predict(audio_data)
if prediction > threshold:
    # Wake word detected!

Reality: OpenWakeWord uses a sliding window of predictions (prediction_buffer), not instantaneous results. The immediate return value is meaningless.

What actually works:

oww_model.predict(audio_data)  # Just updates the buffer
for mdl in oww_model.prediction_buffer.keys():
    score = list(oww_model.prediction_buffer[mdl])[-1]
    if score > threshold:
        # Actually detected!

Impact: This was the #2 reason detection failed


Assumption 5: ALSA Default Device Works

What we thought:

arecord -D default -r 16000 -c 1 -f S16_LE test.wav

Reality: Batocera uses PipeWire, which conflicts with direct ALSA access. The “default” device routes through PipeWire and causes “Host is down” errors.

What actually works:

# Bypass PipeWire entirely
arecord -D plughw:0,0 -r 16000 -c 1 -f S16_LE test.wav

Impact: Recording was completely broken until we found this


Model & Binary Assumptions

Assumption 6: Pre-built Binaries Work on Pi 5

What we thought: Download whisper.cpp binaries from GitHub releases

Reality: Pre-built binaries are compiled for older ARM architectures and crash with SIGILL (illegal instruction) on Pi 5’s ARMv8.2-A.

What actually works: Compile whisper.cpp specifically for Pi 5 using Docker or cross-compilation:

docker run --rm -v $(pwd):/work arm64v8/debian:latest \
  bash -c "apt-get update && apt-get install -y cmake build-essential && \
  cd /work && cmake -B build -DWHISPER_BUILD_EXAMPLES=ON && \
  cmake --build build --config Release"

Impact: SIGILL crashes until we compiled ourselves


Assumption 7: Just Include the Main Binary

What we thought: Copy only whisper-cli to the Pi

Reality: whisper-cli depends on multiple .so libraries (libwhisper.so.1, libggml*.so*) that must be in the same directory or LD_LIBRARY_PATH.

What actually works: Copy the entire build output:

whisper-cli
libwhisper.so.1
libggml.so.0
libggml-base.so.0
libggml-cpu.so.0

Impact: “Library not found” errors


Assumption 8: Any ONNX Model Works

What we thought: Any “Hey Jarvis” ONNX model from the internet would work

Reality: OpenWakeWord models are specifically trained with MFCC preprocessing and expect exact input dimensions [1, 16, 96]. Random ONNX models won’t work.

What actually works: Use models specifically trained for OpenWakeWord from their repository.

Impact: Model would load but produce garbage predictions


Assumption 9: Model Works on Raw Audio

What we thought: The model takes raw audio samples and does the feature extraction

Reality: The model expects pre-computed MFCC (Mel-Frequency Cepstral Coefficients) features, not raw audio. OpenWakeWord’s predict() method handles this internally.

What actually works: Use OpenWakeWord’s high-level API - it handles MFCC extraction internally.

Impact: Tried to manually compute features (waste of time)


Hardware & System Assumptions

Assumption 10: GPIO Button is Active High

What we thought:

if GPIO.input(BUTTON_PIN) == GPIO.HIGH:
    # Button pressed

Reality: The AIY HAT button is wired active-low (connected to ground when pressed).

What actually works:

if GPIO.input(BUTTON_PIN) == GPIO.LOW:
    # Button pressed

Impact: Button detection was inverted


Assumption 11: Audio Chunk Size Doesn’t Matter

What we thought: Any chunk size would work - just process whatever we get

Reality: OpenWakeWord expects specific chunk sizes (1280 samples = 80ms at 16kHz) for its internal buffering and MFCC computation.

What actually works:

CHUNK_SIZE = 1280  # 80ms at 16000Hz
input_chunk_size = int(CHUNK_SIZE * (input_rate / OWW_SAMPLE_RATE))

Impact: Wrong chunk sizes caused prediction delays and inaccuracies


Assumption 12: We Can Use System Python Packages

What we thought: Install scipy, numpy, etc. via pip on Batocera

Reality: Batocera is a read-only root filesystem. We must use /userdata directory and set PYTHONPATH.

What actually works:

sys.path.insert(0, '/userdata/voice-assistant/lib')

Impact: Couldn’t install packages the normal way


Process & Debugging Assumptions

Assumption 13: Wake Word Should Work Immediately

What we thought: If the wake word doesn’t detect on the first try, it’s broken

Reality: Wake word detection requires:

  1. Proper audio levels (not too quiet, not clipping)
  2. Clear pronunciation
  3. Appropriate distance from microphone
  4. Some models need a few seconds to “warm up”

What actually works: Test with consistent, clear speech at 6-12 inches from mic. Check audio levels first.

Impact: Thought it was broken when it just needed better test conditions


Assumption 14: High Score = Good Detection

What we thought: Scores near 1.0 are required for reliable detection

Reality: The “Hey Jarvis” model typically scores 0.5-0.95 when working correctly. Scores of 0.999 are suspicious and might indicate overfitting or wrong model.

What actually works: Threshold of 0.5 works well for this model.

Impact: Set threshold too high (0.8) and missed valid detections


Assumption 15: One Detection Per Wake Word

What we thought: Say “Hey Jarvis” once → one detection

Reality: Depending on chunk boundaries and audio processing, you might get multiple detections from a single utterance if you don’t reset the buffer.

What actually works:

if score > threshold:
    oww_model.reset()  # Clear the prediction buffer
    # Process command...

Impact: Multiple activations from single wake word


Architecture Assumptions

Assumption 16: Use Same Audio Path for Everything

What we thought: Use SoundDevice for both wake word detection AND recording

Reality: SoundDevice holds the audio device open, blocking arecord from accessing it. Also, SoundDevice doesn’t work well with ALSA direct mode.

What actually works:

  • SoundDevice for wake word detection (PortAudio)
  • arecord for command recording (direct ALSA)
  • Close SoundDevice stream before calling arecord

Impact: Recording failed with “Device busy” errors


Assumption 17: Synchronous Processing is Fine

What we thought: Process everything in the audio callback

Reality: Audio callbacks must be fast (<10ms) or you get dropouts. LLM inference takes seconds.

What actually works:

def audio_callback(indata, frames, time_info, status):
    # Only do fast operations here
    wake_detected = check_wake_word(indata)
    if wake_detected:
        trigger_processing_thread()  # Do slow work elsewhere

Impact: Audio dropouts, missed wake words


The Big Picture Mistakes

Mistake 1: Not Reading be-more-agent Code First

We spent hours debugging when be-more-agent had already solved these problems. Lesson: Look for working reference implementations first.

Mistake 2: Assuming Errors Mean Broken Hardware

Multiple “Host is down” and SIGILL errors made us think the hardware was faulty. Lesson: Software/configuration issues are more likely than hardware failure.

Mistake 3: Changing Too Many Things at Once

We tried different sample rates, formats, and models simultaneously. Lesson: Change one variable at a time and test.

Mistake 4: Not Checking Audio Quality First

We assumed audio was good because the stream opened. Lesson: Always verify audio quality with test recordings before processing.


Checklist for Future Voice Projects

Before you start debugging:

  1. Record test audio: arecord -D plughw:0,0 -r 16000 -f S16_LE test.wav
  2. Verify audio quality by playing it back: aplay test.wav
  3. Check audio format matches model expectations
  4. Find a working reference implementation
  5. Test with simplest possible setup first
  6. Verify binary compatibility (ARM64 vs ARM32)
  7. Check all library dependencies
  8. Confirm chunk sizes match model requirements

Summary Table

AssumptionRealityTime Wasted
Simple downsampling [::-3]Use scipy.signal.resample2 hours
float32 audio formatMust use int164 hours
Check immediate predictionCheck prediction_buffer3 hours
ALSA default deviceMust use plughw:0,01 hour
Pre-built binaries workMust compile for Pi 52 hours
Include only main binaryNeed all .so libraries30 minutes
Any ONNX model worksNeed OpenWakeWord specific models1 hour
Model takes raw audioNeeds MFCC features2 hours
GPIO button active highActually active low30 minutes
Audio chunk size flexibleMust be 1280 samples1 hour
System Python packagesMust use /userdata/lib1 hour
High score = good0.5-0.95 is normal30 minutes
One detection per utteranceNeed to reset buffer1 hour
Same audio path for allClose stream before recording2 hours
Synchronous processingMust use threads2 hours

Total time wasted on wrong assumptions: ~23 hours


Final Advice

When something doesn’t work:

  1. Don’t assume - Test every assumption
  2. Look for working examples - Someone has solved this before
  3. Read the source - Documentation lies, code doesn’t
  4. Check the basics - Audio quality, format, levels
  5. Change one thing at a time - Isolate variables
  6. Log everything - You can’t debug what you can’t see

The working implementation is the result of correcting ALL of these assumptions. Miss even one, and things break mysteriously.