Hermes Voice Gemini — Kodama Vault

Voice-first Hermes frontend built on Gemini Live, with ask_hermes tool that delegates to Claude + MCPs. Integrated into hermes-gateway (same bot/process — does NOT require a second Discord bot).

Architecture

Discord voice (DAVE E2EE) → VoiceReceiver (hermes-gateway)
  → 48kHz stereo PCM
  → audioop downmix + resample to 16kHz mono
  → GeminiVoiceBridge.send_user_pcm
  → Gemini Live WebSocket (audio=Blob, mime=audio/pcm;rate=16000)
  → Gemini responds with 24kHz mono PCM
  → upsample + stereo → GeminiStreamSource → vc.play()
  → Discord voice channel

Tool call path:
  Gemini decides ask_hermes is needed → bridge calls runner._handle_gemini_ask(query, uid)
  → builds synthetic MessageEvent with chat_id="voice-ask-<uid>"
  → runner._handle_message(event) runs full Claude+MCP agent
  → returns text → bridge sends FunctionResponse back to Gemini
  → Gemini speaks result

Key files

/app/tools/gemini_voice.py — GeminiVoiceBridge class (~250 LOC)
/app/gateway/platforms/discord.py — VoiceReceiver forwards PCM to self._gemini_bridge when attached
/app/gateway/run.py — _handle_gemini_ask method + startup wiring that attaches adapter._gemini_bridge_factory when GOOGLE_API_KEY is set
/home/hermes/hermes-home/.env — stores GOOGLE_API_KEY and GEMINI_MODEL=gemini-2.5-flash-native-audio-latest
/app/gateway/run.py — also hosts aiohttp /ask endpoint on :7171 (not used by voice bridge since it's in-process, but handy for external clients)

Current model + voice

Model: gemini-2.5-flash-native-audio-latest (only v1beta bidi-capable model with audio input on this key)
Voice: Charon (PT-BR)
Auto VAD (not manual — manual VAD tried and reverted)
Session resumption enabled via SessionResumptionConfig(handle=...) — captures session_resumption_update.new_handle from each recv and passes to next reconnect

Known quirks (Gemini Live)

50-second session lifetime: server sends go_away then closes. Bridge catches this and reconnects with resumption handle so conversation continues.
Keepalive ping timeouts (1011) happen transiently; supervisor auto-reconnects after 2s backoff.
Tool response + speak-back may need user silence: if user keeps speaking while Gemini is generating, its response is interrupted.
Response audio may be in response.data OR nested in server_content.model_turn.parts[*].inline_data.data — bridge handles both.
_handle_gemini_ask triggers a real Hermes agent run which tries to send reply to chat_id="voice-ask-<uid>" — Discord adapter fails with invalid literal for int() on this chat_id. Cosmetic only: the response text still returns to the bridge and goes back to Gemini. Could suppress by marking event.source.chat_type differently or intercepting send.

Confirmed working

Gemini Live session opens + stays connected (with resumption)
PCM reaches Gemini (send_user_pcm #1..#N logs)
Gemini transcribes + responds with audio (gemini audio out #N bytes=...)
ask_hermes tool call fires → Claude + MCPs run → returns text
response ready for chat=voice-ask-<uid> shows Claude executed properly

Current state (2026-04-23 after debug session)

Works:

Multi-turn conversation within a single ~60s session
ask_hermes tool call → Claude+MCPs → audio response back
Fast reconnect (FIRST_COMPLETED wait + _SessionClosed sentinel for go_away)
Manual VAD via ActivityStart/ActivityEnd with 1.2s silence threshold (optional; auto VAD also works)

Does NOT work:

Session resumption is not supported by gemini-2.5-flash-native-audio-latest — server sends empty session_resumption_update: {} (no new_handle, no resumable). Confirmed via raw response dump. Model limitation, not code bug. transparent=True raises ValueError: transparent parameter is not supported in Gemini API.
Consequence: every reconnect (every ~60s) = fresh conversation. User context lost. For longer conversations, need to manually re-inject turn history.

Known issues (older)

Session lifetime ~60s: Gemini sends go_away after ~50s then keepalive ping timeout closes it. Supervisor reconnect path has a bug where reconnection sometimes doesn't fire after graceful go_away return from recv_loop. Observed: go_away logged, no Gemini Live session connected after.
Session resumption doesn't work: session_resumption_update arrives with resumable=None. Adding transparent=True to SessionResumptionConfig raises ValueError('transparent parameter is not supported in Gemini API') — docs lie about this field. Without transparent/resumable, each reconnect starts fresh conversation (loses context).
Response latency unpredictable: confirmed working responses took 3s (good) up to 90s (useless). Gemini's auto VAD behavior erratic with Discord's intermittent packet stream.
After ask_hermes tool result returned to Gemini, never observed Gemini vocalizing the result in this session.

Current state (2026-04-23 end-of-day)

GOOGLE_API_KEY commented out in /home/hermes/hermes-home/.env → bridge factory doesn't attach → VoiceReceiver falls back to Whisper+Claude flow.
Gemini integration code stays in /app/tools/gemini_voice.py and gateway wiring intact — just dormant until key is restored.
Text Hermes fully working (Sonnet 4.6, MCPs Linear/Excalidraw).

Next attempt should consider

Switch to OpenAI Realtime (gpt-4o-realtime): ~~30x more expensive (~~$0.30/min vs ~$0.01/min) but docs are mature, latency consistently <800ms, no 50s session limit. Same architecture — swap WebSocket provider + tool format, keep ask_hermes path.
If staying with Gemini: fix supervisor reconnect after graceful go_away return. Study why transparent=True was rejected (maybe SDK vs API version mismatch). Consider manual session-keepalive ping.

Debugging commands

Check Gemini recv: grep 'gemini recv\\|audio out\\|Gemini Live session' /home/hermes/hermes-home/logs/agent.log | tail -20
Check VAD flow: grep 'send_user_pcm\\|Voice state\\|ask_hermes' ... | tail -20
Check tool call end-to-end: grep 'ask_hermes\\|response ready.*voice-ask' ...

Reverting to Whisper+Claude pipeline

If Gemini path is broken, unset GOOGLE_API_KEY in /home/hermes/hermes-home/.env + restart. adapter._gemini_bridge_factory won't attach → _gemini_bridge stays None → VoiceReceiver falls back to buffer/silence/Whisper flow (which is less-natural but proven).