Hermes Voice Gemini
GPT-4o-Realtime-style voice pipeline for Hermes using Gemini Live + Claude tool calls. Architecture, files, quirks, and known issues.
Voice-first Hermes frontend built on Gemini Live, with ask_hermes tool that delegates to Claude + MCPs. Integrated into hermes-gateway (same bot/process — does NOT require a second Discord bot).
Architecture
Discord voice (DAVE E2EE) → VoiceReceiver (hermes-gateway)
→ 48kHz stereo PCM
→ audioop downmix + resample to 16kHz mono
→ GeminiVoiceBridge.send_user_pcm
→ Gemini Live WebSocket (audio=Blob, mime=audio/pcm;rate=16000)
→ Gemini responds with 24kHz mono PCM
→ upsample + stereo → GeminiStreamSource → vc.play()
→ Discord voice channel
Tool call path:
Gemini decides ask_hermes is needed → bridge calls runner._handle_gemini_ask(query, uid)
→ builds synthetic MessageEvent with chat_id="voice-ask-<uid>"
→ runner._handle_message(event) runs full Claude+MCP agent
→ returns text → bridge sends FunctionResponse back to Gemini
→ Gemini speaks result
Key files
/app/tools/gemini_voice.py— GeminiVoiceBridge class (~250 LOC)/app/gateway/platforms/discord.py— VoiceReceiver forwards PCM toself._gemini_bridgewhen attached/app/gateway/run.py—_handle_gemini_askmethod + startup wiring that attachesadapter._gemini_bridge_factorywhenGOOGLE_API_KEYis set/home/hermes/hermes-home/.env— storesGOOGLE_API_KEYandGEMINI_MODEL=gemini-2.5-flash-native-audio-latest/app/gateway/run.py— also hosts aiohttp/askendpoint on :7171 (not used by voice bridge since it's in-process, but handy for external clients)
Current model + voice
- Model:
gemini-2.5-flash-native-audio-latest(only v1beta bidi-capable model with audio input on this key) - Voice:
Charon(PT-BR) - Auto VAD (not manual — manual VAD tried and reverted)
- Session resumption enabled via
SessionResumptionConfig(handle=...)— capturessession_resumption_update.new_handlefrom each recv and passes to next reconnect
Known quirks (Gemini Live)
- 50-second session lifetime: server sends
go_awaythen closes. Bridge catches this and reconnects with resumption handle so conversation continues. - Keepalive ping timeouts (1011) happen transiently; supervisor auto-reconnects after 2s backoff.
- Tool response + speak-back may need user silence: if user keeps speaking while Gemini is generating, its response is interrupted.
- Response audio may be in
response.dataOR nested inserver_content.model_turn.parts[*].inline_data.data— bridge handles both. _handle_gemini_asktriggers a real Hermes agent run which tries to send reply tochat_id="voice-ask-<uid>"— Discord adapter fails withinvalid literal for int()on this chat_id. Cosmetic only: the response text still returns to the bridge and goes back to Gemini. Could suppress by marking event.source.chat_type differently or intercepting send.
Confirmed working
- Gemini Live session opens + stays connected (with resumption)
- PCM reaches Gemini (
send_user_pcm #1..#Nlogs) - Gemini transcribes + responds with audio (
gemini audio out #N bytes=...) ask_hermestool call fires → Claude + MCPs run → returns textresponse readyforchat=voice-ask-<uid>shows Claude executed properly
Current state (2026-04-23 after debug session)
Works:
- Multi-turn conversation within a single ~60s session
- ask_hermes tool call → Claude+MCPs → audio response back
- Fast reconnect (FIRST_COMPLETED wait + _SessionClosed sentinel for go_away)
- Manual VAD via ActivityStart/ActivityEnd with 1.2s silence threshold (optional; auto VAD also works)
Does NOT work:
- Session resumption is not supported by
gemini-2.5-flash-native-audio-latest— server sends emptysession_resumption_update: {}(no new_handle, no resumable). Confirmed via raw response dump. Model limitation, not code bug.transparent=TrueraisesValueError: transparent parameter is not supported in Gemini API. - Consequence: every reconnect (every ~60s) = fresh conversation. User context lost. For longer conversations, need to manually re-inject turn history.
Known issues (older)
- Session lifetime ~60s: Gemini sends
go_awayafter ~50s then keepalive ping timeout closes it. Supervisor reconnect path has a bug where reconnection sometimes doesn't fire after graceful go_away return from recv_loop. Observed: go_away logged, noGemini Live session connectedafter. - Session resumption doesn't work:
session_resumption_updatearrives withresumable=None. Addingtransparent=TruetoSessionResumptionConfigraisesValueError('transparent parameter is not supported in Gemini API')— docs lie about this field. Without transparent/resumable, each reconnect starts fresh conversation (loses context). - Response latency unpredictable: confirmed working responses took 3s (good) up to 90s (useless). Gemini's auto VAD behavior erratic with Discord's intermittent packet stream.
- After ask_hermes tool result returned to Gemini, never observed Gemini vocalizing the result in this session.
Current state (2026-04-23 end-of-day)
GOOGLE_API_KEYcommented out in/home/hermes/hermes-home/.env→ bridge factory doesn't attach → VoiceReceiver falls back to Whisper+Claude flow.- Gemini integration code stays in
/app/tools/gemini_voice.pyand gateway wiring intact — just dormant until key is restored. - Text Hermes fully working (Sonnet 4.6, MCPs Linear/Excalidraw).
Next attempt should consider
- Switch to OpenAI Realtime (gpt-4o-realtime):
30x more expensive ($0.30/min vs ~$0.01/min) but docs are mature, latency consistently <800ms, no 50s session limit. Same architecture — swap WebSocket provider + tool format, keepask_hermespath. - If staying with Gemini: fix supervisor reconnect after graceful go_away return. Study why transparent=True was rejected (maybe SDK vs API version mismatch). Consider manual session-keepalive ping.
Debugging commands
- Check Gemini recv:
grep 'gemini recv\\|audio out\\|Gemini Live session' /home/hermes/hermes-home/logs/agent.log | tail -20 - Check VAD flow:
grep 'send_user_pcm\\|Voice state\\|ask_hermes' ... | tail -20 - Check tool call end-to-end:
grep 'ask_hermes\\|response ready.*voice-ask' ...
Reverting to Whisper+Claude pipeline
If Gemini path is broken, unset GOOGLE_API_KEY in /home/hermes/hermes-home/.env + restart. adapter._gemini_bridge_factory won't attach → _gemini_bridge stays None → VoiceReceiver falls back to buffer/silence/Whisper flow (which is less-natural but proven).