Outbound call connects and TTS says "您好" for both XFYUN standard voice and clone voice.
After the greeting finishes, the callee speaks, but the robot does not respond.
Expected
After the greeting ends, ASR should receive caller speech and trigger the next interaction round.
Hypotheses
ASR is still paused after TTS playback and never resumes, so caller speech is never delivered to the robot loop.
recvPlayBackEndEvent or ttsChannelClosed is not transitioned correctly, so the main loop keeps waiting and drops ASR events.
The first greeting path still goes through direct streaming sendToTts()/closeTts() and bypasses the new file-playback flow, causing TTS/ASR timing mismatch.
ASR does receive speech, but allowInterrupt / VAD gating drops the result before interactWithRobot() can continue.
The outbound greeting completes, but the robot never re-enters the next wait cycle because playback-finished signaling is released too early or too late.
Existing Evidence
Prior logs show XFYUN clone could buffer a large amount of audio and keep copying after final frame.
Prior logs also showed Doubao duplicate playback due to Java-side double feed, which has been narrowed down separately.
Current symptom now affects both XFYUN standard and clone, so shared post-TTS state handling is highly suspect.
Standard-voice call 2606022315190110001 first 100 log lines confirm normal outbound setup:
call answered successfully
record_session attached
llm_wait.wav played
xf_tts_mode variable was set
The provided slice stops before the key evidence region. It does not yet include:
first speak(...)
Speech-Open / Speech-Closed
PLAYBACK_STOP
ASR middle/vad/result logs
Java-side robot state logs (waitForCustomerSpeak, resumeAsr, dropped ASR decisions)
Additional evidence from standard voice call 2606022315190110001:
Speech-Closed is emitted at 23:15:25.928
ASR continues reading and sending audio after that
ASR produces valid middle and vad results:
text=对
text=哎你好
text=喂,你好。
text=你好
text=你好。
Despite valid ASR results, the next robot reply at 23:15:33.448 is 不好意思我刚刚没听清,您能再说一遍吗?
User confirmed Doubao TTS works normally in the same outbound conversation flow and can continue multi-turn dialogue.
Hypothesis Status
H1 rejected: ASR is not completely paused forever. Evidence shows continuous audio-read, audio-send, and valid result-generated.
H2 confirmed as root cause in Java state handling: Speech-Closed does arrive, but recvPlayBackEndEvent was not set to true in the TtsEvent -> Speech-Closed branch. As a result, XFYUN ASR results were dropped by the playback gate (!getAllowInterrupt() && !recvPlayBackEndEvent).
H3 confirmed as contributing factor: the greeting still goes through direct streaming TTS path.
H4 strongly supported: caller speech reaches ASR, but the robot logic does not consume it as an effective next-turn input and falls back to the no-hear retry prompt.
H5 still possible: application-side playback-finished / wake-up signaling may be delayed or mismatched with ASR result timing.
New narrowing: because Doubao works, the failure is likely XFYUN-specific event semantics, XFYUN resume/close handling, or XFYUN path differences in the first-turn streaming flow, rather than a fully generic outbound/ASR loop failure.
Fix Applied
In RobotChat, when receiving CUSTOM TtsEvent -> Speech-Closed, also set:
recvPlayBackEndEvent = true
playbackEndTime = System.currentTimeMillis()
This aligns XFYUN Speech-Closed semantics with the gating logic used later in waitForCustomerSpeakEx() and ASR event consumption.
Added an XFYUN-specific ASR resume guard delay before resumeAsr():
config key: xfyun-asr-resume-delay-ms
default: 600
Added an XFYUN-specific short-window greeting echo filter:
config key: xfyun-greeting-filter-window-ms
default: 1800
filters immediate post-playback short greetings like 你好/您好/喂你好/嗯你好
goal: avoid self-playback bleed entering the next LLM turn