logo
平台介绍
快速接入
密钥管理
文本转语音
音色克隆
音色列表
智能体
视频生成
语音识别
语音识别介绍
POST
语音识别API
WSS
WebSocket 实时识别
POST
音频质量检测
GET
语音识别历史
计费规则
常见问题
工作台
立即登录

WebSocket 实时语音识别

基于 WebSocket 协议的流式语音转文字服务,支持”边录边转”的实时交互。

核心特性

  • 全双工实时反馈:音频上传与文本下发同步进行,延迟低
  • 流式语音输入:自动识别语音停顿,智能切分并精准断句
  • 多语言支持:支持中文、英文、日文等 19 种语言
  • 翻译: 自动翻译识别出的内容
  • 智能指令转译:语音中包含翻译请求时自动执行,如”今天天气很好,翻译成英文” → 输出 “The weather is nice today”

应用场景

  • 人机交互:语音助手、车载系统、智能机器人的底层语音识别
  • 实时字幕:会议、网课、直播场景的实时文字展示
  • 跨语言翻译:跨境通话、同传场景的语音指令翻译
  • 语音听写:办公、采访、速记场景的流式语音输入

API接口

1. 接口概述

流式音频上传和内容识别。

2. 连接信息

  • 接口地址: wss://api.senseaudio.cn/ws/v1/audio/transcriptions
  • 协议类型: WebSocket
  • 鉴权方式: Bearer Token (HTTP 握手阶段通过 Header 传递)

3. 协议详情

通信包含两种消息类型:

  1. 控制消息 (Text/JSON): 用于发送指令(如开始、结束)和接收服务端状态/结果。
  2. 音频消息 (Binary): 用于上传原始音频数据。

3.1 控制消息结构

所有控制消息均采用 JSON 格式传递。

客户端请求消息结构 (WSSttClientMessage)

字段类型必填说明
eventstring是事件类型,可选值: task_start, task_finish
modelstring是(task_start)模型名称,目前仅支持 sense-asr-deepthink
audio_settingobject是(task_start)音频参数设置,见下文详情
vad_settingobject否VAD (语音活动检测) 设置,见下文详情
transcription_settingobject否识别相关设置,见下文详情

参数详情

1. audio_setting

字段类型必填说明
sample_rateint是采样率,目前仅支持 16000
channelint是声道数,目前仅支持 1 (单声道)
formatstring是音频格式,目前仅支持 pcm

2. vad_setting (可选)

字段类型说明默认值
silence_durationint静音切分阈值 (ms)500
min_speech_durationint最小语音时长 (ms)300
soft_max_durationint软超时时长 (ms)15000
hard_max_durationint硬超时时长 (ms)30000
soft_silence_durationint软超时后的静音阈值 (ms)300
thresholdfloatVAD 能量阈值 (0.0 - 1.0)0.5

3. transcription_setting (可选)

字段类型说明示例
target_languagestring目标语言代码 (详见下表)“en”, “zh”
recognize_modestring识别模式”auto” (默认), “record_only” (仅识别不执行指令)

支持的语言列表 (target_language):

代码语言代码语言代码语言
arArabicyueCantonesezhChinese
nlDutchenEnglishfrFrench
deGermanidIndonesianitItalian
jaJapanesekoKoreanmsMalay
ptPortugueseruRussianesSpanish
thThaitrTurkishurUrdu
viVietnamese

基础响应结构 (base_resp)

字段类型说明
status_codeint状态码,0 表示成功,非 0 表示错误
status_msgstring状态描述信息

服务端响应消息结构 (WSSttResponse)

字段类型说明
eventstring事件类型: connected_success, task_started, result_final, task_finished, task_failed
session_idstring会话 ID
trace_idstring链路追踪 ID
dataobject识别结果数据 (仅 result_final 包含)
base_respobject基础响应信息 (状态码和消息)

3.2 交互阶段详解

Step 1: 建立连接

客户端发起 WebSocket 连接,服务端验证通过后返回 connected_success。

服务端响应示例:

json
复制
{ "event": "connected_success", "session_id": "trace-id-xxx", "base_resp": { "status_code": 0, "status_msg": "success" } }

Step 2: 任务开始 (task_start)

客户端发送 task_start 事件配置参数。

客户端请求示例:

json
复制
{ "event": "task_start", "model": "sense-asr-deepthink", "audio_setting": { "sample_rate": 16000, "format": "pcm", "channel": 1 }, "vad_setting": { "silence_duration": 500, "min_speech_duration": 300 } }

参数说明:

  • audio_setting:
  • sample_rate: 采样率,必须为 16000。
  • channel: 声道数,必须为 1 (单声道)。
  • format: 音频格式,目前仅支持 pcm (16-bit Little Endian)。
  • vad_setting (可选):
  • silence_duration: 静音切分阈值 (ms)。
  • threshold: 能量阈值 (0.0 - 1.0)。

服务端响应示例:

json
复制
{ "event": "task_started", "session_id": "trace-id-xxx", "base_resp": { "status_code": 0, "status_msg": "success" } }

Step 3: 音频流传输

客户端持续发送二进制音频数据 (Binary Message)。

  • 格式要求: PCM signed 16-bit little-endian, 16kHz, Mono.
  • 服务端通过 VAD 自动断句,每识别完一句话返回一次 result_final。

服务端响应示例 (result_final):

json
复制
{ "event": "result_final", "session_id": "trace-id-xxx", "data": { "text": "你好,今天天气真不错。", "is_final": true, "segment_id": 1, "timestamp_end": 1773027072669 }, "base_resp": { "status_code": 0, "status_msg": "success" } }

Step 4: 任务结束 (task_finish)

客户端发送 task_finish 事件通知发送完毕。服务端处理剩余音频后返回 task_finished。

客户端请求示例:

json
复制
{ "event": "task_finish" }

服务端响应示例:

json
复制
{ "event": "task_finished", "session_id": "trace-id-xxx", "base_resp": { "status_code": 0, "status_msg": "success" } }

异常处理 (task_failed)

如果发生错误,服务端会返回 task_failed 并可能关闭连接。

json
复制
{ "event": "task_failed", "base_resp": { "status_code": 2013, "status_msg": "model is required" } }

4. 交互流程

以下展示了客户端与服务端建立连接、发送任务开始指令、流式发送音频以及结束任务的完整交互流程。


5. 使用示例

请将示例代码中的 YOUR_API_KEY 和 AUDIO_FILE_PATH 替换为实际值。

Python

依赖: pip install websockets

python
复制
import asyncio import json import websockets API_KEY = "YOUR_API_KEY" WS_URL = "wss://api.senseaudio.cn/ws/v1/audio/transcriptions" AUDIO_FILE = "test_audio_16k.pcm" # 16kHz, 16bit, PCM 单声道 async def receive_messages(websocket): async for message in websocket: msg_json = json.loads(message) print(f"< Received Event: {msg_json.get('event')}") if msg_json.get("event") == "result_final": print(f" 识别结果: {msg_json['data']['text']}") elif msg_json.get("event") == "task_finished": print(" 任务完成") break async def speech_recognition(): headers = {"Authorization": f"Bearer {API_KEY}"} async with websockets.connect(WS_URL, additional_headers=headers) as websocket: # 1. 接收连接成功消息 resp = await websocket.recv() print(f"< Received: {resp}") # 2. 发送 task_start start_payload = { "event": "task_start", "model": "sense-asr-deepthink", "audio_setting": {"sample_rate": 16000, "format": "pcm", "channel": 1}, } await websocket.send(json.dumps(start_payload)) print("> Sent task_start") # 接收 task_started resp = await websocket.recv() print(f"< Received: {resp}") # 3. 接收结果(并发)+ 发送音频 receive_task = asyncio.create_task(receive_messages(websocket)) with open(AUDIO_FILE, "rb") as f: while True: data = f.read(3200) # 每次发送 3200 字节(约 100ms) if not data: break await websocket.send(data) await asyncio.sleep(0.1) # 模拟实时流 # 4. 发送 task_finish await websocket.send(json.dumps({"event": "task_finish"})) print("> Sent task_finish") await receive_task if __name__ == "__main__": asyncio.run(speech_recognition())

Golang

依赖: go get github.com/gorilla/websocket

go
复制
package main import ( "encoding/json" "fmt" "io" "log" "net/http" "os" "time" "github.com/gorilla/websocket" ) const ( API_KEY = "YOUR_API_KEY" WS_URL = "wss://api.senseaudio.cn/ws/v1/audio/transcriptions" AUDIO_FILE = "test_audio_16k.pcm" ) func main() { header := http.Header{} header.Add("Authorization", "Bearer "+API_KEY) log.Printf("Connecting to %s", WS_URL) c, _, err := websocket.DefaultDialer.Dial(WS_URL, header) if err != nil { log.Fatal("dial error:", err) } defer c.Close() // 1. 读取连接成功 msg readMessage(c) // 2. 发送 task_start startMsg := map[string]interface{}{ "event": "task_start", "model": "sense-asr-deepthink", "audio_setting": map[string]interface{}{ "sample_rate": 16000, "format": "pcm", "channel": 1, }, } if err := c.WriteJSON(startMsg); err != nil { log.Fatal("write startMsg error:", err) } fmt.Println("> Sent task_start") // 读取 task_started readMessage(c) // 3. 准备并发读取结果 done := make(chan struct{}) go func() { defer close(done) for { _, message, err := c.ReadMessage() if err != nil { if websocket.IsCloseError(err, websocket.CloseNormalClosure, websocket.CloseGoingAway) { fmt.Println("< Connection closed normally") return } log.Println("read error:", err) return } fmt.Printf("< Received: %s\n", message) var msgMap map[string]interface{} if err := json.Unmarshal(message, &msgMap); err != nil { log.Println("json unmarshal error:", err) continue } if event, ok := msgMap["event"].(string); ok && event == "task_finished" { return } } }() // 4. 发送音频(100ms 分片) file, err := os.Open(AUDIO_FILE) if err != nil { log.Fatal("open audio file error:", err) } defer file.Close() const chunkSize = 3200 buf := make([]byte, chunkSize) for { n, err := file.Read(buf) if err == io.EOF { break } if err != nil { log.Fatal("read file error:", err) } err = c.WriteMessage(websocket.BinaryMessage, buf[:n]) if err != nil { log.Fatal("write message error:", err) } fmt.Printf("> Sent %d bytes audio data\n", n) // 模拟实时发送:100ms的音频数据,等待100ms if n == chunkSize { time.Sleep(100 * time.Millisecond) } } // 5. 发送 task_finish(音频发送完毕后立即通知服务端) finishMsg := map[string]string{"event": "task_finish"} err = c.WriteJSON(finishMsg) if err != nil { log.Fatal("write task_finish error:", err) } // 6. 等待服务端处理完成并返回 task_finished <-done fmt.Println("> Sent task_finish") fmt.Println("All done!") } func readMessage(c *websocket.Conn) { _, message, err := c.ReadMessage() if err != nil { if websocket.IsCloseError(err, websocket.CloseNormalClosure, websocket.CloseGoingAway) { fmt.Println("< Connection closed") return } log.Fatal("read message error:", err) } fmt.Printf("< Received: %s\n", message) }

Node.js

依赖: npm install ws

javascript
复制
const WebSocket = require('ws'); const fs = require('fs'); const API_KEY = "YOUR_API_KEY"; const WS_URL = "wss://api.senseaudio.cn/ws/v1/audio/transcriptions"; const AUDIO_FILE = "test_audio_16k.pcm"; const ws = new WebSocket(WS_URL, { headers: { "Authorization": `Bearer ${API_KEY}` } }); ws.on('open', function open() { console.log('Connected'); }); ws.on('message', function incoming(data, isBinary) { if (isBinary) return; const msg = JSON.parse(data.toString()); console.log('< Received:', msg.event); if (msg.event === 'connected_success') { const startMsg = { event: "task_start", model: "sense-asr-deepthink", audio_setting: { sample_rate: 16000, format: "pcm", channel: 1 } }; ws.send(JSON.stringify(startMsg)); console.log('> Sent task_start'); } else if (msg.event === 'task_started') { streamAudio(); } else if (msg.event === 'result_final') { console.log(' Result:', msg.data.text); } else if (msg.event === 'task_finished') { process.exit(0); } }); function streamAudio() { const stream = fs.createReadStream(AUDIO_FILE, { highWaterMark: 3200 }); let processing = false; stream.on('data', function (chunk) { stream.pause(); processing = true; ws.send(chunk); setTimeout(() => { processing = false; stream.resume(); }, 100); }); stream.on('end', function () { const checkAndFinish = () => { if (!processing) { ws.send(JSON.stringify({ event: "task_finish" })); console.log('> Sent task_finish'); } else { setTimeout(checkAndFinish, 50); } }; checkAndFinish(); }); }