Blog

Improved accuracy in Smart Turn v3.1

Smart Turn v3.1 brings improved turn detection accuracy, thanks to new human audio training data. Drop-in replacement for v3.0.

/ 5 min read
Improved accuracy in Smart Turn v3.1

We’re pleased to announce the availability of Smart Turn v3.1, now with improved accuracy thanks to a larger dataset of human audio samples, and improvements to how the model is quantized.

The model uses the same architecture as v3.0, and so v3.1 is a drop-in replacement — simply use your existing inference code with the new ONNX file.

As with v3.0, this new model will be integrated directly into the next Pipecat release, allowing you to integrate Smart Turn into your voice agent with minimal code changes. You can also use v3.1 with your existing version of Pipecat by manually specifying the path to the model weights.

New training data

We’d like to thank our partners, Liva AI, Midcentury, and MundoAI, who have provided new human audio samples in English and Spanish.

These new samples have been added to our datasets on HuggingFace – smart-turn-data-v3.1-train and smart-turn-data-v3.1-test contain all the audio data used to train and test Smart Turn v3.1. As with previous versions, these datasets are released openly.

Smart Turn has historically been heavily reliant on synthetic data generated by TTS models. Synthetic data often lacks the natural variability and subtle cues in actual human speech, and so this new human data has had a significant and measurable effect on accuracy, and we're grateful to be able to include it in the new version.

We've included some statistics in the "accuracy" section below showing the effect this new data has had on the model.

Liva AI

Liva AI provides real human voice data to improve speech models. We identify gaps in current research and collect targeted data to fill them, helping speech models better understand and express different languages, accents, and emotions. The company was founded by Ashley Mo, who has published audio research with MIT (IEEE), and Aoi Otani, who has published ML research in ICML and Nature.
We were honoured to partner with the Pipecat team on Smart Turn v3.1, contributing targeted training data that helped it better recognize subtle audio cues for natural turn-taking in conversations. We're entering an era where models not only sound expressive, but can understand speech and respond in conversation with the same dynamics as a human. Smart Turn is an exciting contribution to this field, and by open sourcing their models and datasets, they're enabling others to build better conversational AI for a wide variety of specific use cases. We're thrilled to have contributed to this milestone."

-- Ashley Mo, Co-Founder

Midcentury

Midcentury is a multimodal-native research company advancing AI with real-world datasets. We generate and license voice, video, and interaction data that trains models to perform in the environments where they actually matter. We’ve built large-scale conversational audio datasets across 12+ languages (including Japanese and Korean) and we license proprietary long-form video directly from creators. Leading AI labs rely on us to deliver the proprietary audio and voice data that power their models.
We worked on the Pipecat Smart Turn model because we’re huge fans of how Pipecat is pushing the OSS voice community forward. We genuinely believe that responsibly publishing high-quality data is one of the strongest levers for accelerating AI. Getting the right data to build better models is still way too hard, and we hope projects like this make it easier.

-- David Guo, Co-Founder

MundoAI

MundoAI is advancing AI by building the world's largest and highest quality multimodal datasets, starting with voice. Our bespoke data, spanning 16+ languages and sourced from a global network of contributors, already powers frontier models at leading research labs.

With our focus on multilingual and multimodal systems, we were thrilled to contribute to the Pipecat Smart Turn model and support its open-source mission. Increasing diversity in training data is essential for building models that work for everyone, so we're proud to play a role in enabling better and more inclusive AI.

-- Garreth Lee, Co-Founder

Model variants

In this release, you have the option of the two following variants of Smart Turn, depending on whether you're performing CPU or GPU inference.

  • CPU model (8MB, int8 quantized): this model is small and fast (the same size as v3.0), with CPU inference in as little as 12ms. As with v3.0, this model will be integrated directly into Pipecat in the next release, and inference completes in around 70ms on Pipecat Cloud.
  • GPU model (32MB, unquantized): if you’re running the model on a GPU, this larger variant provides better inference time compared to the 8MB version, and accuracy is improved by around 1%. It's also possible to run this model on CPU but with a longer inference time – see the "performance" section below.

Accuracy

Evaluating the model on our new smart-turn-data-v3.1-test dataset, we see a dramatic accuracy improvement for English and Spanish compared to the previous v3.0 model, thanks to the new human datasets:

Language v3.0 v3.1 (8MB) v3.1 (32MB)
English 88.3% 94.7% 95.6%
Spanish 86.7% 90.1% 91.0%

We still support 23 languages in total, and the remaining 21 languages have similar performance in v3.1 compared to v3.0. For a full list, see the benchmark results in our HuggingFace repo.

Performance

The 8MB variant of Smart Turn v3.1 maintains the same great inference speed of v3.0. We also now offer the unquantized 32MB variant for GPUs.

The preprocessing (feature extractor) cost is the same for both models and is listed separately.

Note: these results are for a single inference. The model supports batching, which can significantly improve performance.

Device v3.1 (8MB) v3.1 (32MB) Preprocessing
GPU (NVIDIA L40S) 2 ms 1 ms 1 ms
GPU (NVIDIA T4) 5 ms 4 ms 2 ms
CPU (AWS c7a.2xlarge) 9 ms 13 ms 7 ms
CPU (AWS c8g.2xlarge) 20 ms 32 ms 9 ms
CPU (AWS c7a.medium) 37 ms 73 ms 7 ms
CPU (AWS c8g.medium) 57 ms 159 ms 9 ms

We’ve found that setting the following environment variables has a dramatic effect on performance and consistency, even in cases where the ONNX runtime itself is run with multiple threads, due to what seems to be a high level of contention and inter-dependency between the OpenMP threads. Relying on the ONNX runtime for higher-level parallelism on multi-CPU machines seems to give better results.

OMP_NUM_THREADS=1
OMP_WAIT_POLICY="PASSIVE"

What's next

Smart Turn v3.1 is another step forward in turn-taking accuracy, and we couldn't have done it without the contributions from our partners at Liva AI, Midcentury, and MundoAI. Their high-quality human audio data has been instrumental in closing the gap between synthetic and real-world performance.

We're already hard at work on Smart Turn v3.2, where we'll continue our focus on accuracy improvements across all supported languages. As always, we'll be releasing our models, datasets, and benchmarks openly so the community can build on this work.

If you have feedback or questions, reach out to us on GitHub or join the conversation in our Discord.

Subscribe to our blog

Get the latest directly to your inbox.