Making AI conversational: video call training is the next step for ChatGPT-style AI

Conversational AI, such as ChatGPT, has seen significant advancements in recent years, making it increasingly capable of simulating human-like conversation. The use cases for this technology are vast and varied, from customer service interactions to language learning and entertainment.

Furthermore, chatbot technology is becoming increasingly sophisticated, with many chatbots now able to understand context and even use machine learning to improve their responses over time. This allows them to provide more tailored and relevant answers to users' questions. This, in turn, can lead to improved user experiences and greater satisfaction with the service.

The next step for conversational AI is to actually be trained in conversation. Currently, it is trained on a vast number of documents downloaded from the Web and elsewhere. These documents establish a vast knowledge base.

But it also leads to a question/answer type of interaction that actually isn’t that natural. Type any question into ChatGPT and you will get what looks like a well thought out answer, complete with some “facts” and possible caveats.

To move on from Q&A-style interaction, AI needs to be trained on actual conversations.

To address these challenges, the way forward for conversational AI is to build a corpus of video calls and use this corpus to train the AI in natural conversation. Training an AI on video calls would lead to a much more natural form of conversation.

A still from "2001: A Space Odyssey"
“I’ve still got the greatest enthusiasm and confidence in the mission.” HAL 9000 in the movie “2001: A Space Odyssey” engages in natural language conversion with humans, even while being dismantled.

Video calls contain a wealth of information, including audio and visual cues, that can provide valuable input for training AI systems. The audio component of video calls could be used to train speech recognition and natural language processing (NLP) algorithms. The AI would learn to understand the nuances of human speech (such as accent, intonation, and pronunciation) which would improve its ability to understand and respond to human speech.

The visual component of video calls, such as facial expressions and body language, could be used to train computer vision algorithms. The AI would learn to recognize and understand the emotional state of the conversation partner, as well as subtle cues such as nodding and pointing. This would improve its ability to understand the context of the conversation and respond appropriately.

Such information could be integrated with the audio data to give a more complete understanding of the conversation. Taken together, this will unlock a new data source for training of large language models. This will address the lack of training data for large language models, identified by Forbes as one important challenge with the current state of AI.[1]

The focus of this training should be to give AI the ability to communicate effectively with natural language–in a natural two-way conversation. Using a corpus of video calls will be the most efficient way to achieve this, as this data is currently available, and consists of naturally occurring conversations.

Furthermore, in video calls the faces of the speakers are mostly visible so that facial expressions and body language can be analyzed and become part of the conversation without having to normalize the input data.

A corpus of video calls can also train AI in conversation for specific use cases. Want an AI doctor? Training the AI with a data set of telehealth visits. Want AI to handle customer service for your business? Train it with recordings of real customer service calls. A corpus of video calls leads naturally to the creation of artificial intelligence agents for these functions. Building such corpuses can and should happen already today by the companies running such services.

At Daily, a developer platform for video, we’re focusing on creating the technologies for communication that are necessary to support this development. Our customers embed video experiences—like video calls, interactive live streams, and cloud recordings—into their websites and applications. Companies are now implementing transcription and exploring other AI-based analysis across use cases.

We are excited to support corpus building for AI purposes. As long-time veterans of the video space, we see this as yet another way video is central in business, technology, and society. Let us know how video can accelerate your AI exploration—we're excited to help.

1. Rob Toews, "10 AI Predictions For 2023", Forbes, Dec 20, 2022, https://www.forbes.com/sites/robtoews/2022/12/20/10-ai-predictions-for-2023/

Never miss a story

Get the latest direct to your inbox.