Conversational AI has been around for a while now, and one of the main reasons for its success is that it creates more natural ‘human-like’ conversations with users. Conversational AI market includes chatbots and intelligent virtual assistants. Estimated market size value of Conversational AI in 2028 is $ 47.57 Bn, to grow exponentially at a CAGR of 32% from current market size value of $7 Bn (as of FY 2021).
Accelerated through advancements in Automated Speech Recognition (ASR) in the 2010s. Then, the launch of Apple’s Siri voice assistant in 2011, conversational AI has come a long way. Today’s most popular personalities include Alexa (Amazon), Google Now (Google), Siri (Apple)and Cortana (Microsoft). According to research done by Mindfield, below are few stats.
While simple bots may function as little more than searchable databases, complex bots are being infused with artificial intelligence to deliver more conversational, meaningful interactions. These are also commonly referred to as intelligent agents. When combined with personalized user input, intelligent agents further transform into digital assistants that deliver more tailored responses, specific to the individual.
Now, let’s understand the whole Conversational AI landscape. The whole ecosystem is spread across Applications, Development and Hardware.
In Applications layer, apart from the self-explanatory, Enterprise Software and Consumer, I have included “Native voice-first skills” category, which means those voice “applications” that aren’t merely a replication of the same functionality of mobile or desktop computing translated to voice UI (i.e. ordering a pizza) but rather those skills which uniquely take advantage of the new capabilities of the voice platform. These apps are the most exciting, but admittedly the most difficult to describe and predict.
The analogy here is that, just like the mobile platform enabled a service like Uber, by providing a portable connected computing layer between driver and passenger that wasn’t structurally available on the desktop web. Similarly, what could be those applications which can be fundamentally built on voice user interface which just wasn’t possible before? These would be more intuitive, enabling people to use the most natural and fast means of communication.
The Development layer comprises of Platforms, Analytics and Services. Platforms provide tool kits for developers to create chatbots.
IDE (Integrated Development Environment) & Framework - The user can deploy the bot with only a few lines of code. Eg - In platforms such as Facebook Messenger and Telegram.
NUI (Natural User Interface) - Provides technology for Automated Speech Recognition(ASR) and Natural Language Understanding (NLU). Developer of sound recognition based mobile apps.
DIY (Do it yourself) Platforms - Platform to creating AI based task bots - Motion is creating drag and drop AI based agents which can be integrated into an app. They claim to have agents which can accept orders, accept payments, perform customer service chats and diagnose patients.
In Analytics & Services layer, I have included companies that provide software for analysing chatbot performance. Also, companies that provide chatbot development services using proprietary development platforms
Then, there are companies building the hardware products having these voice assistants.
How does voice technology stack look like?
Speech is technically defined as a sequence of basic units called phonemes. Automated Speech Recognition (ASR) converts analog speech signals to digital signals that are segmented to retrieve phonemes. Using this phoneme sequence, the ASR system refers to the (vocabulary and grammar rules) part of linguistic to decipher words or phrases.
This whole method where you have to process speech signals needs great computing power as well as storage space, which would contain phonemes with all the noise and errors and a large volume of vocabulary, words and phrases as a library to refer. Microsoft owns the largest patent portfolio in linguistics technology.
Like Natural Language is the new UI layer, Text-to-speech and speech-to-text converters will be needed to form the core of the UI as they will enable human-to-machine interactions. In the coming Intelligent Digital Assistant era, there will be a greater need for online platforms to harness data, captured at different touchpoints, such as tracking user journeys, vendor management quality, etc. This would be part of the Cognitive layer.
There is a requirement of a fundamental change in the API Management and frameworks because as digital assistants assume interactions across the user journey, online platforms must expose more of their offerings and services through a set of standardized APIs. Building the technology infrastructure around reusable and standardized APIs will have multiple benefits. It will promote information transparency, thus making it easier for a DA to get required information without manual interventions. These APIs would make the entire process of introducing a new feature faster and more reliable.
The large platform players like Amazon are taking up many layers in the stack, the core services platform, the powering layer, etc.
How does the voice technology work? ChatGPT, being the latest innovation in chatbots. How does that work?
Below image shows the current state of art for voice technology. I will cover this as well as ChatGPT in detail in my next blog. Stay tuned!