Demystifying Chatbots

July 17, 2019 5-minute read

Chatbots are hot these days. Everybody and their grandma has a bot now. Want book a flight? There’s app bot for that. Customer support? There’s a bot. (Not to be confused with the Elite Executive Customer Support Engineer TMtrained and employed by certain ISPs only to have scripted conversations.) Need some psychological help? That’s right; bot.

To a software engineer, this whole bot thing seems exciting and overwhelming at the same time. When I started working on my first Google Home Action, I didn’t know where to begin. Over the course of past couple of years, I have developed a few such bot apps for different platforms — including Google Home and Amazon Alexa. While each platform comes with its own nuances, there is a common architectural pattern. In this post, I am going to draw a high level picture of this pattern.

What Is a Chatbot?

For the scope of this post, I am going to assume that any program capable of having some sort of conversation on a topic is a chatbot. So, it can be a feedback collection bot that asks the user to rate 1 to 5 on a bunch of questions or the JARVIS.

Building Blocks

Chatbot frameworks across different platforms follow a similar structure. Google and Amazon both provide infrastructure for end to end bot development. However, the similarity in architecture allows developers to swap out the default components and plug in new ones.

Let’s go through these building blocks — starting at the user interface.

User Interface

Main advantage of the bots over traditional apps is diversity in user interface. Chatbots can communicate over a wide variety of interfaces like text, voice, etc. This enables the developers to deliver their chat apps across wide variety of platforms with minimal effort. For example, same backend can power the bots on Facebook Messenger, Google Home, Alexa and WeChat.

This is great for the users since they can just open a new chat on a platform of their choice instead of downloading another app.

Overall, the interfaces can be categorized in following groups:

Point and Click: This is the simplest. The bot asks a question and provides a bunch of options to choose from. Depending on the user’s choice, next question is asked. Domino’s has one such bot.
Free Text: This type of bot allows the user to type free text messages. This adds a layer of difficulty of understanding intent behind the user’s message. Plenty of websites serve their FAQs as a chat bot.
Voice: These are the bots that you actually talk to. Siri, Alexa, Cortana, Google Assistant — you’ve probably met one of these. This is essentially a free text interface with Speech-to-Text on top of it.

Language Processor

NLP is the magic sauce that powers a chat bot. Natural language processor is responsible for converting the unstructured human text to a structure that can be consumed programmatically. NLP has two major responsibilities, namely intent classification and entity extraction.

In case of Alexa, NLP is provided by built in Alexa Skills Kit while Google recommends Dialogflow. These services do a little more than just NLP, but that is for later.

Let’s take an example question:

What was by balance in savings account last month?

What was by balance in savings account last month?

Intent Classification: Each message from the user needs to be classified into one of the predefined intents. The intent tells us whether the user said “hello”, asked for account balance, or thanked the bot.

The example above should be classified under awesomely named intent “ask_account_balance”

Entity Extraction: Entities are the parameters embedded in the user’s utterance. The example above has two entities.

Account type: savings
Time: last month

So, a natural language processor will take our question and give structured output. Of course, the exact structure will vary according to the NLP system used.

Now that we have a structure, it is time to give some answers.

Fulfillment Service

This is probably the simplest part. A fulfillment service takes the structure above and returns an answer that the bot can utter.

In our example above, maybe we query the database and form an utterance: “Your account balance for savings account was ₹345 on 30th of June. You’re broke!” (That last part is not strictly necessary.)

For most household chat bots, fulfillment service is just an HTTP REST service. Google Cloud Functions, AWS Lambda, Azure Functions, etc. are great candidates for hosting such a service

Multi-Turn Conversations

A full blown conversation requires much more information than can be provided by the current message. Multi-level information sources are used to construct the conversation context.

Persistent Storage

The information that remains static across conversations is obtained from sources like databases. User details, user preferences, etc. are good examples since they need to be persisted across sessions.

Session Storage

Some information is transient and needs to be remembered only in the current session. For example, when the user says, “Let’s talk about my savings account” the bot needs to remember that account type is “savings” for rest of the conversation.

Such information is maintained within the session using a “bag of slots”. A bag of slots is just a dictionary of slot names and values. While the structure may differ between frameworks, they all provide a way to store and retrieve values from session level storage.

That was a birds eye view of how a bot works. I have deliberately kept a lot of details out of this post in order to keep it simple. We will look at those in the coming posts.