What Amazon Echo tells us about the future of enterprise IT

It will be a rare IT organization that will be untouched by the speech computing revolution, so it's time to get prepared for the new world of voice-driven applications.

I recently bought an Amazon Echo to “experiment with developments on the frontier of technology to see what the future of IT holds for us.” Or at least that’s how I rationalized the purchase. Playing around with the device—a hands-free speaker you control with your voice—got me thinking about the role that vocal computing might play in enterprise IT going forward. 

Vocal computing is not appropriate for every interaction, but it is seductively convenient for certain interactions. For example, it works beautifully to connect Echo and the Alexa Voice Service to play music, or to provide information such as news and weather. And I was pleasantly surprised by Alexa's quality of vocal interpretation.

Ultimately, I predict vocal interfaces will become an important element in certain enterprise applications. In fact, speech will soon take equal place with web and mobile interfaces, so you have no time to lose in getting up to speed.

How soon will vocal interfaces become commonplace, and what will be the most relevant enterprise IT use cases? The answers to those key questions will impact IT staffing and technology investments in short order. 

The power of skills

To give you some idea of how quickly one becomes accustomed to an always-waiting smart assistant driven by voice input, let me offer a couple of examples from my early experimentation.

One of the most widely used services Alexa offers is music. The Echo is designed to offer high-quality sound despite its compact form factor (a cylinder roughly 9 inches tall by 2.5 inches wide). I like to listen to jazz and particularly favor a jazz station located in Toronto. In fact, when I set the alarm on my awesome Internet radio clock (I can’t believe it’s been discontinued!), I wake to that station.

I thought I’d see how smart Alexa is, so I said, “Echo, play me Toronto jazz station.” A moment later the device responded something like, “Sorry, I don’t know that information.” I considered this for a moment and then did a Google search for the station information. Its call sign turned out to be CJRT-FM. I then said, “Echo, play CJRT.” It responded with, “Playing Jazz FM 91.1 on TuneIn,” and a second later music came forth.

Wow. That was magical. Clearly Amazon has put a lot of work into making listening to music on Echo a seamless experience. But there’s far more to Alexa. Amazon offers a way for external parties to integrate their API-enabled services to Alexa. It refers to the resulting integration as a "skill," and the number of skills is growing rapidly. Many of these integrations can be controlled through an online site called IFTTT (if this, then that). One creates an IFTTT applet that is called up by saying, “Echo, trigger [name of applet].”

Here’s an example of how useful a skill can be. My Honeywell home thermostat is “smart,” meaning it can be controlled via a web interface. Honeywell integrated its thermostats into IFTTT. I created an applet that changes the temperature when I call it through the Echo. It’s vastly more convenient to say that sentence than to bring up the Honeywell website, log in, and adjust the temperature manually to the desired level.

This kind of ease explains why the total number of Alexa skills grew from 1,000 in June 2016 to 4,000 in November 2016. Clearly we have reached a new frontier of human-computer interaction (HCI), although in the case of Alexa, it’s human-computer service interface.

Speech is much smarter

What’s driving the growth in skills and the craze for voice interfaces? It's speech recognition quality.

I’ve used Google Voice for years, and in the early days, the text renderings of a voice mail were vastly amusing because they typically included words and phrases that had no relevance to the message at all. Over the past couple of years, however, the text rendering has improved greatly. I find that Google Voice always communicates the essence of the message and often renders it perfectly.

In fact, one can understand why voice interfaces are emerging by looking at the recent announcement that Microsoft’s speech recognition technology has reached the quality level of a human.

That's a remarkable improvement, but a voice interface isn’t a panacea that addresses all use cases. Kayak offers an Alexa skill to access its service, and provides examples of how one can use that skill:

  • Discover places you can go within your budget: "Alexa, ask Kayak: Where can I go for $300?"

  • Search flights, hotels, and rental cars: "Alexa, ask Kayak to search for hotels in Barcelona."

  • Access Kayak's Flight Tracker to stay up to date on expected arrivals and departures: "Alexa, ask Kayak to track a flight."

It strikes me that asking for a simple temperature change is easy. I either want the temperature changed or not; it’s a binary decision. However, choosing a hotel in Barcelona is a far more complicated process because I need to evaluate such criteria as prices, location, availability on given dates, and hotel and local amenities. Trying to work through those permutations one voice command at a time is likely to be time-consuming and unlikely to be very satisfying. I’d probably want to use a browser interface for these complex tasks.  

The Alexa team addressed voice command complexity in a recent Fast Company article published on the occasion of Alexa’s second anniversary. In the piece, the Alexa representative noted that music and audio book playing are natural complements to Echo but that shopping—which one would think would be a natural service offered by the e-commerce giant—has actually proved to be challenging. Why? Because it imposes a decision tree and complex interactions. Ordering more washing powder is easy, but buying a blue shirt is more difficult. What shade of blue do you need, and what fabric, fit, and size are you looking for?

In fact, there are many categories where it's tough for Alexa to make a selection on behalf of a customer, especially for items that involve an aesthetic aspect, such as clothing.

So how should IT organizations view vocal computing? There are four takeaways that should factor into application road maps:

1. Embrace the new interface in town

You are probably thinking, “Great. I’ve barely gotten up to speed on mobile, and now there’s a new interface?”

Yep. There sure is. And for voice-appropriate applications, a voice interface is the most convenient (and magical!) way to work with them. Don't make the mistake of trivializing Alexa by characterizing it as perfect for playing music and ordering pizza but not for “real” enterprise offerings. Capital One has created an application that allows you to ask Alexa about your account balances and hear your recent financial transactions with a large number of institutions.

2. Develop skills

Just as many mobile apps crudely transferred web interfaces at first and then became mobile-native when IT organizations got up to speed with developing for a new form factor, so too will your organization need to develop new speech-native skills.

The Fast Company article quotes Alexa speech VP Rohit Prasad on this subject: "When you talk to developers, you will hear them say, 'It was a needed challenge for me to think about how I could transition my app that used a GUI [graphic user interface] to a voice experience.'"

It’s early days for speech recognition development, and there are no established standards or training for developers. But don’t let that hold you back. There are Alexa meetups, and Amazon has a very active evangelism effort you can draw on.

3. Think about service design

As I said, using Alexa for something that works well as a speech-driven application is magic. But trying to interact with it to sort through a not-very-convenient-to-use speech-driven application is, well, not very magical.

Speech is a natural way to drive an application. After all, humans have been talking for hundreds of thousands of years. However, you need to think about what your company offers and design a speech interface that allows efficient, elegant use of your service.

This is a good use case for design thinking, which I’ve written about before. Figuring out what Clayton Christensen calls “the job to be done” and capturing that in a granular enough form that it can be interacted with simply to achieve the desired outcome is important to creating a successful speech-driven application. So spend time thinking through the user's experience and what the user wants to accomplish. That will direct you toward good service design. Of course, there may be knock-on effects, in which the granular service requires creating or aggregating back-end services so that they may be called via speech.

4. Choose your ecosystem

Alexa isn’t the only speech recognition service out there. Google has just released Home, its hardware device that incorporates speech recognition. Google also offers a speech recognition service that can be integrated into applications that are not associated with Home, such as a mobile application that allows speech input as part of its functionality. Not to be left out, Microsoft offers similar functionality, although it has not yet captured it in a stand-alone hardware device.

This means you can choose among the three big cloud providers for your speech recognition capability. It also means you should consider which of the three providers you’re willing to be committed to, as using a speech service makes it convenient to also use other services from the same provider. Whenever I talk to a company about cloud usage, I always emphasize that all of the big three offer a rich ecosystem of services, both their own and from partners, and that it should recognize that beginning to use one service is likely to entail being committed to using complementary services.

One has to feel the pain of IT organizations. The pace of technological change continues to accelerate, and the demands of customers continue to escalate. The recent improvement in speech recognition means that a new application interface is now viable and will soon be a must-have for users. It’s truly remarkable how quickly speech recognition has gone from being a joke (from the user’s perspective; there’s been very active and serious work going on in the artificial intelligence community for years) to being truly useful. It will be a rare IT organization that will be untouched by the speech revolution, so it’s time to prepare for the new world of voice-driven applications.

The future of voice computing: Lessons for leaders

  • Speech will soon take equal place with web and mobile interfaces. You have no time to lose in getting up to speed.
  • Think about what your company offers, and design a speech interface that allows efficient, elegant use of your service.
  • Consider which of the three main cloud providers you’re most willing to engage with, as using a speech service makes it convenient to use other services from the same provider.