MultiTrack Call Transcription with Split Recording

Back in April we announced that cut up recording was obtainable as a part of the Nexmo Voice API. Split recording permits you to document a conversation in stereo, one participant in each channel. This makes widespread use instances reminiscent of transcription much easier to deal with.

Nevertheless, there was one downside to separate recording – when you have greater than two members then the first participant can be in channel zero and everyone else can be in channel 1, which signifies that we lose the power to transcribe what everybody stated individually.

What if I advised you that Nexmo now supports not one, not two, but three(!) separate channels in a recording? Would that make you cheerful? How about if I advised you we might help 4? Five? We’re pleased to announce that out there instantly, Nexmo supports up to 32 channels in each single recording. You learn that appropriately – we will provide 32 separate channels of audio, every containing a single participant for you to course of nevertheless you like.

Identical to last time, we’re going to walk by way of a easy use case collectively. In this state of affairs, Alice needs to debate a work undertaking with Bob and Charlie, they usually’ve all agreed that it’d be a good idea to document the decision. To realize this, Alice has created a small Node.js software that uses Nexmo to connect her to Bob and Charlie and report the dialog.

All the code on this publish is obtainable on Github.

Bootstrapping Your Software

The first thing we need to do is create a new software and set up all of our dependencies. To do that, we’ll use npm to initialise a venture and install categorical and body-parser to deal with our HTTP requests, and dotenv to deal with our software configuration. We’ll also want to put in the nexmo shopper so that we will entry our name recording as soon as we obtain a notification that it’s obtainable and @google-cloud/speech to transcribe the audio .

Once you’ve put in all your dependencies, you’ll have to create an software and lease a quantity that folks can call. In the event you don’t have already got a Nexmo account, you’ll be able to create an account by way of our dashboard.

Nexmo sends HTTP requests to your software every time an occasion happens inside your software. This could possibly be when a name begins ringing, when it’s answered or when a name recording is out there, simply to name a number of. To do this Nexmo needs to be able to attain your software, which is troublesome when the appliance is operating regionally in your laptop computer. To show your native software by way of the internet, you should use a software referred to as ngrok. For extra info you possibly can read our introduction to ngrok blog publish.

Expose your server now by operating ngrok http 3000. It is best to see some textual content that appears just like -> localhost:3000. The primary part is your ngrok URL, and is what you need for the rest of this publish.

We’re lastly able to create a Nexmo software and link a quantity to it. You’ll need to exchange with your personal ngrok URL in these examples:

Should you now make a call to the quantity that you simply purchased, Nexmo will make a request to http://[id] to learn how to handle the call. As we haven’t built that software yet, the decision will fail.

Dealing with an Inbound Call

Let’s construct our software to handle inbound calls.

Create the index.js file and enter the code proven under. This requires all of our dependencies, configures the Nexmo shopper and creates a new categorical instance:

Next, we have to create our /webhooks/reply endpoint which can return an NCCO and tell Nexmo the way to handle our name. In our NCCO we use two join actions to mechanically dial Bob and Charlie’s telephone numbers and add them to the conversation, and a document motion that may inform Nexmo to document the call.

The document action is where the magic occurs. We inform Nexmo to split the audio into separate channels by setting cut up: dialog and that the audio ought to be cut up into three channels by setting channels: 3.

We’ll want to offer Bob’s and Charlie’s telephone numbers for this to work, but let’s end creating our endpoints before we configure our software.

Create the /webhooks/event endpoint which can be informed when occasions are triggered on the call. For now, all we’re going to do is log the parameters we acquired and acknowledge that we acquired it by sending again a 204 response.

Lastly, we have to implement our/webhooks/recording endpoint. This URL was defined in our NCCO in the document motion and shall be notified when a recording is accessible. We’ll routinely transcribe the audio later, however for now let’s log the request parameters so that we will see what’s obtainable.

At this point there’s just one thing left to do – populate our .env file to offer the knowledge that our software needs. You’ll want your application_id, in addition to some telephone numbers for Bob and Charlie that will help you check. Create a .env file and add the next, ensuring to switch the appliance ID and telephone numbers with your personal values:

Be sure that so as to add the Google credentials line, despite the fact that we haven’t created that file but. We’ll create it in the subsequent part

When you’ve executed this you possibly can check your software by operating node index.js and then calling the Nexmo number you purchased. It should routinely call the two numbers that you simply added to the .env file and as soon as everybody hangs up, your /webhooks/recording endpoint ought to receive a recording URL.

Connecting to Google

We’re going to use Google’s Speech-To-Textual content service to transcribe our call recording. To get started with that, you’ll have to generate Google Cloud credentials in the Google console. Download the JSON file they provide, rename it to google_creds.json and place it alongside index.js. This is the file that the Google SDK will attempt to learn to fetch your authentication credentials.

To transcribe our audio knowledge, we’re going to hook up with the Google Speech API, passing in the expected language, the variety of channels in the recording and the audio stream itself. Update your transcribeRecording technique to include the following code:

This returns a promise from the Google Cloud speech SDK which can resolve when the transcription is out there. Earlier than we will use the speech SDK, we need to require it so add the following to your require part on the prime of your file:

At this level we’re able to transcribe our audio. The ultimate step is to fetch the recording from Nexmo once we receive a request to /webhooks/recording and feed the audio in to Google’s transcription service. To do that, we use the nexmo.information.get technique and move the audio returned in to our transcribeRecording technique:

As well as passing the audio as a parameter, we tell our transcribeRecording technique that there are three channels within the audio and that we need to transcribe using the en-US language mannequin. Once the promise is resolved by the Google speech SDK, we learn the results and output a transcription of the conversation, together with which channel the audio came from.

For those who call your Nexmo number now, you’ll be related to each Bob and Charlie. Once the decision is completed, cling up and watch for Nexmo to send you a recording webhook. Once it arrives, we send that audio off to Google and the transcription will appear in the console. This is the way it seems for me:


We’ve just built a convention calling system with automated transcription in just 76 strains of code. Not only does it routinely transcribe the decision, however it transcribes every channel separately, permitting you to know who stated what on the call. For more details about the options out there when recording a call, you’ll be able to view the document action in our NCCO reference, or see an instance implementation of the NCCO required in a number of languages.