1. 12.0 Streaming Voice and Video with united communications
Not only do our networks do a great job of getting data from point A to point B on the network, but these super fast, super reliable networks also do a great job of carrying voice and video. And that’s going to be the focus of this module. And what we’re talking about is sometimes called unified communications and that encompasses sending both voice and video in addition to data over an IP network.
And we’re also going to be supporting features such as messaging and presence. And we’ll get into all those different things in this module with a special focus on the quality of service that’s going to make sure we give preferential treatment to latency, sensitive traffic such as voice. Now, let’s begin our discussion in the next video with a look at voiceover IP.
2. 12.1 Voice over IP
In this video, we want to take a look at VoIP or Voiceover IP. For years, companies had their own telephone system called a PBX, a private branch exchange. And they would interconnect those PBXes between a couple of their sites using something called a tie line. It would tie together those phone systems and it would go over the PSTN. The public switched telephone network, and simultaneously, they may have a different network for data. They might have a router at each site. And those routers were interconnected over the Ipwan. And I didn’t picture it here on screen, but they might have even a third network for video. But as the speed and reliability of the Ipwan improved over the years, it became obvious that we really don’t need to have separate networks.
What if we just combined these voice and data networks and simply sent voice and data over that Wan, where we don’t have to pay for a tie line. Those routers that we connect into our PBXes, they can stream voice between one another using a layer for protocol called RTP. That’s the real time transport protocol. And the routers can connect into the PBX using a port, such as an FX S port, an FXO port, an ENM port, a digital port like a T one or an E one. And this is the way that we started to migrate away from a traditional telephony network to a voice over IP network. But the question is, how do those routers take an analog waveform like the spoken voice and convert that into digital ones and zeros? Well, it’s a lot like when you go to the movies and you’re watching a movie on screen.
A lot of movies are filmed at 24 frames per second. You’re actually not watching smooth motion, you’re watching 24 still images played back in rapid succession. But when that happens, it looks like smooth motion. Same thing with voice. If we take enough samples of our voice waveform represented by the dots here, if we can take a sample frequently and represent each sample with a number and send those numbers across a data network, the other side can take those samples represented by numbers and connect the dots to rebuild this original waveform. The question is, how many samples should we take every second? Now, here, you see I’m taking samples fairly frequently. In fact, I’m taking more samples that would be necessary to reproduce this waveform. And if I take more samples, yes, it’s going to help the audio quality a bit, but it’s going to take up more bandwidth.
It’s going to be more expensive. But if I take too few samples, when I connect the dots, it’s not going to sound anything like the original waveform. Here we’ve connected the dots and you’ll notice that the resulting waveform called the alias signal, it looks nothing like the original signal. So how many samples should we take per second? And we get the answer from a gentleman that used to be a college professor back in the 1920s. His theory was developed with others over time, but it began in the 1920s. His name was Harry Nyquist, and his work developed into what is now referred to as the Nyquist Theorem. And the Nyquist Theorem says if you want to be able to take samples of a waveform and then recreate the original waveform based on those samples, then the number of samples you should take per second equals two times the highest frequency that you’re sampling.
Now, in the voice world, over 90% of human speech intelligence occurs at less than 4000 Hz. So for decades in the telephone industry, the goal has been to represent frequencies 4000 below. And according to Mr. Nyquist, if the highest frequency that we want to sample is 4000 times 4000 is 8000, we should take 8000 samples per second. And then if we play those samples back, it would recreate the original waveform approximately anyway. And that’s called pulse amplitude modulation, or Pam. But we want to take it a step further than pulse amplitude modulation.
We want to assign a value to each of those samples so we can digitally send that value across our data network. That’s called pulse code modulation PCM. And we can take each of those samples, take a look at its volume, its amplitude, and we can say, if you fall in this bottom range, we’re going to assign you as an example, a number of one. If you fall in this next range, we’re going to give you a number of two. In the next range, you get a number of three and so on. But as you look at these amplitudes of these different samples, you’ll notice that none of them perfectly line up with a one or a two or a three. There’s a little bit of error there. Does that cause a problem?
Actually, yes, it can cause a problem. It’s going to cause some background noise called quantization noise because of that quantization error, as it’s called. A way to improve this without having to take more samples and using more bandwidth is to use not a linear scale on the y axis, as we have here, but to use a logarithmic scale much like this. You might remember the old log graph paper you used in high school here. Instead of a linear y axis, we’re going by powers of ten. So if you take a look at segment zero, you’ll notice all those dashes there. Those dashes are called steps, and they’re more tightly packed than the steps in segment one. And the steps in segment one, they’re more tightly packed than we would have it’s not pictured on screen than we would have in segment number two.
In other words, we’re going to be more accurate at lower volumes. That’s good for a couple of reasons. Number one, most samples statistically occur at lower volumes. So by using this logarithmic scale instead of a linear scale, we are being more accurate for more samples. Another reason this is good, that it’s better to be more accurate at the lower volumes. The louder volumes are so loud they’re going to tend to drown out that background quantization noise anyway. So we use a logarithmic scale and we can represent each of those samples superimposed on this scale using just eight bits. One bit is the polarity bit, is it above the zero line or is it below the line? Then we’re going to have three bits to represent which segment it falls in. And remember those little dashes, the steps? We’ll have another four bits to say, here’s the step within that segment. And that’s going to give us an eight bit sample. And remember, Mr Nyquist, he said that we should take 8000 voice samples per second.
In other words, 4000 Hz times two is 8000. If we take 8008 bit samples, we get 64 Kbps. That’s a number that has been used in the telephony industry again for decades. If you remember the old T One circuit, it had 24 separate channels. Each channel could carry a voice call. How much bandwidth was available in each channel? That’s right, 64K, because that will give us uncompressed voice. Now, in the voice over IP world, we’re going to encode a voice using different types of codecs. A codec is short for coder decoder. And one of the codecs that does uncompressed voice is G 711. It uses 64 bandwidth. Now, you have to add on some header information for layer two and layer three and layer four, which makes it 87. 2 Kbps. But payload only, it takes up 64 times eight.
But if you’re going over a lot of area network, you may be willing to sacrifice a little bit of voice quality to save some extra bandwidth. For example, G 729, payload only, it only takes up 8 bandwidth. Yes, the voice quality is slightly less than G 711, but it’s a trade off. Are you willing to sacrifice a little bit of voice quality to save some bandwidth and therefore save some money? And both G 711 and G 729, they’ve been very popular in the industry over the years. More recently, people have started using I LBC Internet low bitrate codec to communicate over the wide area network. And it can operate at a couple of bandwidth, 13. 3K or 15. 2K. It gives us a little better voice quality than G 729, and it still gives us less bandwidth consumption than G 711.
And that’s a look at voice over IP, or VoIP, some people call it, where we’re using the existing data network to carry not just data but also voice. And we could even do video as well over this previously data only network. And the way we send voice across that digital network is using the protocol. It’s a layer four protocol called RTP. The real time transport protocol. And we took a look in this video at how we take the spoken voice that continuously varying analog waveform and convert it into binary ones and zeros. We take 8000 samples per second and we represent each sample with eight bits. And by sending 8008 bit samples per second, we get 64 bandwidth required, not including overhead for an uncompressed voice conversation.
3. 12.2 Video over IP
Just like we can have audio or voice going across an IP network, we can do the same thing with video. Yes, it’s going to take up more bandwidth typically than audio will, but if we have enough bandwidth, we can stream video across a data network and we see that really commonly today. Maybe you have used zoom to have some sort of a video conference call with people you work with, and there are different applications besides just a zoom. To do that, we might be using Microsoft Skype as an example. Or here on screen, we see what’s called an Immersive telepresence room where you have these high definition, like 4K large video screens mounted to a wall and the high definition cameras pointed at you. And I’ve experienced this before, and it’s uncanny how realistic it feels.
Another application for voice on our network might be for a call center. Instead of calling customer service and talking with someone using audio only, what if you were able to see them and maybe show them the piece of equipment that broke and you’re calling in for support on that’s starting to become more and more popular. We have people that are going to their doctor using these video streams. Telemedicine, that’s a big one these days. Maybe you have business partners and you set up some sort of a call with them over video. That’s very common. But besides just using this to meet with other people, maybe we’re using it for security, like video surveillance.
Even in my home, we have these video cameras mounted around the exterior of our home where we’re able to detect motion in front of our garage or the front gate as a couple of examples. And we could over our IP based network, see who’s at the front door or the front gate or at the garage. And as bandwidth has become more and more available, we have seen a surge in video applications. Now, when we talk about video devices on a network, there are some terms I would like you to know. So let’s take a look at some of these video terms. One term you hear, and this might be when you go out to buy an HDTV as an example, is frames per second. And we talk about frames per second in the context of movies.
Oftentimes if you go to a movie theater, you watch what appears to be smooth motion on screen. A lot of movies are shot in 24 frames per second. So you’re not actually watching smooth motion. It looks like you’re watching smooth motion, but actually you’re watching 24 separate images shown to you in rapid succession such that it creates the illusion of smooth motion. And the more frames we take per second, the smoother that motion is going to be. But let’s take that 24 frames per second as an example. Again, when you go to the movie theater and they’re projecting that movie up on the screen, that projector has a shutter that will come down over the lamp to block the light while the film advances to the next frame. But I said we were running at 24 frames a second.
Does that mean that we show one frame and then 120 fourth of a second later, the shutter comes down, blocks the light so we can advance to the next frame, the shutter gets out of the way, and we see that second frame. Know, at 24 frames per second, that would become very noticeable. There would be a lot of flicker. So what’s happening at the movies instead is that the refresh rate represents the number of times that we’re showing an image. So even though we’re showing 24 frames per second, we might have a refresh rate of 48, meaning that for each of those 24 frames, we’re going to project that frame. The shutter is going to come down and block the frame.
The shutter is going to come out of the way, and we’re still showing the same frame. So if the refresh rate is 48, but the frames per second is 24, each frame is shown twice. And by having that shutter come down and then come back up more frequently, it’s going to help reduce that flicker we would otherwise see. And you hear about this a lot in televisions and in computer monitors for gaming as an example, because when you’re watching something on your television, it’s common to have 30 frames per second. But your refresh rate might be 60 or 120 or maybe even 240.
The higher the refresh rate, the smoother the motion. And the way that video image is painted on your screen could be using interlaced video or progressive video. The very first high definition TV that I bought said that it had a resolution of 1080. I not 1080p, but 1080. I the I meant interlaced. Remember, our refresh rate might be 60 and the frames per second might be 30 on a television. So what would happen with interlaced video? During one of those refresh rate periods that lasted one 60th of a second, it would paint half of the lines of the picture. And in the next refresh rate period, where we’re still showing the same image, it would paint the other portion of that image. So it took two refresh rate cycles to completely show that image, and we had 30 images per second.
Now, it looked good, but it’s not as good as progressive video. With progressive video, the entire image gets painted at once. So when you see that your television has a resolution of 1080p, that means that it has 1080 rows that it’s going to paint. But the p means it’s progressive. They all get painted at the same time. And how sharp the image is is determined by the number of pixels, the number of picture elements that we have in an image. And I took a picture of the music icon on a couple of iPads that I have. Now, this is an older iPad, it doesn’t have Apple’s Retina Display. And you’ll notice that there’s a little bit of jaggedness around that image. It’s kind of blocky. Compare that and I’ve zoomed in on these, but compare that to the much smoother image you see on the right.
That’s taken with an iPad that does have a Retina display. So the more pixels, the more sharp the image. Another measurement we hear a lot about in the video world is the aspect ratio. And for years the common aspect ratio for common television sets was four by three. The ratio of the length versus the height was a ratio of four to three. Most televisions today, though, have aspect ratios of 16 to nine. So it’s a wider display. In fact, there was a fairly famous experiment tried recently. You might have heard that the Justice League movie was reintroduced. People are calling it the Snyder cut because of the director that did the cutting.
He actually used a four x three ratio that was very uncommon for movies, but it met with a lot of praise for using that aspect ratio. But let’s take that 16 by nine aspect ratio as an example. What exactly does that mean? Well, if we have a 1080p television with a 16 by nine ratio, that means we have 1080 horizontal rows and we have 1920 vertical columns. And if you multiply 1080 by 1920, you get a little over 2 million pixels on a 1000 and eightyp video display. More popular today though, are 4K displays where we have approximately 4000 vertical columns. Hence the name 4K for 4000. Another term I want you to know about. In fact, some standards I want you to know about are how we do compression of video images. And there have been several different impeg standards over the years.
MPEG stands for Motion Pictures Experts Group. We’ve had MPEG one and two and four for broadcast television. MPEG Two and MPEG Four very common. Now there’s a variant of MPEG Four that we use on our data networks a lot. It’s called H. 264 Encoding That’s a standard that’s used on maybe a DVD player. And it might be used when you’re streaming your video across the internet to have a meeting with your co workers. But that’s a look at a few examples of how we can have video over IP, some of the applications and some of the terms associated with video.