Realtime communication in web browsersVideo sessions in the browser opens up for a lot of new applications. While this has been working with various plugins, new opportunities will open up when it becomes part of your standard web browser. HTML5 introduced native video in the browser. New standardization are looking into peer2peer multimedia sessions in the browser, not just broadcast. This opens up for a large set of applications, from “click2call” without plugins to video chats like in Facebook and Google+. This is exciting. And all the old SIP architects seems to have jumped onto this project, trying to get things right from start in this second generation effort.

IETFThe new and growing project in the IETF and the World Wide Web consortium that encompasses the new way of communication with multimedia is called webrtc for the javascript API part and rtcweb for the network protocol part. This will complement or replace soft phones, which to me always have been a strange application on my desktop.


It is very interesting to see the flow of the discussion back and forth. Web development moves much faster than IETF protocol standardization so there is a sense of urgency and a lot of pressure from various stakeholders, like everywhere else when standardization happens.

While trying to catch up with the work in this area, I have found a few issues that feels important to me:

  • Security
  • Protocol standardization
  • PSTN integration
  • IPv6

Security: Make RTCweb secure by default

In an earlier blog post about Voice 3.0 I expressed the need for a focus on security. Seems like in the RTCweb project, many people have been listening. In the drafts, there’s a proposal to make it secure by default, to always use encrypted media streams. The problem is as always, how we exchange keys to set up an encrypted channel peer 2 peer. Using TLS wrappers on the HTTP stream to the web server means that the web server will be able to see the keys in clear text. Building, managing a PKI and distributing key pairs and certificates is something we all want to avoid. I feel we’re back to square one, unless we use Phil Zimmerman’s ZRTP, which removes the need for any key exchange in the signaling path. Of course, Alan Johnston and Phil has suggested that as a possible solution.

I’ve heard other voices that feel that security is too complicated and not required by the users. “Why make this more secure than any other protocol we use?”. My answer would be – “because we failed with all the other protocols”. Our users are wide open for abuse, monitoring of their phone calls, invasions of their privacy. It’s almost impossible to build a solution with SIP and SRTP today because of lack of interoperability. And there are no real solutions for end2end security. We really, really need to fix this. I envision that rtcweb is going to be used on all kind of mobile devices connected to all kinds of public and private networks. Letting the user make a decision on whether a particular network is secure for a specific type of connection doesn’t make sense to me. Following what Harald Alvestrand proposes seems like the best way – consider all servers in the media and signaling path insecure, consider all networks insecure. Build a solution for everyone and every need.

Why do we need a protocol at all? Interoperability!

The RTCweb architecture relies on the web browser and the HTTP protocol. Using HTTP, we will download a web page that activates the session. The session uses RTP for media between the browser and the remote end – another user, a conference server, a call center, a gateway to existing services like the telephony network. Everyone seems to agree that we need a standardized javascript API so that it looks the same in all browsers. There’s a large amount of disagreement in what to do between the web page and the start of the RTP stream. We need some kind of negotiation – I have audio and not video, you have a HD web cam and stereo microphone and speakers – we need to agree. All this has been handled by the SIP protocol for many years. Now, SIP might be too heavyweight to force into each and every browser. And bringing in SIP might force us to bring in the whole PSTN legacy, including fax and DTMF. That will be very complex.

Some people say that we’ll just keep HTTP and then let the web server select if it wants to use SIP for gateways to PSTN, XMPP to find other people or another protocol. My feeling is that if we don’t standardize this, we won’t get enough interoperability. It will be hard finding each other. But I might be too influenced by the legacy in SIP and XMPP, two protocols that I use daily.

PSTN integration – something that brings DTMF, early media and fax

The SIP core protocol is a rather simple protocol. During the years, a lot of legacy PSTN functionality has been brought in, which has given us complex signaling and functions that is focused on the need for billing – like early media. If we don’t have billing per second for media, early media doesn’t matter. We might as well answer the call and listen to whatever the other end wants to send. It will also bring in DTMF, faxing and a lot of other complex features. Personally I don’t see PSTN connectivity as an important or driving application in RTCweb. Better to integrate with SIP, Skype, Apple FaceTime and XMPP Jingle. Making DTMF a requirement is not a good thing.

IPv6 and dual stacks

There is no mentioning of dual stack issues anywhere. I think it’s naive not to bring this to the table early in a protocol designed now. The current standard proposal (draft) is requiring the use of the Ice Framework for NAT traversal. This might solve the IPv6 traversal issues too, with IPv6 clients using TURN servers to get an IPv4 address. I do however think it has to be put to the table in clear text so it’s not forgotten in the implementations. Maybe it’s time for me to step in and activate myself in this discussion.

 Links to more information: