How to Live Stream Video as You Shoot It in iOS

In our definition, live streaming is a feature of an application that allows you to transmit video over the network as you shoot your video, without having to wait until recording is complete. In this post, we are going to discuss how we have tackled this problem while developing the Together video camera app for iOS.

Together video camera is an iPhone application which allows you to shoot videos and easily manage your video collection, sorting it both automatically by date, location, tags, and manually by albums. You can also synchronize your collection between devices and share your videos and streams with your friends.

At the very onset of the development lifecycle, we had conceived Together as a teleconferencing application , then we shifted our focus to one-way live streaming to a large audience, like in LiveStream or Ustream. And, having finally approached our first public release, we have converted the application into a personal video library manager.

At the first glance, supporting live streaming is not a big deal. There are lots of well-known applications and services run by a variety of platforms that use this feature. It is the main feature for some of them, e.g. Skype, Chatroulette, Livestream, Facetime, and many others. However, the iOS standard development tools and libraries prevent optimal implementation of this function, since they fail to offer a direct access to hardware-based video encoding features.

From the implementation viewpoint, live streaming can be broken down into the following sub-tasks:

Get the stream data while shooting.

Parse the stream data.

Convert the stream into the format supported by the server.

Deliver the data to the server.

Here are the following basic requirements for the app to implement live-streaming:

Ensure a minimum delay between video shooting and its display to the consumer.

Ensure the minimum amount of data sent over the network while maintaining acceptable quality of picture and sound.

Enable optimal utilization of CPU, memory, and storage capacity of the shooting device.

Minimize the battery drain.

Depending on the purpose of live streaming in the app, certain requirements may dominate. For instance, the requirement to minimize the battery drain is in conflict with minimizing delay, as sending large chunks of data across the network may be more energy efficient than maintaining a connection constantly exchanging the data while shooting. Our Together app is not focused at getting feedback from the viewers in real time; but, at the same time, we offer you the opportunity to share videos you shoot as soon as possible. Therefore, the requirement to minimize the battery drain has become a top priority for us.

iOS SDK includes a fairly rich set of features to interact with the camera and handle video and audio data. In our previous posts, we have already told you of the CoreMedia and AVFoundation frameworks. Such classes as AVCaptureAudioDataOutput or AVCaptureVideoDataOutput combined with AVCaptureSession can retrieve frame-by-frame picture from the camera in the format of uncompressed video and audio buffers (CMSampleBuffer). The SDK is supplied with sample apps (AVCamDemo, RosyWriter) illustrating how to work with these classes. Before transmitting buffers obtained from the camera to the server, we have to compress them with a codec. Video encoding at a decent compression ratio usually involves substantial CPU time utilization and, accordingly, battery drain. The iOS devices have special hardware for fast video compression with an H.264 codec. To make this hardware available to developers, the SDK provides but two classes differing in ease-of-use and features:

AVCaptureMovieFileOutput outputs the camera image directly to a MOV or MP4 file.

AVAssetWriter also saves video to a file, but in addition to that it can use frames provided by the developer, as a source picture; such frames can be either obtained from a camera as CMSampleBuffer objects, or generated programmatically.

Lets analyze CPU time spent on H.264 video encoding using standard SDK libraries and FFmpeg in the FFmpeg-iOS-Encoderproject. We ran our benchmarking using the Time Profiler tool, shooting and simultaneously recording a video file at a resolution of 1920×1080.

In the first graph, you can see video recording with the AVAssetWriter class. CPU utilization before and after the recording starts is almost constant. Part of the encoding load is transferred to the mediaserverd system process, but even in view of this the total CPU utilization is rather small, and the video is shot smoothly, at no interruption and the frame rate of 30fps.

In the second graph, you can see the video recorded with FFmpeg libraries using the MPEG4 codec and video resolution down to 320 x 240. In this case, the CPU utilization is nearly 100%, but the video is recorded at a rate of only 10-15fps.

While sending video over a network, an ideal option would be to skip data saving to the disk but to feed compressed data directly into the application’s memory. But, unfortunately, the SDK is not providing for this, so the only thing left is to read compressed data from a file the system writes to.

An attentive reader would probably recall that iOS is a UNIX-like system from inside, and such systems usually have a special type of files called named pipes allowing you to read and write to a file, without involving any disk resources. However, AVAssetWriter fails to support this type of files, as it writes the data not continuously, but sometimes it has to return back and append missing data; this may be needed, for instance, to fill out the mdat atom field length in a MOV file.

To read data from a file which is currently appended to, the easiest way is to run a cycle to read new data appended to the file ignoring the end of the file character until appending to the output file is not complete:

Such an approach is not very efficient in the context of resource utilization as the application is unaware when the new data has been appended to the file and has to access the file constantly, regardless of whether the new data has emerged or not. This downside can be overcome by using the dispatch_source function of GCD, an asynchronous input/output library.

The resulting data will be presented in the same format they are saved by iOS, i.e. in the MOV or MP4 container format.

The MOV video container format (also known as the Quicktime File Format) is used in most of Apple’s products. Its specification was first published in 2001: based on it, the ISO has standardized its MP4 format which is more widely used today. MOV and MP4 are very similar and differ only in some small details. The video file is logically divided into separate hierarchically nested parts called atoms. MOV specification describes about fifty different types of atoms, each of which is responsible for storing specific information. Sometimes, to correctly interpret the contents of an atom you have to first read another atom. For example, a video stream encoded with an H.264 codec is stored in the mdat atom; to correctly display it in the player, you need to first read the compression parameters from the avcC atom, read frame timestamps from the ctts atom, read the boundaries of individual mdat frames from the stbl atom, etc.

Most players cannot play back a video stream as it arrives from an uncompleted MOV file. Moreover, to optimize transmission of video to server and data storage, video stream shoud be converted into another format or at least parsed out into individual packets. The task of video transcoding can be delegated either to the server part or to the client part. In the Together camera, we have chosen the second option, as transcoding is not likely to involve much computing resources, but can, at the same time, help you to simplify and offload the server architecture. The application immediately transcodes the stream into a MPEG TS container and cuts it into 8-second segments which can be easily transmitted to the server in a simple HTTP POST request with the multipart/form-data body. Such segments can be used immediately, without any further processing, to build a broadcast playlist via HTTP Live Streaming.

Specific structure atoms inside a common MOV file, makes it impossible to apply this format to streaming. To decode any part of a MOV file, a complete file shall be available, as decoding-critical information is contained in the end of the file. As a workaround for this problem, an MP4 format extension has been proposed to record a MOV file consisting of multiple fragments, each of which containing a separate block of video stream metadata. For AVAssetWriter to start writing a fragmented MOV, it is sufficient to specify the movieFragmentInterval value constituting the fragment length.

Fragmented MP4 (fMP4) is used in streaming protocols, such as Microsoft Smooth Streaming, Adobe HTTP Dynamic Streaming, and MPEG-DASH. In Apple’s HTTP Live Streaming this purpose is served by the MPEG TS stream broken into separate files called segments.

Before iOS 7, another more sophisticated way to read an incomplete MOV file existed which could do without MOV fragmentation. You could parse the contents of mdat, identifying specific NALUs (H.264 codec data blocks) and AAC buffers. Each NALU and AAC buffer in the output file corresponded to each input sample buffer and exactly followed the recording order. Because of this, you could have easily established correspondence between a NALU, a frame and a frame timestamp. This information was sufficient to decode the video stream. In iOS 7, this clear correspondence was complicated. Now, each input sample buffer can have one or more NALU, and it is impossible to identify how many of them can exist for each particular frame.

For video stream transcoding we used the most intuitive solution: an open set of FFmpeg libraries. With the FFmpeg libraries, we have succeeded in solving the issue of parsing a fragmented MOV file transcoding packages into a MPEG TS container. FFmpeg can relatively easily parse a file submitted in any format, convert it to another format, and even transmit it through the network via any of the protocols supported. For example, for the purposes of live streaming you can use the FLV as an output format, and RTMP or RTP as protocols.

To connect FFmpeg to an iOS application, you need to compile static FFmpeg libraries in several versions existing for different architectures, add them to your project in Xcode, open build settings and add FFmpeg header files path to the Header Search Path option. To cross-compile libraries for iOS, before building FFmpeg run the configure script specifying parameters for the path to iOS SDK and a set of functionality to be included into the build. The easiest way is to download one of the multiple ready build scripts available on the Internet (1,2,3,4) and customize it to your needs.

However, FFmpeg does not fully support such a model we had to implement in Together. So that reading from the recorded file is not interrupted at the end of the file, we wrote a protocol module for FFmpeg called pipelike. To slice segments for HTTP LS, as a basis we used one of the earliest versions of the hlsenc module called libav and revised it, fixing bugs and adding the feature of transmitting the output data and the main module’s events directly to other parts of the application through callbacks.

The resulting solution has the following advantages and disadvantages:

Advantages:

Optimal battery consumption, at the level in no way inferior to standard Apple’s applications. This has been attained by utilizing the platform’s hardware resources.

We have made the maximum use of standard iOS SDK, with no private APIs, which makes the solution fully App Store compatible.

Total: 60 seconds. Other delays in the Together live streaming are caused by the specifics of implementation:

The reason behind such a delay is that the system starts reading the next segment only after the previous segment has been fully sent.

Delay due to the time elapsed from starting to write the file to streaming launch by the user (the file is always streamed from its beginning).

Most of the factors affecting the delay are due to the HTTP LS protocol. Most of live streaming applications use such protocols as RTMP or RTP, which are better tailored to streaming with a minimum delay. We also researched into a few other applications having the live streaming function, such as Skype and Ustream. Skype has a minimum delay of about one second while utilizing 50% of the CPU. It brings us to the conclusion that they use a proprietary protocol and algorithm to compress video data. Ustream uses RTMP, has a delay of 10 seconds or less and generates a minimum CPU utilizaton, just as with hardware video encoding.

All-in-all we have resulted in acceptable live streaming satisfying requirements of the Together Video Camera.

Here are some other developments that use a similar approach to live streaming:

Livu is an application that can stream video from the iOS device to an RTP server. As it is not a full-scale streaming service, you’ll have to specify your own video server. On github you can find a pretty old version of the app’s streaming component. Livu’s developer Steve McFarlin is often consulting at Stack Overflow on video app development issues.

12 thoughts on “How to Live Stream Video as You Shoot It in iOS”

Another approach we use at Dailymotion is to create small (3-seconds long) regular MP4 files (switching output files from the AVAssetWriter within a transaction), and posting them to our standard upload HTTP servers (with extra HTTP headers signaling to denote fragments positions within the series of POSTs).

At the server side, a small piece of C software will demux each fragment and re-create/maintain an RTMP session to our streaming servers (as if the initial stream was pushed from an FMLE instance).

With this approach, we don’t have to craft complicated code on the device itself (MP4->fMP4 or MP4->MPEG-TS remux), and implement it for each and every new platform (Android in Java, Windows Phone in C#, …). All mobile platform are capable of generating regular MP4 files on the local filesystem.

In addition on iPhone, we use 2 local AVAssetWriter with different encoding profiles (one with a 3G-compliant bandwidth, one with a WiFi-compliant bandwidth) and keep all the encoded fragments locally for both versions. We analyze the time spent to send each fragment and choose to send one or the other quality based on available instantaneous upstream bandwidth (which will probably fluctuate if the device switch from 3G to WiFi or vice-versa during the live session).

We also keep all the received fragments at the server side, and at the end of the live session, a special HTTP request is used to “find” which fragment is missing (or which fragment was sent with a “low” quality and could be replaced with a better quality one). This part is optional, but if opted-in by the user, will lead to a perfect recorded version of the live session, even if transmission issues occurred during the live event (it’s almost always the case when you are in a mobility scenario with a smartphone), at the expense of re-sending some fragments.

I think this is quite similar to what Bambuser does (from a feature PoV, I didn’t check their implementation in details).

Thank you for sharing your approach! I think it could be very interesting for others.

We decided to make it client side in order to make server side as simple as possible (avoid video sessions on server side). Also we had some issues with not loosing frames on the edges of several seconds chunks.

Hi pyke,I also tried similar approach but when merging the mp4 videos back at the server, I see that merged video is jittery [when merged with ffmpeg]. Can you provide some pointers on how did you merge the videos ?Thanks

That might be a specific mp4 issue. I think we have even experienced something similar. ffmpeg and many other encoders put something in the beginning of MP4 file, and the first frame timestamp isn’t 0 usually. So, when you re-transcode them into one mp4 file the switches become jittery. That’s why in case of TS files (which could be spliced by a simple concatenation, no transcoding required) it is very important to have a keyframe as the first frame of every TS chunk.

Hi Eugene, Thanks for the response. my files are recored on iphone. so i should record in TS and then merge them, and then transcode them to mp4 format ? I also need a local mp4 copy for playback and sharing. Thank you for your help.

Hi pb, mp4 and ts are containers, so it`s basically a wrapper above chunks of compressed streams (video/audio). So transcoding TS into MP4 is not reasonable – you can always create mp4 instead of TS in the first place

modern approach slightly differs from described in this article. For now, there is no need to parse files, since IOS is able to throw raw compressed h264 chunks (as well as aac for audio), so everything can be performed on the fly.

to combine files (TS/MP4/etc) from h264 chunks i can recommend bento4 library – https://www.bento4.com/ – which is created specifically for manipulating over mp4

ffmpeg can be also used here (without need to create files before constructing final file)

Hi IPv6. Not sure if I need another library to merge the mp4 files. AVFoundation allows me to do that using compositions. My problem is that the merged mp4 videos have glitches at the point of merge. I know I\’m not missing any frames while recording as DidDrop delegate never gets called. At this moment I\’m not even streaming. Just recording in-coming frames from camera into short files (1MB each), then merging them back to create one single file.Thank you for your help.

Hi nfsio, IPv6 is writing a new article on the topic where this question would be addressed (with a lot of demos). It should be published by the end of this week. So, stay tuned! And thank you for reading our blog! Would you like to be notified by an email when it is ready?