Real-time Transport Protocol
The Real-time Transport Protocol (RTP) is an IETF-standardized network protocol that provides end-to-end transport functions suitable for applications transmitting real-time data, such as audio, video, or simulation data, over multicast or unicast IP network services.[1] RTP does not guarantee quality-of-service or address resource reservation but focuses on efficient, low-latency delivery of time-sensitive payloads, typically layered over UDP for minimal overhead.[1] It augments data transport with the companion RTP Control Protocol (RTCP), which enables scalable monitoring of delivery quality, minimal control functions, and participant identification in large networks.[1] RTP was first specified in RFC 1889 in January 1996 as a foundational protocol for real-time multimedia over the Internet. This initial version addressed the growing need for standardized transport in emerging applications like audio and video conferencing amid the expansion of IP networks. The protocol was obsoleted and updated by RFC 3550 in July 2003 to incorporate refinements for broader applicability, including better support for secure extensions and interoperability across diverse systems.[1] These updates maintained RTP's core design independence from underlying transport layers while enhancing its robustness for modern multicast environments.[1] At its core, RTP includes several key features to handle real-time constraints: sequence numbers to detect and reorder lost or delayed packets, timestamps for synchronizing media playback across variable network conditions, and payload type fields to identify and demultiplex different media formats without relying on external negotiation.[1] RTCP complements these by periodically sending control packets that report statistics like packet loss, jitter, and round-trip delay, allowing senders to adapt transmission rates and receivers to assess session health.[1] Profiles and payload formats, defined in companion RFCs, further customize RTP for specific media types, ensuring flexibility without altering the base protocol.[1] RTP underpins a wide array of real-time communication systems, including Voice over IP (VoIP) telephony, video teleconferencing, IPTV broadcasting, and interactive streaming services.[1] It forms a critical component of standards like WebRTC for browser-based peer-to-peer media exchange, enabling low-latency audio and video in web applications.[2] Extensions such as Secure RTP (SRTP) have also emerged to add encryption and authentication, addressing security needs in sensitive deployments like secure video calls.Introduction
Purpose and Design Goals
The Real-time Transport Protocol (RTP) is designed to facilitate the end-to-end delivery of real-time data, such as audio, video, or simulation data, over IP networks using either multicast or unicast services.[1] Its primary objective is to support applications where timeliness and low latency are paramount, prioritizing the prompt transmission of packets over guaranteed delivery or error correction, as delays or jitter can severely degrade the quality of interactive media streams.[3] This approach accepts a degree of packet loss, which is tolerable for real-time applications, rather than incurring retransmission delays that could disrupt the flow of continuous data.[4] Key design principles of RTP include the use of timestamps to enable synchronization across media streams, sequence numbers to assist in packet ordering and detect losses for potential retransmission hints, and payload type identification to allow dynamic switching between codecs during a session without interrupting the flow.[4] These mechanisms provide essential metadata for receivers to reconstruct and play back media correctly, while keeping the protocol lightweight to minimize overhead. RTP operates primarily over the User Datagram Protocol (UDP), leveraging its multiplexing and checksum capabilities while avoiding the head-of-line blocking inherent in TCP, which could introduce unacceptable delays in variable network conditions.[3] In some scenarios, RTP may use TCP or Datagram Transport Layer Security (DTLS) as alternatives, but UDP remains the standard choice to emphasize speed over reliability.[1] In contrast to non-real-time protocols like TCP, which offer robust reliability through acknowledgments and retransmissions, RTP incorporates only minimal reliability features, delegating quality-of-service (QoS) enhancements to underlying networks or higher-layer applications.[3] A core architectural concept is the separation of the data plane (handled by RTP for media payload transfer) from the control plane (managed by the RTP Control Protocol, or RTCP, for feedback and monitoring), which enhances scalability in large multicast sessions by allowing RTCP messages to be transmitted at lower rates without impacting data throughput.[1] This design enables RTP to automatically scale from small conferences to thousands of participants, supporting efficient resource use in diverse network environments.[5]Core Protocol Mechanics
RTP Data Transfer
The Real-time Transport Protocol (RTP) encapsulates media data, such as audio or video streams, into discrete packets for real-time transmission over IP networks. This packetization process involves dividing the continuous media stream into fixed-size units, typically aligned with the media's encoding frame or sample boundaries, and prepending a standardized RTP header to each unit. The header includes a 32-bit synchronization source (SSRC) identifier, which is a randomly chosen value unique to each stream source within a session, enabling receivers to distinguish and synchronize multiple concurrent streams, such as in multicast scenarios.[6] A key component of RTP data transfer is the 16-bit sequence number field in the header, which increments by one for each successive RTP packet sent from a given SSRC, regardless of payload type or content. This numbering allows receivers to detect packet loss, duplication, or reordering caused by network variability, facilitating reconstruction of the original stream order. At the receiver, a jitter buffer leverages the sequence numbers to smooth out arrival delays, buffering packets briefly to reorder them and minimize playout disruptions, which is essential for maintaining real-time media quality.[7] Timestamps in RTP provide precise synchronization for media rendering, using a 32-bit field that indicates the sampling instant of the first octet of the payload data. Unlike sequence numbers, timestamps advance based on the media's clock rate rather than packet count; for instance, in audio encoded at 8000 Hz (as with G.711), a 20 ms packet containing 160 samples would increment the timestamp by 160. This scaling ensures accurate playout timing across varying network paths, compensating for jitter without assuming constant packet intervals.[7] The 7-bit payload type field dynamically identifies the media format and codec within the RTP header, allowing flexible negotiation between sender and receiver. Static assignments cover common types, such as 0 for G.711 mu-law audio, while dynamic values from 96 to 127 accommodate newer codecs like H.264 video, whose payload formats are specified separately. This field enables seamless switching between formats during a session without altering the underlying transport. RTP packets are transmitted over User Datagram Protocol (UDP) datagrams, supporting unicast for point-to-point delivery, multicast for efficient group communication, or broadcast for network-wide distribution. By convention, RTP uses even-numbered UDP ports (e.g., 5004), with the associated control protocol on the next odd port (e.g., 5005), simplifying port pairing in implementations. This UDP-based approach prioritizes low latency over reliability, as RTP relies on application-layer mechanisms for any necessary retransmission.[8]RTCP Feedback Mechanism
The RTP Control Protocol (RTCP) serves as an out-of-band companion to RTP, delivering periodic control information that enables participants in a multimedia session to monitor the quality of service (QoS) for transmitted data streams.[3] Specifically, RTCP facilitates the exchange of sender and receiver reports containing key QoS metrics, such as packet loss fraction, interarrival jitter, and round-trip time (RTT) estimates, which help applications adapt to network conditions and diagnose issues like congestion or faults.[3] These reports are essential for real-time applications, as they provide insights into transmission quality without interfering with the primary RTP data flow.[3] RTCP employs several core packet types to convey this feedback. The Sender Report (SR) is transmitted by active senders and includes detailed statistics on packets sent, octets sent, and an NTP timestamp for clock synchronization, allowing receivers to correlate RTP timestamps with absolute time.[3] In contrast, the Receiver Report (RR) is sent by non-senders or as a component of SR packets by senders, reporting reception statistics such as the fraction of packets lost, cumulative packets lost, highest sequence number received, and an interarrival jitter estimate for the reporting interval.[3] Additionally, the Source Description (SDES) packet provides identification and descriptive information about session participants, including mandatory canonical names (CNAME) that uniquely identify sources across sessions, along with optional items like name, email, or location.[3] To ensure efficient use of network resources, RTCP transmission is carefully scheduled with bandwidth constraints. The protocol recommends allocating no more than 5% of the total session bandwidth to RTCP traffic, with approximately one-quarter of that reserved for senders and the remainder for receivers, preventing control packets from overwhelming the media data.[3] Intervals between RTCP packets are calculated dynamically based on session size, participant roles, and recent reporting activity, incorporating randomization to desynchronize transmissions and avoid bursty network load during simultaneous sends.[3] This approach scales gracefully for large multicast sessions, where the interval grows with the number of participants to maintain the bandwidth limit.[3] Scalability is further enhanced through compound RTCP packets, which bundle multiple RTCP packet types—such as an SR or RR followed by SDES—into a single underlying protocol datagram, reducing header overhead and ensuring atomic delivery of related feedback.[3] For custom needs, the Application-defined (APP) packet type allows extensions specific to particular applications, carrying subtype and name fields to define proprietary feedback while adhering to RTCP's overall structure.[3] A critical component of RTCP feedback is the estimation of interarrival jitter, which quantifies variations in packet arrival times due to network congestion or routing differences. The jitter value J is computed iteratively using the formula: J = J + \frac{|D_i - J|}{16} where D_i represents the difference in packet interarrival delays relative to the RTP timestamp intervals, derived as D_i = (R_i - R_{i-1}) - (S_i - S_{i-1}), with R denoting the receiver's arrival timestamp and S the RTP sender's timestamp.[3] This smoothed estimate, reported in the RR or SR, aids in assessing stream stability and is standardized across receivers for consistent comparison.[3]Packet Formats
RTP Header Structure
The RTP header is a fixed 12-byte structure that precedes the payload in RTP data packets, enabling synchronization, identification, and multiplexing of media streams. It begins with a 1-byte field containing the version number (2 bits, currently set to 2 for compatibility with the RTP specification), a padding flag (P bit, 1 bit, indicating optional padding bytes at the end of the packet to align the payload), an extension flag (X bit, 1 bit, signaling the presence of an optional extension header), and a CSRC count (CC field, 4 bits, specifying the number of contributing sources listed in the packet). The second byte includes the marker bit (M, 1 bit, used to indicate frame boundaries or other significant events in the media stream, as defined by the profile) followed by the payload type (PT, 7 bits, identifying the format of the media data, such as audio or video codec). This is followed by a 16-bit sequence number for detecting packet loss and reordering, a 32-bit timestamp providing monotonic clock progression for synchronization (often derived from the sampling rate of the media), and a 32-bit synchronization source identifier (SSRC) uniquely identifying the source of the stream within the RTP session. If the CC field is greater than zero, the header is extended by a list of up to 15 CSRC identifiers (each 32 bits), which identify the contributing sources in scenarios involving mixers or translators that combine multiple streams into one. When the X bit is set, an optional extension header follows the CSRC list (or fixed header if no CSRCs), consisting of a 16-bit profile-specific identifier, a 16-bit length field indicating the extension length in 32-bit words, and the extension data itself, allowing for profile-defined additional information without altering the base header. The version field has remained at 2 since the publication of RFC 3550 in 2003, ensuring backward compatibility with earlier RTP implementations while supporting the protocol's core functionality for real-time applications.| Field | Size (bits) | Description |
|---|---|---|
| Version (V) | 2 | Protocol version (2). |
| Padding (P) | 1 | Indicates padding at packet end. |
| Extension (X) | 1 | Signals optional extension header. |
| CSRC Count (CC) | 4 | Number of CSRC identifiers (0-15). |
| Marker (M) | 1 | Marks significant events (profile-specific). |
| Payload Type (PT) | 7 | Identifies media format. |
| Sequence Number | 16 | Packet ordering and loss detection. |
| Timestamp | 32 | Synchronization clock value. |
| SSRC | 32 | Source stream identifier. |
| CSRC List (optional) | 32 each (up to 15) | Contributing source IDs. |
| Extension Header (optional) | Variable (min 4 bytes) | Profile-specific extensions. |
RTCP Packet Types
RTCP packets share a common fixed header that precedes type-specific data, enabling receivers to identify the packet type and length. This header is 4 bytes long and consists of the version field (2 bits, set to 2 for RTP/RTCP), a padding bit (1 bit, indicating if the packet contains padding to align to a 32-bit boundary), a reception report count or subtype field (5 bits, varying by packet type: RC for SR/RR, item count for SDES, number of SSRCs for BYE, subtype for APP), a packet type field (PT, 8 bits identifying the subtype: 200 for SR, 201 for RR, 202 for SDES, 203 for BYE, 204 for APP), and a length field (16 bits, representing the number of 32-bit words in the packet minus one).[9] The Sender Report (SR) packet provides transmission and reception statistics from a sender, starting after the common header with the sender's SSRC identifier (32 bits), followed by an NTP timestamp (64 bits for wall-clock time), an RTP timestamp (32 bits corresponding to the RTP timestamp of the first octet in the report interval), the sender's packet count (32 bits, total RTP data packets sent), and the sender's octet count (32 bits, total RTP data octets sent). It then includes zero or more reception report blocks (each 24 bytes: SSRC of the reported source, fraction lost, cumulative packets lost, extended highest sequence number, interarrival jitter, last SR timestamp, and delay since last SR). SR packets are sent periodically by active senders to synchronize streams and report quality.[10] Receiver Report (RR) packets convey reception quality feedback from non-senders or senders without updated transmission stats, following the common header with the sender's SSRC (32 bits) and up to 31 report blocks identical to those in SR packets. An empty RR packet (RC=0) may head a compound packet when no data is sent or received, ensuring minimal overhead while providing essential feedback on metrics like packet loss and jitter.[11] Source Description (SDES) packets carry textual information about participants for identification and statistics display, beginning after the common header with one or more chunks: each chunk starts with an SSRC or CSRC identifier (32 bits), followed by zero or more items (each with a 1-byte type, 1-byte length, and variable-length text value). The Canonical Name (CNAME) item is mandatory in each compound RTCP packet (except during encryption splitting), providing a unique, permanent identifier like "[email protected]" or a hostname to bind SSRCs across sessions without disclosing sensitive user details, thus preserving privacy. Optional items include NAME (user's display name), EMAIL, PHONE, LOC (location), TOOL (software name/version), NOTE (free-form note), and PRIV (private extensions). SDES items are limited to 255 bytes each, and the packet ends with a chunk if the item count in the header matches the number of SSRCs listed.[12] BYE packets signal the departure of one or more sources from the session, consisting of the common header followed by one or more SSRC/CSRC identifiers (32 bits each, up to 31 as indicated by the count field) and an optional reason-for-leaving string (preceded by its length in bytes). Multiple SSRCs allow a single packet to notify of multiple exits, and the reason field aids in debugging or logging without exceeding packet limits. BYE packets may be sent immediately upon leaving, outside the regular RTCP schedule, but follow backoff rules to avoid congestion.[13] Application-defined (APP) packets enable custom control information beyond standard RTCP functions, with the common header's count field specifying a subtype (0-31), followed by an SSRC/CSRC (32 bits), an 8-character ASCII name (identifying the application), and application-dependent data (variable length, up to the packet's total size). APP packets support extensibility for specific applications, such as synchronization or control signals, while maintaining compatibility with the RTCP framework.[14] To optimize bandwidth and reduce header overhead, RTCP packets are typically transmitted as compound packets within a single underlying protocol datagram (e.g., UDP), concatenating multiple simple RTCP packets—often starting with an SR or RR, followed by SDES (with CNAME), and optionally BYE or APP—without additional headers between them. The first packet's PT distinguishes the compound from a simple one, and encryption (if used) applies to the entire compound or splits it into encrypted and unencrypted portions, with SDES CNAME appearing in only one to avoid duplication. This structure ensures efficient delivery of diverse feedback in real-time sessions.[9] The length field in the common header is calculated as the total packet length in 32-bit words minus one, accommodating variable-sized contents like report blocks or SDES items while allowing padding (if the P bit is set) to reach the next 32-bit boundary; padding byte counts are stored in the packet's last byte if present.[9]Profiles and Extensions
RTP Profiles
RTP profiles standardize the application of the Real-time Transport Protocol (RTP) and its control protocol (RTCP) for particular media types and network environments, specifying parameters such as payload types, clock rates, and default behaviors to ensure interoperability.[15] These profiles extend the core RTP specification by defining mappings for common audio and video formats while maintaining minimal control overhead for real-time applications.[15] The Audio/Video Profile (AVP), defined in RFC 3551, serves as the foundational RTP profile for non-secure audio and video conferencing over both IPv4 and IPv6 networks.[15] It designates RTP to use even-numbered UDP ports, with the associated RTCP traffic on the subsequent odd-numbered port, and registers ports 5004 for RTP and 5005 for RTCP as conventional defaults when dynamic port assignment is unavailable.[15] This profile establishes static payload types 0 through 95 with fixed encodings and clock rates—such as 8000 Hz for G.711 audio and 90000 Hz for video formats—to avoid negotiation overhead in basic sessions.[15] Payload types 96 through 127 are reserved as dynamic, requiring negotiation via protocols like the Session Description Protocol (SDP) to assign specific media formats.[15] Additionally, AVP includes provisions for registering MIME types with the Internet Assigned Numbers Authority (IANA) to associate payload types with standardized media descriptions.[15] For secure communications, the Secure Audio/Video Profile (SAVP), specified in RFC 3711, extends AVP by integrating the Secure RTP (SRTP) mechanism to provide confidentiality and authentication without altering the core RTP packet structure.[16] It employs the same port conventions and payload type assignments as AVP but mandates SRTP encryption for RTP and RTCP packets, making it suitable for environments requiring data protection.[16] To support more responsive error correction and synchronization in real-time sessions, the AVP Feedback Profile (AVPF), outlined in RFC 4585, builds on AVP by enabling earlier transmission of RTCP feedback messages, such as Negative Acknowledgments (NACK) for packet loss recovery and Picture Loss Indication (PLI) for video synchronization.[17] This profile reduces feedback latency while adhering to the same bandwidth constraints and scalability rules as AVP, allowing immediate responses within RTCP intervals rather than waiting for periodic reports.[17] The Secure AVP Feedback Profile (SAVPF), defined in RFC 5124, combines SAVP with AVPF to deliver secure, timely feedback in encrypted sessions.[18] Profile-specific extensions further tailor RTP usage, including standardized clock rates for timestamp generation—typically 8000 Hz for audio and 90000 Hz for video in AVP and its variants—and IANA registration of MIME types to ensure consistent payload identification across implementations.[15] These elements collectively enable profiles to adapt RTP for diverse applications while preserving its real-time efficiency.Payload Formats and Codecs
Payload formats in the Real-time Transport Protocol (RTP) define how encoded media data from specific codecs is structured and encapsulated within RTP packets, ensuring compatibility and efficient transmission over IP networks. These formats specify the mapping of codec output to the RTP payload, including octet-level details, timestamping, and handling of codec-specific parameters such as frame boundaries and synchronization. Standardized by the Internet Engineering Task Force (IETF), payload formats are registered to avoid conflicts and enable interoperability, with static payload types (PT) assigned for well-known codecs and dynamic PTs for others negotiated during session setup.[19] For audio codecs, common payload formats include those for G.711, which uses PT=0 for pulse-code modulation (PCM) at 64 kbps with an 8 kHz sampling rate, packaging 160 samples per 20 ms frame in the RTP payload. G.729 employs PT=18 for compressed speech at 8 kbps, also at 8 kHz, where each 10 ms frame is directly mapped to 10 octets in the payload, optionally including voice activity detection via Annex B. The Opus codec, defined in RFC 7587, supports dynamic PTs (typically 96-127) for scalable audio from 6 to 510 kbps across bandwidths of 6-20 kHz, allowing variable frame sizes (2.5-60 ms) and multiple channels; its payload includes a table of contents octet to indicate packet structure, such as mono/stereo or in-band forward error correction.[20] Video payload formats similarly map compressed frames into RTP packets. H.261, an early video codec, uses PT=31 with a format specified in RFC 2032, where picture segments are fragmented into macroblocks, each prefixed by a 4-octet header detailing start bits, group of blocks, and motion vectors, using a 90 kHz timestamp for synchronization.[21] For H.264/AVC, RFC 6184 outlines aggregation of multiple Network Abstraction Layer (NAL) units into single packets via Single-Time Aggregation Packets (STAP) for efficiency, while large NAL units exceeding the MTU are fragmented using Fragmentation Units (FU) with headers indicating start, end, and type.[22] VP8, detailed in RFC 7741, encapsulates frame partitions (up to 9) in payloads with a descriptor header for picture ID and temporal layer indexing, supporting scalability layers and setting the RTP marker bit on the last packet of a key frame to signal frame completion.[23] Payload formats incorporate rules tailored to media characteristics for reliable transport. Large video frames, such as those from H.264, undergo fragmentation to fit network MTU limits (typically 1200-1500 bytes), with each fragment carrying codec headers to enable reassembly without full decoding.[22] Conversely, small audio packets or multiple low-overhead NAL units may be aggregated into one RTP packet to reduce header overhead, as in STAP-A for H.264 or multi-packet Opus bundles.[19][20] The RTP marker bit (M) is used in video formats to denote the final packet of a frame, particularly key frames (I-frames), aiding receivers in buffering and rendering; for example, in VP8, it is set to 1 on the last partition packet of a frame.[23][19] Registration of payload formats occurs through the Internet Assigned Numbers Authority (IANA), which maintains the RTP Parameters registry for static PTs (0-95) and media types, ensuring unique identifiers and parameters like clock rates are documented in RFCs. Dynamic PTs are negotiated via Session Description Protocol (SDP), as in the examplem=audio 5004 RTP/AVP 0, which offers G.711 (PT=0) on UDP port 5004 using the Audio/Video Profile (AVP); the answerer selects or maps PTs to match capabilities. This process, often within RTP profiles, allows flexible codec selection while adhering to format specifications.[20]