Fact-checked by Grok 2 weeks ago

Real-time Transport Protocol

The Real-time Transport Protocol (RTP) is an IETF-standardized network protocol that provides end-to-end transport functions suitable for applications transmitting real-time data, such as audio, video, or simulation data, over or network services. RTP does not guarantee quality-of-service or address resource reservation but focuses on efficient, low-latency delivery of time-sensitive payloads, typically layered over for minimal overhead. It augments data transport with the companion (RTCP), which enables scalable monitoring of delivery quality, minimal control functions, and participant identification in large networks. RTP was first specified in 1889 in January 1996 as a foundational for over the . This initial version addressed the growing need for standardized transport in emerging applications like audio and video conferencing amid the expansion of networks. The was obsoleted and updated by 3550 in July 2003 to incorporate refinements for broader applicability, including better support for secure extensions and interoperability across diverse systems. These updates maintained RTP's core design independence from underlying transport layers while enhancing its robustness for modern environments. At its core, RTP includes several key features to handle real-time constraints: sequence numbers to detect and reorder lost or delayed packets, timestamps for synchronizing media playback across variable network conditions, and payload type fields to identify and demultiplex different media formats without relying on external . RTCP complements these by periodically sending control packets that report statistics like , , and , allowing senders to adapt transmission rates and receivers to assess session health. Profiles and payload formats, defined in companion RFCs, further customize RTP for specific media types, ensuring flexibility without altering the base . RTP underpins a wide array of real-time communication systems, including Voice over IP (VoIP) telephony, video teleconferencing, IPTV broadcasting, and interactive streaming services. It forms a critical component of standards like for browser-based media exchange, enabling low-latency audio and video in web applications. Extensions such as Secure RTP (SRTP) have also emerged to add and , addressing security needs in sensitive deployments like secure video calls.

Introduction

Purpose and Design Goals

The Real-time Transport Protocol (RTP) is designed to facilitate the end-to-end delivery of , such as audio, video, or simulation data, over networks using either or services. Its primary objective is to support applications where timeliness and low latency are paramount, prioritizing the prompt transmission of packets over guaranteed delivery or error correction, as delays or can severely degrade the quality of streams. This approach accepts a degree of , which is tolerable for real-time applications, rather than incurring retransmission delays that could disrupt the flow of continuous data. Key design principles of RTP include the use of timestamps to enable across media streams, sequence numbers to assist in packet ordering and detect losses for potential retransmission hints, and payload type to allow dynamic switching between codecs during a session without interrupting the flow. These mechanisms provide essential metadata for receivers to reconstruct and play back media correctly, while keeping the protocol lightweight to minimize overhead. RTP operates primarily over the (UDP), leveraging its multiplexing and checksum capabilities while avoiding the inherent in , which could introduce unacceptable delays in variable network conditions. In some scenarios, RTP may use or (DTLS) as alternatives, but UDP remains the standard choice to emphasize speed over reliability. In contrast to non-real-time protocols like , which offer robust reliability through acknowledgments and retransmissions, RTP incorporates only minimal reliability features, delegating quality-of-service (QoS) enhancements to underlying networks or higher-layer applications. A core architectural concept is the separation of the data plane (handled by RTP for media payload transfer) from the (managed by the , or RTCP, for feedback and monitoring), which enhances scalability in large sessions by allowing RTCP messages to be transmitted at lower rates without impacting data throughput. This design enables RTP to automatically scale from small conferences to thousands of participants, supporting efficient resource use in diverse network environments.

Core Protocol Mechanics

RTP Data Transfer

The Real-time Transport Protocol (RTP) encapsulates data, such as audio or video , into discrete packets for real-time transmission over networks. This packetization process involves dividing the continuous stream into fixed-size units, typically aligned with the media's encoding frame or sample boundaries, and prepending a standardized RTP header to each unit. The header includes a 32-bit synchronization source (SSRC) identifier, which is a randomly chosen value unique to each stream source within a session, enabling receivers to distinguish and synchronize multiple concurrent , such as in scenarios. A key component of RTP data transfer is the 16-bit sequence number field in the header, which increments by one for each successive RTP packet sent from a given SSRC, regardless of payload type or content. This numbering allows receivers to detect , duplication, or reordering caused by network variability, facilitating reconstruction of the original stream order. At the receiver, a buffer leverages the sequence numbers to smooth out arrival delays, buffering packets briefly to reorder them and minimize disruptions, which is essential for maintaining . Timestamps in RTP provide precise for rendering, using a 32-bit field that indicates the sampling instant of the first octet of the payload data. Unlike sequence numbers, timestamps advance based on the 's rather than packet count; for instance, in audio encoded at 8000 Hz (as with ), a 20 ms packet containing 160 samples would increment the by 160. This scaling ensures accurate playout timing across varying network paths, compensating for without assuming constant packet intervals. The 7-bit payload type field dynamically identifies the media format and codec within the RTP header, allowing flexible between sender and receiver. Static assignments cover common types, such as 0 for mu-law audio, while dynamic values from 96 to 127 accommodate newer codecs like H.264 video, whose payload formats are specified separately. This field enables seamless switching between formats during a session without altering the underlying transport. RTP packets are transmitted over (UDP) datagrams, supporting for point-to-point delivery, for efficient group communication, or broadcast for network-wide distribution. By convention, RTP uses even-numbered UDP ports (e.g., 5004), with the associated control protocol on the next odd port (e.g., 5005), simplifying port pairing in implementations. This UDP-based approach prioritizes low over reliability, as RTP relies on application-layer mechanisms for any necessary retransmission.

RTCP Feedback Mechanism

The (RTCP) serves as an companion to RTP, delivering periodic control information that enables participants in a session to monitor the (QoS) for transmitted data streams. Specifically, RTCP facilitates the exchange of sender and receiver reports containing key QoS metrics, such as packet loss fraction, interarrival , and round-trip time (RTT) estimates, which help applications adapt to network conditions and diagnose issues like congestion or faults. These reports are essential for applications, as they provide insights into transmission quality without interfering with the primary RTP data flow. RTCP employs several core packet types to convey this feedback. The Sender Report (SR) is transmitted by active senders and includes detailed statistics on packets sent, octets sent, and an NTP timestamp for clock synchronization, allowing receivers to correlate RTP timestamps with absolute time. In contrast, the Receiver Report (RR) is sent by non-senders or as a component of SR packets by senders, reporting reception statistics such as the fraction of packets lost, cumulative packets lost, highest sequence number received, and an interarrival jitter estimate for the reporting interval. Additionally, the Source Description (SDES) packet provides identification and descriptive information about session participants, including mandatory canonical names (CNAME) that uniquely identify sources across sessions, along with optional items like name, email, or location. To ensure efficient use of network resources, RTCP transmission is carefully scheduled with bandwidth constraints. The protocol recommends allocating no more than 5% of the total session to RTCP traffic, with approximately one-quarter of that reserved for senders and the remainder for receivers, preventing control packets from overwhelming the data. Intervals between RTCP packets are calculated dynamically based on session size, participant roles, and recent reporting activity, incorporating randomization to desynchronize transmissions and avoid bursty load during simultaneous sends. This approach scales gracefully for large sessions, where the interval grows with the number of participants to maintain the bandwidth limit. Scalability is further enhanced through compound RTCP packets, which bundle multiple RTCP packet types—such as an or followed by SDES—into a single underlying protocol , reducing header overhead and ensuring delivery of related . For custom needs, the Application-defined () packet type allows extensions specific to particular applications, carrying subtype and name fields to define proprietary while adhering to RTCP's overall structure. A critical component of RTCP feedback is the estimation of interarrival , which quantifies variations in packet arrival times due to or routing differences. The jitter value J is computed iteratively using the : J = J + \frac{|D_i - J|}{16} where D_i represents the difference in packet interarrival delays relative to the RTP intervals, derived as D_i = (R_i - R_{i-1}) - (S_i - S_{i-1}), with R denoting the receiver's arrival and S the RTP sender's . This smoothed estimate, reported in the RR or SR, aids in assessing stream stability and is standardized across receivers for consistent comparison.

Packet Formats

RTP Header Structure

The RTP header is a fixed 12-byte structure that precedes the payload in RTP data packets, enabling synchronization, identification, and multiplexing of media streams. It begins with a 1-byte field containing the version number (2 bits, currently set to 2 for compatibility with the RTP specification), a padding flag (P bit, 1 bit, indicating optional padding bytes at the end of the packet to align the payload), an extension flag (X bit, 1 bit, signaling the presence of an optional extension header), and a CSRC count (CC field, 4 bits, specifying the number of contributing sources listed in the packet). The second byte includes the marker bit (M, 1 bit, used to indicate frame boundaries or other significant events in the media stream, as defined by the ) followed by the payload type (PT, 7 bits, identifying the format of the media data, such as audio or ). This is followed by a 16-bit sequence number for detecting and reordering, a 32-bit providing monotonic clock progression for (often derived from the sampling rate of the media), and a 32-bit synchronization source identifier (SSRC) uniquely identifying the source of the stream within the RTP session. If the CC field is greater than zero, the header is extended by a list of up to 15 CSRC identifiers (each 32 bits), which identify the contributing sources in scenarios involving mixers or that combine multiple into one. When the X bit is set, an optional extension header follows the CSRC list (or fixed header if no CSRCs), consisting of a 16-bit profile-specific identifier, a 16-bit length field indicating the extension length in 32-bit words, and the extension data itself, allowing for profile-defined additional information without altering the base header. The version field has remained at 2 since the publication of RFC 3550 in 2003, ensuring with earlier RTP implementations while supporting the protocol's core functionality for applications.
FieldSize (bits)Description
Version (V)2Protocol version (2).
Padding (P)1Indicates padding at packet end.
Extension (X)1Signals optional extension header.
CSRC Count (CC)4Number of CSRC identifiers (0-15).
Marker (M)1Marks significant events (profile-specific).
Payload Type (PT)7Identifies media format.
Sequence Number16Packet ordering and loss detection.
32Synchronization clock value.
SSRC32Source stream identifier.
CSRC List (optional)32 each (up to 15)Contributing source IDs.
Extension Header (optional)Variable (min 4 bytes)Profile-specific extensions.

RTCP Packet Types

RTCP packets share a common fixed header that precedes type-specific data, enabling receivers to identify the packet type and length. This header is 4 bytes long and consists of the version field (2 bits, set to 2 for RTP/RTCP), a bit (1 bit, indicating if the packet contains padding to align to a 32-bit ), a reception report count or subtype (5 bits, varying by packet type: RC for SR/RR, item count for SDES, number of SSRCs for BYE, subtype for APP), a packet type field (PT, 8 bits identifying the subtype: 200 for SR, 201 for RR, 202 for SDES, 203 for BYE, 204 for APP), and a field (16 bits, representing the number of 32-bit words in the packet minus one). The Sender Report (SR) packet provides transmission and reception statistics from a sender, starting after the common header with the sender's SSRC identifier (32 bits), followed by an NTP (64 bits for wall-clock time), an RTP (32 bits corresponding to the RTP of the first octet in the ), the sender's packet (32 bits, total RTP data packets sent), and the sender's octet (32 bits, total RTP data octets sent). It then includes zero or more reception blocks (each 24 bytes: SSRC of the reported source, fraction lost, cumulative packets lost, extended highest sequence number, interarrival , last SR , and delay since last SR). SR packets are sent periodically by active senders to synchronize streams and report quality. Receiver Report (RR) packets convey reception quality feedback from non-senders or senders without updated transmission stats, following the common header with the sender's SSRC (32 bits) and up to 31 report blocks identical to those in SR packets. An empty RR packet (RC=0) may head a compound packet when no data is sent or received, ensuring minimal overhead while providing essential feedback on metrics like and . Source Description (SDES) packets carry textual about participants for and display, beginning after the common header with one or more chunks: each chunk starts with an SSRC or CSRC identifier (32 bits), followed by zero or more items (each with a 1-byte type, 1-byte length, and variable-length text value). The Canonical Name (CNAME) item is mandatory in each compound RTCP packet (except during splitting), providing a unique, permanent identifier like "[email protected]" or a to bind SSRCs across sessions without disclosing sensitive user details, thus preserving . Optional items include NAME (user's display name), , , (location), (software name/), (free-form note), and PRIV (private extensions). SDES items are limited to 255 bytes each, and the packet ends with a chunk if the item count in the header matches the number of SSRCs listed. BYE packets signal the departure of one or more sources from the session, consisting of the common header followed by one or more SSRC/CSRC identifiers (32 bits each, up to 31 as indicated by the count field) and an optional reason-for-leaving string (preceded by its length in bytes). Multiple SSRCs allow a single packet to notify of multiple exits, and the reason field aids in or without exceeding packet limits. BYE packets may be sent immediately upon leaving, outside the regular RTCP schedule, but follow backoff rules to avoid congestion. Application-defined (APP) packets enable custom control information beyond standard RTCP functions, with the common header's count field specifying a subtype (0-31), followed by an SSRC/CSRC (32 bits), an 8-character ASCII name (identifying the application), and application-dependent data (variable length, up to the packet's total size). APP packets support extensibility for specific applications, such as synchronization or control signals, while maintaining compatibility with the RTCP framework. To optimize and reduce header overhead, RTCP packets are typically transmitted as compound packets within a single underlying protocol (e.g., ), concatenating multiple simple RTCP packets—often starting with an SR or RR, followed by SDES (with CNAME), and optionally BYE or APP—without additional headers between them. The first packet's PT distinguishes the compound from a simple one, and (if used) applies to the entire compound or splits it into encrypted and unencrypted portions, with SDES CNAME appearing in only one to avoid duplication. This structure ensures efficient delivery of diverse feedback in sessions. The length field in the common header is calculated as the total packet length in 32-bit words minus one, accommodating variable-sized contents like report blocks or SDES items while allowing padding (if the P bit is set) to reach the next 32-bit boundary; padding byte counts are stored in the packet's last byte if present.

Profiles and Extensions

RTP Profiles

RTP profiles standardize the application of the Real-time Transport Protocol (RTP) and its control protocol (RTCP) for particular media types and network environments, specifying parameters such as payload types, clock rates, and default behaviors to ensure interoperability. These profiles extend the core RTP specification by defining mappings for common audio and video formats while maintaining minimal control overhead for real-time applications. The Audio/Video Profile (AVP), defined in RFC 3551, serves as the foundational RTP profile for non-secure audio and video conferencing over both IPv4 and networks. It designates RTP to use even-numbered ports, with the associated RTCP traffic on the subsequent odd-numbered port, and registers ports 5004 for RTP and 5005 for RTCP as conventional defaults when dynamic port assignment is unavailable. This profile establishes static payload types 0 through 95 with fixed encodings and clock rates—such as 8000 Hz for audio and 90000 Hz for video formats—to avoid negotiation overhead in basic sessions. Payload types 96 through 127 are reserved as dynamic, requiring negotiation via protocols like the (SDP) to assign specific media formats. Additionally, AVP includes provisions for registering types with the (IANA) to associate payload types with standardized media descriptions. For secure communications, the Secure Audio/Video Profile (SAVP), specified in 3711, extends AVP by integrating the Secure RTP (SRTP) mechanism to provide and without altering the core RTP packet structure. It employs the same port conventions and payload type assignments as AVP but mandates SRTP encryption for RTP and RTCP packets, making it suitable for environments requiring data protection. To support more responsive error correction and synchronization in real-time sessions, the AVP Feedback Profile (AVPF), outlined in RFC 4585, builds on AVP by enabling earlier transmission of RTCP messages, such as Negative Acknowledgments (NACK) for recovery and Picture Loss Indication (PLI) for video synchronization. This profile reduces while adhering to the same constraints and rules as AVP, allowing immediate responses within RTCP intervals rather than waiting for periodic reports. The Secure AVP Feedback Profile (SAVPF), defined in RFC 5124, combines SAVP with AVPF to deliver secure, timely in encrypted sessions. Profile-specific extensions further tailor RTP usage, including standardized clock rates for generation—typically 8000 Hz for audio and 90000 Hz for video in AVP and its variants—and IANA registration of types to ensure consistent identification across implementations. These elements collectively enable profiles to adapt RTP for diverse applications while preserving its efficiency.

Payload Formats and Codecs

formats in the Real-time Transport Protocol (RTP) define how encoded media data from specific is structured and encapsulated within RTP packets, ensuring compatibility and efficient transmission over networks. These formats specify the mapping of output to the RTP , including octet-level details, , and handling of -specific parameters such as frame boundaries and synchronization. Standardized by the (IETF), formats are registered to avoid conflicts and enable interoperability, with static payload types (PT) assigned for well-known and dynamic PTs for others negotiated during session setup. For audio codecs, common payload formats include those for , which uses PT=0 for (PCM) at 64 kbps with an 8 kHz sampling rate, packaging 160 samples per 20 ms in the RTP payload. employs PT=18 for compressed speech at 8 kbps, also at 8 kHz, where each 10 ms is directly mapped to 10 octets in the payload, optionally including via Annex B. The codec, defined in RFC 7587, supports dynamic PTs (typically 96-127) for scalable audio from 6 to 510 kbps across bandwidths of 6-20 kHz, allowing variable sizes (2.5-60 ms) and multiple channels; its payload includes a octet to indicate packet structure, such as mono/stereo or in-band . Video payload formats similarly map compressed frames into RTP packets. , an early , uses PT=31 with a format specified in 2032, where picture segments are fragmented into macroblocks, each prefixed by a 4-octet header detailing start bits, group of blocks, and motion vectors, using a 90 kHz for . For H.264/AVC, 6184 outlines aggregation of multiple (NAL) units into single packets via Single-Time Aggregation Packets (STAP) for efficiency, while large NAL units exceeding the MTU are fragmented using Fragmentation Units (FU) with headers indicating start, end, and type. , detailed in 7741, encapsulates frame partitions (up to 9) in payloads with a descriptor header for picture ID and temporal layer indexing, supporting scalability layers and setting the RTP marker bit on the last packet of a to signal frame completion. Payload formats incorporate rules tailored to media characteristics for reliable transport. Large video frames, such as those from H.264, undergo fragmentation to fit network MTU limits (typically 1200-1500 bytes), with each fragment carrying codec headers to enable reassembly without full decoding. Conversely, small audio packets or multiple low-overhead NAL units may be aggregated into one RTP packet to reduce header overhead, as in STAP-A for H.264 or multi-packet Opus bundles. The RTP marker bit (M) is used in video formats to denote the final packet of a frame, particularly key frames (I-frames), aiding receivers in buffering and rendering; for example, in VP8, it is set to 1 on the last partition packet of a frame. Registration of payload formats occurs through the (IANA), which maintains the RTP Parameters registry for static PTs (0-95) and media types, ensuring unique identifiers and parameters like clock rates are documented in RFCs. Dynamic PTs are negotiated via (SDP), as in the example m=audio 5004 RTP/AVP 0, which offers (PT=0) on port 5004 using the Audio/Video Profile (AVP); the answerer selects or maps PTs to match capabilities. This process, often within RTP profiles, allows flexible selection while adhering to format specifications.

Security and Reliability

Secure RTP (SRTP)

The (SRTP) is a profile of the (RTP) designed to add to RTP and (RTCP) streams while preserving the core RTP structure and functionality. Defined in RFC 3711, SRTP provides three primary services: through of RTP payloads to prevent ; message to ensure and origin , protecting against tampering; and replay protection to detect and discard duplicate or out-of-order packets using sequence numbers and timestamps. These features are applied selectively to RTP headers, payloads, and RTCP packets without requiring changes to the RTP session establishment or data flow mechanics. SRTP employs specific cryptographic transforms to achieve these protections. Encryption uses the (AES) in Counter Mode (AES-CM), which allows efficient, parallelizable processing suitable for real-time media. For authentication and integrity, it applies the (HMAC) with the (SHA-1), typically truncating the tag to 80 bits for reduced overhead. Key derivation is performed using a pseudorandom function based on AES in Counter Mode, starting from a master key and incorporating a 112-bit SRTP to generate unique session keys, salts, and authentication keys for each packet; this process ensures and resistance to key compromise across sessions. Key management in SRTP is decoupled from the protocol itself to allow flexibility in secure key exchange. Common methods include Session Description Protocol (SDP) Security Descriptions (SDES) per RFC 4568, which embeds master keys and parameters as SDP attributes during signaling for unicast streams, and Datagram Transport Layer Security (DTLS)-SRTP per RFC 5764, which leverages a DTLS handshake over the media path to negotiate and derive keys using TLS exporters, supporting mutual authentication via certificates. SRTP extends these protections to RTCP via Secure RTCP (SRTCP), which uses the identical master keys and derivation mechanisms as SRTP but optimizes for RTCP's lower frequency and compound packet structure. SRTCP applies only once at the end of each compound RTCP packet, omitting per-packet for headers to minimize bandwidth overhead while ensuring for sensitive RTCP fields like sender reports. Replay protection in SRTCP relies on a 32-bit index counter shared across . To ensure , SRTP mandates of AES-128-CM for paired with HMAC-SHA1-80 for , providing a baseline 128-bit security level with an 80-bit authentication tag that balances protection against brute-force attacks and real-time performance constraints. Additional transforms, such as AES-GCM defined in 7714, offer alternatives that combine and for enhanced security.

Congestion Control and Error Handling

RTP employs sequence numbers in its header to enable receivers to detect and reordering. The 16-bit sequence number field increments by one for each successive RTP data packet sent from a source, allowing the receiver to identify gaps in the sequence that indicate lost or out-of-order packets. This mechanism facilitates loss detection without providing built-in retransmission; RTP itself does not perform automatic recovery, leaving such functions to the . For selective retransmission, applications may utilize RTCP feedback messages, such as Negative Acknowledgments (NACKs), to request retransmission of specific lost packets. Defined in the extended RTP profile, NACKs allow receivers to signal missing sequence numbers, enabling senders to retransmit only the affected packets in a separate RTP using a dedicated format. This approach supports efficient for real-time applications while minimizing overhead. RTP remains agnostic to congestion control, as it operates over and does not incorporate mechanisms to adjust transmission rates in response to . Instead, congestion management is delegated to external algorithms or protocols layered atop RTP. Examples include TCP-Friendly Rate Control (TFRC), which estimates available based on loss rates and round-trip times to provide smooth rate adjustments suitable for , and Congestion Control (), which uses delay and loss signals for adaptive bitrate control in interactive scenarios. To mitigate the effects of variable network delays, RTP includes a 32-bit field in its header, which reflects the sampling instant of the media data. Receivers employ client-side jitter buffers to smooth out arrival time variations; these buffers reorder packets using sequence numbers and hold them until their timestamps indicate they are ready for playback, thereby compensating for without altering the media timing. Packet loss concealment is handled at the application level, as RTP provides no inherent mechanisms for it. Common techniques include , where missing audio samples are estimated by extrapolating from adjacent frames, or waveform substitution for brief losses, ensuring continuity in real-time playback despite undetected or unrecoverable packets. Bandwidth estimation in RTP sessions often derives from RTCP receiver reports, which convey cumulative octets received and interarrival . An approximation of available can be computed as the of octets received over the estimated round-trip time (RTT), obtained via RTCP sender and receiver reports using synchronized timestamps; this informs external congestion controllers about network capacity.

Applications and Implementations

Traditional Uses in VoIP and Streaming

The Real-time Transport Protocol (RTP) serves as the foundational transport mechanism for real-time media in (VoIP) systems, enabling the delivery of time-sensitive audio and video packets with timestamps for synchronization and sequencing. In integration with the (SIP), as defined in RFC 3261, RTP handles the media streams following SIP's role in call establishment, where (SDP) negotiations specify RTP parameters such as payload types and ports. This separation allows SIP to focus on signaling while RTP ensures low-latency, ordered packet delivery over , supporting bidirectional communication in VoIP sessions. Enterprise VoIP deployments, such as the open-source Asterisk PBX, configure RTP channels to manage media flows, with settings in rtp.conf defining port ranges and symmetric RTP for direct endpoint-to-endpoint transmission after signaling. Similarly, the ITU-T H.323 suite employs RTP for media paths in gateways and terminals, where H.225 handles initial signaling and H.245 negotiates capabilities, routing RTP packets directly between participants for efficient multimedia exchange in traditional telephony-to-IP interworking. In media streaming, RTP enables reliable delivery of live broadcasts and on-demand content, particularly in (IPTV) where its capabilities distribute streams to multiple receivers with minimal overhead, as supported by the RTP/AVP . Video conferencing platforms like Polycom systems use RTP to multiplex audio and video streams, with configurable media ports allowing firewall traversal for multi-party sessions in enterprise environments. The Real Time Streaming Protocol (RTSP), per RFC 7826, complements RTP by providing application-level control for initiating, pausing, and tearing down streams, typically over while RTP carries the actual media data. Traditional RTP applications often encounter (NAT) barriers that hinder direct peer connectivity, addressed through pre-browser protocols like (RFC 8489), which enables endpoints to learn their public mappings and adjust RTP source addresses accordingly. In symmetric NAT scenarios, TURN (RFC 5766) acts as a relay server to forward RTP packets, maintaining session integrity in firewall-protected enterprise networks without relying on later frameworks. RTP's dominance in enterprise Private Branch Exchange (PBX) systems emerged in the early , as organizations transitioned from circuit-switched to IP-based , with implementations like Cisco's CallManager leveraging RTP for converged voice-data networks and achieving widespread adoption by mid-decade.

Modern Integrations like WebRTC

In modern web-based real-time communication, the Real-time Transport Protocol (RTP) serves as the core media transport mechanism within , enabling the delivery of audio, video, and other data streams between browsers and devices. implementations mandate support for RTP over , with mandatory encryption via DTLS-SRTP to secure media flows, and of multiple RTP streams into a single transport using the BUNDLE extension to optimize and reduce overhead. This integration allows peer-to-peer connections for applications like video conferencing directly in web browsers without plugins, leveraging RTP's timestamping and sequence numbering for synchronization and ordering. WebRTC addresses network congestion through RTP Control Protocol (RTCP) feedback mechanisms, including Receiver Estimated Maximum Bitrate (REMB) messages that signal the estimated available to the sender, allowing dynamic adjustment of transmission rates. Negative Acknowledgment (NACK) packets enable selective retransmission of lost RTP packets, improving reliability over unreliable transports without full fallback. Experimental integrations have explored tunneling RTP over to leverage its multipath and congestion control features, as outlined in IETF drafts such as draft-ietf-avtcore-rtp-over-quic (version 14, March 2025). In mobile and over-the-top (OTT) applications, RTP over combined with () facilitates and peer discovery, ensuring reliable media transport in diverse network environments. For instance, employs RTP for voice and video calls. Similarly, relies on RTP over ports for media streams, integrating to handle and challenges in enterprise and consumer settings. Recent adaptations of RTP include framing over to traverse firewalls that block , as specified in 4571, which encapsulates RTP and RTCP packets within TCP streams while preserving timing information. In low-latency streaming protocols like (HLS) and (DASH), RTP segments from live sources are often ingested and transcoded into shorter chunks for delivery, enabling end-to-end latencies under 5 seconds in low-latency modes (LL-HLS and LL-DASH). further extends RTP with mechanisms like Retransmission (RTX), which uses a dedicated format to resend lost packets in a separate stream, enhancing error recovery without disrupting primary media flow.

Standards Evolution

Foundational RFCs

The foundational standardization of the Real-time Transport Protocol (RTP) began with RFC 1889, published in January 1996 by authors Henning Schulzrinne, Stephen L. Casner, Ron Frederick, and Van Jacobson. This document introduced RTP as an end-to-end network transport protocol designed for applications transmitting real-time data, such as audio, video, or simulation data, over both multicast and unicast network services. It defined the core RTP packet format, including a fixed 12-byte header with fields for version (set to 2), padding, extension, CSRC count, marker, payload type, sequence number, timestamp, SSRC identifier, and optional CSRC list, enabling functions like payload type identification, sequencing, timing reconstruction, and source identification. Complementing RTP, the specification also outlined the RTP Control Protocol (RTCP), which provides out-of-band control information for quality-of-service feedback, participant identification, and session management through packet types such as sender reports, receiver reports, source description, and goodbye. RTP was specified to operate atop UDP for its low-overhead, connectionless delivery, as defined in RFC 768. RFC 3550, published in July 2003 by the same core authors (Schulzrinne, Casner, Frederick, and Jacobson), obsoleted RFC 1889 and established the current standards for RTP and RTCP without altering the wire format. This update refined rules and algorithms, such as enhanced RTCP interval calculations to improve scalability in large sessions (limiting RTCP to 5% of session bandwidth) and better handling of estimation and detection. It formalized RTP's header structure and mechanisms, including the use of profiles for customization and extensions for additional fields, while emphasizing RTP's role in delivering over networks supporting both and . Like its predecessor, RFC 3550 positions RTP over (per RFC 768) for multiplexing and checksums, and it leverages extensions from RFC 1112 to enable efficient delivery to multiple recipients in group communications. RTCP enhancements include detailed sender and receiver reports for metrics like packets sent, octets sent, cumulative packets lost, and interarrival , aiding in monitoring. Building directly on RFC 3550, RFC 3551, also from July 2003 and authored by Schulzrinne and Casner, specifies the RTP Profile for Audio and Video Conferences with Minimal Control, known as the Audio/Video Profile (AVP). This profile interprets generic RTP fields for low-latency audio and video applications, defining static payload types for common encodings—such as payload type 0 for G.711 μ-law PCM, type 8 for G.711 A-law PCM, and type 31 for G.729—while reserving dynamic types 96–127 for negotiation. It mandates RTCP usage for feedback, allocates default bandwidth (5% total, split as 1.25% for senders and 3.75% for receivers), and sets a 90 kHz for video timestamps to support across diverse media. The AVP profile minimizes control overhead, making it suitable for conferences, and obsoletes the earlier RFC 1890. RFC 4566, published in July 2006 by Mark Handley, , Colin Perkins, and Eve Schooler, defines the (SDP) as a format for describing sessions, including those using RTP for media transport. SDP enables session announcement, invitation, and initiation by specifying attributes like media types (e.g., audio or video), transport protocols (e.g., RTP/AVP over ), port numbers, payload formats, and session timing, facilitating RTP session setup without prescribing the underlying signaling protocol. Key RTP-related elements include the "m=" line for media streams (e.g., m=audio 49172 RTP/AVP 0) and "a=rtpmap" attributes to map payload types to codecs and parameters, such as a=rtpmap:0 PCMU/8000 for μ-law at 8 kHz. This protocol integrates RTP by allowing declarative descriptions of flows, supporting both and configurations as per earlier RTP foundations.

Recent Updates and Extensions

Following the establishment of core RTP standards, several key extensions have addressed evolving needs in security, feedback mechanisms, and transport efficiency. In 2004, RFC 3711 defined the (SRTP), providing confidentiality, message authentication, and replay protection for RTP and RTCP packets through cryptographic transforms like AES-CM. This framework was extended in 2015 by RFC 7714, which integrated AES-GCM into SRTP, enhancing integrity and replay protection via nonce-based authentication that mitigates certain replay attacks more effectively than prior modes. Security further advanced with RFC 8723 in 2020, introducing double procedures for SRTP to enable end-to-end in multiparty scenarios, such as conferencing systems where intermediate nodes handle media routing without accessing content. Complementing this, updates to DTLS-SRTP support 0-RTT resumption as per DTLS 1.3, allowing immediate data transmission on connection resumption with minimal latency overhead while preserving and anti-replay properties. For improved feedback in real-time sessions, 4585 in 2006 specified the profile, an extension to the audiovisual profile that enables more frequent and immediate RTCP reports, including commands like and to facilitate rapid video error recovery and adaptation. feedback mechanisms have also seen enhancements, with RTCP Extended Reports (XR) providing detailed metrics such as burst/gap loss and delay variation for better network diagnostics, as outlined in 3611; related reliability features include generic payloads per 5109 for proactive mitigation. Payload format evolutions include 6184 from 2011, which updates the RTP encapsulation for H.264 video to support scalable video coding through aggregation and fragmentation of units, enabling efficient handling of layered bitstreams for varying bandwidth conditions. In contemporary applications, RTP integrates with as detailed in 8827 (2021), where it serves as the primary media transport under a architecture combining DTLS for and SRTP for , ensuring browser-based real-time communication meets and requirements. Emerging transport options include RTP over , explored in draft-ietf-avtcore-rtp-over-quic (ongoing as of 2025), which maps RTP/RTCP into QUIC streams to exploit its built-in congestion control, multipath support, and low-latency handshakes for enhanced performance in unreliable networks. A niche extension appears in , where the 2025 DICOM PS3.22 standard leverages RTP for real-time video transport over networks, encapsulating SMPTE ST 2110-compliant essence streams with RTP headers to enable low-latency, synchronized delivery of diagnostic images in clinical environments.

References

  1. [1]
    RFC 3550: RTP: A Transport Protocol for Real-Time Applications
    RTP provides end-to-end network transport functions suitable for applications transmitting real-time data, such as audio, video or simulation data.
  2. [2]
    Introduction to the Real-time Transport Protocol (RTP) - Web APIs
    Jul 26, 2024 · The Real-time Transport Protocol (RTP), defined in RFC 3550, is an IETF standard protocol to enable real-time connectivity for exchanging data ...
  3. [3]
    RFC 3550 - RTP: A Transport Protocol for Real-Time Applications
    RTP provides end-to-end network transport functions suitable for applications transmitting real-time data, such as audio, video or simulation data.
  4. [4]
    RFC 3550 - RTP - Tech-invite
    This memorandum describes RTP, the real-time transport protocol. RTP provides end-to-end network transport functions suitable for applications transmitting ...
  5. [5]
    RFC 3550 - RTP - Tech-invite
    RTP is designed to allow an application to scale automatically over session sizes ranging from a few participants to thousands.
  6. [6]
  7. [7]
  8. [8]
  9. [9]
  10. [10]
  11. [11]
  12. [12]
  13. [13]
  14. [14]
  15. [15]
    RFC 3551 - RTP Profile for Audio and Video Conferences with ...
    This document describes a profile called "RTP/AVP" for the use of the real-time transport protocol (RTP), version 2, and the associated control protocol, RTCP.
  16. [16]
    RFC 3711 - The Secure Real-time Transport Protocol (SRTP)
    This document describes the Secure Real-time Transport Protocol (SRTP), a profile of the Real-time Transport Protocol (RTP), which can provide confidentiality, ...
  17. [17]
    RFC 4585 - Extended RTP Profile for Real-time Transport Control ...
    This document defines an extension to the Audio-visual Profile (AVP) that enables receivers to provide, statistically, more immediate feedback to the senders.
  18. [18]
    RFC 5124 - Based Feedback (RTP/SAVPF) - IETF Datatracker
    This memo specifies the combination of both profiles to enable secure RTP communications with feedback.
  19. [19]
    RFC 8088 - How to Write an RTP Payload Format - IETF Datatracker
    This document contains information on how best to write an RTP payload format specification. It provides reading tips, design practices, and practical tips.
  20. [20]
    RFC 7587 - RTP Payload Format for the Opus Speech and Audio ...
    Below are some examples of SDP session descriptions for Opus: Example 1: Standard mono session with 48000 Hz clock rate m=audio 54312 RTP/AVP 101 a=rtpmap ...
  21. [21]
    RFC 2032: RTP Payload Format for H.261 Video Streams
    ### Summary of RFC 2032: RTP Payload Format for H.261 Video Streams
  22. [22]
    RFC 6184 - RTP Payload Format for H.264 Video - IETF Datatracker
    The RTP payload format allows for packetization of one or more Network Abstraction Layer Units (NALUs), produced by an H.264 video encoder, in each RTP payload.Missing: 261 VP8
  23. [23]
    RFC 7741: RTP Payload Format for VP8 Video
    ### Key Points on VP8 RTP Payload Format (RFC 7741)
  24. [24]
    RFC 4568 - Session Description Protocol (SDP) Security ...
    This document defines a Session Description Protocol (SDP) cryptographic attribute for unicast media streams.
  25. [25]
    RFC 5764 - Datagram Transport Layer Security (DTLS) Extension to ...
    This document describes a Datagram Transport Layer Security (DTLS) extension to establish keys for Secure RTP (SRTP) and Secure RTP Control Protocol (SRTCP) ...RFC 5763 · RFC 5741 - RFC Streams... · Draft-ietf-avt-dtls-srtp · RFC 3711
  26. [26]
  27. [27]
    RFC 4588 - RTP Retransmission Payload Format - IETF Datatracker
    This document describes an RTP payload format for performing retransmissions. Retransmitted RTP packets are sent in a separate stream from the original RTP ...
  28. [28]
  29. [29]
    RFC 5348 - TCP Friendly Rate Control (TFRC): Protocol Specification
    This document specifies TCP Friendly Rate Control (TFRC). TFRC is a congestion control mechanism for unicast flows operating in a best- effort Internet ...Missing: RTP | Show results with:RTP
  30. [30]
    RFC 9392 - Sending RTP Control Protocol (RTCP) Feedback for ...
    RFC 9392. Sending RTP Control Protocol (RTCP) Feedback for Congestion Control in Interactive Multimedia Conferences. Abstract.Sending Rtp Control Protocol... · 3.1. Scenario 1: Voice... · 3.2. Scenario 2...
  31. [31]
    RFC 3261 - SIP: Session Initiation Protocol - IETF Datatracker
    This document describes Session Initiation Protocol (SIP), an application-layer control (signaling) protocol for creating, modifying, and terminating sessions ...
  32. [32]
    Asterisk config rtp.conf - VoIP-Info
    May 15, 2004 · Asterisk config rtp.conf: Configuration of Asterisk Real Time Protocol, RTP, media channels. RTP is used for SIP communication.
  33. [33]
    Cisco IP Phone 7905G for H.323 Overview
    The H.323 standard includes support for call signaling and control, multimedia transport and control, and bandwidth control for both point-to-point and point-to ...
  34. [34]
    Configure RTP media ports - Poly Documentation Library
    As specified in RFC 1889, RFC 3550, and RFC 3551, the next-highest odd-numbered port sends and receives RTP. Configure SIP RTP for FECC. Configure the SIP ...<|control11|><|separator|>
  35. [35]
    RFC 2326: Real Time Streaming Protocol (RTSP)
    The Real Time Streaming Protocol, or RTSP, is an application-level protocol for control over the delivery of data with real-time properties.
  36. [36]
    [PDF] IP Telephony Deployment - in Industry History - Cisco
    “Cisco has always made a practice of using its own technology and in 2000, we began migrating our existing PBX systems to a con- verged voice and data network ...
  37. [37]
    RFC 8827 - WebRTC Security Architecture - IETF Datatracker
    This document defines the security architecture for WebRTC, a protocol suite intended for use with real-time applications that can be deployed in browsers.
  38. [38]
    RFC 8445 - Interactive Connectivity Establishment (ICE)
    This document describes a protocol for Network Address Translator (NAT) traversal for UDP-based communication. This protocol is called Interactive Connectivity ...
  39. [39]
    [PDF] WhatsApp Exposed - Investigative Report - webrtcHacks
    Apr 14, 2015 · It is again followed by an undecodable packet in #147. The RTP data is flowing on this udp port pair for about three seconds until packet.Missing: Zoom 8445
  40. [40]
    Zoom network firewall or proxy server settings
    To configure your network firewall, please see the following table. The following rules should be applied to outbound traffic.
  41. [41]
    RFC 4571 - Framing Real-time Transport Protocol (RTP) and RTP ...
    This memo defines a method for framing Real-time Transport Protocol (RTP) and RTP Control Protocol (RTCP) packets onto connection-oriented transport (such as ...
  42. [42]
    RFC 1889: RTP: A Transport Protocol for Real-Time Applications
    RTP provides end-to-end network transport functions suitable for applications transmitting real-time data, such as audio, video or simulation data.
  43. [43]
    RFC 768: User Datagram Protocol
    ### Summary of RFC 768: User Datagram Protocol
  44. [44]
    RFC 1112: Host extensions for IP multicasting
    **RFC 1112 Summary: Host Extensions for IP Multicasting**
  45. [45]
  46. [46]
    RFC 4566: SDP: Session Description Protocol
    SDP is a protocol for describing multimedia sessions, used for session announcement, invitation, and initiation, providing a standard format for session ...
  47. [47]
    RFC 7714 - AES-GCM Authenticated Encryption in the Secure Real ...
    This document defines how the AES-GCM Authenticated Encryption with Associated Data family of algorithms can be used to provide confidentiality and data ...
  48. [48]
    RFC 8723 - Double Encryption Procedures for the Secure Real ...
    This document defines a cryptographic transform for the Secure Real-time Transport Protocol (SRTP) that uses two separate but related cryptographic operations.
  49. [49]
    RFC 3611 - RTP Control Protocol Extended Reports (RTCP XR)
    RFC 3611 defines the Extended Report (XR) packet for RTCP, conveying information beyond standard RTCP reports, and is signaled via SDP.
  50. [50]
    RFC 5109 - RTP Payload Format for Generic Forward Error Correction
    This document specifies a payload format for generic Forward Error Correction (FEC) for media data encapsulated in RTP.
  51. [51]
    DICOM PS3.22 2025d - Real-Time Communication - NEMA
    The byte stream of the Data Set is placed into the RTP Payload after the DICOM-RTV Meta Information. Each RTP session corresponds to a single SOP Instance.