学好 MP4，让直播更给力 · 音视频开发之路

MP4 实际代表的含义是 `MPEG-4 Part 14`。它只是 `MPEG` 标准中的 14 部分。它主要参考 `ISO/IEC` 标准来制定的。MP4 主要作用是可以实现快进快放，边下载边播放的效果。他是基于 `MOV`，然后发展成自己相关的格式内容。然后和 MP4 相关的文件还有：`3GP`，`M4V` 这两种格式。 MP4 的格式稍微比 FLV 复杂一些，它是通过嵌的方式来实现整个数据的携带。换句话说，它的每一段内容，都可以变成一个对象，如果需要播放的话，只要得到相应的对象即可。 MP4 中最基本的单元就是 Box，它内部是通过一个一个独立的 box 拼接而成的。所以，这里，我们先从 Box 的讲解开始。 > PS：作为一个前端开发，在大部分场合了解 MP4 非但没用，而且有点浪费时间。本文推荐阅读是针对音视频开发感兴趣的同学，特别是从事直播，或者，视频播放器业务相关的开发者。 ## MP4 box MP4 box 可以分为 basic box 和 full box。 * basic box: 主要针对的是相关的基础 box。比如 ftyp,moov 等。 * full box: 主要针对视频源的 media box。这里，再次强调一下，MP4 box 是 MP4 box 的核心。在 decode/encode 过程中，最好把它的基本格式背下来，这样，你写起来会开心很多（经验之谈）。 OK，我们来看一下，Box 的具体结构。 ### basic box 首先来看一下 basic box 的结构： ![](https://img.kancloud.cn/c1/dc/c1dcad280432a15d20ddcb351752f6e3_323x257.png) 如果用代码来表示就是： ~~~javascript aligned(8) class Box (unsigned int(32) boxtype, optional unsigned int(8)[16] extended_type) { unsigned int(32) size; unsigned int(32) type = boxtype; if (size==1) { unsigned int(64) largesize; } else if (size==0) { // box extends to end of file } // 这里针对的是 MP4 extension 的盒子类型。一般不会发生 if (boxtype==‘uuid’) { unsigned int(8)[16] usertype = extended_type; } } ~~~ 上面代码其实已经说的很清楚了。这里，我在简单的阐述一下。 * size\[4B\]: 用来代指该 box 的大小，包括 header 和 body。由于其大小有限制，有可能不满足超大的 box。所以，这里有一个判断逻辑，当 `size===1` 时，会出现一个 8B 的 `largesize` 字段来存放大小。当 `size===0` 时，表示文件的结束。 * type\[4B\]: 用来标识该 box 的类型，其实内容很简单，就是直接取指定盒子的英文字母的 ASCII 码。因为 boxname 的长度只有 4 个字母，所以，只需要通过 `charCodeAt` API 获取 4 次即可。 ~~~javascript // 获得指定 box 的 type 字段内容 val.charCodeAt(0) val.charCodeAt(1) val.charCodeAt(2) val.charCodeAt(3) ~~~ 实际整个盒子的结构可以用下图来表示： ![](https://img.kancloud.cn/7f/f5/7ff5c5ff13fd054c98eef18fad122579_1026x158.png) 这里需要强调的一点就是，在 MP4 中，默认写入字节序都是 Big-Endian 。所以，在上面，涉及到 4B 8B 等字段内容时，都是以 BE 来写入的。上面不是说了，box 有两种基本格式吗？还有一种为 fullBox ### full box full box 和 box 的主要区别是增加了 version 和 flag 字段。它的应用场景不高，主要是在 `trak` box 中使用。它的基本格式为： ~~~javascript aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f) extends Box(boxtype) { unsigned int(8) version = v; bit(24) flags = f; } ~~~ 在实操中，如果你的没有针对 version 和 flags 的业务场景，那么基本上就可以直接设为默认值，比如 0x00。它的基本结构图为： ![](https://img.kancloud.cn/44/39/44390ecccb96fa577a14888fa795a880_1036x270.png) 在实际 remux 中，会以 `box` 为最小组合单位，来完成相关的 remux 过程。比如，这里以 JS 来完成最小基本 box 的构造： ~~~javascript MP4.box = function (type) { let boxLength = 8; // include the total 8 byte length of size and type let buffers = Array.prototype.slice.call(arguments, 1); buffers.forEach(val => { boxLength += val.byteLength; }); let boxBuffer = new Uint8Array(boxLength); // the first four byte stands for boxLength boxBuffer[0] = (boxLength >> 24) & 0xff; boxBuffer[1] = (boxLength >> 16) & 0xff; boxBuffer[2] = (boxLength >> 8) & 0xff; boxBuffer[3] = boxLength & 0xff; // the second four byte is box's type boxBuffer.set(type, 4); let offset = 8; // the byteLength of type and size buffers.forEach(val => { boxBuffer.set(val, offset); offset += val.byteLength; }) return boxBuffer;} ~~~ 上述，通常的调用方法为： ~~~javascript // MP4.symbolValue.FTYP 为某一个具体的 BufferMP4.box(MP4.types.ftyp, MP4.symbolValue.FTYP); ~~~ 接下来，我们就要正式的来看一下，MP4 中真正用到的一些 Box 了。这里，我们按照 MP4 box 的划分来进行相关的阐述。先看一张 MP4 给出的结构图： ![](https://img.kancloud.cn/87/77/87774ab04be6b996d45944ae24777edf_603x901.png) 说明一下，我们只讲带星号的 box。其他的因为不是必须 box，我们就选择性的忽略了。不过，里面带星号的 Box 还是挺多的。因为，我们的主要目的是为了生成一个 MP4 文件。一个正常的 MP4 文件的结构并不是所有带星号的 Box 都必须有。正常播放的 MP4 文件其实还可以分为 unfragmented MP4（简写为 MP4）和 fragmented MP4（简写为 FMP4)。那这两者具体有什么区别呢？可以说，完全不同。因为他们本身确定 media stream 播放的方式都是完全不同的模式。 . **MP4 格式** 基本 box 为： ![](https://img.kancloud.cn/63/ae/63ae5b26fc7feaa1ef187c1006fd62cf_410x322.png) 上面这是最基本的 MP4 Box 内容。较完整的为： ![](https://img.kancloud.cn/4d/04/4d04dec2e7a7ad241c745a09cddbc911_198x535.png) MP4 box 根据 trak 中的 stbl 下的 stts stsc 等基本 box 来完成在 mdat box 中的索引。那 FMP4 是啥呢？ * 非标：非标常用于生成单一 trak 的文件。 * ftyp * moov * moof * mdat * 标准：用来生成含有多个 trak 的文件。 * ftyp * moov * mdat 看起来非标还多一个 box。但在具体编解码的时候，标准解码需要更多关注在如何编码 stbl 下的几个子 box--stts,stco,ctts 等盒子。而非标不需要关注 stbl，只需要将本来处于 stbl 的数据直接抽到 moof 中。并且在转换过程中，moof 里面的格式相比 stbl 来说，是非常简单的。所以，这里，我们主要围绕上面两种的标准，来讲解对应的 Box。 ## 标准 MP4 盒子 ### ftyp ftyp 盒子相当于就是该 mp4 的纲领性说明。即，告诉解码器它的基本解码版本，兼容格式。简而言之，就是用来告诉客户端，该 MP4 的使用的解码标准。通常，ftyp 都是放在 MP4 的开头。它的格式为： ~~~javascript aligned(8) class FileTypeBox extends Box(‘ftyp’) { unsigned int(32) major_brand; unsigned int(32) minor_version; unsigned int(32) compatible_brands[]; } ~~~ 上面的字段一律都是放在 `data` 字段中（参考，box 的描述）。 * major\_brand: 因为兼容性一般可以分为推荐兼容性和默认兼容性。这里 major\_brand 就相当于是推荐兼容性。通常，在 Web 中解码，一般而言都是使用 `isom` 这个万金油即可。如果是需要特定的格式，可以自行定义。 * minor\_version: 指最低兼容版本。 * compatible\_brands: 和 major\_brand 类似，通常是针对 MP4 中包含的额外格式，比如，AVC，AAC 等相当于的音视频解码格式。说这么多概念，还不如给代码实在。这里，我们可以来看一下，对于通用 ftyp box 的创建。 ~~~javascript FTYP: new Uint8Array([ 0x69, 0x73, 0x6F, 0x6D, // major_brand: isom 0x0, 0x0, 0x0, 0x1, // minor_version: 0x01 0x69, 0x73, 0x6F, 0x6D, // isom 0x61, 0x76, 0x63, 0x31 // avc1 ]) ~~~ ### moov moov box 主要是作为一个很重要的容器盒子存在的，它本身的实际内容并不重要。moov 主要是存放相关的 trak 。其基本格式为： ~~~javascript aligned(8) class MovieExtendsBox extends Box(‘mvex’){ } ~~~ ### mvhd mvhd 是 moov 下的第一个 box，用来描述 media 的相关信息。其基本内容为： ~~~javascript aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) timescale; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) timescale; unsigned int(32) duration; } template int(32) rate = 0x00010000; // typically 1.0template int(16) volume = 0x0100; // typically, full volumeconst bit(16) reserved = 0; const unsigned int(32)[2] reserved = 0; template int(32)[9] matrix = { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // Unity matrix bit(32)[6] pre_defined = 0; unsigned int(32) next_track_ID; } ~~~ * version: 一般默认为 0。 * creation\_time: 创建的时间。从 1904 年开始算起，用秒来表示。 * timescale: 时间比例。通过该值和 duration 来算出实际时间 * duration: 持续时间，单位是根据 timescale 来决定的。实际时间为：duration/timescale = xx 秒。 * rate: 播放比例。 * volume: 音量大小。0x0100 为最大值。 * matrix: 不解释。我也不懂 * next\_track\_ID: 需要比当前 trak\_id 最大值还大才行。一般随便填个很大的值即可。实际上，mvhd 大部分的值，都可以设为固定值： ~~~javascript new Uint8Array([ 0x00, 0x00, 0x00, 0x00, // version(0) + flags 0x00, 0x00, 0x00, 0x00, // creation_time 0x00, 0x00, 0x00, 0x00, // modification_time (timescale >>> 24) & 0xFF, // timescale: 4 bytes (timescale >>> 16) & 0xFF, (timescale >>> 8) & 0xFF, (timescale) & 0xFF, (duration >>> 24) & 0xFF, // duration: 4 bytes (duration >>> 16) & 0xFF, (duration >>> 8) & 0xFF, (duration) & 0xFF, 0x00, 0x01, 0x00, 0x00, // Preferred rate: 1.0 0x01, 0x00, 0x00, 0x00, // PreferredVolume(1.0, 2bytes) + reserved(2bytes) 0x00, 0x00, 0x00, 0x00, // reserved: 4 + 4 bytes 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, // ----begin composition matrix---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, // ----end composition matrix---- 0x00, 0x00, 0x00, 0x00, // ----begin pre_defined 6 * 4 bytes---- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ----end pre_defined 6 * 4 bytes---- 0xFF, 0xFF, 0xFF, 0xFF // next_track_ID ]); ~~~ ***** ## trak trak box 就是主要存放相关 media stream 的内容。其基本格式很简单就是简单的 box： ~~~javascript aligned(8) class TrackBox extends Box(‘trak’) { } ~~~ 不过，有时候里面也可以带上该 media stream 的相关描述： ![](https://img.kancloud.cn/10/dd/10dde2b5be1986a4c7b527cc366ec39e_554x282.png) ### tkhd tkhd 是 trak box 的子一级 box 的内容。主要是用来描述该特定 trak 的相关内容信息。其主要内容为： ~~~javascript aligned(8) class TrackHeaderBox extends FullBox(‘tkhd’, version, flags){ if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) track_ID; const unsigned int(32) reserved = 0; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) track_ID; const unsigned int(32) reserved = 0; unsigned int(32) duration; }const unsigned int(32)[2] reserved = 0;template int(16) layer = 0;template int(16) alternate_group = 0;template int(16) volume = {if track_is_audio 0x0100 else 0}; const unsigned int(16) reserved = 0;template int(32)[9] matrix= { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // unity matrix unsigned int(32) width; unsigned int(32) height; } ~~~ 上面内容确实挺多的，但是，有些并不是一定需要填一些合法值。这里简单说明一下： * creation\_time: 创建时间，非必须 * modification\_time: 修改时间，非必须 * track\_ID: 指明当前描述的 track ID。 * duration: 当前 track 内容持续的时间。通常结合 timescale 进行相关计算。 * layer: 没啥用。通常用来作为分成 video trak 的使用。 * alternate\_group: 可替换 track 源。如果为 0 表示当前 track 没有指定的 track 源替代。非 0 的话，则表示存在多个源的 group。 * volume: 用来确定音量大小。满音量为 1(0x0100)。 * width and height：确定视频的宽高 ### mdia mdia 主要用来包裹相关的 media 信息。本身没啥说的，格式为： ~~~javascript aligned(8) class MediaBox extends Box(‘mdia’) { } ~~~ ### mdhd mdhd 和 tkhd 来说，内容大致都是一样的。不过，tkhd 通常是对指定的 track 设定相关属性和内容。而 mdhd 是针对于独立的 media 来设置的。不过事实上，两者一般都是一样的。具体格式为： ~~~javascript aligned(8) class MediaHeaderBox extends FullBox(‘mdhd’, version, 0) { if (version==1) { unsigned int(64) creation_time; unsigned int(64) modification_time; unsigned int(32) timescale; unsigned int(64) duration; } else { // version==0 unsigned int(32) creation_time; unsigned int(32) modification_time; unsigned int(32) timescale; unsigned int(32) duration; }bit(1) pad = 0;unsigned int(5)[3] language; // ISO-639-2/T language code unsigned int(16) pre_defined = 0;} ~~~ 里面就有 3 个额外的字段：pad，language，pre\_defined。根据字面意思很好理解： * pad: 占位符，通常为 0 * language: 表明当前 trak 的语言。因为该字段总长为 15bit，通常是和 pad 组合成为 2B 的长度。 * pre\_defined: 默认为 0. 实际代码的计算方式为： ~~~javascript new Uint8Array([ 0x00, 0x00, 0x00, 0x00, // version(0) + flags 0x00, 0x00, 0x00, 0x00, // creation_time 0x00, 0x00, 0x00, 0x00, // modification_time (timescale >>> 24) & 0xFF, // timescale: 4 bytes (timescale >>> 16) & 0xFF, (timescale >>> 8) & 0xFF, (timescale) & 0xFF, (duration >>> 24) & 0xFF, // duration: 4 bytes (duration >>> 16) & 0xFF, (duration >>> 8) & 0xFF, (duration) & 0xFF, 0x55, 0xC4, // language: und (undetermined) 0x00, 0x00 // pre_defined = 0 ]) ~~~ ### hdlr hdlr 是用来设置不同 trak 的处理方式的。常用处理方式如下： * vide : Video track * soun : Audio track * hint : Hint track * meta : Timed Metadata track * auxv : Auxiliary Video track 这个，其实就和我们在得到和接收到资源时，设置的 `Content-Type` 类型字段是一致的，例如 `application/javascript`。其基本格式为： ~~~javascript aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) { unsigned int(32) pre_defined = 0; unsigned int(32) handler_type; const unsigned int(32)[3] reserved = 0; string name; } ~~~ 其中有两字段需要额外说明一下： * handler\_type：是代指具体 trak 的处理类型。也就是我们上面列写的 vide,soun,hint 字段。 * name: 是用来写名字的。其主要不是给机器读的，而是给人读，所以，这里你只要觉得能表述清楚，填啥其实都行。 handler\_type 填的值其实就是 string 转换为 hex 之后得到的值。比如： * vide 为 `0x76, 0x69, 0x64, 0x65` * soun 为 `0x73, 0x6F, 0x75, 0x6E` ### minf minf 是子属内容中，重要的容器 box，用来存放当前 track 的基本描述信息。本身没啥说的，基本格式为： ~~~javascript aligned(8) class MediaInformationBox extends Box(‘minf’) { } ~~~ ### v/smhd v/smhd 是对当前 trak 的描述 box。vmhd 针对的是 video，smhd 针对的是 audio。这两个盒子在解码中，非不可或缺的（有时候得看播放器），缺了的话，有可能会被认为格式不正确。我们先来看一下 vmhd 的基本格式： ~~~javascript aligned(8) class VideoMediaHeaderBox extends FullBox(‘vmhd’, version = 0, 1) { template unsigned int(16) graphicsmode = 0; // copy, see below template unsigned int(16)[3] opcolor = {0, 0, 0}; } ~~~ 这很简单都是一些默认值，我这里就不多说了。 smhd 的格式同样也很简单： ~~~javascript aligned(8) class SoundMediaHeaderBox extends FullBox(‘smhd’, version = 0, 0) { template int(16) balance = 0; const unsigned int(16) reserved = 0; } ~~~ 其中，balance 这个字段相当于和我们通常设置的左声道，右声道有关。 * balance: 该值是一个浮点值，0 为 center，1.0 为 right，-1.0 为 left。 ### dinf dinf 是用来说明在 trak 中，media 描述信息的位置。其实本身就是一个容器，没啥内容： ~~~javascript aligned(8) class DataInformationBox extends Box(‘dinf’) { } ~~~ ### dref dref 是用来设置当前 Box 描述信息的 data\_entry。基本格式为： ~~~javascript aligned(8) class DataReferenceBox extends FullBox(‘dref’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { DataEntryBox(entry_version, entry_flags) data_entry; } } ~~~ 其中的 DataEntryBox 就是 DataEntryUrlBox/DataEntryUrnBox 中的一个。简单来说，就是 dref 下的子 box -- `url` 或者 `urn` 这两个 box。其中，entry\_version 和 entry\_flags 需要额外说明一下。 * entry\_version: 用来指明当前 entry 的格式 * entry\_flags: 其值不是固定的，但是有一个特殊的值, 0x000001 用来表示当前 media 的数据和 moov 包含的数据一致。不过，就通常来说，我真的没有用到过有实际数据的 dref 。所以，这里就不衍生来讲了。 ### url url box 是由 dref 包裹的子一级 box，里面是对不同的 sample 的描述信息。不过，一般都是附带在其它 box 里。其基本格式为： ~~~javascript aligned(8) class DataEntryUrlBox (bit(24) flags) extends FullBox(‘url ’, version = 0, flags) { string location; } ~~~ 实际并没有用到过 `location` 这个字段，所以，一般也就不需要了。 ### stts stts 主要是用来存储 refSampleDelta。即，相邻两帧间隔的时间。它基本格式为： ~~~javascript aligned(8) class TimeToSampleBox extends FullBox(’stts’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_delta; } } ~~~ 看代码其实看不出什么，我们结合实际抓包结果，来讲解。现有如下的帧： ![](https://img.kancloud.cn/28/77/287774b92a6758b90cb946d90e5ac6bf_1128x328.png) 可以看到，上面的 Decode delta 值都是 10。这就对应着 sample\_delta 的值。而 sample\_count 就对应出现几次的 sample\_delta。比如，上面 10 的 delta 出现了 14 次，那么 sample\_count 就是 14。如果对应于 RTMP 中的 Video Msg，那么 sample\_delta 就是当前 RTMP Header 中，后面一个的 timeStamp delta。 ### stco stco 是 stbl 包里面一个非常关键的 Box。它用来定义每一个 sample 在 mdat 具体的位置。基本格式为： ~~~javascript aligned(8) class ChunkOffsetBox extends FullBox(‘stco’, version = 0, 0) { unsigned int(32) entry_count;for (i=1; i u entry_count; i++) { unsigned int(32) chunk_offset; } } ~~~ 具体可以参考： ![](https://img.kancloud.cn/6f/05/6f05a865858eec227f01ed1f4616fde0_772x208.png) stco 有两种形式，如果你的视频过大的话，就有可能造成 chunkoffset 超过 32bit 的限制。所以，这里针对大 Video 额外创建了一个 co64 的 Box。它的功效等价于 stco，也是用来表示 sample 在 `mdat` box 中的位置。只是，里面 chunk\_offset 是 64bit 的。 ~~~javascript aligned(8) class ChunkLargeOffsetBox extends FullBox(‘co64’, version = 0, 0) { unsigned int(32) entry_count;for (i=1; i u entry_count; i++) { unsigned int(64) chunk_offset; } } ~~~ ### stsc stsc 这个 Box 有点绕，并不是它的字段多，而是它的字段意思有点奇怪。其基本格式为： ~~~javascript aligned(8) class SampleToChunkBox extends FullBox(‘stsc’, version = 0, 0) { unsigned int(32) entry_count; for (i=1; i u entry_count; i++) { unsigned int(32) first_chunk; unsigned int(32) samples_per_chunk; unsigned int(32) sample_description_index; } } ~~~ 关键点在于他们里面的三个字段: `first_chunk`,`samples_per_chunk`,`sample_description_index`。 * first\_chunk: 每一个 entry 开始的 chunk 位置。 * samples\_per\_chunk: 每一个 chunk 里面包含多少的 sample * sample\_description\_index: 每一个 sample 的描述。一般可以默认设置为 1。这 3 个字段实际上决定了一个 MP4 中有多少个 chunks，每个 chunks 有多少个 samples。这里顺便普及一下 chunk 和 sample 的相关概念。在 MP4 文件中，最小的基本单位是 `Chunk` 而不是 `Sample`。 * sample: 包含最小单元数据的 slice。里面有实际的 NAL 数据。 * chunk: 里面包含的是一个一个的 sample。为了是优化数据的读取，让 I/O 更有效率。看了上面字段就懂得，感觉你要么是大牛，要么就是在装逼。官方文档和上面一样的描述，但是，看了一遍后，懵逼，再看一遍后，懵逼。所以，这里为了大家更好的理解，这里额外再补充一下。前面说了，在 MP4 中最小的单位是 chunks，那么通过 stco 中定义的 chunk\_offsets 字段，它描述的就是 chunks 在 mdat 中的位置。每一个 stco chunk\_offset 就对应于某一个 index 的 chunks。那么，`first_chunk` 就是用来定义该 chunk entry 开始的位置。那这样的话，stsc 需要对每一个 chunk 进行定义吗？不需要，因为 stsc 是定义一整个 entry，即，如果他们的 `samples_per_chunk`，`sample_description_index` 不变的话，那么后续的 chunks 都是用一样的模式。即，如果你的 stsc 只有： * first\_chunk: 1 * samples\_per\_chunk: 4 * sample\_description\_index: 1 也就是说，从第一个 chunk 开始，每通过切分 4 个 sample 划分为一个 chunk，并且每个 sample 的表述信息都是 1。它会按照这样划分方法一直持续到最后。当然，如果你的 sample 最后不能被 4 整除，最后的几段 sample 就会当做特例进行处理。通常情况下，stsc 的值是不一样的： ![](https://img.kancloud.cn/3f/fd/3ffde58fd6aea83bbc99865838708b7d_648x156.png) 按照上面的情况就是，第 1 个 chunk 包含 2 个 samples。第 2-4 个 chunk 包含 1 个 sample，第 5 个 chunk 包含两个 chunk，第 6 个到最后一个 chunk 包含一个 sample。 ### ctts ctts 主要针对 Video 中的 B 帧来确定的。也就是说，如果你视频里面没有 B 帧，那么，ctts 的结构就很简单了。它主要的作用，是用来记录每一个 sample 里面的 cts。格式为： ~~~javascript aligned(8) class CompositionOffsetBox extends FullBox(‘ctts’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_offset; } } ~~~ 还是看实例吧，假如你视频中帧的排列如下： ![](https://img.kancloud.cn/28/77/287774b92a6758b90cb946d90e5ac6bf_1128x328.png) 其中，sample\_offset 就是 Composition offset。通过合并一致的 Composition offset，可以得到对应的 sample\_count。最终 ctts 的结果为： ![](https://img.kancloud.cn/6d/92/6d92e4e3f3bb00891cd365bf25765f5a_248x316.png) 看实例抓包的结果为： ![](https://img.kancloud.cn/41/df/41df91360a58756787663a719d77f47b_670x344.png) 如果，你是针对 RTMP 的 video，由于，其没有 B 帧，那么 ctts 的整个结果，就只有一个 sample\_count 和 sample\_offset。比如： ~~~javascript sample_count: 100sample_offset: 0 ~~~ 通常只有 video track 才需要 ctts。 ### stsz stsz 是用来存放每一个 sample 的 size 信息的。基本格式为： ~~~javascript aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) { unsigned int(32) sample_size; unsigned int(32) sample_count;if (sample_size==0) { for (i=1; i <= sample_count; i++) { unsigned int(32) entry_size; } } } ~~~ 这个没啥说的，就是所有 sample 的 size 大小，以及相应的描述信息。 ![](https://img.kancloud.cn/76/a9/76a99e161d2ae64e63a69fce6ace0772_476x386.png) ## fragmented MP4 前面部分是标准 box 的所有内容。当然，fMP4 里面大部分内容和 MP4 标准格式有很多重复的地方，剩下的就不过多赘述，只把不同的单独挑出来讲解。 ### mvex mvex 是 fMP4 的标准盒子。它的作用是告诉解码器这是一个 fMP4 的文件，具体的 samples 信息内容不再放到 trak 里面，而是在每一个 moof 中。基本格式为： ~~~javascript aligned(8) class MovieExtendsBox extends Box(‘mvex’){ } ~~~ ### trex trex 是 mvex 的子一级 box 用来给 fMP4 的 sample 设置默认值。基本内容为： ~~~javascript aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ unsigned int(32) track_ID; unsigned int(32) default_sample_description_index; unsigned int(32) default_sample_duration; unsigned int(32) default_sample_size; unsigned int(32) default_sample_flags } ~~~ 具体设哪一个值，这得看你业务里面具体的要求才行。如果实在不知道，那就可以直接设置为 0： ~~~javascript new Uint8Array([ 0x00, 0x00, 0x00, 0x00, // version(0) + flags (trackId >>> 24) & 0xFF, // track_ID (trackId >>> 16) & 0xFF, (trackId >>> 8) & 0xFF, (trackId) & 0xFF, 0x00, 0x00, 0x00, 0x01, // default_sample_description_index 0x00, 0x00, 0x00, 0x00, // default_sample_duration 0x00, 0x00, 0x00, 0x00, // default_sample_size 0x00, 0x01, 0x00, 0x01 // default_sample_flags ]) ~~~ ### moof moof 主要是用来存放 FMP4 的相关内容。它本身没啥太多的内容： ~~~javascript aligned(8) class TrackFragmentBox extends Box(‘traf’){ } ~~~ ### tfhd tfhd 主要是对指定的 trak 进行相关的默认设置。例如：sample 的时长，大小，偏移量等。不过，这些都可以忽略不设，只要你在其它 box 里面设置完整即可： ~~~javascript aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){ unsigned int(32) track_ID;// all the following are optional fields unsigned int(64) base_data_offset; unsigned int(32) sample_description_index; unsigned int(32) default_sample_duration; unsigned int(32) default_sample_size; unsigned int(32) default_sample_flags } ~~~ base\_data\_offset 是用来计算后面数据偏移量用到的。如果存在则会用上，否则直接是相关开头的偏移。 ### tfdt tfdt 主要是用来存放相关 sample 编码的绝对时间的。因为 FMP4 是流式的格式，所以，不像 MP4 一样可以直接根据 sample 直接 seek 到具体位置。这里就需要一个标准时间参考，来快速定位都某个具体的 fragment。它的基本格式为： ~~~javascript aligned(8) class TrackFragmentBaseMediaDecodeTimeBox extends FullBox(‘tfdt’, version, 0) {if (version==1) { unsigned int(64) baseMediaDecodeTime; } else { // version==0 unsigned int(32) baseMediaDecodeTime; } } ~~~ baseMediaDecodeTime 基本值是前面所有指定 trak\_id 中 samples 持续时长的总和，相当于就是当前 traf 里面第一个 sample 的 dts 值。 ### trun trun 存储该 moof 里面相关的 sample 内容。例如，每个 sample 的 size，duration，offset 等。基本内容为： ~~~javascript aligned(8) class TrackRunBox extends FullBox(‘trun’, version, tr_flags) {unsigned int(32) sample_count;// the following are optional fieldssigned int(32) data_offset;unsigned int(32) first_sample_flags;// all fields in the following array are optional { unsigned int(32) sample_duration; unsigned int(32) sample_size; unsigned int(32) sample_flags if (version == 0) { unsigned int(32) sample_composition_time_offset else { signed int(32) sample_composition_time_offset }[ sample_count ] } ~~~ 可以说，trun 上面的字段是 traf 里面最重要的标识字段： tr\_flags 是用来表示下列 sample 相关的标识符是否应用到每个字段中： * 0x000001: data-offset-present，只应用 data-offset * 0x000004: 只对第一个 sample 应用对应的 flags。剩余 sample flags 就不管了。 * 0x000100: 这个比较重要，表示每个 sample 都有自己的 duration，否则使用默认的 * 0x000200: 每个 sample 有自己的 sample\_size，否则使用默认的。 * 0x000400: 对每个 sample 使用自己的 flags。否则，使用默认的。 * 0x000800: 每个 sample 都有自己的 cts 值后面字段，我们这简单介绍一下。 * data\_offset: 用来表示和该 moof 配套的 mdat 中实际数据内容距 moof 开头有多少 byte。相当于就是 moof.byteLength + mdat.headerSize。 * sample\_count: 一共有多少个 sample * first\_sample\_flags: 主要针对第一个 sample。一般来说，都可以默认设为 0。后面的几个字段，我就不赘述了，对了，里面的 sample\_flags 是一个非常重要的东西，常常用它来表示，到底哪一个 sampel 是对应的 keyFrame。基本计算方法为： ~~~javascript (flags.isLeading << 2) | flags.dependsOn, // sample_flags(flags.isDepended << 6) | (flags.hasRedundancy << 4) | flags.isNonSync ~~~ ### sdtp sdtp 主要是用来描述具体某个 sample 是否是 I 帧，是否是 leading frame 等相关属性值，主要用来作为当进行点播回放时的同步参考信息。其内容一共有 4 个： * is\_leading：是否是开头部分。 * 0: 当前 sample 的 leading 属性未知（经常用到） * 1: 当前 sample 是 leading sample，并且不能被 decoded * 2: 当前 sample 并不是 leading sample。 * 3: 当前 sample 是 leading sample，并且能被 decoded * sample\_depends\_on：是否是 I 帧。 * 0: 该 sample 不知道是否依赖其他帧 * 1: 该 sample 是 B/P 帧 * 2: 该 sample 是 I 帧。 * 3: 保留字 * sample\_is\_depended\_on: 该帧是否被依赖 * 0: 不知道是否被依赖，特指（B/P） * 1: 被依赖，特指 I 帧 * 3: 保留字 * sample\_has\_redundancy: 是否有冗余编码 * 0: 不知道是否有冗余 * 1: 有冗余编码 * 2: 没有冗余编码 * 3: 保留字整个基本格式为： ~~~javascript aligned(8) class SampleDependencyTypeBox extends FullBox(‘sdtp’, version = 0, 0) { for (i=0; i < sample_count; i++){ unsigned int(2) is_leading; unsigned int(2) sample_depends_on; unsigned int(2) sample_is_depended_on; unsigned int(2) sample_has_redundancy; } } ~~~ sdtp 对于 video 来说很重要，因为，其内容字段主要就是给 video 相关的帧设计的。而 audio，一般直接采用默认值： ~~~javascript isLeading: 0,dependsOn: 1, isDepended: 0,hasRedundancy: 0 ~~~ 到这里，整个 MP4 和 fMP4 的内容就已经介绍完了。更详细的内容可以参考 MP4 & FMP4 doc。当然，这里只是非常皮毛的一部分，仅仅知道 box 的内容，并不足够来做一些音视频处理。更多的是关于音视频的基础知识，比如，dts/pts、音视频同步、视频盒子的封装等等