| Snappy framing format description |
| Last revised: 2013-10-25 |
| |
| This format decribes a framing format for Snappy, allowing compressing to |
| files or streams that can then more easily be decompressed without having |
| to hold the entire stream in memory. It also provides data checksums to |
| help verify integrity. It does not provide metadata checksums, so it does |
| not protect against e.g. all forms of truncations. |
| |
| Implementation of the framing format is optional for Snappy compressors and |
| decompressor; it is not part of the Snappy core specification. |
| |
| |
| 1. General structure |
| |
| The file consists solely of chunks, lying back-to-back with no padding |
| in between. Each chunk consists first a single byte of chunk identifier, |
| then a three-byte little-endian length of the chunk in bytes (from 0 to |
| 16777215, inclusive), and then the data if any. The four bytes of chunk |
| header is not counted in the data length. |
| |
| The different chunk types are listed below. The first chunk must always |
| be the stream identifier chunk (see section 4.1, below). The stream |
| ends when the file ends -- there is no explicit end-of-file marker. |
| |
| |
| 2. File type identification |
| |
| The following identifiers for this format are recommended where appropriate. |
| However, note that none have been registered officially, so this is only to |
| be taken as a guideline. We use "Snappy framed" to distinguish between this |
| format and raw Snappy data. |
| |
| File extension: .sz |
| MIME type: application/x-snappy-framed |
| HTTP Content-Encoding: x-snappy-framed |
| |
| |
| 3. Checksum format |
| |
| Some chunks have data protected by a checksum (the ones that do will say so |
| explicitly). The checksums are always masked CRC-32Cs. |
| |
| A description of CRC-32C can be found in RFC 3720, section 12.1, with |
| examples in section B.4. |
| |
| Checksums are not stored directly, but masked, as checksumming data and |
| then its own checksum can be problematic. The masking is the same as used |
| in Apache Hadoop: Rotate the checksum by 15 bits, then add the constant |
| 0xa282ead8 (using wraparound as normal for unsigned integers). This is |
| equivalent to the following C code: |
| |
| uint32_t mask_checksum(uint32_t x) { |
| return ((x >> 15) | (x << 17)) + 0xa282ead8; |
| } |
| |
| Note that the masking is reversible. |
| |
| The checksum is always stored as a four bytes long integer, in little-endian. |
| |
| |
| 4. Chunk types |
| |
| The currently supported chunk types are described below. The list may |
| be extended in the future. |
| |
| |
| 4.1. Stream identifier (chunk type 0xff) |
| |
| The stream identifier is always the first element in the stream. |
| It is exactly six bytes long and contains "sNaPpY" in ASCII. This means that |
| a valid Snappy framed stream always starts with the bytes |
| |
| 0xff 0x06 0x00 0x00 0x73 0x4e 0x61 0x50 0x70 0x59 |
| |
| The stream identifier chunk can come multiple times in the stream besides |
| the first; if such a chunk shows up, it should simply be ignored, assuming |
| it has the right length and contents. This allows for easy concatenation of |
| compressed files without the need for re-framing. |
| |
| |
| 4.2. Compressed data (chunk type 0x00) |
| |
| Compressed data chunks contain a normal Snappy compressed bitstream; |
| see the compressed format specification. The compressed data is preceded by |
| the CRC-32C (see section 3) of the _uncompressed_ data. |
| |
| Note that the data portion of the chunk, i.e., the compressed contents, |
| can be at most 16777211 bytes (2^24 - 1, minus the checksum). |
| However, we place an additional restriction that the uncompressed data |
| in a chunk must be no longer than 65536 bytes. This allows consumers to |
| easily use small fixed-size buffers. |
| |
| |
| 4.3. Uncompressed data (chunk type 0x01) |
| |
| Uncompressed data chunks allow a compressor to send uncompressed, |
| raw data; this is useful if, for instance, uncompressible or |
| near-incompressible data is detected, and faster decompression is desired. |
| |
| As in the compressed chunks, the data is preceded by its own masked |
| CRC-32C (see section 3). |
| |
| An uncompressed data chunk, like compressed data chunks, should contain |
| no more than 65536 data bytes, so the maximum legal chunk length with the |
| checksum is 65540. |
| |
| |
| 4.4. Padding (chunk type 0xfe) |
| |
| Padding chunks allow a compressor to increase the size of the data stream |
| so that it complies with external demands, e.g. that the total number of |
| bytes is a multiple of some value. |
| |
| All bytes of the padding chunk, except the chunk byte itself and the length, |
| should be zero, but decompressors must not try to interpret or verify the |
| padding data in any way. |
| |
| |
| 4.5. Reserved unskippable chunks (chunk types 0x02-0x7f) |
| |
| These are reserved for future expansion. A decoder that sees such a chunk |
| should immediately return an error, as it must assume it cannot decode the |
| stream correctly. |
| |
| Future versions of this specification may define meanings for these chunks. |
| |
| |
| 4.6. Reserved skippable chunks (chunk types 0x80-0xfd) |
| |
| These are also reserved for future expansion, but unlike the chunks |
| described in 4.5, a decoder seeing these must skip them and continue |
| decoding. |
| |
| Future versions of this specification may define meanings for these chunks. |