blob: 4614df5509caa5afcf41b62b154f38e9607d7aa9 [file] [log] [blame] [view]
# JPEG XL Format Overview
This document gives an overview of the JPEG XL file format and codestream,
its features, and the underlying design rationale.
The aim of this document is to provide general insight into the
format capabilities and design, thus helping developers
better understand how to use the `libjxl` API.
## Codestream and File Format
The JPEG XL format is defined in ISO/IEC 18181. This standard consists of
four parts:
* 18181-1: Core codestream
* 18181-2: File format
* 18181-3: Conformance testing
* 18181-4: Reference implementation
### Core codestream
The core codestream contains all the data necessary to decode and display
still image or animation data. This includes basic metadata like image dimensions,
the pixel data itself, colorspace information, orientation, upsampling, etc.
### File format
The JPEG XL file format can take two forms:
* A 'naked' codestream. In this case, only the image/animation data itself is
stored, and no additional metadata can be included. Such a file starts with the
bytes `0xFF0A` (the JPEG marker for "start of JPEG XL codestream").
* An ISOBMFF-based container. This is a box-based container that includes a
JPEG XL codestream box (`jxlc`), and can optionally include other boxes with
additional information, such as Exif metadata. In this case, the file starts with
the bytes `0x0000000C 4A584C20 0D0A870A`.
### Conformance testing
This part of the standard defines precision bounds and test cases for conforming
decoders, to verify that they implement all coding tools correctly and accurately.
### Reference implementation
The `libjxl` software is the reference implementation of JPEG XL.
## Metadata versus Image Data
JPEG XL makes a clear separation between metadata and image data.
Everything that is needed to correctly display an image is
considered to be image data, and is part of the core codestream. This includes
elements that have traditionally been considered 'metadata', such as ICC profiles
and Exif orientation. The goal is to reduce the ambiguity and potential for
incorrect implementations that can be caused by having a 'black box' codestream
that only contains numerical pixel data, requiring applications to figure out how
to correctly interpret the data (i.e. apply color transforms, upsampling,
orientation, blending, cropping, etc.). By including this functionality in the
codestream itself, the decoder can provide output in a normalized way
(e.g. in RGBA, orientation already applied, frames blended and coalesced),
simplifying things and making it less error-prone for applications.
The remaining metadata, e.g. Exif or XMP, can be stored in the container format,
but it does not influence image rendering. In the case of Exif orientation,
this field has to be ignored by applications, since the orientation in the
codestream always takes precedence (and will already have been applied
transparently by the decoder). This means that stripping metadata can be done
without affecting the displayed image.
## Codestream Features
### Color Management
In JPEG XL, images always have a fully defined colorspace, i.e. it is always
unambiguous how to interpret the pixel values. There are two options:
* Pixel data is in a specified (non-XYB) colorspace, and the decoder will produce
a pixel buffer in this colorspace plus an ICC profile that describes that
colorspace. Mathematically lossless encoding can only use this option.
* Pixel data is in the XYB colorspace, which is an absolute colorspace.
In this case, the decoder can produce a pixel buffer directly in a desired
display space like sRGB, Display-P3 or Rec.2100 PQ.
The image header always contains a colorspace; however, its meaning depends on
which of the above two options were used:
* In the first case (non-XYB), the signaled colorspace defines the
interpretation of the pixel data.
* In the second case (XYB), the signaled colorspace is merely a _suggestion_
of a target colorspace to represent the image in, i.e. it is the colorspace
the original image was in, that has a sufficiently wide gamut and a
suitable transfer curve to represent the image data with high fidelity
using a limited bit depth representation.
Colorspaces can be signaled in two ways in JPEG XL:
* CICP-style Enum values: This is a very compact representation that
covers most or all of the common colorspaces. The decoder can convert
XYB to any of these colorspaces without requiring an external color management
library.
* ICC profiles: Arbitrary ICC profiles can also be used, including
CMYK ones. The ICC profile data gets compressed. In this case, external
color management software (e.g. lcms2 or skcms) has to be used for color
conversions.
### Frames
A JPEG XL codestream contains one or more frames. In the case of animation,
these frames have a duration and can be looped (infinitely or a number of times).
Zero-duration frames are possible and represent different layers of the image.
Frames can have a blendmode (Replace, Add, Alpha-blend, Multiply, etc.) and
they can use any previous frame as a base.
They can be smaller than the image canvas, in which case the pixels outside the
crop are copied from the base frame. They can be positioned at an arbitrary
offset from the image canvas; this offset can also be negative and frames can
also be larger than the image canvas, in which case parts of the frame will
be invisible and only the intersection with the image canvas will be shown.
By default, the decoder will blend and coalesce frames, producing only a single
output frame when there are subsequent zero-duration frames, and all output frames
are of the same size (the size of the image canvas) and have either no duration
(in case of a still image) or a non-zero duration (in case of animation).
### Pixel Data
Every frame contains pixel data encoded in one of two modes:
* VarDCT mode: In this mode, variable-sized DCT transforms are applied
and the image data is encoded in the form of DCT coefficients. This mode is
always lossy, but it can also be used to losslessly represent an existing
(already lossy) JPEG image, in which case only the DCT8x8 is used.
* Modular mode: In this mode, only integer arithmetic is used, which
enables lossless compression. However, this mode can also be used for lossy
compression. Multiple transformations can be used to improve compression or to
obtain other desirable effects: reversible color transforms (RCTs),
(delta) palette transforms, and a modified non-linear Haar transform
called Squeeze, which facilitates (but does not require) lossy compression
and enables progressive decoding.
Internally, the VarDCT mode uses Modular sub-bitstreams to encode
various auxiliary images, such as the "LF image" (a 1:8 downscaled version
of the image that contains the DC coefficients of DCT8x8 and low-frequency
coefficients of the larger DCT transforms), extra channels besides the
three color channels (e.g. alpha), and weights for adaptive quantization.
In addition, both modes can separately encode additional 'image features' that
are rendered on top of the decoded image:
* Patches: rectangles from a previously decoded frame (which can be a
'hidden' frame that is not displayed but only stored to be referenced later)
can be blended using one of the blendmodes on top of the current frame.
This allows the encoder to identify repeating patterns (such as letters of
text) and encode them only once, using patches to insert the pattern in
multiple spots. These patterns are encoded in a previous frame, making
it possible to add Modular-encoded pixels to a VarDCT-encoded frame or
vice versa.
* Splines: centripetal Catmull-Rom splines can be encoded, with a color
and a thickness that can vary along the arclength of the curve.
Although the current encoder does not use this bitstream feature yet, we
anticipate that it can be useful to complement DCT-encoded data, since
thin lines are hard to represent faithfully using the DCT.
* Noise: luma-modulated synthetic noise can be added to an image, e.g.
to emulate photon noise, in a way that avoids poor compression due to
high frequency DCT coefficients.
Finally, both modes can also optionally apply two filtering methods to
the decoded image, which both have the goal of reducing block artifacts
and ringing:
* Gabor-like transform ('Gaborish'): a small (3x3) blur that gets
applied across block and group boundaries, reducing blockiness. The
encoder applies the inverse sharpening transform before encoding,
effectively getting the benefits of lapped transforms without the
disadvantages.
* Edge-preserving filter ('EPF'): similar to a bilateral filter,
this smoothing filter avoids blurring edges while reducing ringing.
The strength of this filter is signaled and can locally be adapted.
### Groups
In both modes (Modular and VarDCT), the frame data is signaled as
a sequence of groups. These groups can be decoded independently,
and the frame header contains a table of contents (TOC) with bitstream
offsets for the start of each group. This enables parallel decoding,
and also partial decoding of a region of interest or a progressive preview.
In VarDCT mode, all groups have dimensions 256x256 (or smaller at the
right and bottom borders). First the LF image is encoded, also in
256x256 groups (corresponding to 2048x2048 pixels, since this data
corresponds to the 1:8 image). This means there is always a basic
progressive preview available in VarDCT mode.
Optionally, the LF image can be encoded separately in a (hidden)
LF frame, which can itself recursively be encoded in VarDCT mode
and have its own LF frame. This makes it possible to represent huge
images while still having an overall preview that can be efficiently
decoded.
Then the HF groups are encoded, corresponding to the remaining AC
coefficients. The HF groups can be encoded in multiple passes for
more progressive refinement steps; the coefficients of all passes
are added. Unlike JPEG progressive scan scripts, JPEG XL allows
signaling any amount of detail in any part of the image in any pass.
In Modular mode, groups can have dimensions 128x128, 256x256, 512x512
or 1024x1024. If the Squeeze transform was used, the data will
be split in three parts: the Global groups (the top of the Laplacian
pyramid that fits in a single group), the LF groups (the middle part
of the Laplacian pyramid that corresponds to the data needed to
reconstruct the 1:8 image) and the HF groups (the base of the Laplacian
pyramid), where the HF groups are again possibly encoded in multiple
passes (up to three: one for the 1:4 image, one for the 1:2 image,
and one for the 1:1 image).
In case of a VarDCT image with extra channels (e.g. alpha), the
VarDCT groups and the Modular groups are interleaved in order to
allow progressive previews of all the channels.
The default group order is to encode the LF and HF groups in
scanline order (top to bottom, left to right), but this order
can be permuted arbitrarily. This allows, for example, a center-first
ordering or a saliency-based ordering, causing the bitstream
to prioritize progressive refinements in a different way.
## File Format Features
Besides the image data itself (stored in the `jxlc` codestream box),
the optional container format allows storing additional information.
## Metadata
Three types of metadata can be included in a JPEG XL container:
* Exif (`Exif`)
* XMP (`xml `)
* JUMBF (`jumb`)
This metadata can contain information about the image, such as copyright
notices, GPS coordinates, camera settings, etc.
If it contains rendering-impacting information (such as Exif orientation),
the information in the codestream takes precedence.
## Compressed Metadata
The container allows the above metadata to be stored either uncompressed
(e.g. plaintext XML in the case of XMP) or by Brotli-compression.
In the latter case, the box type is `brob` (Brotli-compressed Box) and
the first four bytes of the box contents define the actual box type
(e.g. `xml `) it represents.
## JPEG Bitstream Reconstruction Data
JPEG XL can losslessly recompress existing JPEG files.
The general design philosophy still applies in this case:
all the image data is stored in the codestream box, including the DCT
coefficients of the original JPEG image and possibly an ICC profile or
Exif orientation.
In order to allow bit-identical reconstruction of the original JPEG file
(not just the image but the actual file), additional information is needed,
since the same image data can be encoded in multiple ways as a JPEG file.
The `jbrd` box (JPEG Bitstream Reconstruction Data) contains this information.
Typically it is relatively small. Using the image data from the codestream,
the JPEG bitstream reconstruction data, and possibly other metadata boxes
that were present in the JPEG file (Exif/XMP/JUMBF), the exact original
JPEG file can be reconstructed.
This box is not needed to display a recompressed JPEG image; it is only
needed to reconstruct the original JPEG file.
## Frame Index
The container can optionally store a `jxli` box, which contains an index
of offsets to keyframes of a JPEG XL animation. It is not needed to display
the animation, but it does facilitate efficient seeking.
## Partial Codestream
The codestream can optionally be split into multiple `jxlp` boxes;
conceptually, this is equivalent to a single `jxlc` box that contains the
concatenation of all partial codestream boxes.
This makes it possible to create a file that starts with
the data needed for a progressive preview of the image, followed by
metadata, followed by the remaining image data.