wiki:

Hardware

/

VAAPI


Version 4 (modified by jkqxz, 3 weeks ago) (diff)

--

Platform Support

Intel / i965

See QuickSync.

AMD / Mesa

The Mesa VAAPI driver uses the UVD (Unified Video Decoder) and VCE (Video Coding Engine) hardware found in all recent AMD graphics cards and APUs.

H.264, MPEG-2, MPEG-4 part 2 and VC-1 decode are supported by all GCN GPUs (since Southern Islands). H.265 support was added with GCN 3 (Volcanic Islands) and H.265 10-bit with GCN 4 (Arctic Islands). Older GPUs may or may not be supported.

MPEG-4 part 2 is disabled by default due to VAAPI limitations (the main Intel driver never implemented it, so it didn't get much testing). Set the environment variable VAAPI_MPEG4_ENABLED to 1 to try to use it anyway.

H.264 encode is working on GCN GPUs, but is still incomplete. No other codecs are supported by Mesa for encode yet.

Encoding and interlacing support in Mesa are incompatible because of the data layout in GPU memory. By default, frames are separated into fields and interlaced video is supported but encoding is not. Set the environment variable VAAPI_DISABLE_INTERLACE to 1 to be able to use the encoder (but without any interlaced video support).

Device Selection

The libva driver needs to be attached to a DRM device to work. This can be connected either directly or via a running X server. When working standlone, it is generally best to use a DRM render node (/dev/dri/render*) - only use a connection via X if you actually want to deal with surfaces inside X (with DRI2, for example).

In ffmpeg, a named global device can be created using the -init_hw_device option:

ffmpeg -init_hw_device vaapi=foo:/dev/dri/renderD128

With the decode hwaccel, each input stream can then be given a previously initialised device with the -hwaccel_device option:

ffmpeg -init_hw_device vaapi=foo:/dev/dri/renderD128 -hwaccel vaapi -hwaccel_device foo -i ...

If only one stream is being used, -hwaccel_device can also accept a device path directly:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i ...

Where filters require a device (for example, the hwupload filter), the device used in a filter graph can be specified with the -filter_hw_device option:

ffmpeg -init_hw_device vaapi=foo:/dev/dri/renderD128 -i ... -filter_hw_device foo -filter_complex ...hwupload... ...

If you have multiple usable devices in the same machine (for example, an Intel integrated GPU and an AMD discrete graphics card), they can be used simultaneously to decode different streams:

ffmpeg -init_hw_device vaapi=intel:/dev/dri/renderD128 -init_hw_device vaapi=amd:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device intel -i ... -hwaccel vaapi -hwaccel_device amd -i ...

(See <http://www.ffmpeg.org/ffmpeg.html#toc-Advanced-Video-options> for more detail about these options.)

Finally, the -vaapi_device option may be more convenient in single-device cases with filters.

ffmpeg -vaapi_device /dev/dri/renderD128

acts equivalently to:

ffmpeg -init_hw_device vaapi=vaapi0:/dev/dri/renderD128 -filter_hw_device vaapi0

Surface Formats

The hardware codecs used by VAAPI are not able to access frame data in arbitrary memory. Therefore, all frame data needs to be uploaded to hardware surfaces connected to the appropriate device before being used. All VAAPI hardware surfaces in ffmpeg are represented by the vaapi pixfmt (the internal layout is not visible here, though).

The hwaccel decoders normally output frames in the associated hardware format, but by default the ffmpeg utility download the output frames to normal memory before passing them to the next component. This allows the decoder to work standlone to make decoding faster without any additional options:

ffmpeg -hwaccel vaapi ... -i input.mp4 -c:v libx264 ... output.mp4

For other outputs, the option -hwaccel_output_format can be used to specify the format to be used. This can be a software format (which formats are usable depends on the driver), or it can be the vaapi hardware format to indicate that the surface should not be downloaded.

For example, to decode only and do nothing with the result:

ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi ... -i input.mp4 -f null -

This can be used to test the speed / CPU use of the decoder only (the download operation typically adds a large amount of additional overhead).

When decoder output is in hardware surfaces, the frames will be given to following filters or encoders in that form. The scale_vaapi and deinterlace_vaapi filters act on vaapi format frames to scale and deinterlace them respecitvely. There are also some generic filters - hwdownload, hwupload and hwmap - which support all hardware formats, including VAAPI (see <http://www.ffmpeg.org/ffmpeg-filters.html#hwdownload>).

For example, take an interlaced input, decode, deinterlace, scale to 720p, download to normal memory and encode with libx264:

ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi ... -i interlaced_input.mp4 -vf 'deinterlace_vaapi,scale_vaapi=w=1280:h=720,hwdownload,format=nv12' -c:v libx264 ... progressive_output.mp4

Encoding

The encoders only accept input as VAAPI surfaces. If the input is in normal memory, it will need to be uploaded before giving the frames to the encoder - in the ffmpeg utility, the hwupload filter can be used for this. It will upload to a surface with the same layout as the software frame, so it may be necessary to add a format filter immediately before to get the input into the right format (hardware generally wants the nv12 layout, but most software functions use the yuv420p layout). The hwupload filter also requires a device to upload to, which needs to be defined before the filter graph is created.

So, to use the default decoder for some input, then upload frames to VAAPI and encode with H.264 and default settings:

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=nv12,hwupload' -c:v h264_vaapi output.mp4

If the input is known to be hardware-decodable, then we can use the hwaccel:

ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -hwaccel_device /dev/dri/renderD128 -i input.mp4 -c:v h264_vaapi output.mp4

Finally, when the input may or may not be hardware decodable we can do:

ffmpeg -init_hw_device vaapi=foo:/dev/dri/renderD128 -hwaccel vaapi -hwaccel_output_format vaapi -hwaccel_device foo -i input.mp4 -filter_hw_device foo -vf 'format=nv12|vaapi,hwupload' -c:v h264_vaapi output.mp4

This works because the decoder will output either vaapi surfaces (if the hwaccel is usable) or software frames (if it isn't). In the first case, it matches the vaapi format and hwupload does nothing (it passes through hardware frames unchanged). Im the second case, it matches the nv12 format and converts whatever the input is to that, then uploads. Performance will likely vary by a large amount depending which path is chosen, though.

The supported encoders are:

H.262 / MPEG-2 part 2 mpeg2_vaapi
H.264 / MPEG-4 part 10 (AVC) h264_vaapi
H.265 / MPEG-H part 2 (HEVC) hevc_vaapi
MJPEG / JPEG mjpeg_vaapi
VP8 vp8_vaapi
VP9 vp9_vaapi

For an explanation of codec options, see <http://www.ffmpeg.org/ffmpeg-codecs.html#VAAPI-encoders>.

Mapping options from libx264

No CRF-like mode is currently supported. The only constant-quality mode is CQP (constant quantisation parameter), which has no adaptivity to scene content. It does, however, allow different quality settings for different frame types, to improve compression by spending fewer bits on unreferenced B-frames - see the (i|b)_q(factor|offset) options. CQP mode cannot be combined with a maximum bitrate or buffer size.

CBR and VBR modes are supported, though the output of them varies significantly by driver and device (default is VBR, set -maxrate equal to -b:v for CBR). HRD buffering options (rc_max_rate, rc_buffer_size) are functional, and the encoder will generate buffering_period and pic_timing SEI when appropriate.

There is no complete analogue of the -preset option. The -compression_level option controls the local speed/quality tradeoff in the encoder (that is, the amount of effort expended on trying to get the best results from local choices like motion estimation and mode decision), using a nebulous per-device scale. The argument is a small integer, from 1 up to some limit dependent on the device (not more than 7) - higher values are faster / lower stream quality. Separately, some hardware (Intel gen9) supports a low-power mode with more restricted features. It is accessible via the -low_power option.

Neither two-pass encoding nor lookahead are supported at all - only local rate control is possible. VBR mode should do a reasonably good job at getting close to an overall bitrate target, but quality will vary significantly through a stream if the complexity varies.

Full Examples

All of these examples assume the input and output files will contain one video stream (audio will need to be considered separately). It is assumed that VAAPI is usable via the DRM device node /dev/dri/renderD128.

Decode-only

Decode an input with hardware if possible, output in normal memory to encode with libx264:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i input.mp4 -c:v libx264 -crf 20 output.mp4

Decode an input with hardware, deinterlace it if it was interlaced, downscale, then download to normal memory to encode with libx264 (will fail if the input is not supported by the hardware decoder):

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -vf 'deinterlace_vaapi=rate=field:auto=1,scale_vaapi=w=640:h=360,hwdownload,format=nv12' -c:v libx264 -crf 20 output.mp4

Decode an input and discard the output (this can be used as a crude benchmark of the decoder):

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -f null -

Encode-only

Encode an input with H.264 at 5Mbps VBR:

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=nv12,hwupload' -c:v h264_vaapi -b:v 5M output.mp4

As previous, but use constrained baseline profile only for compatibility with old devices:

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=nv12,hwupload' -c:v h264_vaapi -b:v 5M -profile 578 -bf 0 output.mp4

Encode with H.264 at good constant quality:

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=nv12,hwupload' -c:v h264_vaapi -qp 18 output.mp4

Encode with 10-bit H.265 at 15Mbps VBR (recent hardware required - Kaby Lake or later Intel):

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=p010,hwupload' -c:v hevc_vaapi -b:v 15M -profile 2 output.mp4

Scale to 720p and encode with H.264 at 5Mbps CBR:

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'hwupload,scale_vaapi=w=1280:h=720:format=nv12' -c:v h264_vaapi -b:v 5M -maxrate 5M output.mp4

Encode with VP9 at 5Mbps VBR:

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=nv12,hwupload' -c:v vp9_vaapi -b:v 5M output.webm

Encode with VP9 at good constant quality, using pseudo-B-frames to improve compression:

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=nv12,hwupload' -c:v vp9_vaapi -global_quality 50 -bf 1 -bsf:v vp9_raw_reorder,vp9_superframe output.webm

Capture the screen from X and encode with H.264 at reasonable constant-quality:

ffmpeg -vaapi_device /dev/dri/renderD128 -f x11grab -video_size 1920x1080 -i :0 -vf 'hwupload,scale_vaapi=format=nv12' -c:v h264_vaapi -qp 24 output.mp4

Note that it is also possible to do the format conversion (RGB to YUV) on the CPU - this is slower, but might be desirable if other filters are going to be applied:

ffmpeg -vaapi_device /dev/dri/renderD128 -f x11grab -video_size 1920x1080 -i :0 -vf 'format=nv12,hwupload' -c:v h264_vaapi -qp 24 output.mp4

Capture the screen from the first active KMS plane:

ffmpeg_g -device /dev/dri/card0 -f kmsgrab -i - -vf 'hwmap=derive_device=vaapi,scale_vaapi=w=1920:h=1080:format=nv12' -c:v h264_vaapi -qp 24 output.mp4

Compared to capturing through X as in the previous examples, this should use much less CPU (all surfaces stay on the GPU side) and can work outside X (on VTs or in Wayland), but can only capture whole planes and requires DRM master or CAP_SYS_ADMIN to run.

Transcode

Hardware-only transcode to H.264 at 2Mbps CBR:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -c:v h264_vaapi -b:v 2M -maxrate 2M output.mp4

Decode, deinterlace if interlaced, scale to 720p, encode with H.265 at 5Mbps VBR:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -vf 'deinterlace_vaapi=rate=field:auto=1,scale_vaapi=w=1280:h=720' -c:v hevc_vaapi -b:v 5M output.mp4

Transcode to 10-bit H.265 at 15Mbps VBR (the input can be 10-bit, but need not be):

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -vf 'scale_vaapi=format=p010' -c:v hevc_vaapi -profile 2 -b:v 15M output.mp4

Transcode to H.264 in constrained baseline profile at level 3 and 1Mbps CBR for compatibility with old devices:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -vf 'fps=30,scale_vaapi=w=640:h=-2:format=nv12' -c:v h264_vaapi -profile 578 -level 30 -bf 0 -b:v 1M -maxrate 1M output.mp4

Decode the input, then pick a frame from it every 10 seconds to make a sequence of JPEG screenshots at high quality:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -r 1/10 -c:v mjpeg_vaapi -global_quality 90 -f image2 output%03d.jpeg

Burn subtitles into the video while transcoding:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -vf 'scale_vaapi,hwmap=mode=read+write+direct,format=nv12,ass=subtitles.ass,hwmap' -c:v h264_vaapi -b:v 2M -maxrate 2M output.mp4

(Note that the scale_vaapi filter is required here to copy the frames - without it, the subtitles would be drawn directly on the reference frames being used by the decoder at the same time.)

Transcode to two different outputs (one at constant-quality and one at constant-bitrate) from the same input:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -filter_complex 'split[cq][cb]' -map '[cq]' -c:v h264_vaapi -qp 18 output-cq.mp4 -map '[cb]' -c:v h264_vaapi -b:v 5M -maxrate 5M output-cb.mp4

Transcode for multiple streaming formats (one H.264 and one VP9, with the same parameters) from the same input:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -filter_complex 'split[h264][vp9]' -map '[h264]' -c:v h264_vaapi -b:v 5M output-h264.mp4 -map '[vp9]' -c:v vp9_vaapi -b:v 5M output-vp9.webm

Decode on one device, download, upload to a second device, and encode:

ffmpeg -init_hw_device vaapi=decdev:/dev/dri/renderD128 -init_hw_device vaapi=encdev:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device decdev -hwaccel_output_format vaapi -i input.mp4 -filter_hw_device encdev -vf 'hwdownload,format=nv12,hwupload' -c:v h264_vaapi -b:v 5M output.mp4

Other

Use the VAAPI deinterlacer standalone to attempt to make a software transcode run faster (this may actually make things slower - the additional copying to the GPU and back is quite a large overhead):

ffmpeg -vaapi_device /dev/dri/renderD128 -i input.mp4 -vf 'format=nv12,hwupload,deinterlace_vaapi=rate=field,hwdownload,format=nv12' -c:v libx264 -crf 24 output.mp4