The drm/i915 driver supports all (with the exception of some very early
models) integrated GFX chipsets with both Intel display and rendering
blocks. This excludes a set of SoC platforms with an SGX rendering unit,
those have basic support through the gma500 drm driver.

The i915 driver supports dynamic enabling and disabling of entire hardware
blocks at runtime. This is especially important on the display side where
software is supposed to control many power gates manually on recent hardware,
since on the GT side a lot of the power management is done by the hardware.
But even there some manual control at the device level is required.

Since i915 supports a diverse set of platforms with a unified codebase and
hardware engineers just love to shuffle functionality around between power
domains there’s a sizeable amount of indirection required. This file provides
generic functions to the driver for grabbing and releasing references for
abstract power domains. It then maps those to the actual power wells
present for a given platform.

This function can be used to check the hw power domain state. It is mostly
used in hardware state readout functions. Everywhere else code should rely
upon explicit power domain reference counting to ensure that the hardware
block is powered up before accessing it.

Callers must hold the relevant modesetting locks to ensure that concurrent
threads can’t disable the power well while the caller tries to read a few
registers.

Signal to DMC firmware/HW the target DC power state passed in state.
DMC/HW can turn off individual display clocks and power rails when entering
a deeper DC power state (higher in number) and turns these back when exiting
that state to a shallower power state (lower in number). The HW will decide
when to actually enter a given state on an on-demand basis, for instance
depending on the active state of display pipes. The state of display
registers backed by affected power rails are saved/restored as needed.

Based on the above enabling a deeper DC power state is asynchronous wrt.
enabling it. Disabling a deeper power state is synchronous: for instance
setting DC_STATE_DISABLE won’t complete until all HW resources are turned
back on and register state is restored. This is guaranteed by the MMIO write
to DC_STATE_EN blocking until the state is restored.

This function grabs a power domain reference for domain and ensures that the
power domain and all its parents are powered up. Therefore users should only
grab a reference to the innermost power domain they need.

Any power domain reference obtained by this function must have a symmetric
call to intel_display_power_put() to release the reference again.

This function grabs a power domain reference for domain and ensures that the
power domain and all its parents are powered up. Therefore users should only
grab a reference to the innermost power domain they need.

Any power domain reference obtained by this function must have a symmetric
call to intel_display_power_put() to release the reference again.

This function initializes the hardware power domain state and enables all
power wells belonging to the INIT power domain. Power wells in other
domains (and not in the INIT domain) are referenced or disabled by
intel_modeset_readout_hw_state(). After that the reference count of each
power well must match its HW enabled state, see
intel_power_domains_verify_state().

Enable the ondemand enabling/disabling of the display power wells. Note that
power wells not belonging to POWER_DOMAIN_INIT are allowed to be toggled
only at specific points of the display modeset sequence, thus they are not
affected by the intel_power_domains_enable()/disable() calls. The purpose
of these function is to keep the rest of power wells enabled until the end
of display HW readout (which will acquire the power references reflecting
the current HW state).

Verify if the reference count of each power well matches its HW enabled
state and the total refcount of the domains it belongs to. This must be
called after modeset HW state sanitization, which is responsible for
acquiring reference counts for any power wells in use and disabling the
ones left on by BIOS but not required by any active output.

This function grabs a device-level runtime pm reference if the device is
already in use and ensures that it is powered up. It is illegal to try
and access the HW should intel_runtime_pm_get_if_in_use() report failure.

Any runtime pm reference obtained by this function must have a symmetric
call to intel_runtime_pm_put() to release the reference again.

Return

the wakeref cookie to pass to intel_runtime_pm_put(), evaluates
as True if the wakeref was acquired, or False otherwise.

This function grabs a device-level runtime pm reference (mostly used for GEM
code to ensure the GTT or GT is on).

It will _not_ power up the device but instead only check that it’s powered
on. Therefore it is only valid to call this functions from contexts where
the device is known to be powered up and where trying to power it up would
result in hilarity and deadlocks. That pretty much means only the system
suspend/resume code where this is used to grab runtime pm references for
delayed setup down in work items.

Any runtime pm reference obtained by this function must have a symmetric
call to intel_runtime_pm_put() to release the reference again.

This function can be used get GT’s forcewake domain references.
Normal register access will handle the forcewake domains automatically.
However if some sequence requires the GT to not power down a particular
forcewake domains this function should be called at the beginning of the
sequence. And subsequently the reference should be dropped by symmetric
call to intel_unforce_forcewake_put(). Usually caller wants all the domains
to be kept awake so the fw_domains would be then FORCEWAKE_ALL.

This routine waits until the target register reg contains the expected
value after applying the mask, i.e. it waits until

(I915_READ_FW(reg) & mask) == value

Otherwise, the wait will timeout after slow_timeout_ms milliseconds.
For atomic context slow_timeout_ms must be zero and fast_timeout_us
must be not larger than 20,0000 microseconds.

Note that this routine assumes the caller holds forcewake asserted, it is
not suitable for very long waits. See intel_wait_for_register() if you
wish to wait without holding forcewake for the duration (i.e. you expect
the wait to be slow).

Returns 0 if the register matches the desired condition, or -ETIMEOUT.

Returns a set of forcewake domains required to be taken with for example
intel_uncore_forcewake_get for the specified register to be accessible in the
specified mode (read, write or read/write) with raw mmio accessors.

NOTE

On Gen6 and Gen7 write forcewake domain (FORCEWAKE_RENDER) requires the
callers to do FIFO management on their own or risk losing writes.

These functions provide the basic support for enabling and disabling the
interrupt handling support. There’s a lot more functionality in i915_irq.c
and related files, but that will be described in separate chapters.

Intel GVT-g is a graphics virtualization technology which shares the
GPU among multiple virtual machines on a time-sharing basis. Each
virtual machine is presented a virtual GPU (vGPU), which has equivalent
features as the underlying physical GPU (pGPU), so i915 driver can run
seamlessly in a virtual machine. This file provides vGPU specific
optimizations when running in a virtual machine, to reduce the complexity
of vGPU emulation and to improve the overall performance.

A primary function introduced here is so-called “address space ballooning”
technique. Intel GVT-g partitions global graphics memory among multiple VMs,
so each VM can directly access a portion of the memory without hypervisor’s
intervention, e.g. filling textures or queuing commands. However with the
partitioning an unmodified i915 driver would assume a smaller graphics
memory starting from address ZERO, then requires vGPU emulation module to
translate the graphics address between ‘guest view’ and ‘host view’, for
all registers and command opcodes which contain a graphics memory address.
To reduce the complexity, Intel GVT-g introduces “address space ballooning”,
by telling the exact partitioning knowledge to each guest i915 driver, which
then reserves and prevents non-allocated portions from allocation. Thus vGPU
emulation module only needs to scan and validate graphics addresses without
complexity of address translation.

This function is called at the initialization stage, to balloon out the
graphic address space allocated to other vGPUs, by marking these spaces as
reserved. The ballooning related knowledge(starting address and size of
the mappable/unmappable graphic memory) is described in the vgt_if structure
in a reserved mmio range.

To give an example, the drawing below depicts one typical scenario after
ballooning. Here the vGPU1 has 2 pieces of graphic address spaces ballooned
out each for the mappable and the non-mappable part. From the vGPU1 point of
view, the total size is the same as the physical one, with the start address
of its graphic space being zero. Yet there are some portions ballooned out(
the shadow part, which are marked as reserved by drm allocator). From the
host point of view, the graphic address space is partitioned by multiple
vGPUs in different VMs.

Intel GVT-g is a graphics virtualization technology which shares the
GPU among multiple virtual machines on a time-sharing basis. Each
virtual machine is presented a virtual GPU (vGPU), which has equivalent
features as the underlying physical GPU (pGPU), so i915 driver can run
seamlessly in a virtual machine.

This file is intended as a central place to implement most [1] of the
required workarounds for hardware to work as originally intended. They fall
in five basic categories depending on how/when they are applied:

Workarounds that touch registers that are saved/restored to/from the HW
context image. The list is emitted (via Load Register Immediate commands)
everytime a new context is created.

GT workarounds. The list of these WAs is applied whenever these registers
revert to default values (on GPU reset, suspend/resume [2], etc..).

Display workarounds. The list is applied during display clock-gating
initialization.

Workarounds that whitelist a privileged register, so that UMDs can manage
them directly. This is just a special case of a MMMIO workaround (as we
write the list of these to/be-whitelisted registers to some special HW
registers).

Workaround batchbuffers, that get executed automatically by the hardware
on every HW context restore.

Technically, some registers are powercontext saved & restored, so they
survive a suspend/resume. In practice, writing them again is not too
costly and simplifies things. We can revisit this in the future.

The i915 driver is thus far the only DRM driver which doesn’t use the
common DRM helper code to implement mode setting sequences. Thus it has
its own tailor-made infrastructure for executing a display configuration
change.

Many features require us to track changes to the currently active
frontbuffer, especially rendering targeted at the frontbuffer.

To be able to do so GEM tracks frontbuffers using a bitmask for all possible
frontbuffer slots through i915_gem_track_fb(). The function in this file are
then called when the contents of the frontbuffer are invalidated, when
frontbuffer rendering has stopped again to flush out all the changes and when
the frontbuffer is exchanged with a flip. Subsystems interested in
frontbuffer changes (e.g. PSR, FBC, DRRS) should directly put their callbacks
into the relevant places and filter for the frontbuffer slots that they are
interested int.

On a high level there are two types of powersaving features. The first one
work like a special cache (FBC and PSR) and are interested when they should
stop caching and when to restart caching. This is done by placing callbacks
into the invalidate and the flush functions: At invalidate the caching must
be stopped and at flush time it can be restarted. And maybe they need to know
when the frontbuffer changes (e.g. when the hw doesn’t initiate an invalidate
and flush on its own) which can be achieved with placing callbacks into the
flip functions.

The other type of display power saving feature only cares about busyness
(e.g. DRRS). In that case all three (invalidate, flush and flip) indicate
busyness. There is no direct way to detect idleness. Instead an idle timer
work delayed work should be started from the flush and flip functions and
cancelled as soon as busyness is detected.

Note that there’s also an older frontbuffer activity tracking scheme which
just tracks general activity. This is done by the various mark_busy and
mark_idle functions. For display power management features using these
functions is deprecated and should be avoided.

This function gets called every time rendering on the given object starts and
frontbuffer caching (fbc, low refresh rate for DRRS, panel self refresh) must
be invalidated. For ORIGIN_CS any subsequent invalidation will be delayed
until the rendering completes or a flip on this frontbuffer plane is
scheduled.

This function gets called every time rendering on the given planes has
completed and frontbuffer caching can be started again. Flushes will get
delayed if they’re blocked by some outstanding asynchronous rendering.

This function gets called after scheduling a flip on obj. The actual
frontbuffer flushing will be delayed until completion is signalled with
intel_frontbuffer_flip_complete. If an invalidate happens in between this
flush will be cancelled.

The i915 driver checks for display fifo underruns using the interrupt signals
provided by the hardware. This is enabled by default and fairly useful to
debug display issues, especially watermark settings.

If an underrun is detected this is logged into dmesg. To avoid flooding logs
and occupying the cpu underrun interrupts are disabled after the first
occurrence until the next modeset on a given pipe.

Note that underrun detection on gmch platforms is a bit more ugly since there
is no interrupt (despite that the signalling bit is in the PIPESTAT pipe
interrupt register). Also on some other platforms underrun interrupts are
shared, which means that if we detect an underrun we need to disable underrun
reporting on all pipes.

This function makes us disable or enable PCH fifo underruns for a specific
PCH transcoder. Notice that on some PCHs (e.g. CPT/PPT), disabling FIFO
underrun reporting for one transcoder may also disable all the other PCH
error interruts for the other transcoders, due to the fact that there’s just
one interrupt mask/enable bit for all the transcoders.

Check for CPU fifo underruns immediately. Useful on IVB/HSW where the shared
error interrupt may have been disabled, and so CPU fifo underruns won’t
necessarily raise an interrupt, and on GMCH platforms where underruns never
raise an interrupt.

This section covers plane configuration and composition with the primary
plane, sprites, cursors and overlays. This includes the infrastructure
to do atomic vsync’ed updates of all this state and also tightly coupled
topics like watermark setup and computation, framebuffer compression and
panel self refresh.

The functions here are used by the atomic plane helper functions to
implement legacy plane updates (i.e., drm_plane->:c:func:update_plane() and
drm_plane->:c:func:disable_plane()). This allows plane updates to use the
atomic state infrastructure and perform plane updates as separate
prepare/check/commit/cleanup steps.

This section covers output probing and related infrastructure like the
hotplug interrupt storm detection and mitigation code. Note that the
i915 driver still uses most of the common DRM helper code for output
probing, so those sections fully apply.

Simply put, hotplug occurs when a display is connected to or disconnected
from the system. However, there may be adapters and docking stations and
Display Port short pulses and MST devices involved, complicating matters.

Hotplug in i915 is handled in many different levels of abstraction.

The platform dependent interrupt handling code in i915_irq.c enables,
disables, and does preliminary handling of the interrupts. The interrupt
handlers gather the hotplug detect (HPD) information from relevant registers
into a platform independent mask of hotplug pins that have fired.

The Display Port work function i915_digport_work_func() calls into
intel_dp_hpd_pulse() via hooks, which handles DP short pulses and DP MST long
pulses, with failures and non-MST long pulses triggering regular hotplug
processing on the connector.

Finally, the userspace is responsible for triggering a modeset upon receiving
the hotplug uevent, disabling or enabling the crtc as needed.

The hotplug interrupt storm detection and mitigation code keeps track of the
number of interrupts per hotplug pin per a period of time, and if the number
of interrupts exceeds a certain threshold, the interrupt is disabled for a
while before being re-enabled. The intention is to mitigate issues raising
from broken hardware triggering massive amounts of interrupts and grinding
the system to a halt.

Current implementation expects that hotplug interrupt storm will not be
seen when display port sink is connected, hence on platforms whose DP
callback is handled by i915_digport_work_func reenabling of hpd is not
performed (it was never expected to be disabled in the first place ;) )
this is specific to DP sinks handled by this routine and any other display
such as HDMI or DVI enabled on the same port will have proper logic since
it will use i915_hotplug_work_func where this logic is handled.

Gather stats about HPD IRQs from the specified pin, and detect IRQ
storms. Only the pin specific stats and state are changed, the caller is
responsible for further action.

The number of IRQs that are allowed within HPD_STORM_DETECT_PERIOD is
stored in dev_priv->hotplug.hpd_storm_threshold which defaults to
HPD_STORM_DEFAULT_THRESHOLD. Long IRQs count as +10 to this threshold, and
short IRQs count as +1. If this threshold is exceeded, it’s considered an
IRQ storm and the IRQ state is set to HPD_MARK_DISABLED.

By default, most systems will only count long IRQs towards
dev_priv->hotplug.hpd_storm_threshold. However, some older systems also
suffer from short IRQ storms and must also track these. Because short IRQ
storms are naturally caused by sideband interactions with DP MST devices,
short IRQ detection is only enabled for systems without DP MST support.
Systems which are new enough to support DP MST are far less likely to
suffer from IRQ storms at all, so this is fine.

The HPD threshold can be controlled through i915_hpd_storm_ctl in debugfs,
and should only be adjusted for automated hotplug testing.

This is the main hotplug irq handler for all platforms. The platform specific
irq handlers call the platform specific hotplug irq handlers, which read and
decode the appropriate registers into bitmasks about hpd pins that have
triggered (pin_mask), and which of those pins may be long pulses
(long_mask). The long_mask is ignored if the port corresponding to the pin
is not a digital port.

Here, we do hotplug irq storm detection and mitigation, and pass further
processing to appropriate bottom halves.

This function enables the hotplug support. It requires that interrupts have
already been enabled with intel_irq_init_hw(). From this point on hotplug and
poll request can run concurrently to other code, so locking rules must be
obeyed.

This is a separate step from interrupt enabling to simplify the locking rules
in the driver load and resume code.

This function enables polling for all connectors, regardless of whether or
not they support hotplug detection. Under certain conditions HPD may not be
functional. On most Intel GPUs, this happens when we enter runtime suspend.
On Valleyview and Cherryview systems, this also happens when we shut off all
of the powerwells.

Since this function can get called in contexts where we’re already holding
dev->mode_config.mutex, we do the actual hotplug enabling in a seperate
worker.

The graphics and audio drivers together support High Definition Audio over
HDMI and Display Port. The audio programming sequences are divided into audio
codec and controller enable and disable sequences. The graphics driver
handles the audio codec sequences, while the audio driver handles the audio
controller sequences.

The disable sequences must be performed before disabling the transcoder or
port. The enable sequences may only be performed after enabling the
transcoder and port, and after completed link training. Therefore the audio
enable/disable sequences are part of the modeset sequence.

The codec and controller sequences could be done either parallel or serial,
but generally the ELDV/PD change in the codec sequence indicates to the audio
driver that the controller sequence should start. Indeed, most of the
co-operation between the graphics and audio drivers is handled via audio
related registers. (The notable exception is the power management, not
covered here.)

The struct i915_audio_component is used to interact between the graphics
and audio drivers. The struct i915_audio_component_opsops in it is
defined in graphics driver and called in audio driver. The
struct i915_audio_component_audio_opsaudio_ops is called from i915 driver.

This will register with the component framework a child component which
will bind dynamically to the snd_hda_intel driver’s corresponding master
component when the latter is registered. During binding the child
initializes an instance of struct i915_audio_component which it receives
from the master. The master can then start to use the interface defined by
this struct. Each side can break the binding at any point by deregistering
its own component after which each side’s component unbind callback is
called.

We ignore any error during registration and continue with reduced
functionality (i.e. without HDMI audio).

Motivation:
Atom platforms (e.g. valleyview and cherryTrail) integrates a DMA-based
interface as an alternative to the traditional HDaudio path. While this
mode is unrelated to the LPE aka SST audio engine, the documentation refers
to this mode as LPE so we keep this notation for the sake of consistency.

The interface is handled by a separate standalone driver maintained in the
ALSA subsystem for simplicity. To minimize the interaction between the two
subsystems, a bridge is setup between the hdmi-lpe-audio and i915:
1. Create a platform device to share MMIO/IRQ resources
2. Make the platform device child of i915 device for runtime PM.
3. Create IRQ chip to forward the LPE audio irqs.
the hdmi-lpe-audio driver probes the lpe audio device and creates a new
sound card

Threats:
Due to the restriction in Linux platform device model, user need manually
uninstall the hdmi-lpe-audio driver before uninstalling i915 module,
otherwise we might run into use-after-free issues after i915 removes the
platform device: even though hdmi-lpe-audio driver is released, the modules
is still in “installed” status.

Implementation:
The MMIO/REG platform resources are created according to the registers
specification.
When forwarding LPE audio irqs, the flow control handler selection depends
on the platform, for example on valleyview handle_simple_irq is enough.

Since Haswell Display controller supports Panel Self-Refresh on display
panels witch have a remote frame buffer (RFB) implemented according to PSR
spec in eDP1.3. PSR feature allows the display to go to lower standby states
when system is idle but display is on as it eliminates display refresh
request to DDR memory completely as long as the frame buffer for that
display is unchanged.

Panel Self Refresh must be supported by both Hardware (source) and
Panel (sink).

PSR saves power by caching the framebuffer in the panel RFB, which allows us
to power down the link and memory controller. For DSI panels the same idea
is called “manual mode”.

The implementation uses the hardware-based PSR support which automatically
enters/exits self-refresh mode. The hardware takes care of sending the
required DP aux message and could even retrain the link (that part isn’t
enabled yet though). The hardware also keeps track of any frontbuffer
changes to know when to exit self-refresh mode again. Unfortunately that
part doesn’t work too well, hence why the i915 PSR support uses the
software frontbuffer tracking to make sure it doesn’t miss a screen
update. For this integration intel_psr_invalidate() and intel_psr_flush()
get called by the frontbuffer tracking code. Note that because of locking
issues the self-refresh re-enable code is done from a work queue, which
must be correctly synchronized/cancelled when shutting down the pipe.”

Since the hardware frontbuffer tracking has gaps we need to integrate
with the software frontbuffer tracking. This function gets called every
time frontbuffer rendering starts and a buffer gets dirtied. PSR must be
disabled if the frontbuffer mask contains a buffer relevant to PSR.

Dirty frontbuffers relevant to PSR are tracked in busy_frontbuffer_bits.”

Since the hardware frontbuffer tracking has gaps we need to integrate
with the software frontbuffer tracking. This function gets called every
time frontbuffer rendering has completed and flushed out to memory. PSR
can be enabled again if no other frontbuffer relevant to PSR is dirty.

Dirty frontbuffers relevant to PSR are tracked in busy_frontbuffer_bits.

FBC tries to save memory bandwidth (and so power consumption) by
compressing the amount of memory used by the display. It is total
transparent to user space and completely handled in the kernel.

The benefits of FBC are mostly visible with solid backgrounds and
variation-less patterns. It comes from keeping the memory footprint small
and having fewer memory pages opened and accessed for refreshing the display.

i915 is responsible to reserve stolen memory for FBC and configure its
offset on proper registers. The hardware takes care of all
compress/decompress. However there are many known cases where we have to
forcibly disable it to allow proper screen updates.

This function checks if the given CRTC was chosen for FBC, then enables it if
possible. Notice that it doesn’t activate FBC. It is valid to call
intel_fbc_enable multiple times for the same pipe without an
intel_fbc_disable in the middle, as long as it is deactivated.

Without FBC, most underruns are harmless and don’t really cause too many
problems, except for an annoying message on dmesg. With FBC, underruns can
become black screens or even worse, especially when paired with bad
watermarks. So in order for us to be on the safe side, completely disable FBC
in case we ever detect a FIFO underrun on any pipe. An underrun on any pipe
already suggests that watermarks may be bad, so try to be as safe as
possible.

The FBC code needs to track CRTC visibility since the older platforms can’t
have FBC enabled while multiple pipes are used. This function does the
initial setup at driver load to make sure FBC is matching the real hardware.

Display Refresh Rate Switching (DRRS) is a power conservation feature
which enables swtching between low and high refresh rates,
dynamically, based on the usage scenario. This feature is applicable
for internal panels.

Indication that the panel supports DRRS is given by the panel EDID, which
would list multiple refresh rates for one resolution.

DRRS is of 2 types - static and seamless.
Static DRRS involves changing refresh rate (RR) by doing a full modeset
(may appear as a blink on screen) and is used in dock-undock scenario.
Seamless DRRS involves changing RR without any visual effect to the user
and can be used during normal system usage. This is done by programming
certain registers.

Support for static/seamless DRRS may be indicated in the VBT based on
inputs from the panel spec.

DRRS saves power by switching to low RR based on usage scenarios.

The implementation is based on frontbuffer tracking implementation. When
there is a disturbance on the screen triggered by user activity or a periodic
system activity, DRRS is disabled (RR is changed to high RR). When there is
no movement on screen, after a timeout of 1 second, a switch to low RR is
made.

This function gets called when refresh rate (RR) has to be changed from
one frequency to another. Switches can be between high and low RR
supported by the panel or to any other RR based on media playback (in
this case, RR value needs to be passed from user space).

This function gets called every time rendering on the given planes has
completed or flip on a crtc is completed. So DRRS should be upclocked
(LOW_RR -> HIGH_RR). And also Idleness detection should be started again,
if no other planes are dirty.

Dirty frontbuffers relevant to DRRS are tracked in busy_frontbuffer_bits.

VLV, CHV and BXT have slightly peculiar display PHYs for driving DP/HDMI
ports. DPIO is the name given to such a display PHY. These PHYs
don’t follow the standard programming model using direct MMIO
registers, and instead their registers must be accessed trough IOSF
sideband. VLV has one such PHY for driving ports B and C, and CHV
adds another PHY for driving port D. Each PHY responds to specific
IOSF-SB port.

Each display PHY is made up of one or two channels. Each channel
houses a common lane part which contains the PLL and other common
logic. CH0 common lane also contains the IOSF-SB logic for the
Common Register Interface (CRI) ie. the DPIO registers. CRI clock
must be running when any DPIO registers are accessed.

In addition to having their own registers, the PHYs are also
controlled through some dedicated signals from the display
controller. These include PLL reference clock enable, PLL enable,
and CRI clock selection, for example.

Eeach channel also has two splines (also called data lanes), and
each spline is made up of one Physical Access Coding Sub-Layer
(PCS) block and two TX lanes. So each channel has two PCS blocks
and four TX lanes. The TX lanes are used as DP lanes or TMDS
data/clock pairs depending on the output type.

Additionally the PHY also contains an AUX lane with AUX blocks
for each channel. This is used for DP AUX communication, but
this fact isn’t really relevant for the driver since AUX is
controlled from the display controller side. No DPIO registers
need to be accessed during AUX communication,

Generally on VLV/CHV the common lane corresponds to the pipe and
the spline (PCS/TX) corresponds to the port.

For dual channel PHY (VLV/CHV):

pipe A == CMN/PLL/REF CH0

pipe B == CMN/PLL/REF CH1

port B == PCS/TX CH0

port C == PCS/TX CH1

This is especially important when we cross the streams
ie. drive port B with pipe B, or port C with pipe A.

For single channel PHY (CHV):

pipe C == CMN/PLL/REF CH0

port D == PCS/TX CH0

On BXT the entire PHY channel corresponds to the port. That means
the PLL is also now associated with the port rather than the pipe,
and so the clock needs to be routed to the appropriate transcoder.
Port A PLL is directly connected to transcoder EDP and port B/C
PLLs can be routed to any transcoder A/B/C.

Display Context Save and Restore (CSR) firmware support added from gen9
onwards to drive newly added DMC (Display microcontroller) in display
engine to save and restore the state of display engine when it enter into
low-power state and comes back to normal.

CSR firmware is read from a .bin file and kept in internal memory one time.
Everytime display comes back from low power state this function is called to
copy the firmware from internal memory to registers.

The Video BIOS Table, or VBT, provides platform and board specific
configuration information to the driver that is not discoverable or available
through other means. The configuration is mostly related to display
hardware. The VBT is available via the ACPI OpRegion or, on older systems, in
the PCI ROM.

The VBT consists of a VBT Header (defined as structvbt_header), a BDB
Header (structbdb_header), and a number of BIOS Data Blocks (BDB) that
contain the actual configuration information. The VBT Header, and thus the
VBT, begins with “$VBT” signature. The VBT Header contains the offset of the
BDB Header. The data blocks are concatenated after the BDB Header. The data
blocks have a 1-byte Block ID, 2-byte Block Size, and Block Size bytes of
data. (Block 53, the MIPI Sequence Block is an exception.)

The driver parses the VBT during load. The relevant information is stored in
driver private data for ease of use, and the actual VBT is not read after
that.

Parse and initialize settings from the Video BIOS Tables (VBT). If the VBT
was not found in ACPI OpRegion, try to find it in PCI ROM first. Also
initialize some defaults if the VBT is not present at all.

The display engine uses several different clocks to do its work. There
are two main clocks involved that aren’t directly related to the actual
pixel clock or any symbol/bit clock of the actual output port. These
are the core display clock (CDCLK) and RAWCLK.

CDCLK clocks most of the display pipe logic, and thus its frequency
must be high enough to support the rate at which pixels are flowing
through the pipes. Downscaling must also be accounted as that increases
the effective pixel rate.

On several platforms the CDCLK frequency can be changed dynamically
to minimize power consumption for a given display configuration.
Typically changes to the CDCLK frequency require all the display pipes
to be shut down while the frequency is being changed.

On SKL+ the DMC will toggle the CDCLK off/on during DC5/6 entry/exit.
DMC will not change the active CDCLK frequency however, so that part
will still be performed by the driver directly.

RAWCLK is a fixed frequency clock, often used by various auxiliary
blocks such as AUX CH or backlight PWM. Hence the only thing we
really need to know about RAWCLK is its frequency so that various
dividers can be programmed correctly.

Initialize CDCLK. This consists mainly of initializing dev_priv->cdclk.hw and
sanitizing the state of the hardware if needed. This is generally done only
during the display core initialization sequence, after which the DMC will
take care of turning CDCLK off/on as needed.

Similarly to the atomic helpers this function does a complete swap,
i.e. it also puts the old state into state. This is used by the commit
code to determine how CDCLK has changed (for instance did it increase or
decrease).

Display PLLs used for driving outputs vary by platform. While some have
per-pipe or per-encoder dedicated PLLs, others allow the use of any PLL
from a pool. In the latter scenario, it is possible that multiple pipes
share a PLL if their configurations match.

This file provides an abstraction over display PLLs. The function
intel_shared_dpll_init() initializes the PLLs for the given platform. The
users of a PLL are tracked and that tracking is integrated with the atomic
modest interface. During an atomic operation, a PLL can be requested for a
given CRTC and encoder configuration by calling intel_get_shared_dpll() and
a previously used PLL can be released with intel_release_shared_dpll().
Changes to the users are first staged in the atomic state, and then made
effective by calling intel_shared_dpll_swap_state() during the atomic
commit phase.

Find an appropriate DPLL for the given CRTC and encoder combination. A
reference from the crtc_state to the returned pll is registered in the
atomic state. That configuration is made effective by calling
intel_shared_dpll_swap_state(). The reference should be released by calling
intel_release_shared_dpll().

hardware configuration for the DPLL stored in
struct intel_dpll_hw_state.

Description

This structure holds an atomic state for the DPLL, that can represent
either its current state (in struct intel_shared_dpll) or a desired
future state which would be applied by an atomic mode set (stored in
a struct intel_atomic_state).

RCS engine is for rendering 3D and performing compute, this is named
I915_EXEC_RENDER in user space.

BCS is a blitting (copy) engine, this is named I915_EXEC_BLT in user
space.

VCS is a video encode and decode engine, this is named I915_EXEC_BSD
in user space

VECS is video enhancement engine, this is named I915_EXEC_VEBOX in user
space.

The enumeration I915_EXEC_DEFAULT does not refer to specific engine;
instead it is to be used by user space to specify a default rendering
engine (for 3D) that may or may not be the same as RCS.

The Intel GPU family is a family of integrated GPU’s using Unified
Memory Access. For having the GPU “do work”, user space will feed the
GPU batch buffers via one of the ioctls DRM_IOCTL_I915_GEM_EXECBUFFER2
or DRM_IOCTL_I915_GEM_EXECBUFFER2_WR. Most such batchbuffers will
instruct the GPU to perform work (for example rendering) and that work
needs memory from which to read and memory to which to write. All memory
is encapsulated within GEM buffer objects (usually created with the ioctl
DRM_IOCTL_I915_GEM_CREATE). An ioctl providing a batchbuffer for the GPU
to create will also list all GEM buffer objects that the batchbuffer reads
and/or writes. For implementation details of memory management see
GEM BO Management Implementation Details.

The i915 driver allows user space to create a context via the ioctl
DRM_IOCTL_I915_GEM_CONTEXT_CREATE which is identified by a 32-bit
integer. Such a context should be viewed by user-space as -loosely-
analogous to the idea of a CPU process of an operating system. The i915
driver guarantees that commands issued to a fixed context are to be
executed so that writes of a previously issued command are seen by
reads of following commands. Actions issued between different contexts
(even if from the same file descriptor) are NOT given that guarantee
and the only way to synchronize across contexts (even from the same
file descriptor) is through the use of fences. At least as far back as
Gen4, also have that a context carries with it a GPU HW context;
the HW context is essentially (most of atleast) the state of a GPU.
In addition to the ordering guarantees, the kernel will restore GPU
state via HW context when commands are issued to a context, this saves
user space the need to restore (most of atleast) the GPU state at the
start of each batchbuffer. The non-deprecated ioctls to submit batchbuffer
work can pass that ID (in the lower bits of drm_i915_gem_execbuffer2::rsvd1)
to identify what context to use with the command.

The GPU has its own memory management and address space. The kernel
driver maintains the memory translation table for the GPU. For older
GPUs (i.e. those before Gen8), there is a single global such translation
table, a global Graphics Translation Table (GTT). For newer generation
GPUs each context has its own translation table, called Per-Process
Graphics Translation Table (PPGTT). Of important note, is that although
PPGTT is named per-process it is actually per context. When user space
submits a batchbuffer, the kernel walks the list of GEM buffer objects
used by the batchbuffer and guarantees that not only is the memory of
each such GEM buffer object resident but it is also present in the
(PP)GTT. If the GEM buffer object is not yet placed in the (PP)GTT,
then it is given an address. Two consequences of this are: the kernel
needs to edit the batchbuffer submitted to write the correct value of
the GPU address when a GEM BO is assigned a GPU address and the kernel
might evict a different GEM BO from the (PP)GTT to make address room
for another GEM BO. Consequently, the ioctls submitting a batchbuffer
for execution also include a list of all locations within buffers that
refer to GPU-addresses so that the kernel can edit the buffer correctly.
This process is dubbed relocation.

This section documents the interface functions for evicting buffer
objects to make space available in the virtual gpu address spaces. Note
that this is mostly orthogonal to shrinking buffer objects caches, which
has the goal to make main memory (shared with the gpu through the
unified memory architecture) available.

This function will try to evict vmas until a free space satisfying the
requirements is found. Callers must check first whether any such hole exists
already before calling this function.

This function is used by the object/vma binding code.

Since this function is only used to free up virtual address space it only
ignores pinned vmas, and not object where the backing storage itself is
pinned. Hence obj->pages_pin_count does not protect against eviction.

To clarify: This is for freeing up virtual address space, not for freeing
memory in e.g. the shrinker.

This section documents the interface function for shrinking memory usage
of buffer object caches. Shrinking is used to make main memory
available. Note that this is mostly orthogonal to evicting buffer
objects, which has the goal to make space in gpu virtual address spaces.

This function is the main interface to the shrinker. It will try to release
up to target pages of main memory backing storage from buffer objects.
Selection of the specific caches can be done with flags. This is e.g. useful
when purgeable objects should be removed from caches preferentially.

Note that it’s not guaranteed that released amount is actually available as
free system memory - the pages might still be in-used to due to other reasons
(like cpu mmaps) or the mm core has reused them before we could grab them.
Therefore code that needs to explicitly shrink buffer objects caches (e.g. to
avoid deadlocks in memory reclaim) must fall back to i915_gem_shrink_all().

Also note that any kind of pinning (both per-vma address space pins and
backing storage pins at the buffer object level) result in the shrinker code
having to skip the object.

This is a simple wraper around i915_gem_shrink() to aggressively shrink all
caches completely. It also first waits for and retires all outstanding
requests to also be able to release backing storage for active objects.

This should only be used in code to intentionally quiescent the gpu or as a
last-ditch effort when memory seems to have run out.

Motivation:
Certain OpenGL features (e.g. transform feedback, performance monitoring)
require userspace code to submit batches containing commands such as
MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
generations of the hardware will noop these commands in “unsecure” batches
(which includes all userspace batches submitted via i915) even though the
commands may be safe and represent the intended programming model of the
device.

The software command parser is similar in operation to the command parsing
done in hardware for unsecure batches. However, the software parser allows
some operations that would be noop’d by hardware, if the parser determines
the operation is safe, and submits the batch as “secure” to prevent hardware
parsing.

Threats:
At a high level, the hardware (and software) checks attempt to prevent
granting userspace undue privileges. There are three categories of privilege.

First, commands which are explicitly defined as privileged or which should
only be used by the kernel driver. The parser generally rejects such
commands, though it may allow some from the drm master process.

Second, commands which access registers. To support correct/enhanced
userspace functionality, particularly certain OpenGL extensions, the parser
provides a whitelist of registers which userspace may safely access (for both
normal and drm master processes).

The majority of the problematic commands fall in the MI_* range, with only a
few specific commands on each engine (e.g. PIPE_CONTROL and MI_FLUSH_DW).

Implementation:
Each engine maintains tables of commands and registers which the parser
uses in scanning batch buffers submitted to that engine.

Since the set of commands that the parser must check for is significantly
smaller than the number of commands supported, the parser tables contain only
those commands required by the parser. This generally works because command
opcode ranges have standard command length encodings. So for commands that
the parser does not need to check, it can easily skip them. This is
implemented via a per-engine length decoding vfunc.

Unfortunately, there are a number of commands that do not follow the standard
length encoding for their opcode range, primarily amongst the MI_* commands.
To handle this, the parser provides a way to define explicit “skip” entries
in the per-engine command tables.

Other command table entries map fairly directly to high level categories
mentioned above: rejected, master-only, register whitelist. The parser
implements a number of checks, including the privileged memory checks, via a
general bitmasking mechanism.

In order to submit batch buffers as ‘secure’, the software command parser
must ensure that a batch buffer cannot be modified after parsing. It does
this by copying the user provided batch buffer contents to a kernel owned
buffer from which the hardware will actually execute, and by carefully
managing the address space bindings for such buffers.

The batch pool framework provides a mechanism for the driver to manage a
set of scratch buffers to use for this purpose. The framework can be
extended to support other uses cases should they arise.

Userspace submits commands to be executed on the GPU as an instruction
stream within a GEM object we call a batchbuffer. This instructions may
refer to other GEM objects containing auxiliary state such as kernels,
samplers, render targets and even secondary batchbuffers. Userspace does
not know where in the GPU memory these objects reside and so before the
batchbuffer is passed to the GPU for execution, those addresses in the
batchbuffer and auxiliary objects are updated. This is known as relocation,
or patching. To try and avoid having to relocate each object on the next
execution, userspace is told the location of those objects in this pass,
but this remains just a hint as the kernel may choose a new location for
any object in the future.

At the level of talking to the hardware, submitting a batchbuffer for the
GPU to execute is to add content to a buffer from which the HW
command streamer is reading.

Add a command to load the HW context. For Logical Ring Contexts, i.e.
Execlists, this command is not placed on the same buffer as the
remaining items.

Add a command to invalidate caches to the buffer.

Add a batchbuffer start command to the buffer; the start command is
essentially a token together with the GPU address of the batchbuffer
to be executed.

Add a pipeline flush to the buffer.

Add a memory write command to the buffer to record when the GPU
is done executing the batchbuffer. The memory write writes the
global sequence number of the request, i915_request::global_seqno;
the i915 driver uses the current value in the register to determine
if the GPU has completed the batchbuffer.

Add a user interrupt command to the buffer. This command instructs
the GPU to issue an interrupt when the command, pipeline flush and
memory write are completed.

Inform the hardware of the additional commands added to the buffer
(by updating the tail pointer).

Processing an execbuf ioctl is conceptually split up into a few phases.

Validation - Ensure all the pointers, handles and flags are valid.

Reservation - Assign GPU address space for every object

Relocation - Update any addresses to point to the final locations

Serialisation - Order the request with respect to its dependencies

Construction - Construct a request to execute the batchbuffer

Submission (at some point in the future execution)

Reserving resources for the execbuf is the most complicated phase. We
neither want to have to migrate the object in the address space, nor do
we want to have to update any relocations pointing to this object. Ideally,
we want to leave the object where it is and for all the existing relocations
to match. If the object is given a new address, or if userspace thinks the
object is elsewhere, we have to parse all the relocation entries and update
the addresses. Userspace can set the I915_EXEC_NORELOC flag to hint that
all the target addresses in all of its objects match the value in the
relocation entries and that they all match the presumed offsets given by the
list of execbuffer objects. Using this knowledge, we know that if we haven’t
moved any buffers, all the relocation entries are valid and we can skip
the update. (If userspace is wrong, the likely outcome is an impromptu GPU
hang.) The requirement for using I915_EXEC_NO_RELOC are:

The addresses written in the objects must match the corresponding
reloc.presumed_offset which in turn must match the corresponding
execobject.offset.

Any render targets written to in the batch must be flagged with
EXEC_OBJECT_WRITE.

To avoid stalling, execobject.offset should match the current
address of that object within the active context.

The reservation is done is multiple phases. First we try and keep any
object already bound in its current location - so as long as meets the
constraints imposed by the new execbuffer. Any object left unbound after the
first pass is then fitted into any available idle space. If an object does
not fit, all objects are removed from the reservation and the process rerun
after sorting the objects into a priority order (more difficult to fit
objects are tried first). Failing that, the entire VM is cleared and we try
to fit the execbuf once last time before concluding that it simply will not
fit.

A small complication to all of this is that we allow userspace not only to
specify an alignment and a size for the object in the address space, but
we also allow userspace to specify the exact offset. This objects are
simpler to place (the location is known a priori) all we have to do is make
sure the space is available.

Once all the objects are in place, patching up the buried pointers to point
to the final locations is a fairly simple job of walking over the relocation
entry arrays, looking up the right address and rewriting the value into
the object. Simple! ... The relocation entries are stored in user memory
and so to access them we have to copy them into a local buffer. That copy
has to avoid taking any pagefaults as they may lead back to a GEM object
requiring the struct_mutex (i.e. recursive deadlock). So once again we split
the relocation into multiple passes. First we try to do everything within an
atomic context (avoid the pagefaults) which requires that we never wait. If
we detect that we may wait, or if we need to fault, then we have to fallback
to a slower path. The slowpath has to drop the mutex. (Can you hear alarm
bells yet?) Dropping the mutex means that we lose all the state we have
built up so far for the execbuf and we must reset any global data. However,
we do leave the objects pinned in their final locations - which is a
potential issue for concurrent execbufs. Once we have left the mutex, we can
allocate and copy all the relocation entries into a large array at our
leisure, reacquire the mutex, reclaim all the objects and other state and
then proceed to update any incorrect addresses with the objects.

As we process the relocation entries, we maintain a record of whether the
object is being written to. Using NORELOC, we expect userspace to provide
this information instead. We also check whether we can skip the relocation
by comparing the expected value inside the relocation entry with the target’s
final address. If they differ, we have to map the current object and rewrite
the 4 or 8 byte pointer within.

Serialising an execbuf is quite simple according to the rules of the GEM
ABI. Execution within each context is ordered by the order of submission.
Writes to any GEM object are in order of submission and are exclusive. Reads
from a GEM object are unordered with respect to other reads, but ordered by
writes. A write submitted after a read cannot occur before the read, and
similarly any read submitted after a write cannot occur before the write.
Writes are ordered between engines such that only one write occurs at any
time (completing any reads beforehand) - using semaphores where available
and CPU serialisation otherwise. Other GEM access obey the same rules, any
write (either via mmaps using set-domain, or via pwrite) must flush all GPU
reads before starting, and any read (either using set-domain or pread) must
flush all GPU writes before starting. (Note we only employ a barrier before,
we currently rely on userspace not concurrently starting a new execution
whilst reading or writing to an object. This may be an advantage or not
depending on how much you trust userspace not to shoot themselves in the
foot.) Serialisation may just result in the request being inserted into
a DAG awaiting its turn, but most simple is to wait on the CPU until
all dependencies are resolved.

After all of that, is just a matter of closing the request and handing it to
the hardware (well, leaving it in a queue to be executed). However, we also
offer the ability for batchbuffers to be run with elevated privileges so
that they access otherwise hidden registers. (Used to adjust L3 cache etc.)
Before any batch is given extra privileges we first must check that it
contains no nefarious instructions, we check that each instruction is from
our whitelist and all registers are also from an allowed list. We first
copy the user’s batchbuffer to a shadow (so that the user doesn’t have
access to it, either by the CPU or GPU as we scan it) and then parse each
instruction. If everything is ok, we set a flag telling the hardware to run
the batchbuffer in trusted mode, otherwise the ioctl is rejected.

Motivation:
GEN8 brings an expansion of the HW contexts: “Logical Ring Contexts”.
These expanded contexts enable a number of new abilities, especially
“Execlists” (also implemented in this file).

One of the main differences with the legacy HW contexts is that logical
ring contexts incorporate many more things to the context’s state, like
PDPs or ringbuffer control registers:

The reason why PDPs are included in the context is straightforward: as
PPGTTs (per-process GTTs) are actually per-context, having the PDPs
contained there mean you don’t need to do a ppgtt->switch_mm yourself,
instead, the GPU will do it for you on the context switch.

But, what about the ringbuffer control registers (head, tail, etc..)?
shouldn’t we just need a set of those per engine command streamer? This is
where the name “Logical Rings” starts to make sense: by virtualizing the
rings, the engine cs shifts to a new “ring buffer” with every context
switch. When you want to submit a workload to the GPU you: A) choose your
context, B) find its appropriate virtualized ring, C) write commands to it
and then, finally, D) tell the GPU to switch to that context.

Instead of the legacy MI_SET_CONTEXT, the way you tell the GPU to switch
to a contexts is via a context execution list, ergo “Execlists”.

LRC implementation:
Regarding the creation of contexts, we have:

One global default context.

One local default context for each opened fd.

One local extra context for each context create ioctl call.

Now that ringbuffers belong per-context (and not per-engine, like before)
and that contexts are uniquely tied to a given engine (and not reusable,
like before) we need:

One ringbuffer per-engine inside each context.

One backing object per-engine inside each context.

The global default context starts its life with these new objects fully
allocated and populated. The local default context for each opened fd is
more complex, because we don’t know at creation time which engine is going
to use them. To handle this, we have implemented a deferred creation of LR
contexts:

The local context starts its life as a hollow or blank holder, that only
gets populated for a given engine once we receive an execbuffer. If later
on we receive another execbuffer ioctl for the same context but a different
engine, we allocate/populate a new ringbuffer and context backing object and
so on.

Finally, regarding local contexts created using the ioctl call: as they are
only allowed with the render ring, we can allocate & populate them right
away (no need to defer anything, at least for now).

Execlists implementation:
Execlists are the new method by which, on gen8+ hardware, workloads are
submitted for execution (as opposed to the legacy, ringbuffer-based, method).
This method works as follows:

When a request is committed, its commands (the BB start and any leading or
trailing commands, like the seqno breadcrumbs) are placed in the ringbuffer
for the appropriate context. The tail pointer in the hardware context is not
updated at this time, but instead, kept by the driver in the ringbuffer
structure. A structure representing this request is added to a request queue
for the appropriate engine: this structure contains a copy of the context’s
tail after the request was written to the ring buffer and a pointer to the
context itself.

If the engine’s request queue was empty before the request was added, the
queue is processed immediately. Otherwise the queue will be processed during
a context switch interrupt. In any case, elements on the queue will get sent
(in pairs) to the GPU’s ExecLists Submit Port (ELSP, for short) with a
globally unique 20-bits submission ID.

When execution of a request completes, the GPU updates the context status
buffer with a context complete event and generates a context switch interrupt.
During the interrupt handling, the driver examines the events in the buffer:
for each context complete event, if the announced ID matches that on the head
of the request queue, then that request is retired and removed from the queue.

After processing, if any requests were retired and the queue is not empty
then a new execution list can be submitted. The two requests at the front of
the queue are next to be submitted but since a context may not occur twice in
an execution list, if subsequent requests have the same ID as the first then
the two requests must be combined. This is done simply by discarding requests
at the head of the queue until either only one requests is left (in which case
we use a NULL second context) or the first two requests have unique IDs.

By always executing the first two requests in the queue the driver ensures
that the GPU is kept as busy as possible. In the case where a single context
completes but a second context is still executing, the request for this second
context will be at the head of the queue when we remove the first one. This
request will then be resubmitted along with a new request for a different context,
which will cause the hardware to continue executing the second request and queue
the new request (the GPU detects the condition of a context getting preempted
with the same context and optimizes the context switch flow by not doing
preemption, but just sampling the new tail pointer).

Historically objects could exists (be bound) in global GTT space only as
singular instances with a view representing all of the object’s backing pages
in a linear fashion. This view will be called a normal view.

To support multiple views of the same object, where the number of mapped
pages is not equal to the backing store, or where the layout of the pages
is not linear, concept of a GGTT view was added.

One example of an alternative view is a stereo display driven by a single
image. In this case we would have a framebuffer looking like this
(2x2 pages):

12
34

Above would represent a normal GGTT view as normally mapped for GPU or CPU
rendering. In contrast, fed to the display engine would be an alternative
view which could look something like this:

1212
3434

In this example both the size and layout of pages in the alternative view is
different from the normal view.

Implementation and usage

GGTT views are implemented using VMAs and are distinguished via enum
i915_ggtt_view_type and struct i915_ggtt_view.

A new flavour of core GEM functions which work with GGTT bound objects were
added with the _ggtt_ infix, and sometimes with _view postfix to avoid
renaming in large amounts of code. They take the struct i915_ggtt_view
parameter encapsulating all metadata required to implement a view.

As a helper for callers which are only interested in the normal view,
globally const i915_ggtt_view_normal singleton instance exists. All old core
GEM API functions, the ones not taking the view parameter, are operating on,
or with the normal GGTT view.

Code wanting to add or use a new GGTT view needs to:

Add a new enum with a suitable name.

Extend the metadata in the i915_ggtt_view structure if required.

Add support to i915_get_vma_pages().

New views are required to build a scatter-gather table from within the
i915_get_vma_pages function. This table is stored in the vma.ggtt_view and
exists for the lifetime of an VMA.

Core API is designed to have copy semantics which means that passed in
struct i915_ggtt_view does not need to be persistent (left around after
calling the core API functions).

The function tries to search if there is an existing PPAT entry which
matches with the required value. If perfectly matched, the existing PPAT
entry will be used. If only partially matched, it will try to check if
there is any available PPAT index. If yes, it will allocate a new PPAT
index for the required entry and update the HW. If not, the partially
matched entry will be used.

Put back the PPAT entry got from intel_ppat_get(). If the PPAT index of the
entry is dynamically allocated, its reference count will be decreased. Once
the reference count becomes into zero, the PPAT index becomes free again.

how much space to allocate inside the GTT,
must be #I915_GTT_PAGE_SIZE aligned

u64offset

where to insert inside the GTT,
must be #I915_GTT_MIN_ALIGNMENT aligned, and the node
(offset + size) must fit within the address space

unsignedlongcolor

color to apply to node, if this node is not from a VMA,
color must be #I915_COLOR_UNEVICTABLE

unsignedintflags

control search and eviction behaviour

Description

i915_gem_gtt_reserve() tries to insert the node at the exact offset inside
the address space (using size and color). If the node does not fit, it
tries to evict any overlapping nodes from the GTT, including any
neighbouring nodes if the colors do not match (to ensure guard pages between
differing domains). See i915_gem_evict_for_node() for the gory details
on the eviction algorithm. #PIN_NONBLOCK may used to prevent waiting on
evicting active overlapping objects, and any overlapping node that is pinned
or marked as unevictable will also result in failure.

Return

0 on success, -ENOSPC if no suitable hole is found, -EINTR if
asked to wait for eviction and interrupted.

how much space to allocate inside the GTT,
must be #I915_GTT_PAGE_SIZE aligned

u64alignment

required alignment of starting offset, may be 0 but
if specified, this must be a power-of-two and at least
#I915_GTT_MIN_ALIGNMENT

unsignedlongcolor

color to apply to node

u64start

start of any range restriction inside GTT (0 for all),
must be #I915_GTT_PAGE_SIZE aligned

u64end

end of any range restriction inside GTT (U64_MAX for all),
must be #I915_GTT_PAGE_SIZE aligned if not U64_MAX

unsignedintflags

control search and eviction behaviour

Description

i915_gem_gtt_insert() first searches for an available hole into which
is can insert the node. The hole address is aligned to alignment and
its size must then fit entirely within the [start, end] bounds. The
nodes on either side of the hole must match color, or else a guard page
will be inserted between the two nodes (or the node evicted). If no
suitable hole is found, first a victim is randomly selected and tested
for eviction, otherwise then the LRU list of objects within the GTT
is scanned to find the first set of replacement nodes to create the hole.
Those old overlapping nodes are evicted from the GTT (and so must be
rebound before any future use). Any node that is currently pinned cannot
be evicted (see i915_vma_pin()). Similar if the node’s VMA is currently
active and #PIN_NONBLOCK is specified, that node is also skipped when
searching for an eviction candidate. See i915_gem_evict_something() for
the gory details on the eviction algorithm.

Return

0 on success, -ENOSPC if no suitable hole is found, -EINTR if
asked to wait for eviction and interrupted.

When mapping objects through the GTT, userspace wants to be able to write
to them without having to worry about swizzling if the object is tiled.
This function walks the fence regs looking for a free one for obj,
stealing one if it can’t find any.

It then sets up the reg based on the object’s properties: address, pitch
and tiling format.

Restore the hw fence state to match the software tracking again, to be called
after a gpu reset and on resume. Note that on runtime suspend we only cancel
the fences, to be reacquired by the user later.

This function saves the bit 17 of each page frame number so that swizzling
can be fixed up later on with i915_gem_object_do_bit_17_swizzle(). This must
be called before the backing storage can be unpinned.

Important to avoid confusions: “fences” in the i915 driver are not execution
fences used to track command completion but hardware detiler objects which
wrap a given range of the global GTT. Each platform has only a fairly limited
set of these objects.

Fences are used to detile GTT memory mappings. They’re also connected to the
hardware frontbuffer render tracking and hence interact with frontbuffer
compression. Furthermore on older platforms fences are required for tiled
objects used by the display engine. They can also be used by the render
engine - they’re required for blitter commands and are optional for render
commands. But on gen4+ both display (with the exception of fbc) and rendering
have their own tiling state bits and don’t need fences.

Also note that fences only support X and Y tiling and hence can’t be used for
the fancier new tiling formats like W, Ys and Yf.

Finally note that because fences are such a restricted resource they’re
dynamically associated with objects. Furthermore fence state is committed to
the hardware lazily to avoid unnecessary stalls on gen2/3. Therefore code must
explicitly call i915_gem_object_get_fence() to synchronize fencing status
for cpu access. Also note that some code wants an unfenced view, for those
cases the fence can be removed forcefully with i915_gem_object_put_fence().

Internally these functions will synchronize with userspace access by removing
CPU ptes into GTT mmaps (not the GTT ptes themselves) as needed.

The idea behind tiling is to increase cache hit rates by rearranging
pixel data so that a group of pixel accesses are in the same cacheline.
Performance improvement from doing this on the back/depth buffer are on
the order of 30%.

Intel architectures make this somewhat more complicated, though, by
adjustments made to addressing of data when the memory is in interleaved
mode (matched pairs of DIMMS) to improve memory bandwidth.
For interleaved memory, the CPU sends every sequential 64 bytes
to an alternate memory channel so it can get the bandwidth from both.

The GPU also rearranges its accesses for increased bandwidth to interleaved
memory, and it matches what the CPU does for non-tiled. However, when tiled
it does it a little differently, since one walks addresses not just in the
X direction but also Y. So, along with alternating channels when bit
6 of the address flips, it also alternates when other bits flip – Bits 9
(every 512 bytes, an X tile scanline) and 10 (every two X tile scanlines)
are common to both the 915 and 965-class hardware.

The CPU also sometimes XORs in higher bits as well, to improve
bandwidth doing strided access like we do so frequently in graphics. This
is called “Channel XOR Randomization” in the MCH documentation. The result
is that the CPU is XORing in either bit 11 or bit 17 to bit 6 of its address
decode.

All of this bit 6 XORing has an effect on our memory management,
as we need to make sure that the 3d driver can correctly address object
contents.

If we don’t have interleaved memory, all tiling is safe and no swizzling is
required.

When bit 17 is XORed in, we simply refuse to tile at all. Bit
17 is not just a page offset, so as we page an object out and back in,
individual pages in it will have different bit 17 addresses, resulting in
each 64 bytes being swapped with its neighbor!

Otherwise, if interleaved, we have to tell the 3d driver what the address
swizzling it needs to do is, since it’s writing with the CPU to the pages
(bit 6 and potentially bit 11 XORed in), and the GPU is reading from the
pages (bit 6, 9, and 10 XORed in), resulting in a cumulative bit swizzling
required by the CPU of XORing in bit 6, 9, 10, and potentially 11, in order
to match what the GPU expects.

In principle GEM doesn’t care at all about the internal data layout of an
object, and hence it also doesn’t care about tiling or swizzling. There’s two
exceptions:

For X and Y tiling the hardware provides detilers for CPU access, so called
fences. Since there’s only a limited amount of them the kernel must manage
these, and therefore userspace must tell the kernel the object tiling if it
wants to use fences for detiling.

On gen3 and gen4 platforms have a swizzling pattern for tiled objects which
depends upon the physical page frame number. When swapping such objects the
page frame number might change and the kernel must be able to fix this up
and hence now the tiling. Note that on a subset of platforms with
asymmetric memory channel population the swizzling pattern changes in an
unknown way, and for those the kernel simply forbids swapping completely.

Since neither of this applies for new tiling layouts on modern platforms like
W, Ys and Yf tiling GEM only allows object tiling to be set to X or Y tiled.
Anything else can be handled in userspace entirely without the kernel’s
invovlement.

The layout of the WOPCM will be fixed after writing to GuC WOPCM size and
offset registers whose values are calculated and determined by HuC/GuC
firmware size and set of hardware requirements/restrictions as shown below:

GuC client:
A intel_guc_client refers to a submission path through GuC. Currently, there
are two clients. One of them (the execbuf_client) is charged with all
submissions to the GuC, the other one (preempt_client) is responsible for
preempting the execbuf_client. This struct is the owner of a doorbell, a
process descriptor and a workqueue (all of them inside a single gem object
that contains all required pages for these elements).

GuC stage descriptor:
During initialization, the driver allocates a static pool of 1024 such
descriptors, and shares them with the GuC.
Currently, there exists a 1:1 mapping between a intel_guc_client and a
guc_stage_desc (via the client’s stage_id), so effectively only one
gets used. This stage descriptor lets the GuC know about the doorbell,
workqueue and process descriptor. Theoretically, it also lets the GuC
know about our HW contexts (context ID, etc...), but we actually
employ a kind of submission where the GuC uses the LRCA sent via the work
item instead (the single guc_stage_desc associated to execbuf client
contains information about the default kernel context only, but this is
essentially unused). This is called a “proxy” submission.

The Scratch registers:
There are 16 MMIO-based registers start from 0xC180. The kernel driver writes
a value to the action register (SOFT_SCRATCH_0) along with any data. It then
triggers an interrupt on the GuC via another register write (0xC4C8).
Firmware writes a success/fail code back to the action register after
processes the request. The kernel driver polls waiting for this update and
then proceeds.
See intel_guc_send()

Doorbells:
Doorbells are interrupts to uKernel. A doorbell is a single cache line (QW)
mapped into process space.

Work Items:
There are several types of work items that the host may place into a
workqueue, each with its own requirements and limitations. Currently only
WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
represents in-order queue. The kernel driver packs ring tail pointer and an
ELSP context descriptor dword into Work Item.
See guc_add_request()

four levels priority _CRITICAL, _HIGH, _NORMAL and _LOW
The kernel client to replace ExecList submission is created with
NORMAL priority. Priority of a client for scheduler can be HIGH,
while a preemption context can use CRITICAL.

The firmware may or may not have modulus key and exponent data. The header,
uCode and RSA signature are must-have components that will be used by driver.
Length of each components, which is all in dwords, can be found in header.
In the case that modulus and exponent are not present in fw, a.k.a truncated
image, the length value still appears in header.

Driver will do some basic fw size validation based on the following rules:

Header, uCode and RSA are must-have components.

All firmware components, if they present, are in the sequence illustrated
in the layout table above.

Length info of each component can be found in header, in dwords.

Modulus and exponent key are not required by driver. They may not appear
in fw. So driver will load a truncated firmware in this case.

HuC firmware layout is same as GuC firmware.

HuC firmware css header is different. However, the only difference is where
the version information is saved. The uc_css_header is unified to support
both. Driver should get HuC version from uc_css_header.huc_sw_version, while
uc_css_header.guc_sw_version for GuC.

The lower part of GuC Address Space [0, ggtt_pin_bias) is mapped to GuC WOPCM
while upper part of GuC Address Space [ggtt_pin_bias, GUC_GGTT_TOP) is mapped
to DRAM. The value of the GuC ggtt_pin_bias is the GuC WOPCM size.

With full ppgtt enabled each process using drm will allocate at least one
translation table. With these traces it is possible to keep track of the
allocation and of the lifetime of the tables; this can be used during
testing/debug to verify that we are not leaking ppgtts.
These traces identify the ppgtt through the vm pointer, which is also printed
by the i915_vma_bind and i915_vma_unbind tracepoints.

Gen graphics supports a large number of performance counters that can help
driver and application developers understand and optimize their use of the
GPU.

This i915 perf interface enables userspace to configure and open a file
descriptor representing a stream of GPU metrics which can then be read() as
a stream of sample records.

The interface is particularly suited to exposing buffered metrics that are
captured by DMA from the GPU, unsynchronized with and unrelated to the CPU.

Streams representing a single context are accessible to applications with a
corresponding drm file descriptor, such that OpenGL can use the interface
without special privileges. Access to system-wide metrics requires root
privileges by default, unless changed via the dev.i915.perf_event_paranoid
sysctl option.

The interface was initially inspired by the core Perf infrastructure but
some notable differences are:

i915 perf file descriptors represent a “stream” instead of an “event”; where
a perf event primarily corresponds to a single 64bit value, while a stream
might sample sets of tightly-coupled counters, depending on the
configuration. For example the Gen OA unit isn’t designed to support
orthogonal configurations of individual counters; it’s configured for a set
of related counters. Samples for an i915 perf stream capturing OA metrics
will include a set of counter values packed in a compact HW specific format.
The OA unit supports a number of different packing formats which can be
selected by the user opening the stream. Perf has support for grouping
events, but each event in the group is configured, validated and
authenticated individually with separate system calls.

i915 perf doesn’t support exposing metrics via an mmap’d circular buffer.
The supported metrics are being written to memory by the GPU unsynchronized
with the CPU, using HW specific packing formats for counter sets. Sometimes
the constraints on HW configuration require reports to be filtered before it
would be acceptable to expose them to unprivileged applications - to hide
the metrics of other processes/contexts. For these use cases a read() based
interface is a good fit, and provides an opportunity to filter data as it
gets copied from the GPU mapped buffers to userspace buffers.

The first prototype of this driver was based on the core perf
infrastructure, and while we did make that mostly work, with some changes to
perf, we found we were breaking or working around too many assumptions baked
into perf’s currently cpu centric design.

In the end we didn’t see a clear benefit to making perf’s implementation and
interface more complex by changing design assumptions while we knew we still
wouldn’t be able to use any existing perf based userspace tools.

Also considering the Gen specific nature of the Observability hardware and
how userspace will sometimes need to combine i915 perf OA metrics with
side-band OA data captured via MI_REPORT_PERF_COUNT commands; we’re
expecting the interface to be used by a platform specific userspace such as
OpenGL or tools. This is to say; we aren’t inherently missing out on having
a standard vendor/architecture agnostic interface by not using perf.

For posterity, in case we might re-visit trying to adapt core perf to be
better suited to exposing i915 metrics these were the main pain points we
hit:

Existing perf pmus are used for profiling work on a cpu and we were
introducing the idea of _IS_DEVICE pmus with different security
implications, the need to fake cpu-related data (such as user/kernel
registers) to fit with perf’s current design, and adding _DEVICE records
as a way to forward device-specific status records.

The OA unit writes reports of counters into a circular buffer, without
involvement from the CPU, making our PMU driver the first of a kind.

Given the way we were periodically forward data from the GPU-mapped, OA
buffer to perf’s buffer, those bursts of sample writes looked to perf like
we were sampling too fast and so we had to subvert its throttling checks.

Perf supports groups of counters and allows those to be read via
transactions internally but transactions currently seem designed to be
explicitly initiated from the cpu (say in response to a userspace read())
and while we could pull a report out of the OA buffer we can’t
trigger a report from the cpu on demand.

Related to being report based; the OA counters are configured in HW as a
set while perf generally expects counter configurations to be orthogonal.
Although counters can be associated with a group leader as they are
opened, there’s no clear precedent for being able to provide group-wide
configuration attributes (for example we want to let userspace choose the
OA unit report format used to capture all counters in a set, or specify a
GPU context to filter metrics on). We avoided using perf’s grouping
feature and forwarded OA reports to userspace via perf’s ‘raw’ sample
field. This suited our userspace well considering how coupled the counters
are when dealing with normalizing. It would be inconvenient to split
counters up into separate events, only to require userspace to recombine
them. For Mesa it’s also convenient to be forwarded raw, periodic reports
for combining with the side-band raw reports it captures using
MI_REPORT_PERF_COUNT commands.

As a side note on perf’s grouping feature; there was also some concern
that using PERF_FORMAT_GROUP as a way to pack together counter values
would quite drastically inflate our sample sizes, which would likely
lower the effective sampling resolutions we could use when the available
memory bandwidth is limited.

With the OA unit’s report formats, counters are packed together as 32
or 40bit values, with the largest report size being 256 bytes.

PERF_FORMAT_GROUP values are 64bit, but there doesn’t appear to be a
documented ordering to the values, implying PERF_FORMAT_ID must also be
used to add a 64bit ID before each value; giving 16 bytes per counter.

Related to counter orthogonality; we can’t time share the OA unit, while
event scheduling is a central design idea within perf for allowing
userspace to open + enable more events than can be configured in HW at any
one time. The OA unit is not designed to allow re-configuration while in
use. We can’t reconfigure the OA unit without losing internal OA unit
state which we can’t access explicitly to save and restore. Reconfiguring
the OA unit is also relatively slow, involving ~100 register writes. From
userspace Mesa also depends on a stable OA configuration when emitting
MI_REPORT_PERF_COUNT commands and importantly the OA unit can’t be
disabled while there are outstanding MI_RPC commands lest we hang the
command streamer.

The contents of sample records aren’t extensible by device drivers (i.e.
the sample_type bits). As an example; Sourab Gupta had been looking to
attach GPU timestamps to our OA samples. We were shoehorning OA reports
into sample records by using the ‘raw’ field, but it’s tricky to pack more
than one thing into this field because events/core.c currently only lets a
pmu give a single raw data pointer plus len which will be copied into the
ring buffer. To include more than the OA report we’d have to copy the
report into an intermediate larger buffer. I’d been considering allowing a
vector of data+len values to be specified for copying the raw data, but
it felt like a kludge to being using the raw field for this purpose.

It felt like our perf based PMU was making some technical compromises
just for the sake of using perf:

perf_event_open() requires events to either relate to a pid or a specific
cpu core, while our device pmu related to neither. Events opened with a
pid will be automatically enabled/disabled according to the scheduling of
that process - so not appropriate for us. When an event is related to a
cpu id, perf ensures pmu methods will be invoked via an inter process
interrupt on that core. To avoid invasive changes our userspace opened OA
perf events for a specific cpu. This was workable but it meant the
majority of the OA driver ran in atomic context, including all OA report
forwarding, which wasn’t really necessary in our case and seems to make
our locking requirements somewhat complex as we handled the interaction
with the rest of the i915 driver.

i915-perf state cleanup is split up into an ‘unregister’ and
‘deinit’ phase where the interface is first hidden from
userspace by i915_perf_unregister() before cleaning up
remaining state in i915_perf_fini().

Validates the stream open parameters given by userspace including flags
and an array of u64 key, value pair properties.

Very little is assumed up front about the nature of the stream being
opened (for instance we don’t assume it’s for periodic OA unit metrics). An
i915-perf stream is expected to be a suitable interface for other forms of
buffered data written by the GPU besides periodic OA metrics.

Note we copy the properties from userspace outside of the i915 perf
mutex to avoid an awkward lockdep with mmap_sem.

Most of the implementation details are handled by
i915_perf_open_ioctl_locked() after taking the drm_i915_private->perf.lock
mutex for serializing with any non-file-operation driver hooks.

Enables the collection of HW samples, either in response to
I915_PERF_IOCTL_ENABLE or implicitly called when stream is opened
without I915_PERF_FLAG_DISABLED.

disable

Disables the collection of HW samples, either in response
to I915_PERF_IOCTL_DISABLE or implicitly called before destroying
the stream.

poll_wait

Call poll_wait, passing a wait queue that will be woken
once there is something ready to read() for the stream

wait_unlocked

For handling a blocking read, wait until there is
something to ready to read() for the stream. E.g. wait on the same
wait queue that would be passed to poll_wait().

read

Copy buffered metrics as records to userspace
buf: the userspace, destination buffer
count: the number of bytes to copy, requested by userspace
offset: zero at the start of the read, updated as the read
proceeds, it represents how many bytes have been copied so far and
the buffer offset for copying the next record.

Copy as many buffered i915 perf samples and records for this stream
to userspace as will fit in the given buffer.

Only write complete records; returning -ENOSPC if there isn’t room
for a complete record.

Return any error condition that results in a short read such as
-ENOSPC or -EFAULT, even though these may be squashed before
returning to userspace.

Note this function only validates properties in isolation it doesn’t
validate that the combination of properties makes sense or that all
properties necessary for a particular kind of stream have been set.

Note that there currently aren’t any ordering requirements for properties so
we shouldn’t validate or assume anything about ordering here. This doesn’t
rule out defining new properties with ordering requirements in the future.

Implements further stream config validation and stream initialization on
behalf of i915_perf_open_ioctl() with the drm_i915_private->perf.lock mutex
taken to serialize with any non-file-operation driver hooks.

Note

at this point the props have only been validated in isolation and
it’s still necessary to validate that the combination of properties makes
sense.

In the case where userspace is interested in OA unit metrics then further
config validation and stream initialization details will be handled by
i915_oa_stream_init(). The code here should only validate config state that
will be relevant to all stream types / backends.

The entry point for handling a read() on a stream file descriptor from
userspace. Most of the work is left to the i915_perf_read_locked() and
i915_perf_stream_ops->read but to save having stream implementations (of
which we might have multiple later) we handle blocking read here.

We can also consistently treat trying to read from a disabled stream
as an IO error so implementations can assume the stream is enabled
while reading.

The intention is that disabling an re-enabling a stream will ideally be
cheaper than destroying and re-opening a stream with the same configuration,
though there are no formal guarantees about what state or buffered data
must be retained between disabling and re-enabling a stream.

Note

while a stream is disabled it’s considered an error for userspace
to attempt to read from the stream (-EIO).

Selects and applies any MUX configuration to set
up the Boolean and Custom (B/C) counters that are part of the
counter reports being sampled. May apply system constraints such as
disabling EU clock gating as required.

disable_metric_set

Remove system constraints associated with using
the OA unit.

oa_enable

Enable periodic sampling

oa_disable

Disable periodic sampling

read

Copy data from the circular OA buffer into a given userspace
buffer.

oa_hw_tail_read

read the OA tail pointer register

In particular this enables us to share all the fiddly code for
handling the OA unit tail pointer race that affects multiple
generations.

[Re]enables hardware periodic sampling according to the period configured
when opening the stream. This also starts a hrtimer that will periodically
check for data in the circular OA buffer for notifying userspace (e.g.
during a read() or poll()).

Stops the OA unit from periodically writing counter reports into the
circular OA buffer. This also stops the hrtimer that periodically checks for
data in the circular OA buffer, for notifying userspace.

For handling userspace polling on an i915 perf stream opened for OA metrics,
this starts a poll_wait with the wait queue that our hrtimer callback wakes
when it sees data ready to read in the circular OA buffer.

This section simply includes all currently documented i915 perf internals, in
no particular order, but may include some more minor utilities or platform
specific details than found in the more high-level sections.

This is either called via fops (for blocking reads in user ctx) or the poll
check hrtimer (atomic ctx) to check the OA buffer tail pointer and check
if there is data available for userspace to read.

This function is central to providing a workaround for the OA unit tail
pointer having a race with respect to what data is visible to the CPU.
It is responsible for reading tail pointers from the hardware and giving
the pointers time to ‘age’ before they are made available for reading.
(See description of OA_TAIL_MARGIN_NSEC above for further details.)

Besides returning true when there is data available to read() this function
also has the side effect of updating the oa_buffer.tails[], .aging_timestamp
and .aged_tail_idx state used for reading.

Note

It’s safe to read OA config state here unlocked, assuming that this is
only called while the stream is enabled, while the global OA configuration
can’t be modified.

The contents of a sample are configured through DRM_I915_PERF_PROP_SAMPLE_*
properties when opening a stream, tracked as stream->sample_flags. This
function copies the requested components of a single sample to the given
read()buf.

Notably any error condition resulting in a short read (-ENOSPC or
-EFAULT) will be returned even though one or more records may
have been successfully copied. In this case it’s up to the caller
to decide if the error should be squashed before returning to
userspace.

Note

reports are consumed from the head, and appended to the
tail, so the tail chases the head?... If you think that’s mad
and back-to-front you’re not alone, but this follows the
Gen PRM naming convention.

Notably any error condition resulting in a short read (-ENOSPC or
-EFAULT) will be returned even though one or more records may
have been successfully copied. In this case it’s up to the caller
to decide if the error should be squashed before returning to
userspace.

Note

reports are consumed from the head, and appended to the
tail, so the tail chases the head?... If you think that’s mad
and back-to-front you’re not alone, but this follows the
Gen PRM naming convention.

For handling userspace polling on an i915 perf stream opened for OA metrics,
this starts a poll_wait with the wait queue that our hrtimer callback wakes
when it sees data ready to read in the circular OA buffer.

[Re]enables hardware periodic sampling according to the period configured
when opening the stream. This also starts a hrtimer that will periodically
check for data in the circular OA buffer for notifying userspace (e.g.
during a read() or poll()).

Stops the OA unit from periodically writing counter reports into the
circular OA buffer. This also stops the hrtimer that periodically checks for
data in the circular OA buffer, for notifying userspace.

Besides wrapping i915_perf_stream_ops->read this provides a common place to
ensure that if we’ve successfully copied any data then reporting that takes
precedence over any internal error status, so the data isn’t lost.

For example ret will be -ENOSPC whenever there is more buffered data than
can be copied to userspace, but that’s only interesting if we weren’t able
to copy some data because it implies the userspace buffer is too small to
receive a single record (and we never split records).

Another case with ret == -EFAULT is more of a grey area since it would seem
like bad form for userspace to ask us to overrun its buffer, but the user
knows best:

The entry point for handling a read() on a stream file descriptor from
userspace. Most of the work is left to the i915_perf_read_locked() and
i915_perf_stream_ops->read but to save having stream implementations (of
which we might have multiple later) we handle blocking read here.

We can also consistently treat trying to read from a disabled stream
as an IO error so implementations can assume the stream is enabled
while reading.

The intention is that disabling an re-enabling a stream will ideally be
cheaper than destroying and re-opening a stream with the same configuration,
though there are no formal guarantees about what state or buffered data
must be retained between disabling and re-enabling a stream.

Note

while a stream is disabled it’s considered an error for userspace
to attempt to read from the stream (-EIO).

Implements further stream config validation and stream initialization on
behalf of i915_perf_open_ioctl() with the drm_i915_private->perf.lock mutex
taken to serialize with any non-file-operation driver hooks.

Note

at this point the props have only been validated in isolation and
it’s still necessary to validate that the combination of properties makes
sense.

In the case where userspace is interested in OA unit metrics then further
config validation and stream initialization details will be handled by
i915_oa_stream_init(). The code here should only validate config state that
will be relevant to all stream types / backends.

Note this function only validates properties in isolation it doesn’t
validate that the combination of properties makes sense or that all
properties necessary for a particular kind of stream have been set.

Note that there currently aren’t any ordering requirements for properties so
we shouldn’t validate or assume anything about ordering here. This doesn’t
rule out defining new properties with ordering requirements in the future.

Validates the stream open parameters given by userspace including flags
and an array of u64 key, value pair properties.

Very little is assumed up front about the nature of the stream being
opened (for instance we don’t assume it’s for periodic OA unit metrics). An
i915-perf stream is expected to be a suitable interface for other forms of
buffered data written by the GPU besides periodic OA metrics.

Note we copy the properties from userspace outside of the i915 perf
mutex to avoid an awkward lockdep with mmap_sem.

Most of the implementation details are handled by
i915_perf_open_ioctl_locked() after taking the drm_i915_private->perf.lock
mutex for serializing with any non-file-operation driver hooks.

In particular OA metric sets are advertised under a sysfs metrics/
directory allowing userspace to enumerate valid IDs that can be
used to open an i915-perf stream.

void i915_perf_unregister(struct drm_i915_private * dev_priv)

hide i915-perf from userspace

Parameters

structdrm_i915_private*dev_priv

i915 device instance

Description

i915-perf state cleanup is split up into an ‘unregister’ and
‘deinit’ phase where the interface is first hidden from
userspace by i915_perf_unregister() before cleaning up
remaining state in i915_perf_fini().

Prefix macros that generally should not be used outside of this file with
underscore ‘_’. For example, _PIPE() and friends, single instances of
registers that are defined solely for the use by function-like macros.

Avoid using the underscore prefixed macros outside of this file. There are
exceptions, but keep them to a minimum.

There are two basic types of register definitions: Single registers and
register groups. Register groups are registers which have two or more
instances, for example one per pipe, port, transcoder, etc. Register groups
should be defined using function-like macros.

For single registers, define the register offset first, followed by register
contents.

For register groups, define the register instance offsets first, prefixed
with underscore, followed by a function-like macro choosing the right
instance based on the parameter, followed by register contents.

Define the register contents (i.e. bit and bit field macros) from most
significant to least significant bit. Indent the register content macros
using two extra spaces between #define and the macro name.

Define bit fields using REG_GENMASK(h,l). Define bit field contents
using REG_FIELD_PREP(mask,value). This will define the values already
shifted in place, so they can be directly OR’d together. For convenience,
function-like macros may be used to define bit fields, but do note that the
macros may be needed to read as well as write the register contents.

Define bits using REG_BIT(N). Do not add _BIT suffix to the name.

Group the register and its contents together without blank lines, separate
from other registers and their contents with one blank line.

Indent macro values from macro names using TABs. Align values vertically. Use
braces in macro values as needed to avoid unintended precedence after macro
substitution. Use spaces in macro values according to kernel coding
style. Use lower case in hexadecimal values.

Try to name registers according to the specs. If the register name changes in
the specs from platform to another, stick to the original name.

Try to re-use existing register macro definitions. Only add new macros for
new register offsets, or when the register contents have changed enough to
warrant a full redefinition.

When a register macro changes for a new platform, prefix the new macro using
the platform acronym or generation. For example, SKL_ or GEN8_. The
prefix signifies the start platform/generation using the register.

When a bit (field) macro changes or gets added for a new platform, while
retaining the existing register macro, add a platform acronym or generation
suffix to the name. For example, _SKL or _GEN8.