I forgot to mention the best part — a trick that modern GPUs can only pull off because of how they feed their DPEs:
Take a modern computer with a modern GPU, and plug the GPU into one 75Hz monitor and one 60Hz monitor. In your OS, arrange the resulting displays to be aligned horizontally. Start an (HW-accelerated) video playing. Now, place that video so it straddles the boundary between the two screens.
When you do this, each DPE will actually be demanding frames from the video draw-context on a different schedule; and so each DPE will actually be getting a snapshot copy of (its slice of) that buffer using a different time-weighted interpolation of two frames! But the result will look perfectly smooth and synchronized!
This synchronization would literally not be possible if the HW draw-context wasn't being snapshotted using each DPE's current rasterization/wire-encoding clock as an input.
The input required for this could — in theory, at least — be synthesized in a pure-functional manner internal to the GPU's compute, by having a final VRAM-to-VRAM copy that is done by a compute-shader with some extra logic to map each pixel position to a simulated DPE's raster-clock time.
But just having a DPE that really does things that way — keeping a physical raster-cycle clock, a "narrow" (much smaller than screen-sized) ring-buffer, and a demand circuit that issues fill orders for the GPU to write lines or tiles into the ring buffer according to the clock — is cheaper even than not synchronizing at all; let alone to doing synchronization through extra VRAM (for the mastering buffer, and for a per-pixel draw-context centroid UV texture) + an extra compute-shader pass with a virtual raster clock.
Take a modern computer with a modern GPU, and plug the GPU into one 75Hz monitor and one 60Hz monitor. In your OS, arrange the resulting displays to be aligned horizontally. Start an (HW-accelerated) video playing. Now, place that video so it straddles the boundary between the two screens.
When you do this, each DPE will actually be demanding frames from the video draw-context on a different schedule; and so each DPE will actually be getting a snapshot copy of (its slice of) that buffer using a different time-weighted interpolation of two frames! But the result will look perfectly smooth and synchronized!
This synchronization would literally not be possible if the HW draw-context wasn't being snapshotted using each DPE's current rasterization/wire-encoding clock as an input.
The input required for this could — in theory, at least — be synthesized in a pure-functional manner internal to the GPU's compute, by having a final VRAM-to-VRAM copy that is done by a compute-shader with some extra logic to map each pixel position to a simulated DPE's raster-clock time.
But just having a DPE that really does things that way — keeping a physical raster-cycle clock, a "narrow" (much smaller than screen-sized) ring-buffer, and a demand circuit that issues fill orders for the GPU to write lines or tiles into the ring buffer according to the clock — is cheaper even than not synchronizing at all; let alone to doing synchronization through extra VRAM (for the mastering buffer, and for a per-pixel draw-context centroid UV texture) + an extra compute-shader pass with a virtual raster clock.