This operation is not currently possible but that's a very good idea!
A slight generalization of this would be "alignment by coordinate systems" or "by calibration", as the approach could also work with the line-based calibration / coordinate system (just origin + scale, axes stay aligned with image axes). Even if less accurate it's often the only calibration available.
Lens distortion correction might also comes into play.
The original goal with superposition was actually to compute this transform matrix automatically, refining it using the video sequence to ignore the foreground layer. I very much like the idea of being able to do something manually before implementing an automation of it.
We need the full homography matrix by the way, not just affine, as it will map arbitrary quad to arbitrary quad. I've been thinking about how to finally build a platform to experiment with these ideas more easily. I also need to revisit and homogenize the matrix maths in some places. No ETA.