This is a summary of previous email thread on QEMU list, see full discussion here (but everything relevant should be here so this is only for reference):
I think we are at a point now where Linux DRM driver (and other OS drivers too) try to set up shared memory buffers that are used for sending command packets to the GPU that the emulated chip will need to parse and convert to register accesses. We need to implement this in some way otherwise it won't work as communication between the driver and card is done using this facility so this should be the next step before adding more registers and emulation. This is called the microengine, command processor, CCE or PM4 (Programming Mode 4) which are terms the docs variably use for this facility. If you search for "ati microengine" it may turn up some documentation on this where the buffers and command packets are described but the actual microengine and it's microcode appears to be undocumented.
We have microcode uploaded by the driver but we don't know how to run it. This is the microcode that the driver uploads to the card that the ME should execute and presumably implements the PM4 way of programming the GPU via packets read from memory:
and I've seen that even the MacOS driver for Rage128Pro uses a similar microcode. The Linux DRM for that card may not be that advanced but the corresponding microcode is here:
If the microcode of the microengine/CCE could be reversed or is documented somewhere it may be easier to implement emulation of that instead of doing the packet parsing for all possible command packets of which there are quite a lot, but in the real GPU all those are handled by the small microcode. The ME/CCE is probably some VLIW processor with 40 bit instructions which likely have ops to read and write memory and access registers and maybe some branching and synchronisation but I have no idea how to find out which opcode means what. This microengine is probably very similar throughout the early Radeons, only the microcode changes so if we could implement that it might work for several cards (also even for Rage128Pro).
Later Radeons (after R600) have a different microengine with different microcode but some Adreno GPUs seem to be based on that. Those microcodes were reversed for the Linux Adreno driver and the freedreno project has some results but it does not match up with these older Radeons so I'm not sure if that's any use for understanding these microcodes.
I've also found that Xenia emulator has some code to parse command packets of the XBox 360 GPU which is similar to some late r5xx GPUs so some of these might be useful for emulating previous Radeons as well. If we can't figure out how the microengine works, another approach could be what Xenia does and directly parse the packets not using the microcode. It could be possible to copy code from Xenia for that but we need to convert C++ to C.
The main difficulty here is to find out how the card works, once one gets that the next question is how to implement that in QEMU: e.g. how can we make the micro engine emulation run in a separate thread. I think it probably can be done with existing qemu thread and locking functions. The synchronisation between CPU and GPU should be solved by the protocol between them (after all in real hardware they also run parallel so this should have been solved). If so, the micro engine thread may only have to read memory packets then generate register accesses. For this to work the register access FIFO may also need to be added which is now missing as in current version register writes are executed immediately but once we have both the CPU and micro engine that can access registers we either need to implement the FIFO or have some synchronisation between the two. This needs some thought. (If we have the register write FIFO then the actual drawing can also be done in a separate thread which is more like what the real GPU does but to make it simple I did not try that in first version.)
Does this make sense to anyone? Feel free to add any knwledge here that could help getting this to work.
This is a summary of previous email thread on QEMU list, see full discussion here (but everything relevant should be here so this is only for reference):
I think we are at a point now where Linux DRM driver (and other OS drivers too) try to set up shared memory buffers that are used for sending command packets to the GPU that the emulated chip will need to parse and convert to register accesses. We need to implement this in some way otherwise it won't work as communication between the driver and card is done using this facility so this should be the next step before adding more registers and emulation. This is called the microengine, command processor, CCE or PM4 (Programming Mode 4) which are terms the docs variably use for this facility. If you search for "ati microengine" it may turn up some documentation on this where the buffers and command packets are described but the actual microengine and it's microcode appears to be undocumented.
We have microcode uploaded by the driver but we don't know how to run it. This is the microcode that the driver uploads to the card that the ME should execute and presumably implements the PM4 way of programming the GPU via packets read from memory:
This one is for R100 which is also used for RV100 I think; R300 has a different one but probably very similar:
and I've seen that even the MacOS driver for Rage128Pro uses a similar microcode. The Linux DRM for that card may not be that advanced but the corresponding microcode is here:
If the microcode of the microengine/CCE could be reversed or is documented somewhere it may be easier to implement emulation of that instead of doing the packet parsing for all possible command packets of which there are quite a lot, but in the real GPU all those are handled by the small microcode. The ME/CCE is probably some VLIW processor with 40 bit instructions which likely have ops to read and write memory and access registers and maybe some branching and synchronisation but I have no idea how to find out which opcode means what. This microengine is probably very similar throughout the early Radeons, only the microcode changes so if we could implement that it might work for several cards (also even for Rage128Pro).
Later Radeons (after R600) have a different microengine with different microcode but some Adreno GPUs seem to be based on that. Those microcodes were reversed for the Linux Adreno driver and the freedreno project has some results but it does not match up with these older Radeons so I'm not sure if that's any use for understanding these microcodes.
I've also found that Xenia emulator has some code to parse command packets of the XBox 360 GPU which is similar to some late r5xx GPUs so some of these might be useful for emulating previous Radeons as well. If we can't figure out how the microengine works, another approach could be what Xenia does and directly parse the packets not using the microcode. It could be possible to copy code from Xenia for that but we need to convert C++ to C.
The main difficulty here is to find out how the card works, once one gets that the next question is how to implement that in QEMU: e.g. how can we make the micro engine emulation run in a separate thread. I think it probably can be done with existing qemu thread and locking functions. The synchronisation between CPU and GPU should be solved by the protocol between them (after all in real hardware they also run parallel so this should have been solved). If so, the micro engine thread may only have to read memory packets then generate register accesses. For this to work the register access FIFO may also need to be added which is now missing as in current version register writes are executed immediately but once we have both the CPU and micro engine that can access registers we either need to implement the FIFO or have some synchronisation between the two. This needs some thought. (If we have the register write FIFO then the actual drawing can also be done in a separate thread which is more like what the real GPU does but to make it simple I did not try that in first version.)
Does this make sense to anyone? Feel free to add any knwledge here that could help getting this to work.