SH2 dynamic recompiler
This describes the dynamic recomplier for SH2 emulation which is in the yabause-dynarec branch, and in trunk since version 0.9.11. The design is based in part on code originally written for mupen64plus.
The PSP port uses a different recompiler, not described here, which has different design goals and in particular is optimized for very low memory usage.
The recompiler generates optimized code, and to avoid the expense of recompiling code multiple times, a large cache (typically 16 to 32 MB) is used.
The memory for the recompiled code is allocated in a single large block, using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.
Saturn games frequently load different code into memory. To facilitate replacement of code, only contiguous blocks of SH2 instructions are compiled. No attempt is made to follow branches or 'hot-paths' to distant memory areas.
The recompiler consists of eight stages, plus a linker, and a memory manager, as described below.
Currently the dynarec generates x86, x86-64, and ARMv7 (little-endian). Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.
Pass 1: Disassembly
When an instruction address is encountered which has not been compiled, the function sh2_recompile_block is called, with the address of the target instruction as its sole parameter.
Instructions are decoded until an unconditional jump is encountered, except that disassembly is usually continued past a JSR (subroutine call) instruction. Up to 4096 instructions (8K of SH2 code) may be disassembled at once.
During disassembly, PC-relative load instructions are provisionally treated as loading constant values, and these values are used to determine possible target addresses of branch instructions and memory writes. If it appears that an instruction will write to one of these nearby memory locations, then this constant-propagation strategy is abandoned.
Pass 2: Constant propagation and liveness analysis
Following disassembly, a revised constant propagation pass is performed. Using information about branch targets found in the first pass, this will avoid propagating constants to any instructions following the target of a branch instruction.
After completing constant propagation and identifying the target address of any branch instruction that can be definitively determined, liveness analysis is performed on the registers. This determines when a particular register will no longer be used, and thus can be removed from the register cache.
Pass 3: Register allocation
The 22 SH2 registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.
Registers are made available using a least-soon-needed replacement policy. When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.
Pass 4: Free unused registers
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary. These are then removed from the register cache. This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.
Pass 5: Pre-allocate registers
If a register will be used soon and needs to be loaded, try to load it early if the register is available. This improves execution on CPUs with a load-use penalty. Additionally, if a loop is encountered, attempt to move the load outside of the loop.
Pass 6: Optimize clean/dirty state
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified. This avoids having to write the same register on multiple code paths due to conditional branches. Additionally, try to avoid writing out dirty registers inside of loops.
Pass 7: Identify interrupt return targets
Checking for interrupts occurs at branch instructions, and these locations may be jumped to upon a return from interrupt. This determines which such locations in the recompiled code are safe to jump to without potentially making any invalid assumptions about values in registers. If a register was assumed to contain a specific value, then an alternative version which does not do constant propagation will have to be compiled if that location is ever jumped to.
Pass 8: Assembly
This generates and outputs the recompiled code.
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.
If a recompiled instruction relies on a certain SH2 register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers. When an instruction outside of this block needs to jump to that location, it will instead jump to the stub. The necessary registers will be loaded, and then it will jump into the main code sequence.
The linker fills in all unresolved branches. Branches within the block are linked to their target address. Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array. This information will be used to remove the links in the event that the target of the branch is invalidated.
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker. When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.
The last step in sh2_recompile_block is to ensure that there will be sufficient memory available to compile the next block. If there is not, then the oldest blocks are purged.
The dynarec cache can be described as a 32MB circular buffer divided into eight segments. Memory is allocated in order from beginning to end.
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer. This continues as memory is needed, wrapping around from end to beginning.
It is unusual for the cache to become full. Most Saturn games do not use more than about 10 megabytes. Thus, code which has been recompiled once is retained in the cache and usually will not need to be recompiled again.
Invalidation and restoration
Memory writes are checked to determine if they modify any areas that have been compiled. This is needed to detect self-modifying code, as well as writes to any locations that were assumed to contain constant data. When a compiled block has been modified, the block is invalidated and no longer used.
The upper 20 bits of the memory address of each write is used as an index to test bits in memory_map and cached_code to determine whether code in that memory area was potentially cached. This has a resolution of 4KB, however some Saturn games place data very close to code. To ensure that code blocks are not invalidated unnecessarily, a second check is performed to ensure that the write actually modified instructions or data that were compiled. If not, the cache is not invalidated. This secondary check is somewhat slow, so if this occurs more than 500 times in rapid succession, the area is invalidated anyway and the check is no longer performed. The counter is reset with each vertical blanking interval. This heuristic appears to be suitable for the majority of Saturn software.
References to compiled blocks are hashed based on the upper bits of the address and stored in one of 2048 linked lists in the jump_in array. Each list covers a 4K block of memory.
When a write hits a cached area, all entries in the corresponding list are invalidated. If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.
When code blocks are invalidated by memory writes, a list of locations of recent writes is kept. This list is checked when blocks are recompiled, and code in these memory areas will be compiled without any constant-propagation based on PC-relative loads.
Sometimes blocks may be invalidated even when none of the code is actually modified. This can happen if code is reloaded without actually modifying it. If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array. If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled. This is performed by the clean_blocks function which is called periodically.
Branches with unresolved addresses jump to the dynamic linker. This will search the jump_in list for a matching address. If found, the branch instruction will be patched with the address, and then it will jump to this address.
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified. If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made. If not, then it will jump to the block. Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker. These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.
If no compiled block is found, or the existing block was modified, the target is recompiled.
When a jump instruction with an adddress in a register is encountered, and the target address can not be predetermined via constant propagation, the address of the recompiled code must be looked up using the address of the SH2 code.
A special case of this is the RTS instruction, which returns from a subroutine. When a call instruction (BSR/JSR) is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a RTS instruction is executed. This allows for a quick return from subroutine calls.
If the small hash table lookup fails to find a match, or the jump instruction is not RTS, a larger 131072-entry hash table is checked. This table contains 65536 bins with up to 2 addresses per bin. If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed, using the list from the jump_in array.
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register. The cycle count is in R10 on ARM and ESI on x86. The value in this register is normally a negative number. When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.
For example, the following x86 code adds eight cycles:
add $8,%esi jns cyclecount_handler
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return, and then calls one of two functions, depending on whether the master or slave SH2 is executing. This is used both for interrupt handling and to switch between the two processors. Code on each processor is executed for one-tenth of a scanline before switching to the other processor. This is approximately 170 to 180 cycles.
SH2 has 'delay slots', where the instruction after the branch is executed before the branch is taken. Instructions in delay slots are issued out-of-order in the recompiled code, so that the branch is taken after the delay slot.
In rare cases, an instruction that sets the T bit occurs in the delay slot of a conditional branch. These are not reordered; the test occurs first.
Registers containing constants are identified via the isconst bits in the regstat structure. Additionally, the isdoingcp and wasdoingcp bits identify registers for which the instructions which write to those registers will be skipped and replaced with an immediate load. The wasdoingcp bit is set for a register if the register contained a known constant before the instruction, and the isdoingcp bit is set for a register if the register will contain a known constant after the instruction executes.
This information is used to determine the target address of branch instructions and memory accesses, where possible. This results in more efficient code because it avoids a hash table lookup of the address at runtime.
These optimizations are not performed where a branch target intervenes, because such locations could be branched to with different values in the registers.
The Saturn has a relatively complex memory map.
Memory addresses are mapped from locations in the Saturn's memory to locations in the emulator's memory by using the upper 20 bits of the address as an index to the memory_map table. This array contains bits which identify whether there is a valid mapping for this memory range, and if so, a relative offset to be added to the address. If the memory region does not have a direct mapping, an appropriate handler is called via the MappedMemoryRead or MappedMemoryWrite functions.
If this option is defined, it enables use of ARMv6 instructions. If this option is not defined, sign extension is performed using a pair of shift instructions instead of the sxtb or sxth instructions. Additionally, if this is not defined, the byte swap instructions are not used, and an xor-swap is used instead. Avoiding use of the ARMv6 instructions is less efficient and increases code size, but is necessary for compatibility with older processors.
If this option is defined, it enables use of ARMv7 instructions. If this option is not defined, values will be loaded from constant pools instead of using movw/movt for immediate value loads. Avoiding use of the ARMv7 instructions is necessary for compatibility with older processors, but is less efficient because the constant loads result in pipeline stalls and increase the L1 cache miss rate.
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between. This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table. Usually improves performance.
Setting this to a value greater than 1 underclocks the emulated SH2 CPU.
Debugging information can be obtained by defining the assem_debug macro as printf. This will cause the dynamic recompiler to print debugging information to stdout. For each disassembled SH2 instruction, an entry similar to the following will be printed:
U: r0 pre: eax=16 ecx=1 edx=17 ebx=0 ebp=-1 esi=23 edi=-1 needs: eax ecx edx esi entry: eax=16 ecx=1 edx=17 ebx=-1 ebp=-1 esi=23 edi=-1 dirty: ebx esi ccadj=3 20000490: XOR r0,r0 eax=16 ecx=1 edx=17 ebx=0 ebp=-1 esi=23 edi=-1 dirty: ebx esi
U: A list of SH2 registers which will not be used before they are overwritten (liveness analysis)
pre: The state of the register mapping prior to execution of this instruction. (-1 = no mapping; 23 = cycle count; The complete list of values with special meanings can be found in the source code)
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings
entry: The minimum set of register mappings required to jump to this point
dirty: Cached registers that have been modified and will need to be written back
ccadj: Cycle count adjustment, the number of cycles which will be added to account for the preceeding instructions.
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes
An asterisk (*) designates locations which are the target of a branch instruction. Constant propagation will not be performed across these points.
After the complete disassembly, the recompiled native code is shown.
Note that the debugging output can be quite voluminous; 20-30 MB is typical.
On little-endian systems, Yabause stores the Saturn memory with the upper and lower half of every 32-bit word swapped. Currently these are swapped after every load and before every store. It would be possible to reduce unnecessary swapping by keeping track of when words actually need to be swapped.
Currently the x86-64 backend does not take advantage of the additional registers available on x86-64. It is unclear whether using additional registers is advantageous, as the REX prefixes increase the code size.
Apple ships a very outdated assembler, which creates some problems building the assembly code. The 32-bit version requires some minor syntax changes to compile. The 64-bit version needs to have all absolute addresses changed to rip-relative addresses.
The tendency of OSX to locate code (and .bss) above the first 4GB of memory creates a problem for indexing the hash tables used for return address lookup. It would be possible to compile without this (#undef USE_MINI_HT) resulting in a small performance penalty. Alternatively the hash table could be allocated in low memory using mmap() with a fixed address.
If memory_map is located above the first 4GB of RAM then it will be necessary to allocate a pointer to it. There is existing code to do so, as pointers are required on ARM (search for MMREG to find this code).
The mmap() calls need to be replaced with VirtualAlloc/VirtualProtect.
The assembly files will need to be compiled with GCC, or converted to a format that MSVC understands.