User Tools

Site Tools


Current gem5-gpu Software Architecture

Here is a diagram of the current gem5-gpu software architecture.

CudaCore (src/gpu/gpgpu-sim/cuda_core.*, src/gpu/gpgpu-sim/

  • Wrapper for GPGPU-Sim shader_core_ctx (gpgpu-sim/gpgpu-sim/shader.h)
  • Sends instruction, global and const memory requests to Ruby cache hierarchy
  • Data memory accesses:
    • Receives global and const data requests from GPGPU-Sim ldst_unit when a warp instruction is issued to it and the ldst_unit cycles (i.e. after ld_exec for the warp instruction)
    • Issues per-thread data requests to ShaderLSQ, which coalesces and sends reads and writes to Ruby
    • Reads and writes shader_core_ctx registers for memory read/write functionality as necessary
    • Signal warp load instruction completion for timing
  • Instruction memory accesses (this is hacky):
    • l1icache_gem5 is descended from read_only_cache (gpgpu-sim/gpgpu-sim/gpu-cache.h), but overrides the access() function to call into gem5-gpu CudaCore for instruction request timing
    • Signal instruction fetch completion when request is returned through Ruby caches

CudaGPU (src/gpu/gpgpu-sim/cuda_gpu.*, src/gpu/gpgpu-sim/

  • Acts as gem5 structure for organizing the GPU hardware:
    • Contains CudaCores
    • Handles kernel begin/end
    • Cycles GPGPU-Sim cores
  • Contains logical copy engine to move data from CPU to GPU address spaces
  • Handles CPU thread context sleep and signal
  • Handles PTX code, variables and checkpointing
  • Manages GPU memory space page table

NOTE: The last 4 items will be eventually moved to a CudaContext as described here.

Shader TLB (src/gpu/shader_tlb.*, src/gpu/

  • In order to avoid confronting virtual address translation and timing on the GPU, the shader TLB is designed as a placeholder component that translates addresses in a single cycle. It contains 2 separate execution paths:
    • If instantiated with access_host_pagetable = False, the CudaGPU manages memory mapping in a functional pagetable that is accessed by the shader TLB
    • If instantiated with access_host_pagetable = True, the shader TLB uses the gem5 X86TLB to do translation
  • There are a couple things to be aware of with address translation:
    • If running in gem5 SE mode, pagetable walk timing is NOT modeled. In this mode, address translation takes a fixed, single cycle
    • If running in gem5 FS mode, pagetable walks ARE modeled. In this mode, we recommend NOT using access_host_pagetable = True, because the walk overhead will be assessed during kernel runtime in the case of a TLB miss

Shader LSQ (src/gpu/shader_lsq.*, src/gpu/

  • The LSQ models load and store address input on a per-lane basis from the issue stage of the GPGPU-Sim core model. This unit handles coalescing of the requests to be sent into the gem5 Ruby memory system, and decoalescing of memory responses from Ruby to update registers for the threads that issued loads

CUDA syscalls (src/api/cuda_syscalls.*)

  • gem5 supports decoding of non-ISA-specified instructions (pseudo-instructions) within a simulated benchmark. We introduced a pseudo-instruction called m5_gpu, which currently allows the CPU to trap and hand control over to CUDA syscalls. These instructions are built into libcuda in the simulated benchmark binary
  • Once trapped into CUDA syscalls, the appropriate CUDA call is executed, which may interface with the GPU for managing memory or kernel handling, PTX code handling or the copy engine

Currently supported CUDA calls are listed here.

Current gem5-gpu limitations

gem5-gpu models memory access to GPU global, const, and local memory through the gem5/Ruby memory hierarchy. Shared (scratch) memory accesses are modeled in GPGPU-Sim code. Atomic memory operations are supported to global and shared memory spaces.

However, currently, there are a few noteworthy limitations to the GPU simulation capability in gem5-gpu:

  • gem5-gpu does not model the CUDA texture memory space. This would require aligning GPGPU-Sim's address space identifiers for texture memory with memory allocations and accesses in gem5-gpu.
  • Though GPGPU-Sim provides some support for asynchronous copy engine activity and multiple GPU kernel streams, this functionality has not yet been pulled into gem5-gpu.
cur-arch.txt · Last modified: 2016/02/19 08:53 by jthestness