I All the images in this talk were rendered from real-time demos. The main difference is that gpu's use multi-threading to tolerate latency, each time you wait for a read, just start another thread, This works if there are lots of threads #### 100,000 times faster for current pixar results, more needed next year In entertainment-related computer graphics business, the amount of time that it takes to compute one frame is constant over time. The reason is that audience expectation increases at the same rate as computer power. ``` thread: // load r1 = load (index) // series of adds r1 = r1 + r1 r1 = r1 + r1 ... Run lots of threads Can you get peak performance/multi-core/cluster? Peak performance = do float ops every cycle ``` This simple program is supposed to show a case where the gpu is much better then the cpu The gap between fetch and the alu is the latency The big bar at the top shows when the float units are running. It is 100% active if there are enough threads Wavefronts are 64 thread units, they are also called warps All resources are allocated at start, so no deadlock is possible A thread is one pc/one group of registers, a wavefront is 64 threads ## **Implications** CPU: Loads determine performance - Compiler works hard to - Minimize ALU code - Reduce memory overhead - Try to use prefetch and other magic to reduce the amount of time waiting for memory GPU: Threads determine performance - Compiler works hard to - Minimize ALU code - Maximize threads - Try to reorder instructions to reduce synchronization and other magic to reduce the amount of time waiting for threads ``` CPU part for each frame (sequential) { build vertex buffer set uniform inputs draw } This is producer consumer parallelism Internal queue of pending draw commands (often hundreds) ``` ``` Programming Model – GPU Part foreach vertex in buffer (parallel) { call vertex kernel/shader } foreach set of 3 vertex outputs (a triangle) (seq) { fixed function rasterize foreach pixel (parallel) { call pixel kernel/shader } Nothing about number of cores Nothing about sync Developer just writes kernels (in RED) ``` Each box in the grid gets its own thread, thread count is determined by hardware not by app, bigger screen means more threads The whole system scales with bigger screen or more processors, without change ``` Pixel Shader AMD float4 ambient; float4 diffuse; float4 specular; float Ka, Ks, Kd, N; float4 main( float4 Diff : COLORO, float3 Normal: TEXCOORDO, float3 Light : TEXCOORD1, float3 View : TEXCOORD2 ) : COLOR // Compute the reflection vector: float3 vReflect = normalize(2*dot(Normal, Light)*Normal - Light); \ensuremath{//} Final color is composed of ambient, diffuse and specular // contributions: float4 FinalColor = Ka * ambient + Kd * diffuse * dot( Normal, Light ) + Ks * specular * pow( max( dot( vReflect, View), 0), N ); return FinalColor; } 20 statements in byte code CGO 2008 ``` # **Programming model** Vertex and pixel kernels (shaders) Parallel loops are implicit Performance aware code does not know how many cores or how many threads All sorts of queues maintained under covers All kinds of sync done implicitly Programs are very small ## Parallelism Model All parallel operations are hidden via domain specific API calls Developers write sequential code + kernels Kernel operate on one vertex or pixel Developers never deal with parallelism directly No need for auto parallel compilers All four demos have specific names, for a graphics talk I'd use the acutal names Either the demo or movie goes here Box and whisker plot of shader length, box is ½ std div around mean, line is 1 and 1/2, outliers after this, size of max shader is growing, more control flow, Inside graph is the max triangles in 1000 triangle units so the highest value is 2 million, number of shaders is the count, time or chip version is going up d1 d2 d3 d4 are the demo numbers We see a double exponent in growth, triangles and shader size, count does down because of more control flow The hor scale is in asm lines so an 800 asm line shader is a big one ### **Shader Compiler (SC)** Developers ship games in byte code - Each time a game starts the shader is compiled Compiler is hidden in driver No user gets to set options or flags Compiler updates with new driver (once a month) Compile done each time game is run Like a JIT but we care about performance SC runs on consoles/phones/laptops/desktops etc # Relations to Std CPU Compiler AMD Smarter Choice About ½ code is traditional compiler, all the usual stuff SSA form Graph coloring register allocator Instruction scheduler But there is a lot of special stuff! #### **Some Odd Features** HLSL compiler is written by Microsoft and has its own idea of how to optimize a program Each compiler fights the other, so SC undoes the ms optimizations Hardware updates frequently so - SC supports large number of targets, internally (picture gcc done in c++ classes) - One version of the compiler supports all chips Shows the delta in registers, hIsI thinks the machine has vector registers and it does, so hIsI does an ok job, I split the 3000 shaders into 6 groups by length Smallest are lower left, biggest are upper right (lots of small one not so many large ones) A shader on the diag line means hIsl and sc used the same number of registers A dot can be a lot a shaders if they overlap ``` Open Problem Path aware allocation If (run-time-const) { call s1; } else { call s2; } Can we allocate high numbered regs to s2? ``` Problem is to allocate registers and then at run tim, if we know that s1 will always be called just say the shader needs less registers and so it gets more threads Handle this without recompiling Here we have the same 3k shaders hlsl thiinks it is vector machine but is actually 5 way vliw, so the vector assignment does not work well These are 5 current dx games, number of pixel shader and average packing in 5 way vliw issue This is an actual graph generated by sc for a single basic block in a shader computing perlin noise, the greedy list scheduler should have left some holes in the schedule fror this case, I think this is clearly a hand or hisl unrolled loop, can we do some fast graph analysis to figure out the structure? This was a real bug report, I listed the two names for the error