Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding CUDA Programming Model for GPU Application Acceleration, Slides of Computer Aided Design (CAD)

An overview of the cuda programming model, its benefits, and the specifications of nvidia gpus such as tesla c2050 and s2050. It also compares the performance of x86 and s2050 and discusses the concepts of grids, blocks, threads, and kernels in cuda. The document aims to help readers understand how to leverage gpus for application acceleration.

Typology: Slides

2012/2013

Uploaded on 04/24/2013

baijayanthi
baijayanthi 🇮🇳

4.5

(13)

171 documents

1 / 40

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Using The CUDA Programming
Model
1
Leveraging GPUs for Application Acceleration
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28

Partial preview of the text

Download Understanding CUDA Programming Model for GPU Application Acceleration and more Slides Computer Aided Design (CAD) in PDF only on Docsity!

Using The CUDA Programming

Model

1

Leveraging GPUs for Application Acceleration

Let’s Make a … Socket!

  • Your goal is to speed up your code as much as

possible, BUT …

  • …you have a budget for Power...
  • Do you choose:
    1. 6 Processors, each providing N performance, and using P Watts
    2. 450 Processors, each providing N/10 performance, and collectively using 2P Watts 3. It depends!

Intel P4 Northwood

Modern Architecture (Intel)

FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure. Copyright © 2009 Elsevier

Why GPGPU Processing?

  • A quiet revolution
    • Calculation: TFLOPS vs. 100 GFLOPS
    • Memory Bandwidth: ~10x
    • GPU in every PC– massive volume

8

Figure 1.1. Enlarging Performance Gap betw een GPUs and CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

NVIDIA Tesla S2050 Server Specs

  • 4 C2050 cards inside a 1U server

(looks like any other server node)

  • 1.15 GHz
  • Single Precision (SP) floating point performance: 4121.6 GFLOPs
  • Double Precision (DP) floating point performance: 2060.8 GFLOPs
  • Internal RAM: 12 GB total (3 GB per GPU card)
  • Internal RAM speed: 576 GB/sec aggregate
  • Has to be plugged into two PCIe slots

(at most 16 GB/sec)

Compare x86 vs S

• Here are some interesting measures:

11

Dual socket, AMD 2.3 GHz 12-core

NVIDIA Tesla S

DP GFLOPs/Watt ~0.5 GFLOPs/Watt ~1.6 GFLOPs/Watt (~3x) SP GFLOPS/Watt ~1 GFLOPs/Watt ~3.2 GFLOPs/Watt (~3x) DP GFLOPs/sq ft ~590 GFLOPs/sq ft ~2750 GFLOPs/sq ft (4.7x) SP GFLOPs/sq ft ~1180 GFLOPs/sq ft ~5500 GFLOPs/sq ft (4.7x) Racks per PFLOP DP 142 racks/PFLOP DP 32 racks/PFLOP DP (23%) Racks per PFLOP SP 71 racks/PFLOP SP 16 racks/PFLOP SP (23%)

OU’s Sooner is 34.5 TFLOPs DP, which is just over 1 rack of S2050.

Previous GPGPU Constraints

• Dealing with graphics API

  • To get general purpose code

working, you had to use the

corner cases of the graphics

API

  • Essentially – re-write entire

program as a collection of

shaders and polygons

13

Input Registers Fragment Program

Output Registers

Constants

Texture

Temp Registers

per Shader^ per thread per Context

FB Memory

CUDA

14

  • “Compute Unified Device Architecture”
  • General purpose programming model
    • User kicks off batches of threads on the GPU
    • GPU = dedicated super-threaded, massively data parallel co-processor
  • Targeted software stack
    • Compute oriented drivers, language, and tools
  • Driver for loading computation programs onto

GPU

CUDA – C with a Co-processor

• One program, two devices

  • Serial or modestly parallel parts in host C code
  • Highly parallel parts in device kernel C code

16

Serial Code (host)

... ...

Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

CUDA Devices and Threads

  • A CUDA compute device
    • Is a coprocessor to the CPU or host
    • Has its own DRAM (device memory)
    • Runs many threads in parallel
    • Is typically a GPU but can also be another type of parallel processing device
  • Data-parallel portions of an application are expressed as device kernels which run on many threads
  • Differences between GPU and CPU threads
    • GPU threads are extremely lightweight
      • Very little creation overhead
    • GPU needs 1000s of threads for full efficiency
      • Multi-core CPU needs only a few (and is hurt by having too many)

Buzzword: Thread

  • In CUDA, a thread is an execution of

a kernel with a given index.

  • Each thread uses its index to access a specific subset of the data, such that the collection of all threads cooperatively processes the entire data set.
  • Think: Process ID
  • These operate very much like

threads in OpenMP

  • they even have shared and private variables.
  • So what’s the difference with

CUDA?

  • Threads are free

19

0 1 2 3 4 5 6 7

… float x = input[threadID]; float y = func(x); output[threadID] = y; …

threadID

Buzzword: Block

  • In CUDA, a block is a group of threads.
  • Blocks are used to organize threads into

manageable (and schedulable) chunks.

  • Can organize threads in 1D, 2D, or 3D arrangements
  • What best matches your data?
  • Some restrictions, based on hardware
  • Threads within a block can do a bit of

synchronization, if necessary.