Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Understanding CUDA Programming Model for GPU Application Acceleration, Slides of Computer Aided Design (CAD)

Baba Ghulam Shah Badhshah University Computer Aided Design (CAD)

An overview of the cuda programming model, its benefits, and the specifications of nvidia gpus such as tesla c2050 and s2050. It also compares the performance of x86 and s2050 and discusses the concepts of grids, blocks, threads, and kernels in cuda. The document aims to help readers understand how to leverage gpus for application acceleration.

Typology: Slides

2012/2013

Uploaded on 04/24/2013

baijayanthi 🇮🇳

4.5

(13)

171 documents

1 / 40

This page cannot be seen from the preview

Don't miss anything!

Using The CUDA Programming

Model

Leveraging GPUs for Application Acceleration

Docsity.com

Partial preview of the text

Download Understanding CUDA Programming Model for GPU Application Acceleration and more Slides Computer Aided Design (CAD) in PDF only on Docsity!

Using The CUDA Programming

Model

Leveraging GPUs for Application Acceleration

Let’s Make a … Socket!

Your goal is to speed up your code as much as

possible, BUT …

…you have a budget for Power...
Do you choose:
1. 6 Processors, each providing N performance, and using P Watts
2. 450 Processors, each providing N/10 performance, and collectively using 2P Watts 3. It depends!

Intel P4 Northwood

Modern Architecture (Intel)

Why GPGPU Processing?

A quiet revolution
- Calculation: TFLOPS vs. 100 GFLOPS
- Memory Bandwidth: ~10x
- GPU in every PC– massive volume

Figure 1.1. Enlarging Performance Gap betw een GPUs and CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

NVIDIA Tesla S2050 Server Specs

4 C2050 cards inside a 1U server

(looks like any other server node)

1.15 GHz
Single Precision (SP) floating point performance: 4121.6 GFLOPs
Double Precision (DP) floating point performance: 2060.8 GFLOPs
Internal RAM: 12 GB total (3 GB per GPU card)
Internal RAM speed: 576 GB/sec aggregate
Has to be plugged into two PCIe slots

(at most 16 GB/sec)

Compare x86 vs S

• Here are some interesting measures:

Dual socket, AMD 2.3 GHz 12-core

NVIDIA Tesla S

DP GFLOPs/Watt ~0.5 GFLOPs/Watt ~1.6 GFLOPs/Watt (~3x) SP GFLOPS/Watt ~1 GFLOPs/Watt ~3.2 GFLOPs/Watt (~3x) DP GFLOPs/sq ft ~590 GFLOPs/sq ft ~2750 GFLOPs/sq ft (4.7x) SP GFLOPs/sq ft ~1180 GFLOPs/sq ft ~5500 GFLOPs/sq ft (4.7x) Racks per PFLOP DP 142 racks/PFLOP DP 32 racks/PFLOP DP (23%) Racks per PFLOP SP 71 racks/PFLOP SP 16 racks/PFLOP SP (23%)

OU’s Sooner is 34.5 TFLOPs DP, which is just over 1 rack of S2050.

Previous GPGPU Constraints

• Dealing with graphics API

To get general purpose code

working, you had to use the

corner cases of the graphics

API

Essentially – re-write entire

program as a collection of

shaders and polygons

Input Registers Fragment Program

Output Registers

Constants

Texture

Temp Registers

per Shader^ per thread per Context

FB Memory

CUDA

“Compute Unified Device Architecture”
General purpose programming model
- User kicks off batches of threads on the GPU
- GPU = dedicated super-threaded, massively data parallel co-processor
Targeted software stack
- Compute oriented drivers, language, and tools
Driver for loading computation programs onto

GPU

CUDA – C with a Co-processor

• One program, two devices

Serial or modestly parallel parts in host C code
Highly parallel parts in device kernel C code

Serial Code (host)

... ...

Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

CUDA Devices and Threads

A CUDA compute device
- Is a coprocessor to the CPU or host
- Has its own DRAM (device memory)
- Runs many threads in parallel
- Is typically a GPU but can also be another type of parallel processing device
Data-parallel portions of an application are expressed as device kernels which run on many threads
Differences between GPU and CPU threads
- GPU threads are extremely lightweight
  - Very little creation overhead
- GPU needs 1000s of threads for full efficiency
  - Multi-core CPU needs only a few (and is hurt by having too many)

Buzzword: Thread

In CUDA, a thread is an execution of

a kernel with a given index.

Each thread uses its index to access a specific subset of the data, such that the collection of all threads cooperatively processes the entire data set.
Think: Process ID
These operate very much like

threads in OpenMP

they even have shared and private variables.
So what’s the difference with

CUDA?

Threads are free

0 1 2 3 4 5 6 7

… float x = input[threadID]; float y = func(x); output[threadID] = y; …

threadID

Buzzword: Block

In CUDA, a block is a group of threads.
Blocks are used to organize threads into

manageable (and schedulable) chunks.

Can organize threads in 1D, 2D, or 3D arrangements
What best matches your data?
Some restrictions, based on hardware
Threads within a block can do a bit of

Understanding CUDA Programming Model for GPU Application Acceleration, Slides of Computer Aided Design (CAD)

Related documents

Partial preview of the text

Download Understanding CUDA Programming Model for GPU Application Acceleration and more Slides Computer Aided Design (CAD) in PDF only on Docsity!

Using The CUDA Programming

Model

Let’s Make a … Socket!

possible, BUT …

Intel P4 Northwood

Why GPGPU Processing?

NVIDIA Tesla S2050 Server Specs

Compare x86 vs S

• Here are some interesting measures:

Previous GPGPU Constraints

• Dealing with graphics API

working, you had to use the

corner cases of the graphics

API

program as a collection of

shaders and polygons

CUDA

CUDA – C with a Co-processor

• One program, two devices

CUDA Devices and Threads

Buzzword: Thread

Buzzword: Block

manageable (and schedulable) chunks.

synchronization, if necessary.