Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Cache Hierarchy - Program Optimization for Multi Core Architectures - Lecture Slides, Slides of Computer Science

Dhirubhai Ambani Institute of Information and Communication Technology Computer Science

These are the Lecture Slides of Program Optimization for Multi Core Architectures which includes Triangular Lower Limits, Multiple Loop Limits, Dependence System Solvers, Single Equation, Simple Test, Extreme Value Test etc.Key important points are: Cache Coherence Protocols, Shared Memory Multiprocessors, Consistency and Coherence, Stores, Invalidation, Protocol, State Transition

Typology: Slides

2012/2013

Uploaded on 03/28/2013

ekanath 🇮🇳

3.8

(4)

80 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Objectives_template

file:///D|/...haudhary,%20Dr.%20Sanjeev%20K%20Aggrwal%20&%20Dr.%20Rajat%20Moona/Multi-core_Architecture/lecture4/4_1.htm[6/14/2012 11:34:06 AM]

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

The Lecture Contains:

Cache Hierarchy

States of a Cache Line

Inclusion Policy

The First Instruction

TLB Access

Memory op Latency

MLP

Out-of-order Loads

Load/store Ordering

MLP and Memory Wall

Partial preview of the text

Download Cache Hierarchy - Program Optimization for Multi Core Architectures - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

The Lecture Contains:

Cache Hierarchy

States of a Cache Line

Inclusion Policy

The First Instruction

TLB Access

Memory op Latency

MLP

Out-of-order Loads

Load/store Ordering

MLP and Memory Wall

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

Cache Hierarchy

Ideally want to hold everything in a fast cache Never want to go to the memory But, with increasing size the access time increases A large cache will slow down every access So, put increasingly bigger and slower caches between the processor and the memory Keep the most recently used data in the nearest cache: register file (RF) Next level of cache: level 1 or L1 (same speed or slightly slower than RF, but much bigger) Then L2: way bigger than L1 and much slower Example: Intel Pentium 4 ( Netburst ) 128 registers accessible in 2 cycles L1 date cache: 8 KB, 4-way set associative, 64 bytes line size, accessible in 2 cycles for integer loads L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 7 cycles

Example: Intel Itanium 2 (code name Madison) 128 registers accessible in 1 cycle L1 instruction and data caches: each 16 KB, 4-way set associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 5 cycles Unified L3 cache: 6 MB, 24-way set associative, 128 bytes line size, accessible in 14 cycles

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

The First Instruction

Accessing the first instruction Take the starting PC Access iTLB with the VPN extracted from PC: iTLB miss Invoke iTLB miss handler Calculate PTE address If PTEs are cached in L1 data and L2 caches, look them up with PTE address: you will miss there also Access page table in main memory: PTE is invalid: page fault Invoke page fault handler Allocate page frame, read page from disk, update PTE, load PTE in iTLB , restart fetch

Now you have the physical address Access Icache : miss Send refill request to higher levels: you miss everywhere Send request to memory controller (north bridge) Access main memory Read cache line Refill all levels of cache as the cache line returns to the processor Extract the appropriate instruction from the cache line with the block offset This is the longest possible latency in an instruction/data access

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

TLB Access

For every cache access (instruction or data) you need to access the TLB first Puts the TLB in the critical path Want to start indexing into cache and read the tags while TLB lookup takes place Virtually indexed physically tagged cache Extract index from the VA, start reading tag while looking up TLB Once the PA is available do tag comparison Overlaps TLB reading and tag reading

Memory op Latency

L1 hit: ~1 ns L2 hit: ~5 ns L3 hit: ~10-15 ns Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns If a load misses in all caches it will eventually come to the head of the ROB and block instruction retirement (in-order retirement is a must) Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and physical registers Ultimately, the fetcher stalls: severely limits ILP

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

Load/store Ordering

Out-of-order load issue relies on speculative memory disambiguation Assumes that there will be no conflicting store If the speculation is correct, you have issued the load much earlier and you have allowed the dependents to also execute much earlier If there is a conflicting store, you have to squash the load and all the dependents that have consumed the load value and re-execute them systematically Turns out that the speculation is correct most of the time To further minimize the load squash, microprocessors use simple memory dependence predictors (predicts if a load is going to conflict with a pending store based on that load's or load/store pairs' past behavior)

MLP and Memory Wall

Today microprocessors try to hide cache misses by initiating early prefetches : Hardware prefetchers try to predict next several load addresses and initiate cache line prefetch if they are not already in the cache All processors today also support prefetch instructions; so you can specify in your program when to prefetch what: this gives much better control compared to a hardware prefetcher Researchers are working on load value prediction Even after doing all these, memory latency remains the biggest bottleneck Today microprocessors are trying to overcome one single wall: the memory wall

Cache Hierarchy - Program Optimization for Multi Core Architectures - Lecture Slides, Slides of Computer Science

Related documents

Partial preview of the text

Download Cache Hierarchy - Program Optimization for Multi Core Architectures - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

The Lecture Contains:

Cache Hierarchy

States of a Cache Line

Inclusion Policy

The First Instruction

TLB Access

Memory op Latency

MLP

Out-of-order Loads

Load/store Ordering

MLP and Memory Wall

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

Cache Hierarchy

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

The First Instruction

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

TLB Access

Memory op Latency

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

Load/store Ordering

MLP and Memory Wall