Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Cache Hierarchy - Program Optimization for Multi Core Architectures - Lecture Slides, Slides of Computer Science

These are the Lecture Slides of Program Optimization for Multi Core Architectures which includes Triangular Lower Limits, Multiple Loop Limits, Dependence System Solvers, Single Equation, Simple Test, Extreme Value Test etc.Key important points are: Cache Coherence Protocols, Shared Memory Multiprocessors, Consistency and Coherence, Stores, Invalidation, Protocol, State Transition

Typology: Slides

2012/2013

Uploaded on 03/28/2013

ekanath
ekanath 🇮🇳

3.8

(4)

80 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Objectives_template
file:///D|/...haudhary,%20Dr.%20Sanjeev%20K%20Aggrwal%20&%20Dr.%20Rajat%20Moona/Multi-core_Architecture/lecture4/4_1.htm[6/14/2012 11:34:06 AM]
Module 2: Virtual Memory and Caches
Lecture 4: Cache Hierarchy and Memory-level Parallelism
The Lecture Contains:
Cache Hierarchy
States of a Cache Line
Inclusion Policy
The First Instruction
TLB Access
Memory op Latency
MLP
Out-of-order Loads
Load/store Ordering
MLP and Memory Wall
pf3
pf4
pf5

Partial preview of the text

Download Cache Hierarchy - Program Optimization for Multi Core Architectures - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

The Lecture Contains:

Cache Hierarchy

States of a Cache Line

Inclusion Policy

The First Instruction

TLB Access

Memory op Latency

MLP

Out-of-order Loads

Load/store Ordering

MLP and Memory Wall

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

Cache Hierarchy

Ideally want to hold everything in a fast cache Never want to go to the memory But, with increasing size the access time increases A large cache will slow down every access So, put increasingly bigger and slower caches between the processor and the memory Keep the most recently used data in the nearest cache: register file (RF) Next level of cache: level 1 or L1 (same speed or slightly slower than RF, but much bigger) Then L2: way bigger than L1 and much slower Example: Intel Pentium 4 ( Netburst ) 128 registers accessible in 2 cycles L1 date cache: 8 KB, 4-way set associative, 64 bytes line size, accessible in 2 cycles for integer loads L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 7 cycles

Example: Intel Itanium 2 (code name Madison) 128 registers accessible in 1 cycle L1 instruction and data caches: each 16 KB, 4-way set associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 5 cycles Unified L3 cache: 6 MB, 24-way set associative, 128 bytes line size, accessible in 14 cycles

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

The First Instruction

Accessing the first instruction Take the starting PC Access iTLB with the VPN extracted from PC: iTLB miss Invoke iTLB miss handler Calculate PTE address If PTEs are cached in L1 data and L2 caches, look them up with PTE address: you will miss there also Access page table in main memory: PTE is invalid: page fault Invoke page fault handler Allocate page frame, read page from disk, update PTE, load PTE in iTLB , restart fetch

Now you have the physical address Access Icache : miss Send refill request to higher levels: you miss everywhere Send request to memory controller (north bridge) Access main memory Read cache line Refill all levels of cache as the cache line returns to the processor Extract the appropriate instruction from the cache line with the block offset This is the longest possible latency in an instruction/data access

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

TLB Access

For every cache access (instruction or data) you need to access the TLB first Puts the TLB in the critical path Want to start indexing into cache and read the tags while TLB lookup takes place Virtually indexed physically tagged cache Extract index from the VA, start reading tag while looking up TLB Once the PA is available do tag comparison Overlaps TLB reading and tag reading

Memory op Latency

L1 hit: ~1 ns L2 hit: ~5 ns L3 hit: ~10-15 ns Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns If a load misses in all caches it will eventually come to the head of the ROB and block instruction retirement (in-order retirement is a must) Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and physical registers Ultimately, the fetcher stalls: severely limits ILP

Module 2: Virtual Memory and Caches

Lecture 4: Cache Hierarchy and Memory-level Parallelism

Load/store Ordering

Out-of-order load issue relies on speculative memory disambiguation Assumes that there will be no conflicting store If the speculation is correct, you have issued the load much earlier and you have allowed the dependents to also execute much earlier If there is a conflicting store, you have to squash the load and all the dependents that have consumed the load value and re-execute them systematically Turns out that the speculation is correct most of the time To further minimize the load squash, microprocessors use simple memory dependence predictors (predicts if a load is going to conflict with a pending store based on that load's or load/store pairs' past behavior)

MLP and Memory Wall

Today microprocessors try to hide cache misses by initiating early prefetches : Hardware prefetchers try to predict next several load addresses and initiate cache line prefetch if they are not already in the cache All processors today also support prefetch instructions; so you can specify in your program when to prefetch what: this gives much better control compared to a hardware prefetcher Researchers are working on load value prediction Even after doing all these, memory latency remains the biggest bottleneck Today microprocessors are trying to overcome one single wall: the memory wall