



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
These are the Lecture Slides of Program Optimization for Multi Core Architectures which includes Triangular Lower Limits, Multiple Loop Limits, Dependence System Solvers, Single Equation, Simple Test, Extreme Value Test etc.Key important points are: Cache Coherence Protocols, Shared Memory Multiprocessors, Consistency and Coherence, Stores, Invalidation, Protocol, State Transition
Typology: Slides
1 / 7
This page cannot be seen from the preview
Don't miss anything!
Ideally want to hold everything in a fast cache Never want to go to the memory But, with increasing size the access time increases A large cache will slow down every access So, put increasingly bigger and slower caches between the processor and the memory Keep the most recently used data in the nearest cache: register file (RF) Next level of cache: level 1 or L1 (same speed or slightly slower than RF, but much bigger) Then L2: way bigger than L1 and much slower Example: Intel Pentium 4 ( Netburst ) 128 registers accessible in 2 cycles L1 date cache: 8 KB, 4-way set associative, 64 bytes line size, accessible in 2 cycles for integer loads L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 7 cycles
Example: Intel Itanium 2 (code name Madison) 128 registers accessible in 1 cycle L1 instruction and data caches: each 16 KB, 4-way set associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 5 cycles Unified L3 cache: 6 MB, 24-way set associative, 128 bytes line size, accessible in 14 cycles
Accessing the first instruction Take the starting PC Access iTLB with the VPN extracted from PC: iTLB miss Invoke iTLB miss handler Calculate PTE address If PTEs are cached in L1 data and L2 caches, look them up with PTE address: you will miss there also Access page table in main memory: PTE is invalid: page fault Invoke page fault handler Allocate page frame, read page from disk, update PTE, load PTE in iTLB , restart fetch
Now you have the physical address Access Icache : miss Send refill request to higher levels: you miss everywhere Send request to memory controller (north bridge) Access main memory Read cache line Refill all levels of cache as the cache line returns to the processor Extract the appropriate instruction from the cache line with the block offset This is the longest possible latency in an instruction/data access
For every cache access (instruction or data) you need to access the TLB first Puts the TLB in the critical path Want to start indexing into cache and read the tags while TLB lookup takes place Virtually indexed physically tagged cache Extract index from the VA, start reading tag while looking up TLB Once the PA is available do tag comparison Overlaps TLB reading and tag reading
L1 hit: ~1 ns L2 hit: ~5 ns L3 hit: ~10-15 ns Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns If a load misses in all caches it will eventually come to the head of the ROB and block instruction retirement (in-order retirement is a must) Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and physical registers Ultimately, the fetcher stalls: severely limits ILP
Out-of-order load issue relies on speculative memory disambiguation Assumes that there will be no conflicting store If the speculation is correct, you have issued the load much earlier and you have allowed the dependents to also execute much earlier If there is a conflicting store, you have to squash the load and all the dependents that have consumed the load value and re-execute them systematically Turns out that the speculation is correct most of the time To further minimize the load squash, microprocessors use simple memory dependence predictors (predicts if a load is going to conflict with a pending store based on that load's or load/store pairs' past behavior)
Today microprocessors try to hide cache misses by initiating early prefetches : Hardware prefetchers try to predict next several load addresses and initiate cache line prefetch if they are not already in the cache All processors today also support prefetch instructions; so you can specify in your program when to prefetch what: this gives much better control compared to a hardware prefetcher Researchers are working on load value prediction Even after doing all these, memory latency remains the biggest bottleneck Today microprocessors are trying to overcome one single wall: the memory wall