Acknowledgments: Lecture slides are from the Operating Systems course taught by John Kubiatowicz at Berkeley, with few minor updates/changes. When slides are obtained from other sources, a reference will be noted on the bottom of that slide, in which case a full list of references is provided on the last slide.
Recall: Fix for sparse address space: The two-level page table

- Tree of Page Tables
  - “Magic” 10b-10b-12b pattern!
- Tables fixed size (1024 entries)
  - On context-switch: save single PageTablePtr register (i.e. CR3)
- Valid bits on Page Table Entries
  - Don’t need every 2nd-level table
  - Even when exist, 2nd-level tables can reside on disk if not in use
Recall: Making it real:
X86 Memory model with segmentation (16/32-bit)

Segment Selector from instruction: **mov eax, gs(0x0)**

2-level page table in 10-10-12 bit address

Combined address is 32-bit “linear” Virtual address

First level called “directory”

Second level called “table”
Recall: In Machine Structures (eg. 61C) …

- Caching is the key to memory system performance

Average Memory Access Time (AMAT)

\[ \text{AMAT} = (\text{Hit Rate} \times \text{HitTime}) + (\text{Miss Rate} \times \text{MissTime}) \]

Where HitRate + MissRate = 1

HitRate = 90% => AMAT = \((0.9 \times 1) + (0.1 \times 101)\) = 11.1 ns

HitRate = 99% => AMAT = \((0.99 \times 1) + (0.01 \times 101)\) = 2.01 ns
Recall: The Memory Hierarchy

- **Registers**
  - L1 Cache
  - L2 Cache (shared)
  - TLB

- **Main Memory (DRAM)**
  - Speed: 0.3 ns
  - Size: 100Bs
  - 10kBs, 100kBs, MBs

- **Secondary Storage (SSD)**
  - Speed: 10-30 ms
  - Size: 100,000 Bs
  - 100GBs

- **Secondary Storage (Disk)**
  - Speed: 10,000,000 ms
  - Size: 10,000,000 Bs
  - TBs

**Accessed in Hardware**

- **Managed in Hardware**
  - PT
  - TLB

**Speed (ns):** 1, 3, 10-30

**Size (bytes):** 100Bs, 10kBs, 100kBs, MBs
Recall: How to make Address Translation Fast?

- Cache results of recent translations!
  - Different from a traditional cache
  - Cache Page Table Entries using Virtual Page # as the key

![Diagram of address translation process]

- Processor (core) is connected to MMU, which in turn is connected to Cache(s).
- The MMU reads Virtual Address \(m\) and Physical Address \(X\) from Processor and Cache, respectively.

- Page Table entries:
  - \(V_{Pg\ M_1} : <Phs\ Frame\ #_1, V, ... >\)
  - \(V_{Pg\ M_2} : <Phs\ Frame\ #_2, V, ... >\)
  - \(V_{Pg\ M_k} : <Phs\ Frame\ #_k, V, ... >\)
Translation Look-Aside Buffer

• TLB is a cache of translations:
  – Record recent Virtual Page # to Physical Frame # translation
• If present, get the physical address from TLB without reading any of the page tables !!!
  – Even if the translation involved multiple levels
  – Caches the end-to-end result
• Was invented by Sir Maurice Wilkes – prior to caches
  – People realized “if it’s good for page tables, why not the rest of the data in memory?”
• On a TLB miss, the page tables may be cached, so only go to memory when both miss
  – Ultimately invokes page table walk
• Question is one of page locality: does it exist?
  – Instruction accesses spend a lot of time on the same page (since accesses sequential)
  – Stack accesses have definite locality of reference
  – Data accesses have less page locality, but still some…

• Can we have a TLB hierarchy?
  – Sure: multiple levels at different sizes/speeds
What kind of Cache for TLB?

- Remember all those cache design parameters and trade-offs?
  - Amount of Data = N * L * K
  - Tag is portion of address that identifies line (w/o line offset)
  - Write Policy (write-thru, write-back), Eviction Policy (LRU, …)
How might organization of TLB differ from that of a conventional instruction or data cache?

• Let’s do some review …
A Summary on Sources of Cache Misses

• **Compulsory** (cold start or process migration, first reference): first access to a block
  – “Cold” fact of life: not a whole lot you can do about it
  – Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant

• **Capacity**:  
  – Cache cannot contain all blocks access by the program  
  – Solution: increase cache size

• **Conflict** (collision):
  – Multiple memory locations mapped to the same cache location  
  – Solution 1: increase cache size  
  – Solution 2: increase associativity

• **Coherence** (Invalidation): other process (e.g., I/O) updates memory
• **Block** is minimum quantum of caching
  – Data select field used to select data (byte) within block
  – Many caching applications don’t have data select field
• **Index** Used to Lookup Candidates in Cache
  – Index identifies the set
• **Tag** used to identify actual copy
  – If no candidates match, then declare cache miss
Review: Direct Mapped Cache

- Direct Mapped $2^N$ byte cache:
  - The uppermost $(32 - N)$ bits are always the Cache Tag
  - The lowest $M$ bits are the Byte Select (Block Size = $2^M$)

- Example: 1 KB Direct Mapped Cache with 32 B Blocks
  - Index chooses potential block
  - Tag checked to verify block
  - Byte select chooses byte within block

Ex: $0x50$

Ex: $0x01$

Ex: $0x00$
Review: Set Associative Cache

• **N-way set associative**: N entries per Cache Index
  – N direct mapped caches operates in parallel
• Example: Two-way set associative cache
  – Cache Index selects a “set” from the cache
  – Two tags in the set are compared to input in parallel
  – Data is selected based on the tag result

![Diagram of a set associative cache](image-url)
Review: Fully Associative Cache

• **Fully Associative**: Every block can hold any line
  – Address does not include a cache index
  – Compare Cache Tags of all Cache Entries in Parallel

• Example: Block Size=32B blocks
  – We need $N$ 27-bit comparators
  – Still have byte select to choose from within block

<table>
<thead>
<tr>
<th>Cache Tag (27 bits long)</th>
<th>Byte Select</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cache Tag</td>
<td>Valid Bit</td>
</tr>
<tr>
<td></td>
<td>Cache Data</td>
</tr>
<tr>
<td>Ex: 0x01</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache Tag</th>
<th>Valid Bit</th>
<th>Cache Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Byte 31</td>
</tr>
<tr>
<td></td>
<td></td>
<td>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Byte 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Byte 0</td>
</tr>
<tr>
<td></td>
<td></td>
<td>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Byte 63</td>
</tr>
<tr>
<td></td>
<td></td>
<td>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Byte 33</td>
</tr>
<tr>
<td></td>
<td></td>
<td>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Byte 32</td>
</tr>
<tr>
<td></td>
<td></td>
<td>.</td>
</tr>
</tbody>
</table>

3/12/20
Kubiatowicz CS162 ©UCB Spring 2020
Where does a Block Get Placed in a Cache?

- Example: Block 12 placed in 8 block cache

32-Block Address Space:

Direct mapped:
block 12 can go only into block 4 (12 mod 8)
Where does a Block Get Placed in a Cache?

• Example: Block 12 placed in 8 block cache

32-Block Address Space:

Direct mapped:
block 12 can go only into block 4
(12 mod 8)

Set associative:
block 12 can go anywhere in set 0
(12 mod 4)
Where does a Block Get Placed in a Cache?

- Example: Block 12 placed in 8 block cache

32-Block Address Space:

<table>
<thead>
<tr>
<th>Block no.</th>
<th>0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1</th>
</tr>
</thead>
</table>

**Direct mapped:**
block 12 can go only into block 4 (12 mod 8)

**Set associative:**
block 12 can go anywhere in set 0 (12 mod 4)

**Fully associative:**
block 12 can go anywhere
Which block should be replaced on a miss?

- Easy for Direct Mapped: Only one possibility
- Set Associative or Fully Associative:
  - Random
  - LRU (Least Recently Used)

Miss rates for a workload:

<table>
<thead>
<tr>
<th>Size</th>
<th>2-way LRU</th>
<th>2-way Random</th>
<th>4-way LRU</th>
<th>4-way Random</th>
<th>8-way LRU</th>
<th>8-way Random</th>
</tr>
</thead>
<tbody>
<tr>
<td>16 KB</td>
<td>5.2%</td>
<td>5.7%</td>
<td>4.7%</td>
<td>5.3%</td>
<td>4.4%</td>
<td>5.0%</td>
</tr>
<tr>
<td>64 KB</td>
<td>1.9%</td>
<td>2.0%</td>
<td>1.5%</td>
<td>1.7%</td>
<td>1.4%</td>
<td>1.5%</td>
</tr>
<tr>
<td>256 KB</td>
<td>1.15%</td>
<td>1.17%</td>
<td>1.13%</td>
<td>1.13%</td>
<td>1.12%</td>
<td>1.12%</td>
</tr>
</tbody>
</table>
Review: What happens on a write?

- **Write through**: The information is written to both the block in the cache and to the block in the lower-level memory
- **Write back**: The information is written only to the block in the cache
  - Modified cache block is written to main memory only when it is replaced
  - Question is block clean or dirty?

**Pros and Cons of each?**

- **WT:**
  - » PRO: read misses cannot result in writes
  - » CON: Processor held up on writes unless writes buffered
- **WB:**
  - » PRO: repeated writes not sent to DRAM processor not held up on writes
  - » CON: More complex
    Read miss may require writeback of dirty data
Impact of caches on Operating Systems

• Dealing with cache effects
  – Maintaining the correctness of various caches
  – E.g., TLB consistency:
    » With PT across context switches?
    » Across updates to the PT?

• Process scheduling
  – Which and how many processes are active? Priorities?
  – Large memory footprints versus small ones?
  – Shared pages mapped into VAS of multiple processes?

• Impact of thread scheduling on cache performance
  – Rapid interleaving of threads (small quantum) may degrade cache performance
    » Increase average memory access time (AMAT)!!!

• Designing operating system data structures for cache performance
What TLB Organization Makes Sense?

- Needs to be really fast
  - Critical path of memory access
    » In simplest view: before the cache
    » Thus, this adds to access time (reducing cache speed)
  - Seems to argue for Direct Mapped or Low Associativity

- However, needs to have very few conflicts!
  - With TLB, the Miss Time extremely high! (PT traversal)
  - Cost of Conflict (Miss Time) is high
  - Hit Time – dictated by clock cycle

- Thrashing: continuous conflicts between accesses
  - What if use low order bits of page as index into TLB?
    » First page of code, data, stack may map to same entry
    » Need 3-way associativity at least?
  - What if use high order bits as index?
    » TLB mostly unused for small programs
TLB organization: include protection

- How big does TLB actually have to be?
  - Usually small: 128-512 entries (larger now)
  - Not very big, can support higher associativity
- Small TLBs usually organized as fully-associative cache
  - Lookup is by Virtual Address
  - Returns Physical Address + other info
- What happens when fully-associative is too slow?
  - Put a small (4-16 entry) direct-mapped cache in front
  - Called a “TLB Slice”
- Example for MIPS R3000:

<table>
<thead>
<tr>
<th>Virtual Address</th>
<th>Physical Address</th>
<th>Dirty</th>
<th>Ref</th>
<th>Valid</th>
<th>Access</th>
<th>ASID</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xFA00</td>
<td>0x0003</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>R/W</td>
<td>34</td>
</tr>
<tr>
<td>0x0040</td>
<td>0x0010</td>
<td>N</td>
<td>Y</td>
<td>Y</td>
<td>R</td>
<td>0</td>
</tr>
<tr>
<td>0x0041</td>
<td>0x0011</td>
<td>N</td>
<td>Y</td>
<td>Y</td>
<td>R</td>
<td>0</td>
</tr>
</tbody>
</table>
Reducing translation time further

- As described, TLB lookup is in serial with cache lookup:

Machines with TLBs go one step further: they overlap TLB lookup with cache access.

- Works because offset available early
Overlapping TLB & Cache Access (1/2)

• Main idea:
  – Offset in virtual address exactly covers the “cache index” and “byte select”
  – Thus can select the cached byte(s) in parallel to perform address translation

virtual address

physical address

<table>
<thead>
<tr>
<th>Virtual Page #</th>
<th>Offset</th>
</tr>
</thead>
<tbody>
<tr>
<td>tag / page #</td>
<td>index</td>
</tr>
</tbody>
</table>
Overlapping TLB & Cache Access

- Here is how this might work with a 4K, direct-mapped cache:

- Four different TLBs
  - Instruction TLB for 4K pages
    - 128 entries, 4-way set associative
  - Instruction TLB for large pages
    - 2 entries, fully associative
  - Data TLB for 4K pages
    - 128 entries, 4-way set associative
  - Data TLB for large pages
    - 8 entries, 4-way set associative

- All TLBs use LRU replacement policy
- Why different TLBs for instruction, data, and page sizes?
Intel Nahelem (2008)

- L1 DTLB
  - 64 entries for 4 K pages and
  - 32 entries for 2/4 M pages,

- L1 ITLB
  - 128 entries for 4 K pages using 4-way associativity and
  - 14 fully associative entries for 2/4 MiB pages

- unified 512-entry L2 TLB for 4 KiB pages, 4-way associative.
Current Intel x86 (Skylake, Cascade Lake)
Current Intel x86 (Skylake, Cascade Lake)
Current Intel x86 (Skylake, Cascade Lake)

The diagram illustrates the architecture of the Current Intel x86 (Skylake, Cascade Lake) processor. It includes the Front End, Execution Engine, Memory Subsystem, and the Instruction Fetch & PreDecode, 5-Way Decode, Micro-Decoder, and Instruction Queue components. The diagram also highlights the Instruction TLB, Stack Engine (SE), and various buffers and queues such as the Decoded Stream Buffer (DSB), Instruction Queue, and the Store Buffer & Forwarding. The processor includes an L1 Instruction Cache, L1 Data Cache, and L2 Cache, along with the EU (Execution Units) for integer and vector operations. The diagram provides a detailed view of the processor's internal architecture and how instructions are processed.
Current Example: Memory Hierarchy

- Caches (all 64 B line size)
  - L1 I-Cache: 32 KiB/core, 8-way set assoc.
  - L1 D Cache: 32 KiB/core, 8-way set assoc., 4-5 cycles load-to-use, Write-back policy
  - L2 Cache: 1 MiB/core, 16-way set assoc., Inclusive, Write-back policy, 14 cycles latency
  - L3 Cache: 1.375 MiB/core, 11-way set assoc., shared across cores, Non-inclusive victim cache, Write-back policy, 50-70 cycles latency
- TLB
  - L1 ITLB, 128 entries; 8-way set assoc. for 4 KB pages
    - 8 entries per thread; fully associative, for 2 MiB / 4 MiB page
  - L1 DTLB 64 entries; 4-way set associative for 4 KB pages
    - 32 entries; 4-way set associative, 2 MiB / 4 MiB page translations:
    - 4 entries; 4-way associative, 1 GiB page translations:
  - L2 STLB: 1536 entries; 12-way set assoc. 4 KiB + 2 MiB pages
    - 16 entries; 4-way set associative, 1 GiB page translations:
What happens to TLB on Context Switch?

• Need to do something, since TLBs map virtual addresses to physical addresses
  – Address Space just changed, so TLB entries no longer valid!

• Options?
  – Invalidate TLB: simple but might be expensive
    » What if switching frequently between processes?
  – Include ProcessID in TLB
    » This is an architectural solution: needs hardware

• What if translation tables change?
  – For example, to move page from memory to disk or vice versa…
  – Must invalidate TLB entry!
    » Otherwise, might think that page is still in memory!
  – Called “TLB Consistency”
Putting Everything Together: Address Translation

Virtual Address:

Physical Address:

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table (1st level)

Physical Address:

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Virtual Address:
- PageTablePtr

Physical Address:
- Page Table (1st level)

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

PageTablePtr

Page Table (1st level)

Physical Address:

Page Table (2nd level)

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:

Virtual P1 index Virtual P2 index Offset

PageTablePtr

Page Table (1st level)

Page Table (2nd level)

Physical Address:

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:

- Virtual P1 index
- Virtual P2 index
- Offset

Page Table Ptr

Page Table (1st level)

Page Table (2nd level)

Physical Address:

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Physical Address:
- Physical Page #

Page Table:
- (1st level)
- (2nd level)

Physical Memory:

PageTablePtr

Diagram showing the process of address translation, including virtual and physical addresses, page tables, and page memories.
Putting Everything Together: Address Translation

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table Pointer

Virtual Address:
- Page Table (1st level)
- Physical Page #
- Offset

Page Table (2nd level)

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table Pointer
- Page Table (1st level)
- Page Table (2nd level)

Physical Address:
- Physical Page #
- Offset

Physical Memory:
Putting Everything Together: Address Translation

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Physical Address:
- Physical Page #
- Offset

Page Table Pointer (PageTablePtr)

Page Table (1st level)

Page Table (2nd level)

Physical Memory:
Putting Everything Together: TLB

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table (1st level)
- Page TablePtr

Page Table (2nd level)

Physical Address:
- Physical Page #
- Offset

Physical Memory:

Virtual Address: Offset

Page Table (1st level)

Physical Address: Offset

Physical Memory:
Putting Everything Together: TLB

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Physical Address:
- Physical Page #
- Offset

Page Table:
- (1st level)
- (2nd level)

Physical Memory:
Putting Everything Together: TLB

Virtual Address:

Page Table (1st level)

Page Table (2nd level)

TLB:

Physical Address:

Physical Memory:
**Putting Everything Together: TLB**

Virtual Address:

- Virtual P1 index
- Virtual P2 index
- Offset

Page Table (1st level)

Page Table (2nd level)

TLB:

- Physical Page #
- Offset

Physical Address:

- Physical Memory:

Virtual Address: Offset

Physical Address: Offset

Physical Memory:
Putting Everything Together: Cache

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table (1st level)

Page Table (2nd level)

TLB:

Physical Address:
- Physical Page #
- Offset

cache:

Physical Memory:
Putting Everything Together: Cache

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table (1st level):
- Page Table (2nd level)
- TLB:
  - Tag
  - Block

Physical Address:
- Physical Page #
- Offset

Physical Memory:

Cache:
- tag:
- block:

...
Putting Everything Together: Cache

Virtual Address:

Virtual P1 index Virtual P2 index Offset

PageTablePtr

Page Table (1st level)

Page Table (2nd level)

TLB:

Physical Address:

Physical Page # Offset

tag index byte

cache:

tag: block:

...
Putting Everything Together: Cache

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table (1st level):
- Page TablePtr

Page Table (2nd level):
- TLB:
  - tag
  - index
  - byte

Physical Address:
- Physical Page #
- Offset

Physical Memory:
- Physical Address:
- Cache:
  - tag:
  - block:
    - ...
  - ...
  - ...
Putting Everything Together: Cache

Virtual Address:
- Virtual P1 index
- Virtual P2 index
- Offset

Page Table (1st level):
- Page Table Ptr

Page Table (2nd level):
- TLB:
- Physical Page #
- Offset
- Tag
- Index
- Byte
- Cache:
- Tag:
- Block:
- Physical Address:
- Physical Memory:
Putting Everything Together: Cache

Virtual Address:

Virtual P1 index  Virtual P2 index  Offset

Page Table (1st level)

Page Table (2nd level)

TLB:

... Page TablePtr

Physical Address:

Physical Page #  Offset

tag  index  byte

cache:

... tag:  block:

Physical Memory:
Recall: Two Critical Issues in Address Translation

- How to translate addresses fast enough?
  - Every instruction fetch
  - Plus every load / store
  - EVERY MEMORY REFERENCE!
  - More than one translation for EVERY instruction

- Next: What to do if the translation fails?
  - Page fault! This is a synchronous exception!
Recall: User→Kernel: (Exceptions: Traps & Interrupts)

• A system call instruction causes a synchronous exception (or “trap”)
  – In fact, often called a software “trap” instruction
• Other sources of Synchronous Exceptions (“Trap”):
  – Divide by zero, Illegal instruction, Bus error (bad address, e.g. unaligned access)
  – Segmentation Fault (address out of range)
  – Page Fault (for illusion of infinite-sized memory)
• Interrupts are Asynchronous Exceptions:
  – Examples: timer, disk ready, network, etc….
  – Interrupts can be disabled, traps cannot!
• On system call, exception, or interrupt:
  – Hardware enters kernel mode with interrupts disabled
  – Saves PC, then jumps to appropriate handler in kernel
  – Some processors (e.g. x86) also save registers, changes stack
• Handler does any required state preservation not done by CPU:
  – Might save registers, other CPU state, and switches to kernel stack
Precise Exceptions

• Precise $\Rightarrow$ state of the machine is preserved as if program executed up to the offending instruction
  – All previous instructions completed
  – Offending instruction and all following instructions act as if they have not even started
  – Same system code will work on different implementations
  – Difficult in the presence of pipelining, out-of-order execution, ...

• Imprecise $\Rightarrow$ system software has to figure out what is where and put it all back together

• Performance goals may lead designers to forsake precise interrupts
  – system software developers, user, markets etc. usually wish they had not done this

• Modern techniques for out-of-order execution and branch prediction help implement precise interrupts
Page Fault is a Synchronous Exception

• The Virtual-to-Physical Translation fails
  – PTE marked invalid (at whatever level of page table), Privilege-Level Violation, Access violation

• Causes a Fault / Trap
  – Not an interrupt because synchronous to instruction execution!
  – May occur on instruction fetch or data access
  – Protection violations typically terminate the instruction in a way that is restartable (more later)

• Page Faults engage operating system to fix the situation and retry the instruction
  – Allocate an additional stack page, or
  – Make the page accessible - Copy on Write,
  – Bring page in from secondary storage – demand paging

• Protection violations that cannot be resolved ⇒ terminate process (possibly “dumping core” image for debugging)

• Fundamental inversion of the hardware / software boundary
Next Up: What happens when ...

Process

virtual address

instruction

MMU

physical address

PT

Operating System
Next Up: What happens when …

Process

<table>
<thead>
<tr>
<th>virtual address</th>
<th>MMU</th>
</tr>
</thead>
</table>

| physical address | PT |

Operating System

instruction

3/12/20
Next Up: What happens when …

Process

virtual address

MMU

physical address

PT

frame#

offset

Operating System

instruction
Next Up: What happens when …

Process

instruction \rightarrow \text{MMU} \rightarrow \text{PT}

virtual address

physical address

Operating System
Next Up: What happens when …

Process

\[\text{virtual address}\]

\[\text{MMU}\]

\[\text{PT}\]

\[\text{physical address}\]

Operating System

\[\text{page fault}\]
Next Up: What happens when ...

Process

virtual address

instruction

exception

Page Fault Handler

physical address

Operating System

MMU

PT

page fault
Next Up: What happens when …

Process

virtual address

instruction

exception

Page Fault Handler

MMU

physical address

PT

Operating System

load page from disk
Next Up: What happens when …

Process

virtual address

instruction

exception

Operating System

Page Fault Handler

MMU

physical address

page fault

update PT entry
Next Up: What happens when ...
Next Up: What happens when …

Process

virtual address

instruction

Operating System

physical address

MMU

PT

frame#

offset

[Diagram showing the process of translating virtual to physical addresses]
Inversion of the Hardware / Software Boundary

• In order for an instruction to complete …
• It requires the intervention of operating system software
• Receive the page fault, remedy the situation
  – Load the page, create the page, copy-on-write
  – Update the PTE entry so the translation will succeed
• Restart (or resume) the instruction
  – This is one of the huge simplifications in RISC instructions sets
  – Can be very complex when instruction modify state (x86)
Demand Paging as Caching, ...

- What “block size”? - 1 page (e.g., 4 KB)
- What “organization” i.e. direct-mapped, set-associ., fully-associative?
  - Any page in any frame of memory, i.e., fully associative: arbitrary virtual → physical mapping
- How do we locate a page?
  - First check TLB, then page-table traversal
- What is page replacement policy? (i.e. LRU, Random...) 
  - This requires more explanation... (kinda LRU)
- What happens on a miss?
  - Go to lower level to fill miss (i.e. disk)
- What happens on a write? (write-through, write back)
  - Definitely write-back – need dirty bit!
Demand Paging

- Modern programs require a lot of physical memory
  - Memory per system growing faster than 25%-30%/year
- But they don’t use all their memory all of the time
  - 90-10 rule: programs spend 90% of their time in 10% of their code
  - Wasteful to require all of user's code to be in memory
- Solution: use main memory as “cache” for disk
Disk is larger than physical memory ⇒
- In-use virtual memory can be bigger than physical memory
- Combined memory of running processes much larger than physical memory
  » More programs fit into memory, allowing more concurrency

Principle: Transparent Level of Indirection (page table)
- Supports flexible placement of physical data
  » Data could be on disk or somewhere across network
- Variable location of data transparent to user program
  » Performance issue, not correctness issue
Review: What is in a PTE?

- What is in a Page Table Entry (or PTE)?
  - Pointer to next-level page table or to actual page
  - Permission bits: valid, read-only, read-write, write-only
- Example: Intel x86 architecture PTE:
  - 2-level page tabler (10, 10, 12-bit offset)
  - Intermediate page tables called “Directories”

<table>
<thead>
<tr>
<th>Page Frame Number (Physical Page Number)</th>
<th>Free (OS)</th>
<th>P</th>
<th>S</th>
<th>D</th>
<th>A</th>
<th>PCD</th>
<th>PWT</th>
<th>U</th>
<th>W</th>
<th>P</th>
</tr>
</thead>
<tbody>
<tr>
<td>31-12</td>
<td>11-9</td>
<td>8</td>
<td>7</td>
<td>6</td>
<td>5</td>
<td>4</td>
<td>3</td>
<td>2</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

P: Present (same as “valid” bit in other architectures)
W: Writeable
U: User accessible
PWT: Page write transparent: external cache write-through
PCD: Page cache disabled (page cannot be cached)
A: Accessed: page has been accessed recently
D: Dirty (PTE only): page has been modified recently
PS: Page Size: PS=1 ⇒ 4MB page (directory only).
Bottom 22 bits of virtual address serve as offset
Origins of Paging

Keep most of the address space on disk.

Disks provide most of the storage.

Actively swap pages to/from.

Relatively small memory, for many processes.

Keep memory full of the frequently accesses pages.

Many clients on dumb terminals running different programs.
Recall: The Memory Hierarchy

- **Registers**
  - Managed in Hardware
  - Accessed in Hardware
  - Speed (ns): 0.3
  - Size (bytes): 100Bs
  - 10kBs 100kBs MBs

- **L1 Cache**
  - 1

- **L2 Cache**
  - 3
  - 10-30
  - 100

- **L3 Cache (shared)**
  - 100
  - 0.1 ms
  - 100,000
  - 100GBs

- **Secondary Storage (SSD)**
  - 100,000
  - 100,000
  - 10 ms
  - 100,000
  - TBs
Very Different Situation Today

Powerful system
Huge memory
Huge disk
Single user
A Picture on one machine

- Memory stays about 75% used, 25% for dynamics
- A lot of it is shared 1.9 GB
Many Uses of "Demand Paging" …

- Extend the stack
  - Allocate a page and zero it

- Extend the heap (sbrk of old, today mmap)

- Process Fork
  - Create a copy of the page table
  - Entries refer to parent pages – NO-WRITE
  - Shared read-only pages remain shared
  - Copy page on write

- Exec
  - Only bring in parts of the binary in active use
  - Do this on demand

- MMAP to explicitly share region (or to access a file as RAM)
Classic: Loading an executable into memory

- `.exe`
  - lives on disk in the file system
  - contains contents of code & data segments, relocation entries and symbols
  - OS loads it into memory, initializes registers (and initial stack pointer)
  - program sets up stack and heap upon initialization:
    - `crt0` (C runtime init)
Utilized pages in the VAS are backed by a page block on disk
- Called the backing store or swap file
- Typically in an optimized block store, but can think of it like a file
Create Virtual Address Space of the Process

- User Page table maps entire VAS
- All the utilized regions are backed on disk
  - swapped into and out of memory as needed
- For every process
Create Virtual Address Space of the Process

- User Page table maps entire VAS
  - Resident pages to the frame in memory they occupy
  - The portion of it that the HW needs to access must be resident in memory
Provide Backing Store for VAS

- User Page table maps entire VAS
- Resident pages mapped to memory frames
- For all other pages, OS must record where to find them on disk
What Data Structure Maps Non-Resident Pages to Disk?

- **FindBlock**(PID, page#) → disk_block
  - Some OSs utilize spare space in PTE for paged blocks
  - Like the PT, but purely software

- Where to store it?
  - In memory – can be compact representation if swap storage is contiguous on disk
  - Could use hash table (like Inverted PT)

- Usually want backing store for resident pages too

- May map code segment directly to on-disk image
  - Saves a copy of code to swap file

- May share code segment with multiple instances of the program
Provide Backing Store for VAS

disk (huge, TB)

stack
heap
data
code

stack
heap
data
code

stack
heap
data
code

VAS 1

memory

user
page frames
user pagetable
kernel
code & data

VAS 2

PT 1

PT 2
On page Fault …

disk (huge, TB)

kernel & data

user page frames

user pagetable

kernel code & data

active process & PT
On page Fault … find & start load
On page Fault ... schedule other P or T

don't hallucinate.
On page Fault … update PTE

disk (huge, TB)

stack

heap

data

code

VAS 1

kernel

stack

heap

data

code

PT 1

memory

user

page frames

user

pagetable

kernel

code &
data

active process & PT
Eventually reschedule faulting thread

disk (huge, TB)

stack

heap
data
code

VAS 1

Kernel

Stack

Heap

Data

code

VAS 2

Kernel

Stack

Heap

Data

code

PT 1

memory

user

page frames

user pagetable

kernel code & data

active process & PT
Summary: Steps in Handling a Page Fault

1. Trap
2. Page is on backing store
3. Bring in missing page
4. Reset page table
5. Free frame
6. Restart instruction

load M
Demand Paging Mechanisms

• PTE makes demand paging implementable
  – Valid ⇒ Page in memory, PTE points at physical page
  – Not Valid ⇒ Page not in memory; use info in PTE to find it on disk when necessary

• Suppose user references page with invalid PTE?
  – Memory Management Unit (MMU) traps to OS
    » Resulting trap is a “Page Fault”
  – What does OS do on a Page Fault?:
    » Choose an old page to replace
    » If old page modified (“D=1”), write contents back to disk
    » Change its PTE and any cached TLB to be invalid
    » Load new page into memory from disk
    » Update page table entry, invalidate TLB for new entry
    » Continue thread from original faulting location
  – TLB for new page will be loaded when thread continued!
  – While pulling pages off disk for one process, OS runs another process from ready queue
    » Suspended process sits on wait queue
Some questions we need to answer!

• During a page fault, where does the OS get a free frame?
  – Keeps a free list
  – Unix runs a “reaper” if memory gets too full
    » Schedule dirty pages to be written back on disk
    » Zero (clean) pages which haven’t been accessed in a while
  – As a last resort, evict a dirty page first

• How can we organize these mechanisms?
  – Work on the replacement policy

• How many page frames/process?
  – Like thread scheduling, need to “schedule” memory resources:
    » Utilization? fairness? priority?
  – Allocation of disk paging bandwidth
Cache Behavior under WS model

- Amortized by fraction of time the Working Set is active
- Transitions from one WS to the next
- Capacity, Conflict, Compulsory misses
- Applicable to memory caches and pages. Others?
Another model of Locality: Zipf

- Likelihood of accessing item of rank $r$ is $\alpha \frac{1}{r^a}$
- Although rare to access items below the top few, there are so many that it yields a “heavy tailed” distribution
- Substantial value from even a tiny cache
- Substantial misses from even a very large cache
Demand Paging Cost Model

• Since Demand Paging like caching, can compute average access time! ("Effective Access Time")
  – EAT = Hit Rate $\times$ Hit Time + Miss Rate $\times$ Miss Time
  – EAT = Hit Time + Miss Rate $\times$ Miss Penalty

• Example:
  – Memory access time = 200 nanoseconds
  – Average page-fault service time = 8 milliseconds
  – Suppose $p =$ Probability of miss, $1-p =$ Probably of hit
  – Then, we can compute EAT as follows:
    
    \[
    \text{EAT} = 200\text{ns} + p \times 8\text{ms} \\
    = 200\text{ns} + p \times 8,000,000\text{ns}
    \]

• If one access out of 1,000 causes a page fault, then
  EAT = 8.2 $\mu$s:
  – This is a slowdown by a factor of 40!

• What if want slowdown by less than 10%?
  – $200\text{ns} \times 1.1 < \text{EAT} \iff p < 2.5 \times 10^{-6}$
  – This is about 1 page fault in 400,000!
What Factors Lead to Misses in Page Cache?

• **Compulsory Misses:**
  – Pages that have never been paged into memory before
  – How might we remove these misses?
    » Prefetching: loading them into memory before needed
    » Need to predict future somehow! More later

• **Capacity Misses:**
  – Not enough memory. Must somehow increase available memory size.
  – Can we do this?
    » One option: Increase amount of DRAM (not quick fix!)
    » Another option: If multiple processes in memory: adjust percentage of memory allocated to each one!

• **Conflict Misses:**
  – Technically, conflict misses don’t exist in virtual memory, since it is a “fully-associative” cache

• **Policy Misses:**
  – Caused when pages were in memory, but kicked out prematurely because of the replacement policy
  – How to fix? Better replacement policy
Page Replacement Policies

• Why do we care about Replacement Policy?
  – Replacement is an issue with any cache
  – Particularly important with pages
    » The cost of being wrong is high: must go to disk
    » Must keep important pages in memory, not toss them out

• FIFO (First In, First Out)
  – Throw out oldest page. Be fair – let every page live in memory for same amount of time.
  – Bad – throws out heavily used pages instead of infrequently used

• RANDOM:
  – Pick random page for every replacement
  – Typical solution for TLB’s. Simple hardware
  – Pretty unpredictable – makes it hard to make real-time guarantees

• MIN (Minimum):
  – Replace page that won’t be used for the longest time
  – Great (provably optimal), but can’t really know future…
  – But past is a good predictor of the future …
Replacement Policies (Con’t)

• **LRU (Least Recently Used):**
  – Replace page that hasn’t been used for the longest time
  – Programs have locality, so if something not used for a while, unlikely to be used in the near future.
  – Seems like LRU should be a good approximation to MIN.

• How to implement LRU? Use a list!
  – On each use, remove page from list and place at head
  – LRU page is at tail

• Problems with this scheme for paging?
  – Need to know immediately when each page used so that can change position in list...
  – Many instructions for each hardware access

• In practice, people **approximate** LRU (more later)
Example: FIFO (strawman)

- Suppose we have 3 page frames, 4 virtual pages, and following reference stream:
  - A B C A B D A D B C B
- Consider FIFO Page replacement:

\[
\begin{array}{cccccccccc}
\text{Ref:} & A & B & C & A & B & D & A & D & B & C & B \\
\text{Page:} & & & & & & & & & & & \\
1 & A & & & D & & & & C & & \\
2 & & B & & A & & & & & & \\
3 & & & C & & & & B & & & \\
\end{array}
\]

- FIFO: 7 faults
- When referencing D, replacing A is bad choice, since need A again right away
Example: MIN / LRU

- Suppose we have the same reference stream:
  - A B C A B D A D B C B
- Consider MIN Page replacement:

<table>
<thead>
<tr>
<th>Ref:</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>A</th>
<th>B</th>
<th>D</th>
<th>A</th>
<th>D</th>
<th>B</th>
<th>C</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<td>Page:</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>A</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>C</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>B</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>C</td>
<td>D</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- MIN: 5 faults
  - Where will D be brought in? Look for page not referenced farthest in future
- What will LRU do?
  - Same decisions as MIN here, but won't always be true!
Is LRU guaranteed to perform well?

- Consider the following: A B C D A B C D A B C D
- LRU Performs as follows (same as FIFO here):

<table>
<thead>
<tr>
<th>Ref:</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
</tr>
</thead>
<tbody>
<tr>
<td>Page:</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>A</td>
<td></td>
<td>D</td>
<td></td>
<td>C</td>
<td></td>
<td>B</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>B</td>
<td>A</td>
<td></td>
<td>D</td>
<td>C</td>
<td></td>
<td>B</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>C</td>
<td>B</td>
<td></td>
<td>D</td>
<td>A</td>
<td></td>
<td>C</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- Every reference is a page fault!
- Fairly contrived example of working set of \( N+1 \) on \( N \) frames
When will LRU perform badly?

- Consider the following: A B C D A B C D A B C D
- LRU Performs as follows (same as FIFO here):

<table>
<thead>
<tr>
<th>Ref:</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
</tr>
</thead>
<tbody>
<tr>
<td>Page:</td>
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
</tr>
<tr>
<td>1</td>
<td>A</td>
<td></td>
<td>D</td>
<td></td>
<td></td>
<td></td>
<td>C</td>
<td></td>
<td>B</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>B</td>
<td></td>
<td>A</td>
<td></td>
<td></td>
<td></td>
<td>D</td>
<td></td>
<td>C</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>C</td>
<td></td>
<td>B</td>
<td></td>
<td></td>
<td></td>
<td>A</td>
<td></td>
<td>D</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- Every reference is a page fault!

- MIN Does much better:

<table>
<thead>
<tr>
<th>Ref:</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
</tr>
</thead>
<tbody>
<tr>
<td>Page:</td>
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
</tr>
<tr>
<td>1</td>
<td>A</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>B</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>B</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>C</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>C</td>
<td>D</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Why it works in Practice: Working Set Model

• As a program executes it transitions through a sequence of “working sets” consisting of varying sized subsets of the address space.
• One desirable property: When you add memory the miss rate drops (stack property)
  – Does this always happen?
  – Seems like it should, right?
• No: Bélády’s anomaly
  – Certain replacement algorithms (FIFO) don’t have this obvious property!
Adding Memory Doesn’t Always HelpFault Rate

• Does adding memory reduce number of page faults?
  – Yes for LRU and MIN
  – Not necessarily for FIFO! (Called Bélády’s anomaly)

• After adding memory:
  – With FIFO, contents can be completely different
  – In contrast, with LRU or MIN, contents of memory with \( X \) pages
    are a subset of contents with \( X + 1 \) Page
Summary (1/2)

• The Principle of Locality:
  – Program likely to access a relatively small portion of the address space at any instant of time.
    » Temporal Locality: Locality in Time
    » Spatial Locality: Locality in Space

• Three (+1) Major Categories of Cache Misses:
  – Compulsory Misses: sad facts of life. Example: cold start misses.
  – Conflict Misses: increase cache size and/or associativity
  – Capacity Misses: increase cache size
  – Coherence Misses: Caused by external processors or I/O devices

• Cache Organizations:
  – Direct Mapped: single block per set
  – Set associative: more than one block per set
  – Fully associative: all entries equivalent
Summary (2/2)

• “Translation Lookaside Buffer” (TLB)
  – Small number of PTEs and optional process IDs (< 512)
  – Fully Associative (Since conflict misses expensive)
  – On TLB miss, page table must be traversed and if located PTE is invalid, cause Page Fault
  – On change in page table, TLB entries must be invalidated
  – TLB is logically in front of cache (need to overlap with cache access)

• Precise Exception specifies a single instruction for which:
  – All previous instructions have completed (committed state)
  – No following instructions nor actual instruction have started

• Can manage caches in hardware or software or both
  – Goal is highest hit rate, even if it means more complex cache management