Code Commentary On The Linux Virtual Memory Manager
Mel Gorman
14th July 2003

Contents
1 Boot Memory Allocator 1.1 Representing the Boot Map . . . . . . . . . 1.2 Initialising the Boot Memory Allocator . . . Function: setup_memory . . . . . . Function: init_bootmem . . . . . . . Function: init_bootmem_node . . . Function: init_bootmem_core . . . 1.3 Allocating Memory . . . . . . . . . . . . . . Function: reserve_bootmem . . . . . Function: reserve_bootmem_node . Function: reserve_bootmem_core . . Function: alloc_bootmem . . . . . . Function: __alloc_bootmem . . . . Function: alloc_bootmem_node . . Function: __alloc_bootmem_node Function: __alloc_bootmem_core . 1.4 Freeing Memory . . . . . . . . . . . . . . . . Function: free_bootmem . . . . . . . Function: free_bootmem_core . . . 1.5 Retiring the Boot Memory Allocator . . . . Function: mem_init . . . . . . . . . Function: free_pages_init . . . . . . Function: free_all_bootmem . . . . Function: free_all_bootmem_core . 2 Physical Page Management 2.1 Allocating Pages . . . . . . . . . . . Function: alloc_pages . . . . Function: _alloc_pages . . . Function: __alloc_pages . . Function: rmqueue . . . . . . Function: expand . . . . . . . 2.2 Free Pages . . . . . . . . . . . . . . . Function: __free_pages . . . Function: __free_pages_ok . 2.3 Page Allocate Helper Functions . . . 2 8 8 8 8 11 12 12 13 13 14 14 15 16 16 17 17 22 22 23 24 24 26 26 27 29 29 29 30 31 35 37 38 39 39 43

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

2.4

Function: alloc_page . . . . . Function: __get_free_page . Function: __get_free_pages Function: __get_dma_pages Function: get_zeroed_page . Page Free Helper Functions . . . . . Function: free_pages . . . . . Function: __free_page . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

43 43 44 44 44 45 45 45 46 46 46 47 48 50 51 52 53 53 55 56 57 59 59 59 68 68 70 70 71 72 73 73 75 76 80 81 81 82 83 83 83 87 87 88

3 Non-Contiguous Memory Allocation 3.1 Allocating A Non-Contiguous Area . . Function: vmalloc . . . . . . . . Function: __vmalloc . . . . . . Function: get_vm_area . . . . Function: vmalloc_area_pages Function: alloc_area_pmd . . Function: alloc_area_pte . . . 3.2 Freeing A Non-Contiguous Area . . . . Function: vfree . . . . . . . . . Function: vmfree_area_pages . Function: free_area_pmd . . . Function: free_area_pte . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

4 Slab Allocator 4.0.1 Cache Creation . . . . . . . . . . . . . . . . . Function: kmem_cache_create . . . . . . . . 4.0.2 Calculating the Number of Objects on a Slab Function: kmem_cache_estimate . . . . . . . 4.0.3 Cache Shrinking . . . . . . . . . . . . . . . . . Function: kmem_cache_shrink . . . . . . . . Function: __kmem_cache_shrink . . . . . . Function: __kmem_cache_shrink_locked . . 4.0.4 Cache Destroying . . . . . . . . . . . . . . . . Function: kmem_cache_destroy . . . . . . . 4.0.5 Cache Reaping . . . . . . . . . . . . . . . . . Function: kmem_cache_reap . . . . . . . . . 4.1 Slabs . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Storing the Slab Descriptor . . . . . . . . . . Function: kmem_cache_slabmgmt . . . . . . Function: kmem_ﬁnd_general_cachep . . . . 4.1.2 Slab Structure . . . . . . . . . . . . . . . . . . 4.1.3 Slab Creation . . . . . . . . . . . . . . . . . . Function: kmem_cache_grow . . . . . . . . . 4.1.4 Slab Destroying . . . . . . . . . . . . . . . . . Function: kmem_slab_destroy . . . . . . . . 4.2 Objects . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

4.2.1

4.3

4.4

4.5 4.6

Initialising Objects in a Slab . . . . . . . . . . . . . . Function: kmem_cache_init_objs . . . . . . . . . . 4.2.2 Object Allocation . . . . . . . . . . . . . . . . . . . . Function: kmem_cache_alloc . . . . . . . . . . . . . Function: __kmem_cache_alloc (UP Case) . . . . . Function: __kmem_cache_alloc (SMP Case) . . . . Function: kmem_cache_alloc_head . . . . . . . . . Function: kmem_cache_alloc_one . . . . . . . . . . Function: kmem_cache_alloc_one_tail . . . . . . . Function: kmem_cache_alloc_batch . . . . . . . . . 4.2.3 Object Freeing . . . . . . . . . . . . . . . . . . . . . Function: kmem_cache_free . . . . . . . . . . . . . . Function: __kmem_cache_free . . . . . . . . . . . . Function: __kmem_cache_free . . . . . . . . . . . . Function: kmem_cache_free_one . . . . . . . . . . . Function: free_block . . . . . . . . . . . . . . . . . . Function: __free_block . . . . . . . . . . . . . . . . Sizes Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . Function: kmem_cache_sizes_init . . . . . . . . . . 4.3.1 kmalloc . . . . . . . . . . . . . . . . . . . . . . . . . Function: kmalloc . . . . . . . . . . . . . . . . . . . . 4.3.2 kfree . . . . . . . . . . . . . . . . . . . . . . . . . . . Function: kfree . . . . . . . . . . . . . . . . . . . . . Per-CPU Object Cache . . . . . . . . . . . . . . . . . . . . . 4.4.1 Describing the Per-CPU Object Cache . . . . . . . . 4.4.2 Adding/Removing Objects from the Per-CPU Cache 4.4.3 Enabling Per-CPU Caches . . . . . . . . . . . . . . . Function: enable_all_cpucaches . . . . . . . . . . . . Function: enable_cpucache . . . . . . . . . . . . . . Function: kmem_tune_cpucache . . . . . . . . . . . 4.4.4 Updating Per-CPU Information . . . . . . . . . . . . Function: smp_function_all_cpus . . . . . . . . . . Function: do_ccupdate_local . . . . . . . . . . . . . 4.4.5 Draining a Per-CPU Cache . . . . . . . . . . . . . . Function: drain_cpu_caches . . . . . . . . . . . . . Slab Allocator Initialisation . . . . . . . . . . . . . . . . . . Function: kmem_cache_init . . . . . . . . . . . . . . Interfacing with the Buddy Allocator . . . . . . . . . . . . . Function: kmem_getpages . . . . . . . . . . . . . . . Function: kmem_freepages . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88 88 90 90 91 92 93 94 95 96 98 98 99 99 100 102 102 102 102 104 104 105 105 106 106 107 107 107 108 109 111 111 112 112 113 114 114 115 115 116 117 117 117 119 119

5 Process Address Space 5.1 Managing the Address Space . . 5.2 Process Memory Descriptors . . 5.2.1 Allocating a Descriptor . Function: allocate_mm

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

5.3

Function: mm_alloc . . . . . . . . . . . Initalising a Descriptor . . . . . . . . . . Function: copy_mm . . . . . . . . . . . Function: mm_init . . . . . . . . . . . . 5.2.3 Destroying a Descriptor . . . . . . . . . Function: mmput . . . . . . . . . . . . . Function: mmdrop . . . . . . . . . . . . Function: __mmdrop . . . . . . . . . . Memory Regions . . . . . . . . . . . . . . . . . 5.3.1 Creating A Memory Region . . . . . . . Function: do_mmap_pgoﬀ . . . . . . . 5.3.2 Finding a Mapped Memory Region . . . Function: ﬁnd_vma . . . . . . . . . . . Function: ﬁnd_vma_prev . . . . . . . . Function: ﬁnd_vma_intersection . . . . 5.3.3 Finding a Free Memory Region . . . . . Function: get_unmapped_area . . . . . Function: arch_get_unmapped_area . . 5.3.4 Inserting a memory region . . . . . . . . Function: __insert_vm_struct . . . . . Function: ﬁnd_vma_prepare . . . . . . Function: vma_link . . . . . . . . . . . Function: __vma_link . . . . . . . . . Function: __vma_link_list . . . . . . . Function: __vma_link_rb . . . . . . . Function: __vma_link_ﬁle . . . . . . . 5.3.5 Merging contiguous region . . . . . . . . Function: vma_merge . . . . . . . . . . Function: can_vma_merge . . . . . . . 5.3.6 Remapping and moving a memory region Function: sys_mremap . . . . . . . . . . Function: do_mremap . . . . . . . . . . Function: move_vma . . . . . . . . . . . Function: move_page_tables . . . . . . Function: move_one_page . . . . . . . . Function: get_one_pte . . . . . . . . . . Function: alloc_one_pte . . . . . . . . . Function: copy_one_pte . . . . . . . . . 5.3.7 Locking a Memory Region . . . . . . . . Function: sys_mlock . . . . . . . . . . . Function: sys_mlockall . . . . . . . . . . Function: do_mlock . . . . . . . . . . . 5.3.8 Unlocking the region . . . . . . . . . . . Function: sys_munlock . . . . . . . . . . Function: sys_munlockall . . . . . . . . 5.2.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120 120 120 122 123 123 124 124 125 128 128 139 139 141 142 142 142 144 145 145 146 147 148 149 149 150 150 150 153 153 153 154 159 164 165 166 167 167 168 168 169 170 172 172 173

5.3.9

5.4

Fixing up regions after locking/unlocking Function: mlock_ﬁxup . . . . . . . . . . Function: mlock_ﬁxup_all . . . . . . . Function: mlock_ﬁxup_start . . . . . . Function: mlock_ﬁxup_end . . . . . . . Function: mlock_ﬁxup_middle . . . . . 5.3.10 Deleting a memory region . . . . . . . . Function: do_munmap . . . . . . . . . . Function: unmap_ﬁxup . . . . . . . . . 5.3.11 Deleting all memory regions . . . . . . . Function: exit_mmap . . . . . . . . . . Page Fault Handler . . . . . . . . . . . . . . . . 5.4.1 Handling the Page Fault . . . . . . . . . Function: handle_mm_fault . . . . . . . Function: handle_pte_fault . . . . . . . 5.4.2 Demand Allocation . . . . . . . . . . . . Function: do_no_page . . . . . . . . . . Function: do_anonymous_page . . . . . 5.4.3 Demand Paging . . . . . . . . . . . . . . Function: do_swap_page . . . . . . . . 5.4.4 Copy On Write (COW) Pages . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

173 173 174 175 176 177 179 179 183 186 186 188 198 198 199 200 200 204 206 206 209 212 212 212 213 213 216 217 217 217 218 218 219 220 220 222 223 225 225 225 225 226 226 228 228

6 High Memory Management 6.1 Mapping High Memory Pages . . . . . . . Function: kmap . . . . . . . . . . . Function: kmap_high . . . . . . . Function: map_new_virtual . . . . Function: ﬂush_all_zero_pkmaps 6.1.1 Unmapping Pages . . . . . . . . . . Function: kunmap . . . . . . . . . Function: kunmap_high . . . . . . 6.2 Mapping High Memory Pages Atomically . Function: kmap_atomic . . . . . . Function: kunmap_atomic . . . . . 6.3 Bounce Buﬀers . . . . . . . . . . . . . . . Function: create_buﬀers . . . . . . Function: alloc_bounce_bh . . . . Function: alloc_bounce_page . . . 6.3.1 Copying via Bounce Buﬀers . . . . Function: bounce_end_io_write . Function: bounce_end_io_read . . Function: copy_from_high_bh . . Function: copy_to_high_bh_irq . Function: bounce_end_io . . . . . 6.4 Emergency Pools . . . . . . . . . . . . . . Function: init_emergency_pool . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

7 Page Frame Reclamation 7.1 Page Swap Daemon . . . . . . . . . . . . . . . . Function: kswapd_init . . . . . . . . . . Function: kswapd . . . . . . . . . . . . . Function: kswapd_can_sleep . . . . . . Function: kswapd_can_sleep_pgdat . . Function: kswapd_balance . . . . . . . . Function: kswapd_balance_pgdat . . . 7.2 Page Cache . . . . . . . . . . . . . . . . . . . . Function: lru_cache_add . . . . . . . . Function: add_page_to_active_list . . Function: add_page_to_inactive_list . Function: lru_cache_del . . . . . . . . . Function: __lru_cache_del . . . . . . . Function: del_page_from_active_list . Function: del_page_from_inactive_list Function: mark_page_accessed . . . . . Function: activate_lock . . . . . . . . . Function: activate_page_nolock . . . . Function: page_cache_get . . . . . . . . Function: page_cache_release . . . . . . Function: add_to_page_cache . . . . . Function: __add_to_page_cache . . . 7.3 Shrinking all caches . . . . . . . . . . . . . . . . Function: shrink_caches . . . . . . . . . Function: try_to_free_pages . . . . . . Function: try_to_free_pages_zone . . . 7.4 Reﬁlling inactive_list . . . . . . . . . . . . . . . Function: reﬁll_inactive . . . . . . . . . 7.5 Reclaiming pages from the page cache . . . . . . Function: shrink_cache . . . . . . . . . 7.6 Swapping Out Process Pages . . . . . . . . . . . Function: swap_out . . . . . . . . . . . Function: swap_out_mm . . . . . . . . Function: swap_out_vma . . . . . . . . Function: swap_out_pgd . . . . . . . . Function: swap_out_pmd . . . . . . . . Function: try_to_swap_out . . . . . . . 8 Swap Management 8.1 Describing the Swap Area . . . . . . . 8.2 Scanning for free entries . . . . . . . . Function: get_swap_page . . . Function: scan_swap_map . . 8.3 Swap Cache . . . . . . . . . . . . . . . Function: add_to_swap_cache

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

230 230 230 230 231 232 232 233 234 234 234 235 235 236 236 236 237 237 237 238 238 238 238 239 239 241 242 243 243 244 244 251 251 253 254 255 256 258 262 262 262 262 264 267 267

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

8.4 8.5

Function: swap_duplicate . . . Function: swap_free . . . . . . Function: swap_entry_free . . Function: swap_info_get . . . Function: swap_info_put . . . Function: lookup_swap_cache Function: ﬁnd_get_page . . . . Function: __ﬁnd_get_page . . Function: __ﬁnd_page_nolock Activating a Swap Area . . . . . . . . Function: sys_swapon . . . . . Deactivating a Swap Area . . . . . . . Function: sys_swapoﬀ . . . . . Function: try_to_unuse . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

268 269 269 270 272 272 272 273 273 274 274 285 285 289

List of Figures
1.1 2.1 2.2 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6.1 6.2 7.1 7.2 8.1 Call Graph: setup_memory . . . . . . . . . . . . . . . . . . . . . . . . . . . Call Graph: alloc_pages() . . . . . . . . . . . . . . . . . . . . . . . . . . . Call Graph: __free_pages() . . . . . . . . . . . . . . . . . . . . . . . . . . Call Graph: vmalloc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Call Graph: vfree() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Call Graph: kmem_cache_create() . Call Graph: kmem_cache_shrink() . Call Graph: kmem_cache_destroy() Call Graph: kmem_cache_reap() . . Call Graph: kmem_cache_grow() . . Call Graph: kmem_slab_destroy() . Call Graph: kmem_cache_alloc() . . Call Graph: kmem_cache_free() . . kmalloc . . . . . . . . . . . . . . . . kfree . . . . . . . . . . . . . . . . . . Call Graph: sys_mmap2() . . . . . Call Graph: get_unmapped_area() Call Graph: insert_vm_struct() . Call Graph: sys_mremap . . . . . Call Graph: move_vma . . . . . . Call Graph: move_page_tables() . do_munmap . . . . . . . . . . . . . Call Graph: do_page_fault() . . . Call Graph: do_no_page() . . . . . do_swap_page . . . . . . . . . . . do_wp_page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 30 39 47 54

. 60 . 70 . 73 . 75 . 83 . 87 . 91 . 98 . 104 . 105 . . . . . . . . . . . 129 143 145 153 159 164 179 189 201 210 211

Call Graph: kmap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Call Graph: create_bounce . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 shrink_cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Call Graph: swap_out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Call Graph: get_swap_page() . . . . . . . . . . . . . . . . . . . . . . . . . . 262

9

List of Tables
2.1 2.2 3.1 3.2 4.1 Physical Pages Allocation API . . . . . . . . . . . . . . . . . . . . . . . . . . Physical Pages Free API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-Contiguous Memory Allocation API . . . . . . . . . . . . . . . . . . . . Non-Contiguous Memory Free API . . . . . . . . . . . . . . . . . . . . . . . Slab Allocator API for caches . . . . . . . . . . . . . . . . . . . . . . . . . . 29 38 46 53 60

7

Chapter 1 Boot Memory Allocator
1.1 Representing the Boot Map

A bootmem_data struct exists for each node of memory in the system. It contains the information needed for the boot memory allocator to allocate memory for a node such as the bitmap representing allocated pages and where the memory is located. It is declared as follows in <linux/bootmem.h>; 25 typedef struct bootmem_data { 26 unsigned long node_boot_start; 27 unsigned long node_low_pfn; 28 void *node_bootmem_map; 29 unsigned long last_offset; 30 unsigned long last_pos; 31 } bootmem_data_t;

node_boot_start is the starting physical address of the represented block node_low_pfn is the end physical address, in other words, the end of the ZONE_NORMAL this node represents node_bootmem_map is the location of the bitmap representing allocated or free pages with each bit last_offset is the oﬀset within the the page of the end of the last allocation. If 0, the page used is full last_pos is the the PFN of the page used with the last allocation. Using this with the last_offset ﬁeld, a test can be made to see if allocations can be merged with the page used for the last allocation rather than using up a full new page

1.2

Initialising the Boot Memory Allocator

Function: setup_memory (arch/i386/kernel/setup.c) 8

1.2. Initialising the Boot Memory Allocator

9

setup_memory

find_max_pfn

find_max_low_pfn

init_bootmem

register_bootmem_low_pages

find_smp_config

init_bootmem_core

free_bootmem

find_intel_smp

free_bootmem_core

smp_scan_config

reserve_bootmem

mpf_checksum

reserve_bootmem_core

Figure 1.1: Call Graph: setup_memory This function gets the necessary information to give to the boot memory allocator to initialise itself. It is broken up into a number of diﬀerent tasks. • Find the start and ending Page Frame Number (PFN) for low memory (min_low_pfn, max_low_pfn), the start and end PFN for high memory (highstart_pfn, highend_pfn) and the PFN for the last page in the system (max_pfn). • Initialise the bootmem_data structure and declare which pages may be used by the boot memory allocator • Mark all pages usable by the system as “free” and then reserve the pages used by the bitmap representing the pages • Reserve pages used by the SMP conﬁg or the initrd image if one exists 949 static unsigned long __init setup_memory(void) 950 { 951 unsigned long bootmap_size, start_pfn, max_low_pfn; 952 953 /* 954 * partially used pages are not usable - thus 955 * we are rounding upwards: 956 */ 957 start_pfn = PFN_UP(__pa(&_end)); 958 959 find_max_pfn(); 960

1.2. Initialising the Boot Memory Allocator 961 max_low_pfn = find_max_low_pfn(); 962 963 #ifdef CONFIG_HIGHMEM 964 highstart_pfn = highend_pfn = max_pfn; 965 if (max_pfn > max_low_pfn) { 966 highstart_pfn = max_low_pfn; 967 } 968 printk(KERN_NOTICE "%ldMB HIGHMEM available.\n", 969 pages_to_mb(highend_pfn - highstart_pfn)); 970 #endif 971 printk(KERN_NOTICE "%ldMB LOWMEM available.\n", 972 pages_to_mb(max_low_pfn));

10

957 PFN_UP() takes a physical address, rounds it up to the next page and returns the page frame number. _end is the address of the end of the loaded kernel image so start_pfn is now the oﬀset of the ﬁrst physical page frame that may be used 959 find_max_pfn() loops through the e820 map searching for the highest available pfn 961 find_max_low_pfn() ﬁnds the highest page frame addressable in ZONE_NORMAL 964-969 If high memory is enabled, start with a high memory region of 0. If it turns out there is memory after max_low_pfn, put the start of high memory (highstart_pfn) there and the end of high memory at max_pfn. Print out an informational message on the availability of high memory 971-972 Print out an informational message on the amount of low memory 976 bootmap_size = init_bootmem(start_pfn, max_low_pfn); 977 978 register_bootmem_low_pages(max_low_pfn); 979 986 reserve_bootmem(HIGH_MEMORY, (PFN_PHYS(start_pfn) + 987 bootmap_size + PAGE_SIZE-1) - (HIGH_MEMORY)); 988 993 reserve_bootmem(0, PAGE_SIZE); 994 995 #ifdef CONFIG_SMP 1001 reserve_bootmem(PAGE_SIZE, PAGE_SIZE); 1002 #endif 976 init_bootmem() initialises the bootmem_data struct for the config_page_data node. It sets where physical memory begins and ends for the node, allocates a bitmap representing the pages and sets all pages as reserved 978 registed_bootmem_low_pages() reads the e820 map and calls free_bootmem() for all usable pages in the running system

1.2. Initialising the Boot Memory Allocator 986-987 Reserve the pages that are being used by the bitmap representing the pages 993 Reserve page 0 as it is often a special page used by the bios

11

1001 Reserve an extra page which is required by the trampoline code. The trampoline code deals with how userspace enters kernel space 1003 1004 #ifdef CONFIG_X86_LOCAL_APIC 1005 /* 1006 * Find and reserve possible boot-time SMP configuration: 1007 */ 1008 find_smp_config(); 1009 #endif 1010 #ifdef CONFIG_BLK_DEV_INITRD 1011 if (LOADER_TYPE && INITRD_START) { 1012 if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) { 1013 reserve_bootmem(INITRD_START, INITRD_SIZE); 1014 initrd_start = 1015 INITRD_START ? INITRD_START + PAGE_OFFSET : 0; 1016 initrd_end = initrd_start+INITRD_SIZE; 1017 } 1018 else { 1019 printk(KERN_ERR "initrd extends beyond end of memory " 1020 "(0x%08lx > 0x%08lx)\ndisabling initrd\n", 1021 INITRD_START + INITRD_SIZE, 1022 max_low_pfn << PAGE_SHIFT); 1023 initrd_start = 0; 1024 } 1025 } 1026 #endif 1027 1028 return max_low_pfn; 1029 } 1008 This function reserves memory that stores conﬁg information about the SMP setup 1010-1026 If initrd is enabled, the memory containing its image will be reserved. initrd provides a tiny ﬁlesystem image which is used to boot the system 1028 Return the upper limit of addressable memory in ZONE_NORMAL Function: init_bootmem (mm/bootmem.c) Called by UMA architectures to initialise their bootmem data.

1.2. Initialising the Boot Memory Allocator 304 unsigned long __init init_bootmem (unsigned long start, unsigned long pages) 305 { 306 max_low_pfn = pages; 307 min_low_pfn = start; 308 return(init_bootmem_core(&contig_page_data, start, 0, pages)); 309 }

12

304 Confusingly, the pages parameter is actually the end PFN of the memory addressable by this node, not the number of pages as the name impies 306 Set the max PFN addressable by this node in case the architecture dependent code did not 307 Set the min PFN addressable by this node in case the architecture dependent code did not 308 Call init_bootmem_core() which does the real work of initialising the bootmem_data Function: init_bootmem_node (mm/bootmem.c) Used by NUMA architectures to initialise bootmem data for a given node 284 unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn) 285 { 286 return(init_bootmem_core(pgdat, freepfn, startpfn, endpfn)); 287 } 286 Just call init_bootmem_core() directly Function: init_bootmem_core (mm/bootmem.c) Initialises the appropriate struct bootmem_data_t and inserts the node into the linked list of nodes pgdat_list. 46 static unsigned long __init init_bootmem_core (pg_data_t *pgdat, 47 unsigned long mapstart, unsigned long start, unsigned long end) 48 { 49 bootmem_data_t *bdata = pgdat->bdata; 50 unsigned long mapsize = ((end - start)+7)/8; 51 52 pgdat->node_next = pgdat_list; 53 pgdat_list = pgdat; 54 55 mapsize = (mapsize + (sizeof(long) - 1UL)) & ~(sizeof(long) - 1UL); 56 bdata->node_bootmem_map = phys_to_virt(mapstart << PAGE_SHIFT);

1.3. Allocating Memory 57 58 59 60 61 62 63 64 65 66 67 } bdata->node_boot_start = (start << PAGE_SHIFT); bdata->node_low_pfn = end; /* * Initially all pages are reserved - setup_arch() has to * register free RAM areas explicitly. */ memset(bdata->node_bootmem_map, 0xff, mapsize); return mapsize;

13

46 The parameters are; pgdat is the node descriptor been initialised mapstart is the beginning of the memory that will be usable start is the beginning PFN of the node end is the end PFN of the node 50 Each page requires one bit to represent it so the size of the map required is the number of pages in this node rounded up to the nearest multiple of 8 and then divided by 8 to give the number of bytes required 52-53 As the node will be shortly considered initialised, insert it into the global pgdat_list 55 Round the mapsize up to the closest word boundary 56 Convert the mapstart to a virtual address and store it in bdata→node_bootmem_map 57 Convert the starting PFN to a physical address and store it on node_boot_start 58 Store the end PFN of ZONE_NORMAL in node_low_pfn 64 Fill the full map with 1’s marking all pages as allocated. It is up to the architecture dependent code to mark the usable pages

1.3

Allocating Memory

Function: reserve_bootmem (mm/bootmem.c) 311 void __init reserve_bootmem (unsigned long addr, unsigned long size) 312 { 313 reserve_bootmem_core(contig_page_data.bdata, addr, size); 314 } 313 Just call reserve_bootmem_core() passing the bootmem data from contig_page_data as the node to reserve memory from

1.3. Allocating Memory Function: reserve_bootmem_node (mm/bootmem.c) 289 void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size) 290 { 291 reserve_bootmem_core(pgdat->bdata, physaddr, size); 292 }

14

291 Just call reserve_bootmem_core() passing it the bootmem data of the requested node Function: reserve_bootmem_core (mm/bootmem.c) 74 static void __init reserve_bootmem_core(bootmem_data_t *bdata, unsigned long addr, unsigned long size) 75 { 76 unsigned long i; 77 /* 78 * round up, partially reserved pages are considered 79 * fully reserved. 80 */ 81 unsigned long sidx = (addr - bdata->node_boot_start)/PAGE_SIZE; 82 unsigned long eidx = (addr + size - bdata->node_boot_start + 83 PAGE_SIZE-1)/PAGE_SIZE; 84 unsigned long end = (addr + size + PAGE_SIZE-1)/PAGE_SIZE; 85 86 if (!size) BUG(); 87 88 if (sidx < 0) 89 BUG(); 90 if (eidx < 0) 91 BUG(); 92 if (sidx >= eidx) 93 BUG(); 94 if ((addr >> PAGE_SHIFT) >= bdata->node_low_pfn) 95 BUG(); 96 if (end > bdata->node_low_pfn) 97 BUG(); 98 for (i = sidx; i < eidx; i++) 99 if (test_and_set_bit(i, bdata->node_bootmem_map)) 100 printk("hm, page %08lx reserved twice.\n", i*PAGE_SIZE); 101 } 81 The sidx is the starting index to serve pages from. The value is obtained by subtracting the starting address from the requested address and dividing by the size of a page

1.3. Allocating Memory

15

82 A similar calculation is made for the ending index eidx except that the allocation is rounded up to the nearest page. This means that requests to partially reserve a page will result in the full page being reserved 84 end is the last PFN that is aﬀected by this reservation 86 Check that a non-zero value has been given 88-89 Check the starting index is not before the start of the node 90-91 Check the end index is not before the start of the node 92-93 Check the starting index is not after the end index 94-95 Check the starting address is not beyond the memory this bootmem node represents 96-97 Check the ending address is not beyond the memory this bootmem node represents 88-100 Starting with sidx and ﬁnishing with eidx, test and set the bit in the bootmem map that represents the page marking it as allocated. If the bit was already set to 1, print out a message saying it was reserved twice Function: alloc_bootmem (mm/bootmem.c) 38 39 40 41 42 43 44 45 #define alloc_bootmem(x) \ __alloc_bootmem((x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS)) #define alloc_bootmem_low(x) \ __alloc_bootmem((x), SMP_CACHE_BYTES, 0) #define alloc_bootmem_pages(x) \ __alloc_bootmem((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS)) #define alloc_bootmem_low_pages(x) \ __alloc_bootmem((x), PAGE_SIZE, 0)

39 alloc_bootmem() will align to the L1 hardware cache and start searching for a page after the maximum address usable for DMA 40 alloc_bootmem_low() will align to the L1 hardware cache and start searching from page 0 42 alloc_bootmem_pages() will align the allocation to a page size so that full pages will be allocated starting from the maximum address usable for DMA 44 alloc_bootmem_pages() will align the allocation to a page size so that full pages will be allocated starting from physical address 0

1.3. Allocating Memory Function: __alloc_bootmem (mm/bootmem.c) 326 void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal) 327 { 328 pg_data_t *pgdat; 329 void *ptr; 330 331 for_each_pgdat(pgdat) 332 if ((ptr = __alloc_bootmem_core(pgdat->bdata, size, 333 align, goal))) 334 return(ptr); 335 336 /* 337 * Whoops, we cannot satisfy the allocation request. 338 */ 339 printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size); 340 panic("Out of memory"); 341 return NULL; 342 } 326 The parameters are; size is the size of the requested allocation align is the desired alignment and must be a power of 2. SMP_CACHE_BYTES or PAGE_SIZE goal is the starting address to begin searching from

16

Currently either

331-334 Cycle through all available nodes and try allocating from each in turn. In the UMA case, this will just allocate from the contig_page_data node 349-340 If the allocation fails, the system is not going to be able to boot so the kernel panics Function: alloc_bootmem_node (mm/bootmem.c) 53 #define alloc_bootmem_node(pgdat, x) \ 54 __alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS)) 55 #define alloc_bootmem_pages_node(pgdat, x) \ 56 __alloc_bootmem_node((pgdat), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS)) 57 #define alloc_bootmem_low_pages_node(pgdat, x) \ 58 __alloc_bootmem_node((pgdat), (x), PAGE_SIZE, 0) 53-54 alloc_bootmem_node() will allocate from the requested node and align to the L1 hardware cache and start searching for a page after the maximum address usable for DMA

1.3. Allocating Memory

17

55-56 alloc_bootmem_pages() will allocate from the requested node and align the allocation to a page size so that full pages will be allocated starting from the maximum address usable for DMA 57-58 alloc_bootmem_pages() will allocate from the requested node and align the allocation to a page size so that full pages will be allocated starting from physical address 0 Function: __alloc_bootmem_node (mm/bootmem.c) 344 void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal) 345 { 346 void *ptr; 347 348 ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal); 349 if (ptr) 350 return (ptr); 351 352 /* 353 * Whoops, we cannot satisfy the allocation request. 354 */ 355 printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size); 356 panic("Out of memory"); 357 return NULL; 358 } 344 The parameters are the same as for __alloc_bootmem_node() except the node to allocate from is speciﬁed 348 Call the core function __alloc_bootmem_core() to perform the allocation 349-350 Return a pointer if it was successful 355-356 Otherwise print out a message and panic the kernel as the system will not boot if memory can not be allocated even now Function: __alloc_bootmem_core (mm/bootmem.c) This is the core function for allocating memory from a speciﬁed node with the boot memory allocator. It is quite large and broken up into the following tasks; • Function preamble. Make sure the parameters are sane • Calculate the starting address to scan from based on the goal parameter • Check to see if this allocation may be merged with the page used for the previous allocation to save memory.

1.3. Allocating Memory

18

• Mark the pages allocated as 1 in the bitmap and zero out the contents of the pages 144 static void * __init __alloc_bootmem_core (bootmem_data_t *bdata, 145 unsigned long size, unsigned long align, unsigned long goal) 146 { 147 unsigned long i, start = 0; 148 void *ret; 149 unsigned long offset, remaining_size; 150 unsigned long areasize, preferred, incr; 151 unsigned long eidx = bdata->node_low_pfn 152 (bdata->node_boot_start >> PAGE_SHIFT); 153 154 if (!size) BUG(); 155 156 if (align & (align-1)) 157 BUG(); 158 159 offset = 0; 160 if (align && 161 (bdata->node_boot_start & (align - 1UL)) != 0) 162 offset = (align - (bdata->node_boot_start & (align - 1UL))); 163 offset >>= PAGE_SHIFT; Function preamble, make sure the parameters are sane 144 The parameters are; bdata is the bootmem for the struct being allocated from size is the size of the requested allocation align is the desired alignment for the allocation. Must be a power of 2 goal is the preferred address to allocate above if possible 151 Calculate the ending bit index eidx which returns the highest page index that may be used for the allocation 154 Call BUG() if a request size of 0 is speciﬁed 156-156 If the alignment is not a power of 2, call BUG() 159 The default oﬀset for alignments is 0 160 If an alignment has been speciﬁed and... 161 And the requested alignment is the same alignment as the start of the node then calculate the oﬀset to use

1.3. Allocating Memory

19

162 The oﬀset to use is the requested alignment masked against the lower bits of the starting address. In reality, this offset will likely be identical to align for the prevalent values of align 169 170 171 172 173 174 175 176 177 178 if (goal && (goal >= bdata->node_boot_start) && ((goal >> PAGE_SHIFT) < bdata->node_low_pfn)) { preferred = goal - bdata->node_boot_start; } else preferred = 0; preferred = ((preferred + align - 1) & ~(align - 1)) >> PAGE_SHIFT; preferred += offset; areasize = (size+PAGE_SIZE-1)/PAGE_SIZE; incr = align >> PAGE_SHIFT ? : 1;

Calculate the starting PFN to start scanning from based on the goal parameter. 169 If a goal has been speciﬁed and the goal is after the starting address for this node and the PFN of the goal is less than the last PFN adressable by this node then .... 170 The preferred oﬀset to start from is the goal minus the beginning of the memory addressable by this node 173 Else the preferred oﬀset is 0 175-176 Adjust the preferred address to take the oﬀset into account so that the address will be correctly aligned 177 The number of pages that will be aﬀected by this allocation is stored in areasize 178 incr is the number of pages that have to be skipped to satisify alignment requirements if they are over one page 179 180 restart_scan: 181 for (i = preferred; i < eidx; i += incr) { 182 unsigned long j; 183 if (test_bit(i, bdata->node_bootmem_map)) 184 continue; 185 for (j = i + 1; j < i + areasize; ++j) { 186 if (j >= eidx) 187 goto fail_block; 188 if (test_bit (j, bdata->node_bootmem_map)) 189 goto fail_block; 190 } 191 start = i; 192 goto found;

1.3. Allocating Memory 193 194 195 196 197 198 199 fail_block:; } if (preferred) { preferred = offset; goto restart_scan; } return NULL;

20

Scan through memory looking for a block large enough to satisfy this request 180 If the allocation could not be satisifed starting from goal, this label is jumped back to for rescanning 181-194 Starting from preferred, scan lineraly searching for a free block large enough to satisfy the request. Walk the address space in incr steps to satisfy alignments greater than one page. If the alignment is less than a page, incr will just be 1 183-184 Test the bit, if it is already 1, it is not free so move to the next page 185-190 Scan the next areasize number of pages and see if they are also free. It fails if the end of the addressable space is reached (eidx) or one of the pages is already in use 191-192 A free block is found so record the start and jump to the found block 195-198 The allocation failed so start again from the beginning 199 If that also failed, return NULL which will result in a kernel panic 200 found: 201 if (start >= eidx) 202 BUG(); 203 209 if (align <= PAGE_SIZE 210 && bdata->last_offset && bdata->last_pos+1 == start) { 211 offset = (bdata->last_offset+align-1) & ~(align-1); 212 if (offset > PAGE_SIZE) 213 BUG(); 214 remaining_size = PAGE_SIZE-offset; 215 if (size < remaining_size) { 216 areasize = 0; 217 // last_pos unchanged 218 bdata->last_offset = offset+size; 219 ret = phys_to_virt(bdata->last_pos*PAGE_SIZE + offset + 220 bdata->node_boot_start); 221 } else { 222 remaining_size = size - remaining_size; 223 areasize = (remaining_size+PAGE_SIZE-1)/PAGE_SIZE;

1.3. Allocating Memory 224 225 226 227 228 229 230 231 232 233 234 ret = phys_to_virt(bdata->last_pos*PAGE_SIZE + offset + bdata->node_boot_start); bdata->last_pos = start+areasize-1; bdata->last_offset = remaining_size; } bdata->last_offset &= ~PAGE_MASK; } else { bdata->last_pos = start + areasize - 1; bdata->last_offset = size & ~PAGE_MASK; ret = phys_to_virt(start * PAGE_SIZE + bdata->node_boot_start); }

21

Test to see if this allocation may be merged with the previous allocation. 201-202 Check that the start of the allocation is not after the addressable memory. This check was just made so it is redundent 209-230 Try and merge with the previous allocation if the alignment is less than a PAGE_SIZE, the previously page has space in it (last_offset != 0) and that the previously used page is adjactent to the page found for this allocation 231-234 Else record the pages and oﬀset used for this allocation to be used for merging with the next allocation 211 Update the oﬀset to use to be aligned correctly for the requested align 212-213 If the oﬀset now goes over the edge of a page, BUG() is called. This condition would require a very poor choice of alignment to be used. As the only alignment commonly used is a factor of PAGE_SIZE, it is impossible for normal usage 214 remaining_size is the remaining free space in the previously used page 215-221 If there is enough space left in the old page then use the old page totally and update the bootmem_data struct to reﬂect it 221-228 Else calculate how many pages in addition to this one will be required and update the bootmem_data 216 The number of pages used by this allocation is now 0 218 Update the last_offset to be the end of this allocation 219 Calculate the virtual address to return for the successful allocation 222 remaining_size is how space will be used in the last page used to satisfy the allocation 223 Calculate how many more pages are needed to satisfy the allocation

1.4. Freeing Memory 224 Record the address the allocation starts from

22

226 The last page used is the start page plus the number of additional pages required to satisfy this allocation areasize 227 The end of the allocation has already been calculated 229 If the oﬀset is at the end of the page, make it 0 231 No merging took place so record the last page used to satisfy this allocation 232 Record how much of the last page was used 233 Record the starting virtual address of the allocation 238 239 240 241 242 243 } for (i = start; i < start+areasize; i++) if (test_and_set_bit(i, bdata->node_bootmem_map)) BUG(); memset(ret, 0, size); return ret;

Mark the pages allocated as 1 in the bitmap and zero out the contents of the pages 238-240 Cycle through all pages used for this allocation and set the bit to 1 in the bitmap. If any of them are already 1, then a double allocation took place so call BUG() 241 Zero ﬁll the pages 242 Return the address of the allocation

1.4

Freeing Memory

Function: free_bootmem (mm/bootmem.c) 294 void __init free_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size) 295 { 296 return(free_bootmem_core(pgdat->bdata, physaddr, size)); 297 } 316 void __init free_bootmem (unsigned long addr, unsigned long size) 317 { 318 return(free_bootmem_core(contig_page_data.bdata, addr, size)); 319 } 296 Call the core function with the corresponding bootmem data for the requested node 318 Call the core function with the bootmem data for contig_page_data

1.4. Freeing Memory Function: free_bootmem_core (mm/bootmem.c) 103 static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr, unsigned long size) 104 { 105 unsigned long i; 106 unsigned long start; 111 unsigned long sidx; 112 unsigned long eidx = (addr + size bdata->node_boot_start)/PAGE_SIZE; 113 unsigned long end = (addr + size)/PAGE_SIZE; 114 115 if (!size) BUG(); 116 if (end > bdata->node_low_pfn) 117 BUG(); 118 119 /* 120 * Round up the beginning of the address. 121 */ 122 start = (addr + PAGE_SIZE-1) / PAGE_SIZE; 123 sidx = start - (bdata->node_boot_start/PAGE_SIZE); 124 125 for (i = sidx; i < eidx; i++) { 126 if (!test_and_clear_bit(i, bdata->node_bootmem_map)) 127 BUG(); 128 } 129 } 112 Calculate the end index aﬀected as eidx

23

113 The end address is the end of the aﬀected area rounded down to the nearest page if it is not already page aligned 115 If a size of 0 is freed, call BUG 116-117 If the end PFN is after the memory addressable by this node, call BUG 122 Round the starting address up to the nearest page if it is not already page aligned 123 Calculate the starting index to free 125-127 For all full pages that are freed by this action, clear the bit in the boot bitmap. If it is already 0, it is a double free or is memory that was never used so call BUG

1.5. Retiring the Boot Memory Allocator

24

1.5

Retiring the Boot Memory Allocator

Function: mem_init (arch/i386/mm/init.d) The important part of this function for the boot memory allocator is that it calls free_pages_init(). The function is broken up into the following tasks • Function preamble, set the PFN within the global mem_map for the location of high memory and zero out the system wide zero page • Call free_pages_init() • Print out an informational message on the availability of memory in the system • Check the CPU supports PAE if the conﬁg option is enabled and test the WP bit on the CPU. This is important as without the WP bit, the function verify_write() has to be called for every write to userspace from the kernel. This only applies to old processors like the 386 • Fill in entries for the userspace portion of the PGD for swapper_pg_dir, the kernel page tables. The zero page is mapped for all entries 507 void __init mem_init(void) 508 { 509 int codesize, reservedpages, datasize, initsize; 510 511 if (!mem_map) 512 BUG(); 513 514 set_max_mapnr_init(); 515 516 high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); 517 518 /* clear the zero-page */ 519 memset(empty_zero_page, 0, PAGE_SIZE); 514 This function records the PFN high memory starts in mem_map (highmem_start_page), the maximum number of pages in the system (max_mapnr and num_physpages) and ﬁnally the maximum number of pages that may be mapped by the kernel (num_mappedpages) 516 high_memory is the virtual address where high memory begins 519 Zero out the system wide zero page 520 521 522

reservedpages = free_pages_init();

1.5. Retiring the Boot Memory Allocator

25

512 Call free_pages_init() which tells the boot memory allocator to retire itself as well as initialising all pages in high memory for use with the buddy allocator 523 524 525 526 527 528 529 530 531 532 533 534 535 codesize = datasize = initsize = (unsigned long) &_etext - (unsigned long) &_text; (unsigned long) &_edata - (unsigned long) &_etext; (unsigned long) &__init_end - (unsigned long) &__init_begin;

printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init, %ldk highmem)\n", (unsigned long) nr_free_pages() << (PAGE_SHIFT-10), max_mapnr << (PAGE_SHIFT-10), codesize >> 10, reservedpages << (PAGE_SHIFT-10), datasize >> 10, initsize >> 10, (unsigned long) (totalhigh_pages << (PAGE_SHIFT-10)) );

Print out an informational message 523 Calculate the size of the code segment, data segment and memory used by initialisation code and data (all functions marked __init will be in this section) 527-535 Print out a nice message on how the availability of memory and the amount of memory consumed by the kernel 536 537 #if CONFIG_X86_PAE 538 if (!cpu_has_pae) 539 panic("cannot execute a PAE-enabled kernel on a PAE-less CPU!"); 540 #endif 541 if (boot_cpu_data.wp_works_ok < 0) 542 test_wp_bit(); 543 538-539 If PAE is enabled but the processor does not support it, panic 541-542 Test for the availability of the WP bit 550 #ifndef CONFIG_SMP 551 zap_low_mappings(); 552 #endif 553 554 } 551 Cycle through each PGD used by the userspace portion of swapper_pg_dir and map the zero page to it

1.5. Retiring the Boot Memory Allocator

26

Function: free_pages_init (arch/i386/mm/init.c) This function has two important functions, to call free_all_bootmem() to retire the boot memory allocator and to free all high memory pages to the buddy allocator. 481 static int __init free_pages_init(void) 482 { 483 extern int ppro_with_ram_bug(void); 484 int bad_ppro, reservedpages, pfn; 485 486 bad_ppro = ppro_with_ram_bug(); 487 488 /* this will put all low memory onto the freelists */ 489 totalram_pages += free_all_bootmem(); 490 491 reservedpages = 0; 492 for (pfn = 0; pfn < max_low_pfn; pfn++) { 493 /* 494 * Only count reserved RAM pages 495 */ 496 if (page_is_ram(pfn) && PageReserved(mem_map+pfn)) 497 reservedpages++; 498 } 499 #ifdef CONFIG_HIGHMEM 500 for (pfn = highend_pfn-1; pfn >= highstart_pfn; pfn--) 501 one_highpage_init((struct page *) (mem_map + pfn), pfn, bad_ppro); 502 totalram_pages += totalhigh_pages; 503 #endif 504 return reservedpages; 505 } 486 There is a bug in the Pentium Pros that prevent certain pages in high memory being used. The function ppro_with_ram_bug() checks for its existance 489 Call free_all_bootmem() to retire the boot memory allocator 491-498 Cycle through all of memory and count the number of reserved pages that were left over by the boot memory allocator 500-501 For each page in high memory, call one_highpage_init(). This function clears the PG_reserved bit, sets the PG_high bit, sets the count to 1, calls __free_pages() to give the page to the buddy allocator and increments the totalhigh_pages count. Pages which kill buggy Pentium Pro’s are skipped Function: free_all_bootmem (mm/bootmem.c) 299 unsigned long __init free_all_bootmem_node (pg_data_t *pgdat)

1.5. Retiring the Boot Memory Allocator 300 { 301 302 }

27

return(free_all_bootmem_core(pgdat));

321 unsigned long __init free_all_bootmem (void) 322 { 323 return(free_all_bootmem_core(&contig_page_data)); 324 } 299-302 For NUMA, simply call the core function with the speciﬁed pgdat 321-324 For UMA, call the core function with the only node contig_page_data Function: free_all_bootmem_core (mm/bootmem.c) This is the core function which “retires” the boot memory allocator. It is divided into two major tasks • For all unallocated pages known to the allocator for this node; – Clear the PG_reserved ﬂag in its struct page – Set the count to 1 – Call __free_pages() so that the buddy allocator (discussed next chapter) can build its free lists • Free all pages used for the bitmap and free to them to the buddy allocator 245 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat) 246 { 247 struct page *page = pgdat->node_mem_map; 248 bootmem_data_t *bdata = pgdat->bdata; 249 unsigned long i, count, total = 0; 250 unsigned long idx; 251 252 if (!bdata->node_bootmem_map) BUG(); 253 254 count = 0; 255 idx = bdata->node_low_pfn - (bdata->node_boot_start >>PAGE_SHIFT); 256 for (i = 0; i < idx; i++, page++) { 257 if (!test_bit(i, bdata->node_bootmem_map)) { 258 count++; 259 ClearPageReserved(page); 260 set_page_count(page, 1); 261 __free_page(page); 262 } 263 } 264 total += count;

1.5. Retiring the Boot Memory Allocator

28

252 If no map is available, it means that this node has already been freed and something woeful is wrong with the architecture dependent code so call BUG() 254 A running count of the number of pages given to the buddy allocator 255 idx is the last index that is addressable by this node 256-263 Cycle through all pages addressable by this node 257 If the page is marked free then... 258 Increase the running count of pages given to the buddy allocator 259 Clear the PG_reserved ﬂag 260 Set the count to 1 so that the buddy allocator will think this is the last user of the page and place it in its free lists 261 Call the buddy allocator free function 264 total will come the total number of pages given over by this function 270 271 272 page = virt_to_page(bdata->node_bootmem_map); count = 0; for (i = 0; i < ((bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT) )/8 + PAGE_SIZE-1)/PAGE_SIZE; i++,page++) { count++; ClearPageReserved(page); set_page_count(page, 1); __free_page(page); } total += count; bdata->node_bootmem_map = NULL; return total;

273 274 275 276 277 278 279 280 281 282 }

Free the allocator bitmap and return 270 Get the struct page that is at the beginning of the bootmem map 271 Count of pages freed by the bitmap 272-277 For all pages used by the bitmap, free them to the buddy allocator the same way the previous block of code did 279 Set the bootmem map to NULL to prevent it been freed a second time by accident 281 Return the total number of pages freed by this function

Chapter 2 Physical Page Management
alloc_pages(unsigned int gfp_mask, unsigned int order) Allocate 2order number of pages and returns a struct page __get_dma_pages(unsigned int gfp_mask, unsigned int order) Allocate 2order number of pages from the DMA zone and return a struct page __get_free_pages(unsigned int gfp_mask, unsigned int order) Allocate 2order number of pages and return a virtual address alloc_page(unsigned int gfp_mask) Allocate a single page and return a struct address __get_free_page(unsigned int gfp_mask) Allocate a single page and return a virtual address get_free_page(unsigned int gfp_mask) Allocate a single page, zero it and return a virtual address Table 2.1: Physical Pages Allocation API

2.1

Allocating Pages

Function: alloc_pages (include/linux/mm.h) The toplevel alloc_pages() function is declared as

29

2.1. Allocating Pages

30

alloc_pages

_alloc_pages

__alloc_pages

balance_classzone

rmqueue

try_to_free_pages_zone

__free_pages_ok

expand

Figure 2.1: Call Graph: alloc_pages()

428 static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order) 429 { 433 if (order >= MAX_ORDER) 434 return NULL; 435 return _alloc_pages(gfp_mask, order); 436 }

428 The gfp_mask (Get Free Pages) ﬂags tells the allocator how it may behave. For example GFP_WAIT is not set, the allocator will not block and instead return NULL if memory is tight. The order is the power of two number of pages to allocate 433-434 A simple debugging check optimized away at compile time 435 This function is described next Function: _alloc_pages (mm/page_alloc.c) The function _alloc_pages() comes in two varieties. The ﬁrst is in mm/page_alloc.c is designed to only work with UMA architectures such as the x86. It only refers to the static node contig_page_data. The second is in mm/numa.c and is a simple extension. It uses a node-local allocation policy which means that memory will be allocated from the bank closest to the processor. For the purposes of this document, only the mm/page_alloc.c version will be examined but for completeness the reader should glance at the functions _alloc_pages() and _alloc_pages_pgdat() in mm/numa.c

2.1. Allocating Pages 244 245 246 247 248 249 250

31

#ifndef CONFIG_DISCONTIGMEM struct page *_alloc_pages(unsigned int gfp_mask, unsigned int order) { return __alloc_pages(gfp_mask, order, contig_page_data.node_zonelists+(gfp_mask & GFP_ZONEMASK)); } #endif

244 The ifndef is for UMA architectures like the x86. NUMA architectures used the _alloc_pages() function in mm/numa.c which employs a node local policy for allocations 245 The gfp_mask ﬂags tell the allocator how it may behave. The order is the power of two number of pages to allocate 247 node_zonelists is an array of preferred fallback zones to allocate from. It is initialised in build_zonelists() The lower 16 bits of gfp_mask indicate what zone is preferable to allocate from. gfp_mask & GFP_ZONEMASK will give the index in node_zonelists we prefer to allocate from. Function: __alloc_pages (mm/page_alloc.c) At this stage, we’ve reached what is described as the "heart of the zoned buddy allocator", the __alloc_pages() function. It is responsible for cycling through the fallback zones and selecting one suitable for the allocation. If memory is tight, it will take some steps to address the problem. It will wake kswapd and if necessary it will do the work of kswapd manually. 327 struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist) 328 { 329 unsigned long min; 330 zone_t **zone, * classzone; 331 struct page * page; 332 int freed; 333 334 zone = zonelist->zones; 335 classzone = *zone; 336 if (classzone == NULL) 337 return NULL; 338 min = 1UL << order; 339 for (;;) { 340 zone_t *z = *(zone++); 341 if (!z) 342 break; 343 344 min += z->pages_low; 345 if (z->free_pages > min) { 346 page = rmqueue(z, order);

2.1. Allocating Pages 347 if (page) 348 return page; 349 } 350 } 351 352 classzone->need_balance = 1; 353 mb(); 354 if (waitqueue_active(&kswapd_wait)) 355 wake_up_interruptible(&kswapd_wait); 356 357 zone = zonelist->zones; 358 min = 1UL << order; 359 for (;;) { 360 unsigned long local_min; 361 zone_t *z = *(zone++); 362 if (!z) 363 break; 364 365 local_min = z->pages_min; 366 if (!(gfp_mask & __GFP_WAIT)) 367 local_min >>= 2; 368 min += local_min; 369 if (z->free_pages > min) { 370 page = rmqueue(z, order); 371 if (page) 372 return page; 373 } 374 } 375 376 /* here we’re in the low on memory slow path */ 377 378 rebalance: 379 if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) { 380 zone = zonelist->zones; 381 for (;;) { 382 zone_t *z = *(zone++); 383 if (!z) 384 break; 385 386 page = rmqueue(z, order); 387 if (page) 388 return page; 389 } 390 return NULL; 391 }

32

2.1. Allocating Pages 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 }

33

/* Atomic allocations - we can’t balance anything */ if (!(gfp_mask & __GFP_WAIT)) return NULL; page = balance_classzone(classzone, gfp_mask, order, &freed); if (page) return page; zone = zonelist->zones; min = 1UL << order; for (;;) { zone_t *z = *(zone++); if (!z) break; min += z->pages_min; if (z->free_pages > min) { page = rmqueue(z, order); if (page) return page; } } /* Don’t let big-order allocations loop */ if (order > 3) return NULL; /* Yield for kswapd, and try again */ yield(); goto rebalance;

334 Set zone to be the preferred zone to allocate from 335 The preferred zone is recorded as the classzone. If one of the pages low watermarks is reached later, the classzone is marked as needing balance 336-337 An unnecessary sanity check. build_zonelists() would need to be seriously broken for this to happen 338-350 This style of block appears a number of times in this function. It reads as "cycle through all zones in this fallback list and see can the allocation be satisﬁed without violating watermarks. Note that the pages_low for each fallback zone is added together. This is deliberate to reduce the probability a fallback zone will be used. 340 z is the zone currently been examined. zone is moved to the next fallback zone

2.1. Allocating Pages 341-342 If this is the last zone in the fallback list, break

34

344 Increment the number of pages to be allocated by the watermark for easy comparisons. This happens for each zone in the fallback zones. While it would appear to be a bug, it is assumed that this behavior is intended to reduce the probability a fallback zone is used. 345-349 Allocate the page block if it can be assigned without reaching the pages_min watermark. rmqueue() is responsible from removing the block of pages from the zone 347-348 If the pages could be allocated, return a pointer to them 352 Mark the preferred zone as needing balance. This ﬂag will be read later by kswapd 353 This is a memory barrier. It ensures that all CPU’s will see any changes made to variables before this line of code. This is important because kswapd could be running on a diﬀerent processor to the memory allocator. 354-355 Wake up kswapdif it is asleep 357-358 Begin again with the ﬁrst preferred zone and min value 360-374 Cycle through all the zones. This time, allocate the pages if they can be allocated without hitting the pages_min watermark 365 local_min how low a number of free pages this zone can have 366-367 If the process can not wait or reschedule (__GFP_WAIT is clear), then allow the zone to be put in further memory pressure than the watermark normally allows 378 This label is returned to after an attempt is made to synchronusly free pages. From this line on, the low on memory path has been reached. It is likely the process will sleep 379-391 These two ﬂags are only set by the OOM killer. As the process is trying to kill itself cleanly, allocate the pages if at all possible as it is known they will be freed very soon 394-395 If the calling process can not sleep, return NULL as the only way to allocate the pages from here involves sleeping 397 This function does the work of kswapd in a synchronous fashion. The principle diﬀerence is that instead of freeing the memory into a global pool, it is kept for the process using the current→local_pages ﬁeld 398-399 If a page block of the right order has been freed, return it. Just because this is NULL does not mean an allocation will fail as it could be a higher order of pages that was released 403-414 This is identical to the block above. Allocate the page blocks if it can be done without hitting the pages_min watermark

2.1. Allocating Pages

35

417-418 Satisiﬁng a large allocation like 24 number of pages is diﬃcult. If it has not been satisﬁed by now, it is better to simply return NULL 421 Yield the processor to give kswapd a chance to work 422 Attempt to balance the zones again and allocate Function: rmqueue (mm/page_alloc.c) This function is called from __alloc_pages(). It is responsible for ﬁnding a block of memory large enough to be used for the allocation. If a block of memory of the requested size is not available, it will look for a larger order that may be split into two buddies. The actual splitting is performed by the expand() function. 198 static FASTCALL(struct page *rmqueue(zone_t *zone, unsigned int order)); 199 static struct page * rmqueue(zone_t *zone, unsigned int order) 200 { 201 free_area_t * area = zone->free_area + order; 202 unsigned int curr_order = order; 203 struct list_head *head, *curr; 204 unsigned long flags; 205 struct page *page; 206 207 spin_lock_irqsave(&zone->lock, flags); 208 do { 209 head = &area->free_list; 210 curr = head->next; 211 212 if (curr != head) { 213 unsigned int index; 214 215 page = list_entry(curr, struct page, list); 216 if (BAD_RANGE(zone,page)) 217 BUG(); 218 list_del(curr); 219 index = page - zone->zone_mem_map; 220 if (curr_order != MAX_ORDER-1) 221 MARK_USED(index, curr_order, area); 222 zone->free_pages -= 1UL << order; 223 224 page = expand(zone, page, index, order, curr_order, area); 225 spin_unlock_irqrestore(&zone->lock, flags); 226 227 set_page_count(page, 1); 228 if (BAD_RANGE(zone,page)) 229 BUG(); 230 if (PageLRU(page))

2.1. Allocating Pages 231 232 233 234 235 236 237 238 239 240 241 242 } BUG(); if (PageActive(page)) BUG(); return page; } curr_order++; area++; } while (curr_order < MAX_ORDER); spin_unlock_irqrestore(&zone->lock, flags); return NULL;

36

199 The parameters are the zone to allocate from and what order of pages are required 201 Because the free_area is an array of linked lists, the order may be used an an index within the array 207 Acquire the zone lock 208-238 This while block is responsible for ﬁnding what order of pages we will need to allocate from. If there isn’t a free block at the order we are interested in, check the higher blocks until a suitable one is found 209 head is the list of free page blocks for this order 210 curr is the ﬁrst block of pages 212-235 If there is a free page block at this order, then allocate it 215 page is set to be a pointer to the ﬁrst page in the free block 216-217 Sanity check that checks to make sure the page this page belongs to this zone and is within the zone_mem_map. It is unclear how this could possibly happen without severe bugs in the allocator itself that would place blocks in the wrong zones 218 As the block is going to be allocated, remove it from the free list 219 index treats the zone_mem_map as an array of pages so that index will be the oﬀset within the array 220-221 Toggle the bit that represents this pair of buddies. MARK_USED() is a macro which calculates which bit to toggle 222 Update the statistics for this zone. 1UL < < order is the number of pages been allocated 224 expand() is the function responsible for splitting page blocks of higher orders 225 No other updates to the zone need to take place so release the lock

2.1. Allocating Pages 227 Show that the page is in use 228-233 Sanity checks 234 Page block has been successfully allocated so return it

37

236-237 If a page block was not free of the correct order, move to a higher order of page blocks and see what can be found there 239 No other updates to the zone need to take place so release the lock 241 No page blocks of the requested or higher order are available so return failure Function: expand (mm/page_alloc.c) This function splits page blocks of higher orders until a page block of the needed order is available. 177 static inline struct page * expand (zone_t *zone, struct page *page, unsigned long index, int low, int high, free_area_t * area) 179 { 180 unsigned long size = 1 << high; 181 182 while (high > low) { 183 if (BAD_RANGE(zone,page)) 184 BUG(); 185 area--; 186 high--; 187 size >>= 1; 188 list_add(&(page)->list, &(area)->free_list); 189 MARK_USED(index, high, area); 190 index += size; 191 page += size; 192 } 193 if (BAD_RANGE(zone,page)) 194 BUG(); 195 return page; 196 } 177 The parameters are zone is where the allocation is coming from page is the ﬁrst page of the block been split index is the index of page within mem_map

2.2. Free Pages low is the order of pages needed for the allocation high is the order of pages that is been split for the allocation area is the free_area_t representing the high order block of pages 180 size is the number of pages in the block that is to be split 182-192 Keep splitting until a block of the needed page order is found

38

183-184 Sanity check that checks to make sure the page this page belongs to this zone and is within the zone_mem_map 185 area is now the next free_area_t representing the lower order of page blocks 186 high is the next order of page blocks to be split 187 The size of the block been split is now half as big 188 Of the pair of buddies, the one lower in the mem_map is added to the free list for the lower order 189 Toggle the bit representing the pair of buddies 190 index now the index of the second buddy of the newly created pair 191 page now points to the second buddy of the newly created paid 193-194 Sanity check 195 The blocks have been successfully split so return the page

2.2

Free Pages
__free_pages(struct page *page, unsigned int order) Free an order number of pages from the given page __free_page(struct page *page) Free a single page free_page(void *addr) Free a page from the given virtual address Table 2.2: Physical Pages Free API

2.2. Free Pages

39

__free_pages

__free_pages_ok

lru_cache_del

__lru_cache_del

Figure 2.2: Call Graph: __free_pages() Function: __free_pages (mm/page_alloc.c) Confusingly, the opposite to alloc_pages() is not free_pages(), it is __free_pages(). free_pages() is a helper function which takes an address as a parameter, it will be discussed in a later section. 451 void __free_pages(struct page *page, unsigned int order) 452 { 453 if (!PageReserved(page) && put_page_testzero(page)) 454 __free_pages_ok(page, order); 455 } 451 The parameters are the page we wish to free and what order block it is 453 Sanity checked. PageReserved() indicates that the page is reserved by the boot memory allocator. put_page_testzero() decrements the usage count and makes sure it is zero 454 Call the function that does all the hard work Function: __free_pages_ok (mm/page_alloc.c) This function will do the actual freeing of the page and coalesce the buddies if possible. 81 static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order)); 82 static void __free_pages_ok (struct page *page, unsigned int order) 83 { 84 unsigned long index, page_idx, mask, flags; 85 free_area_t *area; 86 struct page *base; 87 zone_t *zone; 88

2.2. Free Pages 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 if (PageLRU(page)) { if (unlikely(in_interrupt())) BUG(); lru_cache_del(page); } if (page->buffers) BUG(); if (page->mapping) BUG(); if (!VALID_PAGE(page)) BUG(); if (PageLocked(page)) BUG(); if (PageActive(page)) BUG(); page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty)); if (current->flags & PF_FREE_PAGES) goto local_freelist; back_local_freelist: zone = page_zone(page); mask = (~0UL) << order; base = zone->zone_mem_map; page_idx = page - base; if (page_idx & ~mask) BUG(); index = page_idx >> (1 + order); area = zone->free_area + order; spin_lock_irqsave(&zone->lock, flags); zone->free_pages -= mask; while (mask + (1 << (MAX_ORDER-1))) { struct page *buddy1, *buddy2; if (area >= zone->free_area + MAX_ORDER) BUG(); if (!__test_and_change_bit(index, area->map)) /* * the buddy page is still allocated.

40

2.2. Free Pages 138 */ 139 break; 140 /* 141 * Move the buddy up one level. 142 * This code is taking advantage of the identity: 143 * -mask = 1+~mask 144 */ 145 buddy1 = base + (page_idx ^ -mask); 146 buddy2 = base + page_idx; 147 if (BAD_RANGE(zone,buddy1)) 148 BUG(); 149 if (BAD_RANGE(zone,buddy2)) 150 BUG(); 151 152 list_del(&buddy1->list); 153 mask <<= 1; 154 area++; 155 index >>= 1; 156 page_idx &= mask; 157 } 158 list_add(&(base + page_idx)->list, &area->free_list); 159 160 spin_unlock_irqrestore(&zone->lock, flags); 161 return; 162 163 local_freelist: 164 if (current->nr_local_pages) 165 goto back_local_freelist; 166 if (in_interrupt()) 167 goto back_local_freelist; 168 169 list_add(&page->list, &current->local_pages); 170 page->index = order; 171 current->nr_local_pages++; 172 }

41

82 The parameters are the beginning of the page block to free and what order number of pages are to be freed. 32 A dirty page on the LRU will still have the LRU bit set when pinned for IO. It is just freed directly when the IO is complete so it just has to be removed from the LRU list 99-108 Sanity checks 109 The ﬂags showing a page has being referenced and is dirty have to be cleared because the page is now free and not in use

2.2. Free Pages

42

111-112 If this ﬂag is set, the pages freed are to be kept for the process doing the freeing. This is set during page allocation if the caller is freeing the pages itself rather than waiting for kswapd to do the work 115 The zone the page belongs to is encoded within the page ﬂags. The page_zone() macro returns the zone 117 The calculation of mask is discussed in companion document. It is basically related to the address calculation of the buddy 118 base is the beginning of this zone_mem_map. For the buddy calculation to work, it was to be relative to an address 0 so that the addresses will be a power of two 119 page_idx treats the zone_mem_map as an array of pages. This is the index page within the map 120-121 If the index is not the proper power of two, things are severely broken and calculation of the buddy will not work 122 This index is the bit index within free_area→map 124 area is the area storing the free lists and map for the order block the pages are been freed from. 126 The zone is about to be altered so take out the lock 128 Another side eﬀect of the calculation of mask is that -mask is the number of pages that are to be freed 130-157 The allocator will keep trying to coalesce blocks together until it either cannot merge or reaches the highest order that can be merged. mask will be adjusted for each order block that is merged. When the highest order that can be merged is reached, this while loop will evaluate to 0 and exit. 133-134 If by some miracle, mask is corrupt, this check will make sure the free_area array will not not be read beyond the end 135 Toggle the bit representing this pair of buddies. If the bit was previously zero, both buddies were in use. As this buddy is been freed, one is still in use and cannot be merged 145-146 The calculation of the two addresses is discussed in the companion document 147-150 Sanity check to make sure the pages are within the correct zone_mem_map and actually belong to this zone 152 The buddy has been freed so remove it from any list it was part of 153-156 Prepare to examine the higher order buddy for merging 153 Move the mask one bit to the left for order 2k+1

2.3. Page Allocate Helper Functions 154 area is a pointer within an array so area++ moves to the next index 155 The index in the bitmap of the higher order 156 The page index within the zone_mem_map for the buddy to merge

43

158 As much merging as possible as completed and a new page block is free so add it to the free_list for this order 160-161 Changes to the zone is complete so free the lock and return 163 This is the code path taken when the pages are not freed to the main pool but instaed are reserved for the process doing the freeing. 164-165 If the process already has reserved pages, it is not allowed to reserve any more so return back 166-167 An interrupt does not have process context so it has to free in the normal fashion. It is unclear how an interrupt could end up here at all. This check is likely to be bogus and impossible to be true 169 Add the page block to the list for the processes local_pages 170 Record what order allocation it was for freeing later 171 Increase the use count for nr_local_pages

2.3

Page Allocate Helper Functions

This section will cover miscellaneous helper functions and macros the Buddy Allocator uses to allocate pages. Very few of them do "real" work and are available just for the convenience of the programmer. Function: alloc_page (include/linux/mm.h) This trivial macro just calls alloc_pages() with an order of 0 to return 1 page. It is declared as follows 438 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) Function: __get_free_page (include/linux/mm.h) This trivial function calls __get_free_pages() with an order of 0 to return 1 page. It is declared as follows 443 #define __get_free_page(gfp_mask) \ 444 __get_free_pages((gfp_mask),0)

2.3. Page Allocate Helper Functions

44

Function: __get_free_pages (mm/page_alloc.c) This function is for callers who do not want to worry about pages and only get back an address it can use. It is declared as follows 428 unsigned long __get_free_pages(unsigned int gfp_mask, unsigned int order) 428 { 430 struct page * page; 431 432 page = alloc_pages(gfp_mask, order); 433 if (!page) 434 return 0; 435 return (unsigned long) page_address(page); 436 } 428 gfp_mask are the ﬂags which aﬀect allocator behaviour. Order is the power of 2 number of pages required. 431 alloc_pages() does the work of allocating the page block. See Section 2.1 433-434 Make sure the page is valid 435 page_address() returns the physical address of the page Function: __get_dma_pages (include/linux/mm.h) This is of principle interest to device drivers. It will return memory from ZONE_DMA suitable for use with DMA devices. It is declared as follows 446 #define __get_dma_pages(gfp_mask, order) \ 447 __get_free_pages((gfp_mask) | GFP_DMA,(order)) 447 The gfp_mask is or-ed with GFP_DMA to tell the allocator to allocate from ZONE_DMA

Function: get_zeroed_page (mm/page_alloc.c) This function will allocate one page and then zero out the contents of it. It is declared as follows 438 unsigned long get_zeroed_page(unsigned int gfp_mask) 439 { 440 struct page * page; 441 442 page = alloc_pages(gfp_mask, 0); 443 if (page) { 444 void *address = page_address(page); 445 clear_page(address); 446 return (unsigned long) address; 447 } 448 return 0; 449 }

2.4. Page Free Helper Functions 438 gfp_mask are the ﬂags which aﬀect allocator behaviour. 442 alloc_pages() does the work of allocating the page block. See Section 2.1 444 page_address() returns the physical address of the page 445 clear_page() will ﬁll the contents of a page with zero 446 Return the address of the zeroed page

45

2.4

Page Free Helper Functions

This section will cover miscellaneous helper functions and macros the Buddy Allocator uses to free pages. Very few of them do "real" work and are available just for the convenience of the programmer. There is only one core function for the freeing of pages and it is discussed in Section 2.2. The only functions then for freeing are ones that supply an address and for freeing a single page. Function: free_pages (mm/page_alloc.c) This function takes an address instead of a page as a parameter to free. It is declared as follows 457 void free_pages(unsigned long addr, unsigned int order) 458 { 459 if (addr != 0) 460 __free_pages(virt_to_page(addr), order); 461 } 460 The function is discussed in Section 2.2. The macro virt_to_page() returns the struct page for the addr Function: __free_page (include/linux/mm.h) This trivial macro just calls the function __free_pages() (See Section 2.2 with an order 0 for 1 page. It is declared as follows 460 #define __free_page(page) __free_pages((page), 0)

Chapter 3 Non-Contiguous Memory Allocation
3.1 Allocating A Non-Contiguous Area
vmalloc(unsigned long size) Allocate a number of pages in vmalloc space that satisfy the requested size vmalloc_dma(unsigned long size) Allocate a number of pages from ZONE_DMA vmalloc_32(unsigned long size) Allocate memory that is suitable for 32 bit addressing. This ensures it is in ZONE_NORMAL at least which some PCI devices require Table 3.1: Non-Contiguous Memory Allocation API

Function: vmalloc (include/linux/vmalloc.h) They only diﬀerence between these macros is the GFP_ ﬂags (See the companion document for an explanation of GFP ﬂags). The size parameter is page aligned by __vmalloc() 33 34 35 36 37 41 42 43 44 45 46 static inline void * vmalloc (unsigned long size) { return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL); }

static inline void * vmalloc_dma (unsigned long size) { return __vmalloc(size, GFP_KERNEL|GFP_DMA, PAGE_KERNEL); }

46

3.1. Allocating A Non-Contiguous Area

47

vmalloc

__vmalloc

get_vm_area

vmalloc_area_pages

pmd_alloc

alloc_area_pmd

pte_alloc

alloc_area_pte

Figure 3.1: Call Graph: vmalloc() 50 51 static inline void * vmalloc_32(unsigned long size) 52 { 53 return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL); 54 } 33 The ﬂags indicate that to use either ZONE_NORMAL or ZONE_HIGHMEM as necessary 42 The ﬂag indicates to only allocate from ZONE_DMA 51 Only physical pages from ZONE_NORMAL will be allocated Function: __vmalloc (mm/vmalloc.c) This function has three tasks. It page aligns the size request, asks get_vm_area() to ﬁnd an area for the request and uses vmalloc_area_pages() to allocate the PTEs for the pages. 231 void * __vmalloc (unsigned long size, int gfp_mask, pgprot_t prot) 232 { 233 void * addr; 234 struct vm_struct *area; 235 236 size = PAGE_ALIGN(size); 237 if (!size || (size >> PAGE_SHIFT) > num_physpages) { 238 BUG(); 239 return NULL; 240 } 241 area = get_vm_area(size, VM_ALLOC);

3.1. Allocating A Non-Contiguous Area 242 243 245 246 247 248 249 250 251 } if (!area) return NULL; addr = area->addr; if (vmalloc_area_pages(VMALLOC_VMADDR(addr), size, gfp_mask, prot)) { vfree(addr); return NULL; } return addr;

48

231 The parameters are the size to allocate, the GFP_ ﬂags to use for allocation and what protection to give the PTE 236 Align the size to a page size 237 Sanity check. Make sure the size is not 0 and that the size requested is not larger than the number of physical pages has been requested 241 Find an area of virtual address space to store the allocation (See Section 3.1) 245 The addr ﬁeld has been ﬁlled by get_vm_area() 246 Allocate the PTE entries needed for the allocation with vmalloc_area_pages(). If it fails, a non-zero value -ENOMEM is returned 247-248 If the allocation fails, free any PTEs, pages and descriptions of the area 250 Return the address of the allocated area Function: get_vm_area (mm/vmalloc.c) To allocate an area for the vm_struct, the slab allocator is asked to provide the necessary memory via kmalloc(). It then searches the vm_struct list lineraly looking for a region large enough to satisfy a request, including a page pad at the end of the area. 171 struct vm_struct * get_vm_area(unsigned long size, unsigned long flags) 172 { 173 unsigned long addr; 174 struct vm_struct **p, *tmp, *area; 175 176 area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL); 177 if (!area) 178 return NULL; 179 size += PAGE_SIZE; 180 if(!size) 181 return NULL; 182 addr = VMALLOC_START; 183 write_lock(&vmlist_lock);

3.1. Allocating A Non-Contiguous Area 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 out: 202 203 204 205 } for (p = &vmlist; (tmp = *p) ; p = &tmp->next) { if ((size + addr) < addr) goto out; if (size + addr <= (unsigned long) tmp->addr) break; addr = tmp->size + (unsigned long) tmp->addr; if (addr > VMALLOC_END-size) goto out; } area->flags = flags; area->addr = (void *)addr; area->size = size; area->next = *p; *p = area; write_unlock(&vmlist_lock); return area;

49

write_unlock(&vmlist_lock); kfree(area); return NULL;

171 The parameters is the size of the requested region which should be a multiple of the page size and the area ﬂags, either VM_ALLOC or VM_IOREMAP 176-178 Allocate space for the vm_struct description struct 179 Pad the request so there is a page gap between areas. This is to help against overwrites 180-181 This is to ensure the size is not 0 after the padding 182 Start the search at the beginning of the vmalloc address space 183 Lock the list 184-192 Walk through the list searching for an area large enough for the request 185-186 Check to make sure the end of the addressable range has not been reached 187-188 If the requested area would ﬁt between the current address and the next area, the search is complete 189 Make sure the address would not go over the end of the vmalloc address space 193-195 Copy in the area information 196-197 Link the new area into the list 198-199 Unlock the list and return

3.1. Allocating A Non-Contiguous Area 201 This label is reached if the request could not be satisﬁed 202 Unlock the list 203-204 Free the memory used for the area descriptor and return

50

Function: vmalloc_area_pages (mm/vmalloc.c) This is the beginning of a standard page table walk function. This top level function will step through all PGDs within an address range. For each PGD, it will call pmd_alloc() to allocate a PMD directory and call alloc_area_pmd() for the directory. 140 inline int vmalloc_area_pages (unsigned long address, unsigned long size, 141 int gfp_mask, pgprot_t prot) 142 { 143 pgd_t * dir; 144 unsigned long end = address + size; 145 int ret; 146 147 dir = pgd_offset_k(address); 148 spin_lock(&init_mm.page_table_lock); 149 do { 150 pmd_t *pmd; 151 152 pmd = pmd_alloc(&init_mm, dir, address); 153 ret = -ENOMEM; 154 if (!pmd) 155 break; 156 157 ret = -ENOMEM; 158 if (alloc_area_pmd(pmd, address, end - address, gfp_mask, prot)) 159 break; 160 161 address = (address + PGDIR_SIZE) & PGDIR_MASK; 162 dir++; 163 164 ret = 0; 165 } while (address && (address < end)); 166 spin_unlock(&init_mm.page_table_lock); 167 flush_cache_all(); 168 return ret; 169 } 140 address is the starting address to allocate PMDs for. size is the size of the region, gfp_mask is the GFP_ ﬂags for alloc_pages() and prot is the protection to give the PTE entry 144 The end address is the starting address plus the size

3.1. Allocating A Non-Contiguous Area 147 Get the PGD entry for the starting address 148 Lock the kernel page table

51

149-165 For every PGD within this address range, allocate a PMD directory and call alloc_area_pmd() 152 Allocate a PMD directory 158 Call alloc_area_pmd() which will allocate a PTE for each PTE slot in the PMD 161 address becomes the base address of the next PGD entry 162 Move dir to the next PGD entry 166 Release the lock to the kernel page table 167 flush_cache_all() will ﬂush all CPU caches. This is necessary because the kernel page tables have changed 168 Return success Function: alloc_area_pmd (mm/vmalloc.c) This is the second stage of the standard page table walk to allocate PTE entries for an address range. For every PMD within a given address range on a PGD, pte_alloc() will creates a PTE directory and then alloc_area_pte() will be called to allocate the physical pages 120 static inline int alloc_area_pmd(pmd_t * pmd, unsigned long address, unsigned long size, int gfp_mask, pgprot_t prot) 121 { 122 unsigned long end; 123 124 address &= ~PGDIR_MASK; 125 end = address + size; 126 if (end > PGDIR_SIZE) 127 end = PGDIR_SIZE; 128 do { 129 pte_t * pte = pte_alloc(&init_mm, pmd, address); 130 if (!pte) 131 return -ENOMEM; 132 if (alloc_area_pte(pte, address, end - address, gfp_mask, prot)) 133 return -ENOMEM; 134 address = (address + PMD_SIZE) & PMD_MASK; 135 pmd++; 136 } while (address < end); 137 return 0; 138 }

3.1. Allocating A Non-Contiguous Area

52

120 address is the starting address to allocate PMDs for. size is the size of the region, gfp_mask is the GFP_ ﬂags for alloc_pages() and prot is the protection to give the PTE entry 124 Align the starting address to the PGD 125-127 Calculate end to be the end of the allocation or the end of the PGD, whichever occurs ﬁrst 128-136 For every PMD within the given address range, allocate a PTE directory and call alloc_area_pte() 129 Allocate the PTE directory 132 Call alloc_area_pte() which will allocate the physical pages 134 address becomes the base address of the next PMD entry 135 Move pmd to the next PMD entry 137 Return success Function: alloc_area_pte (mm/vmalloc.c) This is the last stage of the page table walk. For every PTE in the given PTE directory and address range, a page will be allocated and associated with the PTE. 95 static inline int alloc_area_pte (pte_t * pte, unsigned long address, 96 unsigned long size, int gfp_mask, pgprot_t prot) 97 { 98 unsigned long end; 99 100 address &= ~PMD_MASK; 101 end = address + size; 102 if (end > PMD_SIZE) 103 end = PMD_SIZE; 104 do { 105 struct page * page; 106 spin_unlock(&init_mm.page_table_lock); 107 page = alloc_page(gfp_mask); 108 spin_lock(&init_mm.page_table_lock); 109 if (!pte_none(*pte)) 110 printk(KERN_ERR "alloc_area_pte: page already exists\n"); 111 if (!page) 112 return -ENOMEM; 113 set_pte(pte, mk_pte(page, prot)); 114 address += PAGE_SIZE; 115 pte++; 116 } while (address < end);

3.2. Freeing A Non-Contiguous Area 117 118 } return 0;

53

100 Align the address to a PMD directory 101-103 The end address is the end of the request or the end of the directory, whichever occurs ﬁrst 104-116 For every PTE in the range, allocate a physical page and set it to the PTE 106 Unlock the kernel page table before calling alloc_page(). alloc_page() may sleep and a spinlock must not be held 108 Re-acquire the page table lock 109-110 If the page already exists it means that areas must be overlapping somehow 112-113 Return failure if physical pages are not available 113 Assign the struct page to the PTE 114 address becomes the address of the next PTE 115 Move to the next PTE 117 Return success

3.2

Freeing A Non-Contiguous Area
vfree(void *addr) Free a region of memory allocated with vmalloc, vmalloc_dma or vmalloc_32 Table 3.2: Non-Contiguous Memory Free API

Function: vfree (mm/vmalloc.c) This is the top level function responsible for freeing a non-contiguous area of memory. It performs basic sanity checks before ﬁnding the vm_struct for the requested addr. Once found, it calls vmfree_area_pages() 207 void vfree(void * addr) 208 { 209 struct vm_struct **p, *tmp; 210 211 if (!addr) 212 return;

3.2. Freeing A Non-Contiguous Area

54

vfree

vmfree_area_pages

flush_tlb_all

free_area_pmd

free_area_pte

__free_pages

Figure 3.2: Call Graph: vfree() 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 } 207 The parameter is the address returned by get_vm_area() returns for ioremaps and vmalloc returns for allocations 211-213 Ignore NULL addresses if ((PAGE_SIZE-1) & (unsigned long) addr) { printk(KERN_ERR "Trying to vfree() bad address (%p)\n", addr); return; } write_lock(&vmlist_lock); for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) { if (tmp->addr == addr) { *p = tmp->next; vmfree_area_pages(VMALLOC_VMADDR(tmp->addr), tmp->size); write_unlock(&vmlist_lock); kfree(tmp); return; } } write_unlock(&vmlist_lock); printk(KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n", addr);

3.2. Freeing A Non-Contiguous Area

55

213-216 This checks the address is page aligned and is a reasonable quick guess to see if the area is valid or not 217 Acquire a write lock to the vmlist 218 Cycle through the vmlist looking for the correct vm_struct for addr 219 If this it the correct address then ... 220 Remove this area from the vmlist linked list 221 Free all pages associated with the address range 222 Release the vmlist lock 223 Free the memory used for the vm_struct and return 227-228 The vm_struct() was not found. Release the lock and print a message about the failed free Function: vmfree_area_pages (mm/vmalloc.c) This is the ﬁrst stage of the page table walk to free all pages and PTEs associated with an address range. It is responsible for stepping through the relevant PGDs and for ﬂushing the TLB.

80 void vmfree_area_pages(unsigned long address, unsigned long size) 81 { 82 pgd_t * dir; 83 unsigned long end = address + size; 84 85 dir = pgd_offset_k(address); 86 flush_cache_all(); 87 do { 88 free_area_pmd(dir, address, end - address); 89 address = (address + PGDIR_SIZE) & PGDIR_MASK; 90 dir++; 91 } while (address && (address < end)); 92 flush_tlb_all(); 93 } 80 The parameters are the starting address and the size of the region 82 The address space end is the starting address plus its size 85 Get the ﬁrst PGD for the address range 86 Flush the cache CPU so cache hits will not occur on pages that are to be deleted. This is a null operation on many architectures including the x86

3.2. Freeing A Non-Contiguous Area 87 Call free_area_pmd() to perform the second stage of the page table walk 89 address becomes the starting address of the next PGD 90 Move to the next PGD 92 Flush the TLB as the page tables have now changed

56

Function: free_area_pmd (mm/vmalloc.c) This is the second stage of the page table walk. For every PMD in this directory, call free_area_pte to free up the pages and PTEs. 56 static inline void free_area_pmd(pgd_t * dir, unsigned long address, unsigned long size) 57 { 58 pmd_t * pmd; 59 unsigned long end; 60 61 if (pgd_none(*dir)) 62 return; 63 if (pgd_bad(*dir)) { 64 pgd_ERROR(*dir); 65 pgd_clear(dir); 66 return; 67 } 68 pmd = pmd_offset(dir, address); 69 address &= ~PGDIR_MASK; 70 end = address + size; 71 if (end > PGDIR_SIZE) 72 end = PGDIR_SIZE; 73 do { 74 free_area_pte(pmd, address, end - address); 75 address = (address + PMD_SIZE) & PMD_MASK; 76 pmd++; 77 } while (address < end); 78 } 56 The parameters are the PGD been stepped through, the starting address and the length of the region 61-62 If there is no PGD, return. This can occur after vfree is called during a failed allocation 63-67 A PGD can be bad if the entry is not present, it is marked read-only or it is marked accessed or dirty 68 Get the ﬁrst PMD for the address range 69 Make the address PGD aligned

3.2. Freeing A Non-Contiguous Area

57

70-72 end is either the end of the space to free or the end of this PGD, whichever is ﬁrst 73-77 For every PMD, call free_area_pte() to free the PTE entries 75 address is the base address of the next PMD 76 Move to the next PMD Function: free_area_pte (mm/vmalloc.c) This is the ﬁnal stage of the page table walk. For every PTE in the given PMD within the address range, it will free the PTE and the associated page 22 static inline void free_area_pte(pmd_t * pmd, unsigned long address, unsigned long size) 23 { 24 pte_t * pte; 25 unsigned long end; 26 27 if (pmd_none(*pmd)) 28 return; 29 if (pmd_bad(*pmd)) { 30 pmd_ERROR(*pmd); 31 pmd_clear(pmd); 32 return; 33 } 34 pte = pte_offset(pmd, address); 35 address &= ~PMD_MASK; 36 end = address + size; 37 if (end > PMD_SIZE) 38 end = PMD_SIZE; 39 do { 40 pte_t page; 41 page = ptep_get_and_clear(pte); 42 address += PAGE_SIZE; 43 pte++; 44 if (pte_none(page)) 45 continue; 46 if (pte_present(page)) { 47 struct page *ptpage = pte_page(page); 48 if (VALID_PAGE(ptpage) && (!PageReserved(ptpage))) 49 __free_page(ptpage); 50 continue; 51 } 52 printk(KERN_CRIT "Whee.. Swapped out page in kernel page table\n"); 53 } while (address < end); 54 }

3.2. Freeing A Non-Contiguous Area

58

22 The parameters are the PMD that PTEs are been freed from, the starting address and the size of the region to free 27-28 The PMD could be absent if this region is from a failed vmalloc() 29-33 A PMD can be bad if it’s not in main memory, it’s read only or it’s marked dirty or accessed 34 pte is the ﬁrst PTE in the address range 35 Align the address to the PMD 36-38 The end is either the end of the requested region or the end of the PMD, whichever occurs ﬁrst 38-53 Step through all PTEs, perform checks and free the PTE with its associated page 41 ptep_get_and_clear() will remove a PTE from a page table and return it to the caller 42 address will be the base address of the next PTE 43 Move to the next PTE 44 If there was no PTE, simply continue 46-51 If the page is present, perform basic checks and then free it 47 pte_page() uses the global mem_map to ﬁnd the struct page for the PTE 48-49 Make sure the page is a valid page and it is not reserved before calling __free_page() to free the physical page 50 Continue to the next PTE 52 If this line is reached, a PTE within the kernel address space was somehow swapped out. Kernel memory is not swappable and so is a critical error

Chapter 4 Slab Allocator
4.0.1 Cache Creation

This section covers the creation of a cache. The tasks that are taken to create a cache are • Perform basic sanity checks for bad usage • Perform debugging checks if CONFIG_SLAB_DEBUG is set • Allocate a kmem_cache_t from the cache_cache slab cache • Align the object size to the word size • Calculate how many objects will ﬁt on a slab • Align the slab size to the hardware cache • Calculate colour oﬀsets • Initialise remaining ﬁelds in cache descriptor • Add the new cache to the cache chain See Figure 4.1 to see the call graph relevant to the creation of a cache. The depth of it is shallow as the depths will be discussed in other sections. Function: kmem_cache_create (mm/slab.c) Because of the size of this function, it will be dealt with in chunks. Each chunk is one of the items described in the previous section 621 kmem_cache_t * 622 kmem_cache_create (const char *name, size_t size, 623 size_t offset, unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long), 624 void (*dtor)(void*, kmem_cache_t *, unsigned long)) 625 { 626 const char *func_nm = KERN_ERR "kmem_create: "; 627 size_t left_over, align, slab_size; 59

4.0.1. Cache Creation kmem_cache_create(const char *name, size_t size, size_t offset, unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long), void (*dtor)(void*, kmem_cache_t *, unsigned long)) Creates a new cache and adds it to the cache chain kmem_cache_reap(int gfp_mask) Scans at most REAP_SCANLEN caches and selects one for reaping all per-cpu objects and free slabs from. Called when memory is tight kmem_cache_shrink(kmem_cache_t *cachep) This function will delete all per-cpu objects associated with a cache and delete all slabs in the slabs_free list. It returns the number of pages freed. kmem_cache_alloc(kmem_cache_t *cachep, int flags) Allocate a single object from the cache and return it to the caller kmem_cache_free(kmem_cache_t *cachep, void *objp) Free an object and return it to the cache kmalloc(size_t size, int flags) Allocate a block of memory from one of the sizes cache kfree(const void *objp) Free a block of memory allocated with kmalloc kmem_cache_destroy(kmem_cache_t * cachep) Destroys all objects in all slabs and frees up all associated memory before removing the cache from the chain Table 4.1: Slab Allocator API for caches

60

kmem_cache_create

kmem_cache_alloc

kmem_cache_estimate

kmem_find_general_cachep

enable_cpucache

__kmem_cache_alloc

kmem_tune_cpucache

Figure 4.1: Call Graph: kmem_cache_create()

4.0.1. Cache Creation 628 629 633 634 635 636 637 638 639 640 641 kmem_cache_t *cachep = NULL; if ((!name) || ((strlen(name) >= CACHE_NAMELEN - 1)) || in_interrupt() || (size < BYTES_PER_WORD) || (size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) || (dtor && !ctor) || (offset < 0 || offset > size)) BUG();

61

Perform basic sanity checks for bad usage 622 The parameters of the function are name The human readable name of the cache size The size of an object offset This is used to specify a speciﬁc alignment for objects in the cache but it usually left as 0 flags Static cache ﬂags ctor A constructor function to call for each object during slab creation dtor The corresponding destructor function. It is expected the destructor function leaves an object in an initialised state 633-640 These are all serious usage bugs that prevent the cache even attempting to create 634 If the human readable name is greater than the maximum size for a cache name (CACHE_NAMELEN) 635 An interrupt handler cannot create a cache as access to spinlocks and semaphores is needed 636 The object size must be at least a word in size. Slab is not suitable for objects that are measured in bits 637 The largest possible slab that can be created is 2M AX_OBJ_ORDER number of pages which provides 32 pages. 638 A destructor cannot be used if no constructor is available 639 The oﬀset cannot be before the slab or beyond the boundary of the ﬁrst page 640 Call BUG() to exit

4.0.1. Cache Creation

62

642 #if DEBUG 643 if ((flags & SLAB_DEBUG_INITIAL) && !ctor) { 645 printk("%sNo con, but init state check requested - %s\n", func_nm, name); 646 flags &= ~SLAB_DEBUG_INITIAL; 647 } 648 649 if ((flags & SLAB_POISON) && ctor) { 651 printk("%sPoisoning requested, but con given - %s\n", func_nm, name); 652 flags &= ~SLAB_POISON; 653 } 654 #if FORCED_DEBUG 655 if ((size < (PAGE_SIZE>>3)) && !(flags & SLAB_MUST_HWCACHE_ALIGN)) 660 flags |= SLAB_RED_ZONE; 661 if (!ctor) 662 flags |= SLAB_POISON; 663 #endif 664 #endif 670 BUG_ON(flags & ~CREATE_MASK); This block performs debugging checks if CONFIG_SLAB_DEBUG is set 643-646 The ﬂag SLAB_DEBUG_INITIAL requests that the constructor check the objects to make sure they are in an initialised state. For this, a constructor must obviously exist. If it doesn’t, the ﬂag is cleared 649-653 A slab can be poisoned with a known pattern to make sure an object wasn’t used before it was allocated but a constructor would ruin this pattern falsely reporting a bug. If a constructor exists, remove the SLAB_POISON ﬂag if set 655-660 Only small objects will be red zoned for debugging. Red zoning large objects would cause severe fragmentation 661-662 If there is no constructor, set the poison bit 670 The CREATE_MASK is set with all the allowable ﬂags kmem_cache_create() can be called with. This prevents callers using debugging ﬂags when they are not available and BUG()s it instead 673 cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL); if (!cachep) goto opps; memset(cachep, 0, sizeof(kmem_cache_t)); Allocate a kmem_cache_t from the cache_cache slab cache.

674 675 676

4.0.1. Cache Creation 673 Allocate a cache descriptor object from the cache_cache(see Section 4.2.2) 674-675 If out of memory goto opps which handles the oom situation 676 Zero ﬁll the object to prevent surprises with uninitialised data 682 683 684 685 if (size & (BYTES_PER_WORD-1)) { size += (BYTES_PER_WORD-1); size &= ~(BYTES_PER_WORD-1); printk("%sForcing size word alignment - %s\n", func_nm, name); }

63

686 687 688 #if DEBUG 689 if (flags & SLAB_RED_ZONE) { 694 flags &= ~SLAB_HWCACHE_ALIGN; 695 size += 2*BYTES_PER_WORD; 696 } 697 #endif 698 align = BYTES_PER_WORD; 699 if (flags & SLAB_HWCACHE_ALIGN) 700 align = L1_CACHE_BYTES; 701 703 if (size >= (PAGE_SIZE>>3)) 708 flags |= CFLGS_OFF_SLAB; 709 710 if (flags & SLAB_HWCACHE_ALIGN) { 714 while (size < align/2) 715 align /= 2; 716 size = (size+align-1)&(~(align-1)); 717 } Align the object size to the word size 682 If the size is not aligned to the size of a word then... 683 Increase the object by the size of a word

684 Mask out the lower bits, this will eﬀectively round the object size up to the next word boundary 685 Print out an informational message for debugging purposes 688-697 If debugging is enabled then the alignments have to change slightly 694 Don’t bother trying to align things to the hardware cache. The red zoning of the object is going to oﬀset it by moving the object one word away from the cache boundary

4.0.1. Cache Creation

64

695 The size of the object increases by two BYTES_PER_WORD to store the red zone mark at either end of the object 698 Align the object on a word size 699-700 If requested, align the objects to the L1 CPU cache 703 If the objects are large, store the slab descriptors oﬀ-slab. This will allow better packing of objects into the slab 710 If hardware cache alignment is requested, the size of the objects must be adjusted to align themselves to the hardware cache 714-715 This is important to arches (e.g. Alpha or Pentium 4) with large L1 cache bytes. align will be adjusted to be the smallest that will give hardware cache alignment. For machines with large L1 cache lines, two or more small objects may ﬁt into each line. For example, two objects from the size-32 cache will ﬁt on one cache line from a Pentium 4 716 Round the cache size up to the hardware cache alignment 724 do { 725 726 cal_wastage: 727 728 729 730 731 732 733 734 735 737 738 739 740 741 746 747 748 749 750 751 next: 752

unsigned int break_flag = 0; kmem_cache_estimate(cachep->gfporder, size, flags, &left_over, &cachep->num); if (break_flag) break; if (cachep->gfporder >= MAX_GFP_ORDER) break; if (!cachep->num) goto next; if (flags & CFLGS_OFF_SLAB && cachep->num > offslab_limit) { cachep->gfporder--; break_flag++; goto cal_wastage; } if (cachep->gfporder >= slab_break_gfp_order) break; if ((left_over*8) <= (PAGE_SIZE<<cachep->gfporder)) break; cachep->gfporder++;

4.0.1. Cache Creation 753 754 755 756 757 758 759 760 } while (1); if (!cachep->num) { printk("kmem_cache_create: couldn’t create cache %s.\n", name); kmem_cache_free(&cache_cache, cachep); cachep = NULL; goto opps; } Calculate how many objects will ﬁt on a slab and adjust the slab size as necessary

65

727-728 kmem_cache_estimate() (see Section 4.0.2) calculates the number of objects that can ﬁt on a slab at the current gfp order and what the amount of leftover bytes will be 729-730 The break_flag is set if the number of objects ﬁtting on the slab exceeds the number that can be kept when oﬀslab slab descriptors are used 731-732 The order number of pages used must not exceed MAX_GFP_ORDER (5) 733-734 If even one object didn’t ﬁll, goto next: which will increase the gfporder used for the cache 735 If the slab descriptor is kept oﬀ-cache but the number of objects exceeds the number that can be tracked with bufctl’s oﬀ-slab then .... 737 Reduce the order number of pages used 738 Set the break_flag so the loop will exit 739 Calculate the new wastage ﬁgures 746-747 The slab_break_gfp_order is the order to not exceed unless 0 objects ﬁt on the slab. This check ensures the order is not exceeded 749-759 This is a rough check for internal fragmentation. If the wastage as a fraction of the total size of the cache is less than one eight, it is acceptable 752 If the fragmentation is too high, increase the gfp order and recalculate the number of objects that can be stored and the wastage 755 If after adjustments, objects still do not ﬁt in the cache, it cannot be created 757-758 Free the cache descriptor and set the pointer to NULL 758 Goto opps which simply returns the NULL pointer

4.0.1. Cache Creation 761 slab_size = L1_CACHE_ALIGN( cachep->num*sizeof(kmem_bufctl_t) + sizeof(slab_t)); if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) { flags &= ~CFLGS_OFF_SLAB; left_over -= slab_size; } Align the slab size to the hardware cache

66

762 767 768 769 770

761 slab_size is the total size of the slab descriptor not the size of the slab itself. It is the size slab_t struct and the number of objects * size of the bufctl 767-769 If there is enough left over space for the slab descriptor and it was speciﬁed to place the descriptor oﬀ-slab, remove the ﬂag and update the amount of left_over bytes there is. This will impact the cache colouring but with the large objects associated with oﬀ-slab descriptors, this is not a problem 773 774 775 776 777 778 offset += (align-1); offset &= ~(align-1); if (!offset) offset = L1_CACHE_BYTES; cachep->colour_off = offset; cachep->colour = left_over/offset; Calculate colour oﬀsets. 773-774 offset is the oﬀset within the page the caller requested. This will make sure the oﬀset requested is at the correct alignment for cache usage 775-776 If somehow the oﬀset is 0, then set it to be aligned for the CPU cache 777 This is the oﬀset to use to keep objects on diﬀerent cache lines. Each slab created will be given a diﬀerent colour oﬀset 778 This is the number of diﬀerent oﬀsets that can be used 781 782 783 784 785 786 787 788 789 790 if (!cachep->gfporder && !(flags & CFLGS_OFF_SLAB)) flags |= CFLGS_OPTIMIZE; cachep->flags = flags; cachep->gfpflags = 0; if (flags & SLAB_CACHE_DMA) cachep->gfpflags |= GFP_DMA; spin_lock_init(&cachep->spinlock); cachep->objsize = size; INIT_LIST_HEAD(&cachep->slabs_full);

4.0.1. Cache Creation 791 792 793 794 795 INIT_LIST_HEAD(&cachep->slabs_partial); INIT_LIST_HEAD(&cachep->slabs_free); if (flags & CFLGS_OFF_SLAB) cachep->slabp_cache = kmem_find_general_cachep(slab_size,0); cachep->ctor = ctor; cachep->dtor = dtor; strcpy(cachep->name, name);

67

796 797 799 800 801 #ifdef CONFIG_SMP 802 if (g_cpucache_up) 803 enable_cpucache(cachep); 804 #endif Initialise remaining ﬁelds in cache descriptor

781-782 For caches with slabs of only 1 page, the CFLGS_OPTIMIZE ﬂag is set. In reality it makes no diﬀerence as the ﬂag is unused 784 Set the cache static ﬂags 785 Zero out the gfpﬂags. Defunct operation as memset after the cache descriptor was allocated would do this 786-787 If the slab is for DMA use, set the GFP_DMA ﬂag so the buddy allocator will use ZONE_DMA 788 Initialise the spinlock for access the cache 789 Copy in the object size, which now takes hardware cache alignment if necessary 790-792 Initialise the slab lists 794-795 If the descriptor is kept oﬀ-slab, allocate a slab manager and place it for use in slabp_cache. See Section 4.1.1 796-797 Set the pointers to the constructor and destructor functions 799 Copy in the human readable name 802-803 If per-cpu caches are enabled, create a set for this cache. See Section 4.4 806 807 808 809 810 811 down(&cache_chain_sem); { struct list_head *p; list_for_each(p, &cache_chain) { kmem_cache_t *pc = list_entry(p,

4.0.2. Calculating the Number of Objects on a Slab kmem_cache_t, next); 812 814 815 816 817 818 822 823 824 opps: 825 826 } if (!strcmp(pc->name, name)) BUG(); } } list_add(&cachep->next, &cache_chain); up(&cache_chain_sem); return cachep;

68

Add the new cache to teh cache chain 806 Acquire the semaphore used to synchronize access to the cache chain 810-816 Check every cache on the cache chain and make sure there isn’t a cache there with the same name. If there is, it means two caches of the same type are been created which is a serious bug 811 Get the cache from the list 814-815 Compare the names and if they match bug. It is worth noting that the new cache is not deleted, but this error is the result of sloppy programming during development and not a normal scenario 822 Link the cache into the chain. 823 Release the cache chain semaphore. 825 Return the new cache pointer

4.0.2

Calculating the Number of Objects on a Slab

Function: kmem_cache_estimate (mm/slab.c) During cache creation, it is determined how many objects can be stored in a slab and how much waste-age there will be. The following function calculates how many objects may be stored, taking into account if the slab and bufctl’s must be stored on-slab. 388 static void kmem_cache_estimate (unsigned long gfporder, size_t size, 389 int flags, size_t *left_over, unsigned int *num) 390 { 391 int i; 392 size_t wastage = PAGE_SIZE<<gfporder; 393 size_t extra = 0; 394 size_t base = 0; 395

4.0.2. Calculating the Number of Objects on a Slab 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 } if (!(flags & CFLGS_OFF_SLAB)) { base = sizeof(slab_t); extra = sizeof(kmem_bufctl_t); } i = 0; while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage) i++; if (i > 0) i--; if (i > SLAB_LIMIT) i = SLAB_LIMIT; *num = i; wastage -= i*size; wastage -= L1_CACHE_ALIGN(base+i*extra); *left_over = wastage;

69

388 The parameters of the function are as follows gfporder The 2gf porder number of pages to allocate for each slab size The size of each object flags The cache ﬂags left_over The number of bytes left over in the slab. Returned to caller num The number of objects that will ﬁt in a slab. Returned to caller 392 wastage is decremented through the function. It starts with the maximum possible amount of wast-age. 393 extra is the number of bytes needed to store kmem_bufctl_t 394 base is where usable memory in the slab starts 396 If the slab descriptor is kept on cache, the base begins at the end of the slab_t struct and the number of bytes needed to store the bufctl is the size of kmem_bufctl_t 400 i becomes the number of objects the slab can hold 401-402 This counts up the number of objects that the cache can store. i*size is the amount of memory needed to store the object itself. L1_CACHE_ALIGN(base+i*extra) is slightly trickier. This is calculating the amount of memory needed to store the kmem_bufctl_t of which one exists for every object in the slab. As it is at the beginning of the slab, it is L1 cache aligned so that the ﬁrst object in the slab will be aligned to hardware cache. i*extra will calculate the amount of space needed to hold a kmem_bufctl_t for this object. As wast-age starts out as the size of the slab, its use is overloaded here.

4.0.3. Cache Shrinking

70

403-404 Because the previous loop counts until the slab overﬂows, the number of objects that can be stored is i-1. 406-407 SLAB_LIMIT is the absolute largest number of objects a slab can store. Is is deﬁned as 0xﬀﬀFFFE as this the largest number kmem_bufctl_t(), which is an unsigned int, can hold 409 num is now the number of objects a slab can hold 410 Take away the space taken up by all the objects from wast-age 411 Take away the space taken up by the kmem_bufctl_t 412 Wast-age has now been calculated as the left over space in the slab

4.0.3

Cache Shrinking

kmem_cache_shrink

__kmem_cache_shrink_locked

kmem_slab_destroy

Figure 4.2: Call Graph: kmem_cache_shrink() Two varieties of shrink functions are provided. kmem_cache_shrink() removes all slabs from slabs_free and returns the number of pages freed as a result. __kmem_cache_shrink() frees all slabs from slabs_free and then veriﬁes that slabs_partial and slabs_full are empty. This is important during cache destruction when it doesn’t matter how many pages are freed, just that the cache is empty. Function: kmem_cache_shrink (mm/slab.c) This function performs basic debugging checks and then acquires the cache descriptor lock before freeing slabs. At one time, it also used to call drain_cpu_caches() to free up objects on the per-cpu cache. It is curious that this was removed as it is possible slabs could not be freed due to an object been allocation on a per-cpu cache but not in use. 966 int kmem_cache_shrink(kmem_cache_t *cachep) 967 { 968 int ret; 969

4.0.3. Cache Shrinking 970 971 972 973 974 975 976 977 978 }

71

if (!cachep || in_interrupt() || !is_chained_kmem_cache(cachep)) BUG(); spin_lock_irq(&cachep->spinlock); ret = __kmem_cache_shrink_locked(cachep); spin_unlock_irq(&cachep->spinlock); return ret << cachep->gfporder;

966 The parameter is the cache been shrunk 970 Check that • The cache pointer is not null • That an interrupt isn’t trying to do this • That the cache is on the cache chain and not a bad pointer 973 Acquire the cache descriptor lock and disable interrupts 974 Shrink the cache 975 Release the cache lock and enable interrupts 976 This returns the number of pages freed but does not take into account the objects freed by draining the CPU. Function: __kmem_cache_shrink (mm/slab.c) This function is identical to kmem_cache_shrink() except it returns if the cache is empty or not. This is important during cache destruction when it is not important how much memory was freed, just that it is safe to delete the cache and not leak memory. 945 static int __kmem_cache_shrink(kmem_cache_t *cachep) 946 { 947 int ret; 948 949 drain_cpu_caches(cachep); 950 951 spin_lock_irq(&cachep->spinlock); 952 __kmem_cache_shrink_locked(cachep); 953 ret = !list_empty(&cachep->slabs_full) || 954 !list_empty(&cachep->slabs_partial); 955 spin_unlock_irq(&cachep->spinlock); 956 return ret; 957 } 949 Remove all objects from the per-CPU objects cache

4.0.3. Cache Shrinking 951 Acquire the cache descriptor lock and disable interrupts 952 Free all slabs in the slabs_free list 954-954 Check the slabs_partial and slabs_full lists are empty 955 Release the cache descriptor lock and re-enable interrupts 956 Return if the cache has all its slabs free or not

72

Function: __kmem_cache_shrink_locked (mm/slab.c) This does the dirty work of freeing slabs. It will keep destroying them until the growing ﬂag gets set, indicating the cache is in use or until there is no more slabs in slabs_free. 917 918 919 920 921 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 static int __kmem_cache_shrink_locked(kmem_cache_t *cachep) { slab_t *slabp; int ret = 0; while (!cachep->growing) { struct list_head *p; p = cachep->slabs_free.prev; if (p == &cachep->slabs_free) break; slabp = list_entry(cachep->slabs_free.prev, slab_t, list); #if DEBUG if (slabp->inuse) BUG(); #endif list_del(&slabp->list); spin_unlock_irq(&cachep->spinlock); kmem_slab_destroy(cachep, slabp); ret++; spin_lock_irq(&cachep->spinlock); } return ret; }

923 While the cache is not growing, free slabs 926-930 Get the last slab on the slabs_free list 932-933 If debugging is available, make sure it is not in use. If it is not in use, it should not be on the slabs_free list in the ﬁrst place 935 Remove the slab from the list

4.0.4. Cache Destroying

73

937 Re-enable interrupts. This function is called with interrupts disabled and this is to free the interrupt as quickly as possible. 938 Delete the slab (see Section 4.1.4) 939 Record the number of slabs freed 940 Acquire the cache descriptor lock and disable interrupts

4.0.4

Cache Destroying

When a module is unloaded, it is responsible for destroying any cache is has created as during module loading, it is ensured there is not two caches of the same name. Core kernel code often does not destroy its caches as their existence persists for the life of the system. The steps taken to destroy a cache are • Delete the cache from the cache chain • Shrink the cache to delete all slabs (see Section 4.0.3) • Free any per CPU caches (kfree()) • Delete the cache descriptor from the cache_cache (see Section: 4.2.3) Figure 4.3 Shows the call graph for this task.

kmem_cache_destroy

__kmem_cache_shrink

kfree

kmem_cache_free

Figure 4.3: Call Graph: kmem_cache_destroy() Function: kmem_cache_destroy (mm/slab.c) 995 int kmem_cache_destroy (kmem_cache_t * cachep) 996 { 997 if (!cachep || in_interrupt() || cachep->growing) 998 BUG(); 999 1000 /* Find the cache in the chain of caches. */ 1001 down(&cache_chain_sem); 1002 /* the chain is never empty, cache_cache is never destroyed */ 1003 if (clock_searchp == cachep) 1004 clock_searchp = list_entry(cachep->next.next, 1005 kmem_cache_t, next);

4.0.4. Cache Destroying 1006 1007 1008 1009 1010 list_del(&cachep->next); up(&cache_chain_sem);

74

if (__kmem_cache_shrink(cachep)) { printk(KERN_ERR "kmem_cache_destroy: Can’t free all objects %p\n", 1011 cachep); 1012 down(&cache_chain_sem); 1013 list_add(&cachep->next,&cache_chain); 1014 up(&cache_chain_sem); 1015 return 1; 1016 } 1017 #ifdef CONFIG_SMP 1018 { 1019 int i; 1020 for (i = 0; i < NR_CPUS; i++) 1021 kfree(cachep->cpudata[i]); 1022 } 1023 #endif 1024 kmem_cache_free(&cache_cache, cachep); 1025 1026 return 0; 1027 } 997-998 Sanity check. Make sure the cachep is not null, that an interrupt isn’t trying to do this and that the cache hasn’t been marked growing, indicating it is in use 1001 Acquire the semaphore for accessing the cache chain 1003-1005 Acquire the list entry from the cache chain 1006 Delete this cache from the cache chain 1007 Release the cache chain semaphore 1009 Shrink the cache to free all slabs (see Section 4.0.3) 1010-1015 The shrink function returns true if there is still slabs in the cache. If there is, the cache cannot be destroyed so it is added back into the cache chain and the error reported 1020-1021 If SMP is enabled, the per-cpu data structures are deleted with kfree kfree() 1024 Delete the cache descriptor from the cache_cache

4.0.5. Cache Reaping

75

4.0.5

Cache Reaping

When the page allocator notices that memory is getting tight, it wakes kswapd to begin freeing up pages (see Section 2.1). One of the ﬁrst ways it accomplishes this task is telling the slab allocator to reap caches. It has to be the slab allocator that selects the caches as other subsystems should not know anything about the cache internals.

kmem_cache_reap

__free_block

kmem_slab_destroy

kmem_cache_free_one

kmem_freepages

kmem_cache_free

Figure 4.4: Call Graph: kmem_cache_reap() The call graph in Figure 4.4 is deceptively simple. The task of selecting the proper cache to reap is quite long. In case there is many caches in the system, only REAP_SCANLEN caches are examined in each call. The last cache to be scanned is stored in the variable clock_searchp so as not to examine the same caches over and over again. For each scanned cache, the reaper does the following • Check ﬂags for SLAB_NO_REAP and skip if set • If the cache is growing, skip it • if the cache has grown recently (DFLGS_GROWN is set in dﬂags), skip it but clear the ﬂag so it will be reaped the next time • Count the number of free slabs in slabs_free and calculate how many pages that would free in the variable pages • If the cache has constructors or large slabs, adjust pages to make it less likely for the cache to be selected. • If the number of pages that would be freed exceeds REAP_PERFECT, free half of the slabs in slabs_free • Otherwise scan the rest of the caches and select the one that would free the most pages for freeing half of its slabs in slabs_free

4.0.5. Cache Reaping

76

Function: kmem_cache_reap (mm/slab.c) Because of the size of this function, it will be broken up into three separate sections. The ﬁrst is simple function preamble. The second is the selection of a cache to reap and the third is the freeing of the slabs 1736 int kmem_cache_reap (int gfp_mask) 1737 { 1738 slab_t *slabp; 1739 kmem_cache_t *searchp; 1740 kmem_cache_t *best_cachep; 1741 unsigned int best_pages; 1742 unsigned int best_len; 1743 unsigned int scan; 1744 int ret = 0; 1745 1746 if (gfp_mask & __GFP_WAIT) 1747 down(&cache_chain_sem); 1748 else 1749 if (down_trylock(&cache_chain_sem)) 1750 return 0; 1751 1752 scan = REAP_SCANLEN; 1753 best_len = 0; 1754 best_pages = 0; 1755 best_cachep = NULL; 1756 searchp = clock_searchp; 1736 The only parameter is the GFP ﬂag. The only check made is against the __GFP_WAIT ﬂag. As the only caller, kswapd, can sleep, this parameter is virtually worthless 1746-1747 Can the caller sleep? If yes, then acquire the semaphore 1749-1750 Else, try and acquire the semaphore and if not available, return 1752 REAP_SCANLEN (10) is the number of caches to examine. 1756 Set searchp to be the last cache that was examined at the last reap 1757 1758 1759 1760 1761 1763 1764 1765 1766 1767 do { unsigned int pages; struct list_head* p; unsigned int full_free; if (searchp->flags & SLAB_NO_REAP) goto next; spin_lock_irq(&searchp->spinlock); if (searchp->growing) goto next_unlock;

4.0.5. Cache Reaping 1768 if (searchp->dflags & DFLGS_GROWN) { 1769 searchp->dflags &= ~DFLGS_GROWN; 1770 goto next_unlock; 1771 } 1772 #ifdef CONFIG_SMP 1773 { 1774 cpucache_t *cc = cc_data(searchp); 1775 if (cc && cc->avail) { 1776 __free_block(searchp, cc_entry(cc), cc->avail); 1777 cc->avail = 0; 1778 } 1779 } 1780 #endif 1781 1782 full_free = 0; 1783 p = searchp->slabs_free.next; 1784 while (p != &searchp->slabs_free) { 1785 slabp = list_entry(p, slab_t, list); 1786 #if DEBUG 1787 if (slabp->inuse) 1788 BUG(); 1789 #endif 1790 full_free++; 1791 p = p->next; 1792 } 1793 1799 pages = full_free * (1<<searchp->gfporder); 1800 if (searchp->ctor) 1801 pages = (pages*4+1)/5; 1802 if (searchp->gfporder) 1803 pages = (pages*4+1)/5; 1804 if (pages > best_pages) { 1805 best_cachep = searchp; 1806 best_len = full_free; 1807 best_pages = pages; 1808 if (pages >= REAP_PERFECT) { 1809 clock_searchp = list_entry(searchp->next.next, 1810 kmem_cache_t,next); 1811 goto perfect; 1812 } 1813 } 1814 next_unlock: 1815 spin_unlock_irq(&searchp->spinlock);

77

4.0.5. Cache Reaping 1816 next: 1817 1818

78

searchp = list_entry(searchp->next.next,kmem_cache_t,next); } while (--scan && searchp != clock_searchp);

This block examines REAP_SCANLEN number of caches to select one to free 1765 Acquire an interrupt safe lock to the cache descriptor 1766-1767 If the cache is growing, skip it 1768-1771 If the cache has grown recently, skip it and clear the ﬂag 1773-1779 Free any per CPU objects to the global pool 1784-1792 Count the number of slabs in the slabs_free list 1799 Calculate the number of pages all the slabs hold 1800-1801 If the objects have constructors, reduce the page count by one ﬁfth to make it less likely to be selected for reaping 1802-1803 If the slabs consist of more than one page, reduce the page count by one ﬁfth. This is because high order pages are hard to acquire 1804 If this is the best candidate found for reaping so far, check if it is perfect for reaping 1805-1807 Record the new maximums 1806 best_len is recorded so that it is easy to know how many slabs is half of the slabs in the free list 1808 If this cache is perfect for reaping then .... 1809 Update clock_searchp 1810 Goto perfect where half the slabs will be freed 1814 This label is reached if it was found the cache was growing after acquiring the lock 1815 Release the cache descriptor lock 1816 Move to the next entry in the cache chain 1818 Scan while REAP_SCANLEN has not been reached and we have not cycled around the whole cache chain 1820 1821 1822 1824 1825 clock_searchp = searchp; if (!best_cachep) goto out;

4.0.5. Cache Reaping 1826 spin_lock_irq(&best_cachep->spinlock); 1827 perfect: 1828 /* free only 50% of the free slabs */ 1829 best_len = (best_len + 1)/2; 1830 for (scan = 0; scan < best_len; scan++) { 1831 struct list_head *p; 1832 1833 if (best_cachep->growing) 1834 break; 1835 p = best_cachep->slabs_free.prev; 1836 if (p == &best_cachep->slabs_free) 1837 break; 1838 slabp = list_entry(p,slab_t,list); 1839 #if DEBUG 1840 if (slabp->inuse) 1841 BUG(); 1842 #endif 1843 list_del(&slabp->list); 1844 STATS_INC_REAPED(best_cachep); 1845 1846 /* Safe to drop the lock. The slab is no longer * linked to the 1847 * cache. 1848 */ 1849 spin_unlock_irq(&best_cachep->spinlock); 1850 kmem_slab_destroy(best_cachep, slabp); 1851 spin_lock_irq(&best_cachep->spinlock); 1852 } 1853 spin_unlock_irq(&best_cachep->spinlock); 1854 ret = scan * (1 << best_cachep->gfporder); 1855 out: 1856 up(&cache_chain_sem); 1857 return ret; 1858 }

79

This block will free half of the slabs from the selected cache 1820 Update clock_searchp for the next cache reap 1822-1824 If a cache was not found, goto out to free the cache chain and exit 1826 Acquire the cache chain spinlock and disable interrupts. The cachep descriptor has to be held by an interrupt safe lock as some caches may be used from interrupt context. The slab allocator has no way to diﬀerentiate between interrupt safe and unsafe caches 1829 Adjust best_len to be the number of slabs to free

4.1. Slabs 1830-1852 Free best_len number of slabs 1833-1845 If the cache is growing, exit 1835 Get a slab from the list 1836-1837 If there is no slabs left in the list, exit 1838 Get the slab pointer 1840-1841 If debugging is enabled, make sure there isn’t active objects in the slab 1843 Remove the slab from the slabs_free list 1844 Update statistics if enabled 1849 Free the cache descriptor and enable interrupts 1850 Destroy the slab. See Section 4.1.4 1851 Re-acquire the cache descriptor spinlock and disable interrupts 1853 Free the cache descriptor and enable interrupts 1854 ret is the number of pages that was freed 1856-1857 Free the cache semaphore and return the number of pages freed

80

4.1

Slabs

This section will describe how a slab is structured and managed. The struct which describes it is much simpler than the cache descriptor, but how the slab is arranged is slightly more complex. We begin with the descriptor. 155 typedef struct slab_s { 156 struct list_head 157 unsigned long 158 void 159 unsigned int 160 kmem_bufctl_t 161 } slab_t; list; colouroff; *s_mem; inuse; free;

list The list the slab belongs to. One of slab_full, slab_partial and slab_free colouroff The colour oﬀset is the oﬀset of the ﬁrst object within the slab. The address of the ﬁrst object is s_mem + colouroff . See Section 4.1.1 s_mem The starting address of the ﬁrst object within the slab inuse Number of active objects in the slab free This is an array of bufctls used for storing locations of free objects. See the companion document for seeing how to track free objects.

4.1.1. Storing the Slab Descriptor

81

4.1.1

Storing the Slab Descriptor

Function: kmem_cache_slabmgmt (mm/slab.c) This function will either allocate allocate space to keep the slab descriptor oﬀ cache or reserve enough space at the beginning of the slab for the descriptor and the bufctl’s. 1030 static inline slab_t * kmem_cache_slabmgmt ( kmem_cache_t *cachep, 1031 void *objp, int colour_off, int local_flags) 1032 { 1033 slab_t *slabp; 1034 1035 if (OFF_SLAB(cachep)) { 1037 slabp = kmem_cache_alloc(cachep->slabp_cache, local_flags); 1038 if (!slabp) 1039 return NULL; 1040 } else { 1045 slabp = objp+colour_off; 1046 colour_off += L1_CACHE_ALIGN(cachep->num * 1047 sizeof(kmem_bufctl_t) + sizeof(slab_t)); 1048 } 1049 slabp->inuse = 0; 1050 slabp->colouroff = colour_off; 1051 slabp->s_mem = objp+colour_off; 1052 1053 return slabp; 1054 } 1030 The parameters of the function are cachep The cache the slab is to be allocated to objp When the function is called, this points to the beginning of the slab colour_off The colour oﬀset for this slab local_flags These are the ﬂags for the cache. They are described in the companion document 1035-1040 If the slab descriptor is kept oﬀ cache.... 1037 Allocate memory from the sizes cache. During cache creation, slabp_cache is set to the appropriate size cache to allocate from. See Section 4.0.1 1038 If the allocation failed, return

4.1.1. Storing the Slab Descriptor 1040-1048 Reserve space at the beginning of the slab

82

1045 The address of the slab will be the beginning of the slab (objp) plus the colour oﬀset 1046 colour_off is calculated to be the oﬀset where the ﬁrst object will be placed. The address is L1 cache aligned. cachep->num * sizeof(kmem_bufctl_t) is the amount of space needed to hold the bufctls for each object in the slab and sizeof(slab_t) is the size of the slab descriptor. This eﬀectively has reserved the space at the beginning of the slab 1049 The number of objects in use on the slab is 0 1050 The colouroﬀ is updated for placement of the new object 1051 The address of the ﬁrst object is calculated as the address of the beginning of the slab plus the oﬀset Function: kmem_ﬁnd_general_cachep (mm/slab.c) If the slab descriptor is to be kept oﬀ-slab, this function, called during cache creation (see Section 4.0.1) will ﬁnd the appropriate sizes cache to use and will be stored within the cache descriptor in the ﬁeld slabp_cache. 1618 kmem_cache_t * kmem_find_general_cachep (size_t size, int gfpflags) 1619 { 1620 cache_sizes_t *csizep = cache_sizes; 1621 1626 for ( ; csizep->cs_size; csizep++) { 1627 if (size > csizep->cs_size) 1628 continue; 1629 break; 1630 } 1631 return (gfpflags & GFP_DMA) ? csizep->cs_dmacachep : csizep->cs_cachep; 1632 } 1618 size is the size of the slab descriptor. gfpflags is always 0 as DMA memory is not needed for a slab descriptor 1626-1630 Starting with the smallest size, keep increasing the size until a cache is found with buﬀers large enough to store the slab descriptor 1631 Return either a normal or DMA sized cache depending on the gfpflags passed in. In reality, only the cs_cachep is ever passed back

4.1.2. Slab Structure

83

4.1.2 4.1.3

Slab Structure Slab Creation

This section will show how a cache is grown when no objects are left in the slabs_partial list and there is no slabs in slabs_free. The principle function for this is kmem_cache_grow(). The tasks it fulﬁlls are • Perform basic sanity checks to guard against bad usage • Calculate colour oﬀset for objects in this slab • Allocate memory for slab and acquire a slab descriptor • Link the pages used for the slab to the slab and cache descriptors (see Section 4.1) • Initialise objects in the slab • Add the slab to the cache Function: kmem_cache_grow (mm/slab.c)

kmem_cache_grow

kmem_getpages

kmem_cache_slabmgmt

kmem_cache_init_objs

__get_free_pages

kmem_cache_alloc

__kmem_cache_alloc

Figure 4.5: Call Graph: kmem_cache_grow() Figure 4.5 shows the call graph to grow a cache. This function will be dealt with in blocks. Each block corresponds to one of the tasks described in the previous section 1103 static int kmem_cache_grow (kmem_cache_t * cachep, int flags) 1104 { 1105 slab_t *slabp; 1106 struct page *page; 1107 void *objp; 1108 size_t offset;

4.1.3. Slab Creation 1109 1110 1111 unsigned int unsigned long unsigned long i, local_flags; ctor_flags; save_flags;

84

Basic declarations. The parameters of the function are cachep The cache to allocate a new slab to flags The ﬂags for a slab creation 1116 1117 1118 1119 1120 1127 1128 1129 1130 1131 1132 1137 if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW)) BUG(); if (flags & SLAB_NO_GROW) return 0; if (in_interrupt() && (flags & SLAB_LEVEL_MASK) != SLAB_ATOMIC) BUG(); ctor_flags = SLAB_CTOR_CONSTRUCTOR; local_flags = (flags & SLAB_LEVEL_MASK); if (local_flags == SLAB_ATOMIC) ctor_flags |= SLAB_CTOR_ATOMIC;

Perform basic sanity checks to guard against bad usage. The checks are made here rather than kmem_cache_alloc() to protect the critical path. There is no point checking the ﬂags every time an object needs to be allocated. 1116-1117 Make sure only allowable ﬂags are used for allocation 1118-1119 Do not grow the cache if this is set. In reality, it is never set 1127-1128 If this called within interrupt context, make sure the ATOMIC ﬂag is set 1130 This ﬂag tells the constructor it is to init the object 1131 The local_ﬂags are just those relevant to the page allocator 1132-1137 If the SLAB_ATOMIC ﬂag is set, the constructor needs to know about it in case it wants to make new allocations 1140 1141 1143 1144 1145 1146 1147 1148 1149 1150 1151 spin_lock_irqsave(&cachep->spinlock, save_flags); offset = cachep->colour_next; cachep->colour_next++; if (cachep->colour_next >= cachep->colour) cachep->colour_next = 0; offset *= cachep->colour_off; cachep->dflags |= DFLGS_GROWN; cachep->growing++; spin_unlock_irqrestore(&cachep->spinlock, save_flags);

4.1.3. Slab Creation Calculate colour oﬀset for objects in this slab 1140 Acquire an interrupt safe lock for accessing the cache descriptor 1143 Get the oﬀset for objects in this slab 1144 Move to the next colour oﬀset

85

1145-1146 If colour has been reached, there is no more oﬀsets available, so reset colour_next to 0 1147 colour_off is the size of each oﬀset, so offset * colour_off will give how many bytes to oﬀset the objects to 1148 Mark the cache that it is growing so that kmem_cache_reap() will ignore this cache 1150 Increase the count for callers growing this cache 1151 Free the spinlock and re-enable interrupts 1163 1164 1165 1167 if (!(objp = kmem_getpages(cachep, flags))) goto failed; if (!(slabp = kmem_cache_slabmgmt(cachep, objp, offset, local_flags))) goto opps1; Allocate memory for slab and acquire a slab descriptor 1163-1164 Allocate pages from the page allocator for the slab. See Section 4.6 1167 Acquire a slab descriptor. See Section 4.1.1 1171 1172 1173 1174 1175 1176 1177 1178 i = 1 << cachep->gfporder; page = virt_to_page(objp); do { SET_PAGE_CACHE(page, cachep); SET_PAGE_SLAB(page, slabp); PageSetSlab(page); page++; } while (--i); Link the pages for the slab used to the slab and cache descriptors 1171 i is the number of pages used for the slab. Each page has to be linked to the slab and cache descriptors. 1172 objp is a pointer to the beginning of the slab. The macro virt_to_page() will give the struct page for that address

1158

4.1.3. Slab Creation 1173-1178 Link each pages list ﬁeld to the slab and cache descriptors

86

1174 SET_PAGE_CACHE() links the page to the cache descriptor. See the companion document for details 1176 SET_PAGE_SLAB() links the page to the slab descriptor. See the companion document for details 1176 Set the PG_slab page ﬂag. See the companion document for a full list of page ﬂags 1177 Move to the next page for this slab to be linked 1180 kmem_cache_init_objs(cachep, slabp, ctor_flags);

1180 Initialise all objects. See Section 4.2.1 1182 1183 1184 1186 1187 1188 1189 1190 1191 spin_lock_irqsave(&cachep->spinlock, save_flags); cachep->growing--; list_add_tail(&slabp->list, &cachep->slabs_free); STATS_INC_GROWN(cachep); cachep->failures = 0; spin_unlock_irqrestore(&cachep->spinlock, save_flags); return 1; Add the slab to the cache 1182 Acquire the cache descriptor spinlock in an interrupt safe fashion 1183 Decrease the growing count 1186 Add the slab to the end of the slabs_free list 1187 If STATS is set, increase the cachep→grown ﬁeld 1188 Set failures to 0. This ﬁeld is never used elsewhere 1190 Unlock the spinlock in an interrupt safe fashion 1191 Return success 1192 opps1: 1193 1194 failed: 1195 1196 1197 1298 1299 } 1300 kmem_freepages(cachep, objp); spin_lock_irqsave(&cachep->spinlock, save_flags); cachep->growing--; spin_unlock_irqrestore(&cachep->spinlock, save_flags); return 0;

4.1.4. Slab Destroying Error handling

87

1192-1193 opps1 is reached if the pages for the slab were allocated. They must be freed 1195 Acquire the spinlock for accessing the cache descriptor 1196 Reduce the growing count 1197 Release the spinlock 1298 Return failure

4.1.4

Slab Destroying

When a cache is been shrunk or destroyed, the slabs will be deleted. As the objects may have destructors, they must be called so the tasks of this function are • If available, call the destructor for every object in the slab • If debugging is enabled, check the red marking and poison pattern • Free the pages the slab uses The call graph at Figure 4.6 is very simple.

kmem_slab_destroy

kmem_freepages

kmem_cache_free

Figure 4.6: Call Graph: kmem_slab_destroy() Function: kmem_slab_destroy (mm/slab.c) The debugging section has been omitted from this function but are almost identical to the debugging section during object allocation. See Section 4.2.1 for how the markers and poison pattern are checked. 555 static void kmem_slab_destroy (kmem_cache_t *cachep, slab_t *slabp) 556 { 557 if (cachep->dtor 561 ){ 562 int i; 563 for (i = 0; i < cachep->num; i++) { 564 void* objp = slabp->s_mem+cachep->objsize*i; 565-574 DEBUG: Check red zone markers

4.2. Objects

88

575 576

if (cachep->dtor) (cachep->dtor)(objp, cachep, 0);

577-584 DEBUG: Check poison pattern 585 586 587 588 589 590 591 } } } kmem_freepages(cachep, slabp->s_mem-slabp->colouroff); if (OFF_SLAB(cachep)) kmem_cache_free(cachep->slabp_cache, slabp);

557-586 If a destructor is available, call it for each object in the slab 563-585 Cycle through each object in the slab 564 Calculate the address of the object to destroy 575-576 Call the destructor 588 Free the pages been used for the slab 589 If the slab descriptor is been kept oﬀ-slab, then free the memory been used for it

4.2

Objects

This section will cover how objects are managed. At this point, most of the real hard work has been completed by either the cache or slab managers.

4.2.1

Initialising Objects in a Slab

When a slab is created, all the objects in it put in an initialised state. If a constructor is available, it is called for each object and it is expected when an object is freed, it is left in its initialised state. Conceptually this is very simple, cycle through all objects and call the constructor and initialise the kmem_bufctl for it. The function kmem_cache_init_objs() is responsible for initialising the objects. Function: kmem_cache_init_objs (mm/slab.c) The vast part of this function is involved with debugging so we will start with the function without the debugging and explain that in detail before handling the debugging part. The two sections that are debugging are marked in the code excerpt below as Part 1 and Part 2. 1056 static inline void kmem_cache_init_objs (kmem_cache_t * cachep, 1057 slab_t * slabp, unsigned long ctor_flags) 1058 {

4.2.1. Initialising Objects in a Slab 1059 1060 1061 1062 1063-1070 1077 1078 1079-1092 1093 1094 1095 1096 1097 } int i; for (i = 0; i < cachep->num; i++) { void* objp = slabp->s_mem+cachep->objsize*i; /* Debugging Part 1 */ if (cachep->ctor) cachep->ctor(objp, cachep, ctor_flags); /* Debugging Part 2 */ slab_bufctl(slabp)[i] = i+1; } slab_bufctl(slabp)[i-1] = BUFCTL_END; slabp->free = 0;

89

1056 The parameters of the function are cachep The cache the objects are been initialised for slabp The slab the objects are in ctor_flags Flags the constructor needs whether this is an atomic allocation or not 1061 Initialise cache→num number of objects 1062 The base address for objects in the slab is s_mem. The address of the object to allocate is then i * (size of a single object) 1077-1078 If a constructor is available, call it 1093 The macro slab_bufctl() casts slabp to a slab_t slab descriptor and adds one to it. This brings the pointer to the end of the slab descriptor and then casts it back to a kmem_bufctl_t eﬀectively giving the beginning of the bufctl array. 1096 The index of the ﬁrst free object is 0 in the bufctl array That covers the core of initialising objects. Next the ﬁrst debugging part will be covered 1063 #if DEBUG 1064 1065 1066 1067 1068 1069 1070 #endif

if (cachep->flags & SLAB_RED_ZONE) { *((unsigned long*)(objp)) = RED_MAGIC1; *((unsigned long*)(objp + cachep->objsize BYTES_PER_WORD)) = RED_MAGIC1; objp += BYTES_PER_WORD; }

4.2.2. Object Allocation 1064 If the cache is to be red zones then place a marker at either end of the object 1065 Place the marker at the beginning of the object

90

1066 Place the marker at the end of the object. Remember that the size of the object takes into account the size of the red markers when red zoning is enabled 1068 Increase the objp pointer by the size of the marker for the beneﬁt of the constructor which is called after this debugging block 1079 #if DEBUG 1080 1081 1082 1084 1085 1086 1087 1088 1089 1090 1091 1092 #endif

if (cachep->flags & SLAB_RED_ZONE) objp -= BYTES_PER_WORD; if (cachep->flags & SLAB_POISON) kmem_poison_obj(cachep, objp); if (cachep->flags & SLAB_RED_ZONE) { if (*((unsigned long*)(objp)) != RED_MAGIC1) BUG(); if (*((unsigned long*)(objp + cachep->objsize BYTES_PER_WORD)) != RED_MAGIC1) BUG(); }

This is the debugging block that takes place after the constructor, if it exists, has been called. 1080-1081 The objp was increased by the size of the red marker in the previous debugging block so move it back again 1082-1084 If there was no constructor, poison the object with a known pattern that can be examined later to trap uninitialised writes 1086 Check to make sure the red marker at the beginning of the object was preserved to trap writes before the object 1088-1089 Check to make sure writes didn’t take place past the end of the object

4.2.2

Object Allocation

Function: kmem_cache_alloc (mm/slab.c) This trivial function simply calls __kmem_cache_alloc(). 1527 void * kmem_cache_alloc (kmem_cache_t *cachep, int flags) 1529 { 1530 return __kmem_cache_alloc(cachep, flags); 1531 }

4.2.2. Object Allocation

91

kmem_cache_alloc

__kmem_cache_alloc

kmem_cache_alloc_head

kmem_cache_alloc_one

kmem_cache_alloc_one_tail

kmem_cache_grow

Figure 4.7: Call Graph: kmem_cache_alloc() Function: __kmem_cache_alloc (UP Case) (mm/slab.c) This will take the parts of the function speciﬁc to the UP case. The SMP case will be dealt with in the next section. 1336 static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, int flags) 1337 { 1338 unsigned long save_flags; 1339 void* objp; 1340 1341 kmem_cache_alloc_head(cachep, flags); 1342 try_again: 1343 local_irq_save(save_flags); 1365 objp = kmem_cache_alloc_one(cachep);

1367 local_irq_restore(save_flags); 1368 return objp; 1369 alloc_new_slab: 1374 1375 1379 1380 1381 } local_irq_restore(save_flags); if (kmem_cache_grow(cachep, flags)) goto try_again; return NULL;

1336 The parameters are the cache to allocate from and allocation speciﬁc ﬂags. 1341 This function makes sure the appropriate combination of DMA ﬂags are in use 1343 Disable interrupts and save the ﬂags. This function is used by interrupts so this is the only way to provide synchronisation in the UP case 1365 This macro (see Section 4.2.2) allocates an object from one of the lists and returns it. If no objects are free, it calls goto alloc_new_slab at the end of this function

4.2.2. Object Allocation 1367-1368 Restore interrupts and return

92

1374 At this label, no objects were free in slabs_partial and slabs_free is empty so a new slab is needed 1375 Allocate a new slab (see Section 4.1.3) 1379 A new slab is available so try again 1380 No slabs could be allocated so return failure Function: __kmem_cache_alloc (SMP Case) (mm/slab.c) This is what the function looks like in the SMP case 1336 static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, int flags) 1337 { 1338 unsigned long save_flags; 1339 void* objp; 1340 1341 kmem_cache_alloc_head(cachep, flags); 1342 try_again: 1343 local_irq_save(save_flags); 1345 { 1346 cpucache_t *cc = cc_data(cachep); 1347 1348 if (cc) { 1349 if (cc->avail) { 1350 STATS_INC_ALLOCHIT(cachep); 1351 objp = cc_entry(cc)[--cc->avail]; 1352 } else { 1353 STATS_INC_ALLOCMISS(cachep); 1354 objp = kmem_cache_alloc_batch(cachep,cc,flags); 1355 if (!objp) 1356 goto alloc_new_slab_nolock; 1357 } 1358 } else { 1359 spin_lock(&cachep->spinlock); 1360 objp = kmem_cache_alloc_one(cachep); 1361 spin_unlock(&cachep->spinlock); 1362 } 1363 } 1364 local_irq_restore(save_flags); 1368 return objp; 1369 alloc_new_slab: 1371 spin_unlock(&cachep->spinlock); 1372 alloc_new_slab_nolock:

4.2.2. Object Allocation 1373 1375 1379 1380 1381 } local_irq_restore(save_flags); if (kmem_cache_grow(cachep, flags)) goto try_again; return NULL;

93

1336-1345 Same as UP case 1347 Obtain the per CPU data for this cpu 1348-1358 If a per CPU cache is available then .... 1349 If there is an object available then .... 1350 Update statistics for this cache if enabled 1351 Get an object and update the avail ﬁgure 1352 Else an object is not available so .... 1353 Update statistics for this cache if enabled 1354 Allocate batchcount number of objects, place all but one of them in the per CPU cache and return the last one to objp 1355-1356 The allocation failed, so goto alloc_new_slab_nolock to grow the cache and allocate a new slab 1358-1362 If a per CPU cache is not available, take out the cache spinlock and allocate one object in the same way the UP case does. This is the case during the initialisation for the cache_cache for example 1361 Object was successfully assigned, release cache spinlock 1364-1368 Re-enable interrupts and return the allocated object 1369-1370 If kmem_cache_alloc_one() failed to allocate an object, it will goto here with the spinlock still held so it must be released 1373-1381 Same as the UP case Function: kmem_cache_alloc_head (mm/slab.c) This simple function ensures the right combination of slab and GFP ﬂags are used for allocation from a slab. If a cache is for DMA use, this function will make sure the caller does not accidently request normal memory and vice versa 1229 static inline void kmem_cache_alloc_head(kmem_cache_t *cachep, int flags) 1230 { 1231 if (flags & SLAB_DMA) { 1232 if (!(cachep->gfpflags & GFP_DMA)) 1233 BUG();

4.2.2. Object Allocation 1234 1235 1236 1237 1238 } } else { if (cachep->gfpflags & GFP_DMA) BUG(); }

94

1229 The parameters are the cache we are allocating from and the ﬂags requested for the allocation 1231 If the caller has requested memory for DMA use and .... 1232 The cache is not using DMA memory then BUG() 1235 Else if the caller has not requested DMA memory and this cache is for DMA use, BUG() Function: kmem_cache_alloc_one (mm/slab.c) This is a preprocessor macro. It may seem strange to not make this an inline function but it is a preprocessor macro for for a goto optimisation in __kmem_cache_alloc() (see Section 4.2.2) 1281 #define kmem_cache_alloc_one(cachep) 1282 ({ 1283 struct list_head * slabs_partial, * entry; 1284 slab_t *slabp; 1285 1286 slabs_partial = &(cachep)->slabs_partial; 1287 entry = slabs_partial->next; 1288 if (unlikely(entry == slabs_partial)) { 1289 struct list_head * slabs_free; 1290 slabs_free = &(cachep)->slabs_free; 1291 entry = slabs_free->next; 1292 if (unlikely(entry == slabs_free)) 1293 goto alloc_new_slab; 1294 list_del(entry); 1295 list_add(entry, slabs_partial); 1296 } 1297 1298 slabp = list_entry(entry, slab_t, list); 1299 kmem_cache_alloc_one_tail(cachep, slabp); 1300 }) 1286-1287 Get the ﬁrst slab from the slabs_partial list 1288-1296 If a slab is not available from this list, execute this block 1289-1291 Get the ﬁrst slab from the slabs_free list \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

4.2.2. Object Allocation

95

1292-1293 If there is no slabs on slabs_free, then goto alloc_new_slab(). This goto label is in __kmem_cache_alloc() and it is will grow the cache by one slab 1294-1295 Else remove the slab from the free list and place it on the slabs_partial list because an object is about to be removed from it 1298 Obtain the slab from the list 1299 Allocate one object from the slab Function: kmem_cache_alloc_one_tail (mm/slab.c) This function is responsible for the allocation of one object from a slab. Much of it is debugging code. 1240 1241 1242 1243 1244 1245 1246 1247 1248 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 static inline void * kmem_cache_alloc_one_tail (kmem_cache_t *cachep, slab_t *slabp) { void *objp; STATS_INC_ALLOCED(cachep); STATS_INC_ACTIVE(cachep); STATS_SET_HIGH(cachep); slabp->inuse++; objp = slabp->s_mem + slabp->free*cachep->objsize; slabp->free=slab_bufctl(slabp)[slabp->free]; if (unlikely(slabp->free == BUFCTL_END)) { list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_full); } #if DEBUG if (cachep->flags & SLAB_POISON) if (kmem_check_poison_obj(cachep, objp)) BUG(); if (cachep->flags & SLAB_RED_ZONE) { if (xchg((unsigned long *)objp, RED_MAGIC2) != RED_MAGIC1) BUG(); if (xchg((unsigned long *)(objp+cachep->objsize BYTES_PER_WORD), RED_MAGIC2) != RED_MAGIC1) BUG(); objp += BYTES_PER_WORD; } #endif return objp; }

4.2.2. Object Allocation 1230 The parameters are the cache and slab been allocated from

96

1245-1247 If stats are enabled, this will set three statistics. ALLOCED is the total number of objects that have been allocated. ACTIVE is the number of active objects in the cache. HIGH is the maximum number of objects that were active as a single time 1250 inuse is the number of objects active on this slab 1251 Get a pointer to a free object. s_mem is a pointer to the ﬁrst object on the slab. free is an index of a free object in the slab. index * object size gives an oﬀset within the slab 1252 This updates the free pointer to be an index of the next free object. See the companion document for seeing how to track free objects. 1254-1257 If the slab is full, remove it from the slabs_partial list and place it on the slabs_full. 1258-1272 Debugging code 1273 Without debugging, the object is returned to the caller 1259-1261 If the object was poisoned with a known pattern, check it to guard against uninitialised access 1264-1265 If red zoning was enabled, check the marker at the beginning of the object and conﬁrm it is safe. Change the red marker to check for writes before the object later 1267-1269 Check the marker at the end of the object and change it to check for writes after the object later 1270 Update the object pointer to point to after the red marker 1273 Return the object Function: kmem_cache_alloc_batch (mm/slab.c) This function allocate a batch of objects to a CPU cache of objects. It is only used in the SMP case. In many ways it is very similar kmem_cache_alloc_one() (see Section 4.2.2). 1303 void* kmem_cache_alloc_batch(kmem_cache_t* cachep, cpucache_t* cc, int flags) 1304 { 1305 int batchcount = cachep->batchcount; 1306 1307 spin_lock(&cachep->spinlock); 1308 while (batchcount--) { 1309 struct list_head * slabs_partial, * entry; 1310 slab_t *slabp; 1311 /* Get slab alloc is to come from. */

4.2.3. Object Freeing 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 } slabs_partial = &(cachep)->slabs_partial; entry = slabs_partial->next; if (unlikely(entry == slabs_partial)) { struct list_head * slabs_free; slabs_free = &(cachep)->slabs_free; entry = slabs_free->next; if (unlikely(entry == slabs_free)) break; list_del(entry); list_add(entry, slabs_partial); } slabp = list_entry(entry, slab_t, list); cc_entry(cc)[cc->avail++] = kmem_cache_alloc_one_tail(cachep, slabp); } spin_unlock(&cachep->spinlock); if (cc->avail) return cc_entry(cc)[--cc->avail]; return NULL;

97

1303 The parameters are the cache to allocate from, the per CPU cache to ﬁll and allocation ﬂags 1305 batchcount is the number of objects to allocate 1307 Obtain the spinlock for access to the cache descriptor 1308-1327 Loop batchcount times 1309-1322 This is example the same as kmem_cache_alloc_one() (See Section 4.2.2) . It selects a slab from either slabs_partial or slabs_free to allocate from. If none are available, break out of the loop 1324-1325 Call kmem_cache_alloc_one_tail() (See Section 4.2.2) and place it in the per CPU cache. 1328 Release the cache descriptor lock 1330-1331 Take one of the objects allocated in this batch and return it 1332 If no object was allocated, return. __kmem_cache_alloc() will grow the cache by one slab and try again

4.2.3. Object Freeing

98

kmem_cache_free

__kmem_cache_free

kmem_cache_free_one

Figure 4.8: Call Graph: kmem_cache_free()

4.2.3

Object Freeing

Function: kmem_cache_free (mm/slab.c) 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 void kmem_cache_free (kmem_cache_t *cachep, void *objp) { unsigned long flags; #if DEBUG CHECK_PAGE(virt_to_page(objp)); if (cachep != GET_PAGE_CACHE(virt_to_page(objp))) BUG(); #endif local_irq_save(flags); __kmem_cache_free(cachep, objp); local_irq_restore(flags); }

1574 The parameter is the cache the object is been freed from and the object itself 1577-1581 If debugging is enabled, the page will ﬁrst be checked with CHECK_PAGE() to make sure it is a slab page. Secondly the page list will be examined to make sure it belongs to this cache (See Section 4.1.2) 1583 Interrupts are disabled to protect the path 1584 __kmem_cache_free() will free the object to the per CPU cache for the SMP case and to the global pool in the normal case 1585 Re-enable interrupts

4.2.3. Object Freeing

99

Function: __kmem_cache_free (mm/slab.c) This covers what the function looks like in the UP case. Clearly, it simply releases the object to the slab. 1491 static inline void __kmem_cache_free (kmem_cache_t *cachep, void* objp) 1492 { 1515 kmem_cache_free_one(cachep, objp); 1517 } Function: __kmem_cache_free (mm/slab.c) This case is slightly more interesting. In this case, the object is released to the per-cpu cache if it is available. 1491 static inline void __kmem_cache_free (kmem_cache_t *cachep, void* objp) 1492 { 1494 cpucache_t *cc = cc_data(cachep); 1495 1496 CHECK_PAGE(virt_to_page(objp)); 1497 if (cc) { 1498 int batchcount; 1499 if (cc->avail < cc->limit) { 1500 STATS_INC_FREEHIT(cachep); 1501 cc_entry(cc)[cc->avail++] = objp; 1502 return; 1503 } 1504 STATS_INC_FREEMISS(cachep); 1505 batchcount = cachep->batchcount; 1506 cc->avail -= batchcount; 1507 free_block(cachep, 1508 &cc_entry(cc)[cc->avail],batchcount); 1509 cc_entry(cc)[cc->avail++] = objp; 1510 return; 1511 } else { 1512 free_block(cachep, &objp, 1); 1513 } 1517 } 1494 Get the data for this per CPU cache (See Section 4.4) 1496 Make sure the page is a slab page 1497-1511 If a per CPU cache is available, try to use it. This is not always available. During cache destruction for instance, the per CPU caches are already gone 1499-1503 If the number of available in the per CPU cache is below limit, then add the object to the free list and return 1504 Update Statistics if enabled

4.2.3. Object Freeing

100

1505 The pool has overﬂowed so batchcount number of objects is going to be freed to the global pool 1506 Update the number of available (avail) objects 1507-1508 Free a block of objects to the global cache 1509 Free the requested object and place it on the per CPU pool 1511 If the per CPU cache is not available, then free this object to the global pool Function: kmem_cache_free_one (mm/slab.c) 1412 static inline void kmem_cache_free_one(kmem_cache_t *cachep, void *objp) 1413 { 1414 slab_t* slabp; 1415 1416 CHECK_PAGE(virt_to_page(objp)); 1423 slabp = GET_PAGE_SLAB(virt_to_page(objp)); 1424 1425 #if DEBUG 1426 if (cachep->flags & SLAB_DEBUG_INITIAL) 1431 cachep->ctor(objp, cachep, SLAB_CTOR_CONSTRUCTOR|SLAB_CTOR_VERIFY); 1432 1433 if (cachep->flags & SLAB_RED_ZONE) { 1434 objp -= BYTES_PER_WORD; 1435 if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2) 1436 BUG(); 1438 if (xchg((unsigned long *)(objp+cachep->objsize 1439 BYTES_PER_WORD), RED_MAGIC1) != RED_MAGIC2) 1441 BUG(); 1442 } 1443 if (cachep->flags & SLAB_POISON) 1444 kmem_poison_obj(cachep, objp); 1445 if (kmem_extra_free_checks(cachep, slabp, objp)) 1446 return; 1447 #endif 1448 { 1449 unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize; 1450 1451 slab_bufctl(slabp)[objnr] = slabp->free; 1452 slabp->free = objnr; 1453 } 1454 STATS_DEC_ACTIVE(cachep);

4.2.3. Object Freeing 1455 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 }

101

{ int inuse = slabp->inuse; if (unlikely(!--slabp->inuse)) { /* Was partial or full, now empty. */ list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_free); } else if (unlikely(inuse == cachep->num)) { /* Was full. */ list_del(&slabp->list); list_add(&slabp->list, &cachep->slabs_partial); } }

1416 Make sure the page is a slab page 1423 Get the slab descriptor for the page 1425-1447 Debugging material. Discussed at end of section 1449 Calculate the index for the object been freed 1452 As this object is now free, update the bufctl to reﬂect that. See the companion document for seeing how to track free objects. 1454 If statistics are enabled, disable the number of active objects in the slab 1459-1462 If inuse reaches 0, the slab is free and is moved to the slabs_free list 1463-1466 If the number in use equals the number of objects in a slab, it is full so move it to the slabs_full list 1469 Return 1426-1431 If SLAB_DEBUG_INITIAL is set, the constructor is called to verify the object is in an initialised state 1433-1442 Verify the red marks at either end of the object are still there. This will check for writes beyond the boundaries of the object and for double frees 1443-1444 Poison the freed object with a known pattern 1445-1446 This function will conﬁrm the object is a part of this slab and cache. It will then check the free list (bufctl) to make sure this is not a double free

4.3. Sizes Cache

102

Function: free_block (mm/slab.c) This function is only used in the SMP case when the per CPU cache gets too full. It is used to free a batch of objects in bulk 1479 static void free_block (kmem_cache_t* cachep, void** objpp, int len) 1480 { 1481 spin_lock(&cachep->spinlock); 1482 __free_block(cachep, objpp, len); 1483 spin_unlock(&cachep->spinlock); 1484 } 1479 The parameters are cachep The cache that objects are been freed from objpp Pointer to the ﬁrst object to free len The number of objects to free 1483 Acquire a lock to the cache descriptor 1484 Discussed in next section 1485 Release the lock Function: __free_block (mm/slab.c) This function is trivial. Starting with objpp, it will free len number of objects. 1472 static inline void __free_block (kmem_cache_t* cachep, 1473 void** objpp, int len) 1474 { 1475 for ( ; len > 0; len--, objpp++) 1476 kmem_cache_free_one(cachep, *objpp); 1477 }

4.3

Sizes Cache

Function: kmem_cache_sizes_init (mm/slab.c) This function is responsible for creating pairs of caches for small memory buﬀers suitable for either normal or DMA memory. 436 void __init kmem_cache_sizes_init(void) 437 { 438 cache_sizes_t *sizes = cache_sizes; 439 char name[20]; 440 444 if (num_physpages > (32 << 20) >> PAGE_SHIFT) 445 slab_break_gfp_order = BREAK_GFP_ORDER_HI;

4.3. Sizes Cache 446 452 453 454 455 456 457 458 460 461 462 463 464 465 466 467 468 469 470 471 } do { snprintf(name, sizeof(name), "size-%Zd", sizes->cs_size); if (!(sizes->cs_cachep = kmem_cache_create(name, sizes->cs_size, 0, SLAB_HWCACHE_ALIGN, NULL, NULL))) { BUG(); }

103

if (!(OFF_SLAB(sizes->cs_cachep))) { offslab_limit = sizes->cs_size-sizeof(slab_t); offslab_limit /= 2; } snprintf(name, sizeof(name), "size-%Zd(DMA)", sizes->cs_size); sizes->cs_dmacachep = kmem_cache_create(name, sizes->cs_size, 0, SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL); if (!sizes->cs_dmacachep) BUG(); sizes++; } while (sizes->cs_size);

438 Get a pointer to the cache_sizes array. See Section 4.3 439 The human readable name of the cache . Should be sized CACHE_NAMELEN which is deﬁned to be 20 long 444-445 slab_break_gfp_order determines how many pages a slab may use unless 0 objects ﬁt into the slab. It is statically initialised to BREAK_GFP_ORDER_LO (1). This check sees if more than 32MiB of memory is available and if it is, allow BREAK_GFP_ORDER_HI number of pages to be used because internal fragmentation is more acceptable when more memory is available. 446-470 Create two caches for each size of memory allocation needed 452 Store the human readable cache name in name 453-454 Create the cache, aligned to the L1 cache. See Section 4.0.1 460-463 Calculate the oﬀ-slab bufctl limit which determines the number of objects that can be stored in a cache when the slab descriptor is kept oﬀ-cache. 464 The human readable name for the cache for DMA use

4.3.1. kmalloc

104

465-466 Create the cache, aligned to the L1 cache and suitable for DMA user. See Section 4.0.1 467 if the cache failed to allocate, it is a bug. If memory is unavailable this early, the machine will not boot 469 Move to the next element in the cache_sizes array 470 The array is terminated with a 0 as the last element

4.3.1

kmalloc

With the existence of the sizes cache, the slab allocator is able to oﬀer a new allocator function, kmalloc() for use when small memory buﬀers are required. When a request is received, the appropriate sizes cache is selected and an object assigned from it. The call graph on Figure 4.9 is therefore very simple as all the hard work is in cache allocation (See Section 4.2.2)

kmalloc

__kmem_cache_alloc

Figure 4.9: kmalloc Function: kmalloc (mm/slab.c) 1553 void * kmalloc (size_t size, int flags) 1554 { 1555 cache_sizes_t *csizep = cache_sizes; 1556 1557 for (; csizep->cs_size; csizep++) { 1558 if (size > csizep->cs_size) 1559 continue; 1560 return __kmem_cache_alloc(flags & GFP_DMA ? 1561 csizep->cs_dmacachep : csizep->cs_cachep, flags); 1562 } 1563 return NULL; 1564 } 1555 cache_sizes is the array of caches for each size (See Section 4.3)

4.3.2. kfree

105

1557-1562 Starting with the smallest cache, examine the size of each cache until one large enough to satisfy the request is found 1560 If the allocation is for use with DMA, allocate an object from cs_dmacachep else use the cs_cachep 1563 If a sizes cache of suﬃcient size was not available or an object could not be allocated, return failure

4.3.2

kfree

Just as there is a kmalloc() function to allocate small memory objects for use, there is a kfree() for freeing it. As with kmalloc, the real work takes place during object freeing (See Section 4.2.3) so the call graph in Figure 4.9 is very simple.

kfree

__kmem_cache_free

Figure 4.10: kfree Function: kfree (mm/slab.c) It is worth noting that the work this function does is almost identical to the function kmem_cache_free() with debugging enabled (See Section 4.2.3). 1595 void kfree (const void *objp) 1596 { 1597 kmem_cache_t *c; 1598 unsigned long flags; 1599 1600 if (!objp) 1601 return; 1602 local_irq_save(flags); 1603 CHECK_PAGE(virt_to_page(objp)); 1604 c = GET_PAGE_CACHE(virt_to_page(objp)); 1605 __kmem_cache_free(c, (void*)objp); 1606 local_irq_restore(flags); 1607 } 1600 Return if the pointer is NULL. This is possible if a caller used kmalloc() and had a catch-all failure routine which called kfree() immediately

4.4. Per-CPU Object Cache 1602 Disable interrupts 1603 Make sure the page this object is in is a slab page 1604 Get the cache this pointer belongs to (See Section 4.1) 1605 Free the memory object 1606 Re-enable interrupts

106

4.4

Per-CPU Object Cache

One of the tasks the slab allocator is dedicated to is improved hardware cache utilization. An aim of high performance computing in general is to use data on the same CPU for as long as possible. Linux achieves this by trying to keep objects in the same CPU cache with a Per-CPU object cache, called a cpucache for each CPU in the system. When allocating or freeing objects, they are placed in the cpucache. When there is no objects free, a batch of objects is placed into the pool. When the pool gets too large, half of them are removed and placed in the global cache. This way the hardware cache will be used for as long as possible on the same CPU.

4.4.1

Describing the Per-CPU Object Cache

Each cache descriptor has a pointer to an array of cpucaches, described in the cache descriptor as 231 cpucache_t This structure is very simple 173 typedef struct cpucache_s { 174 unsigned int avail; 175 unsigned int limit; 176 } cpucache_t; avail is the number of free objects available on this cpucache limit is the total number of free objects that can exist A helper macro cc_data() is provided to give the cpucache for a given cache and processor. It is deﬁned as 180 #define cc_data(cachep) \ 181 ((cachep)->cpudata[smp_processor_id()]) This will take a given cache descriptor (cachep) and return a pointer from the cpucache array (cpudata). The index needed is the ID of the current processor, smp_processor_id(). Pointers to objects on the cpucache are placed immediately after the cpucache_t struct. This is very similar to how objects are stored after a slab descriptor illustrated in Section 4.1.2. *cpudata[NR_CPUS];

4.4.2. Adding/Removing Objects from the Per-CPU Cache

107

4.4.2

Adding/Removing Objects from the Per-CPU Cache

To prevent fragmentation, objects are always added or removed from the end of the array. To add an object (obj) to the CPU cache (cc), the following block of code is used cc_entry(cc)[cc->avail++] = obj; To remove an object obj = cc_entry(cc)[--cc->avail]; cc_entry() is a helper macro which gives a pointer to the ﬁrst object in the cpucache. It is deﬁned as 178 #define cc_entry(cpucache) \ 179 ((void **)(((cpucache_t*)(cpucache))+1)) This takes a pointer to a cpucache, increments the value by the size of the cpucache_t descriptor giving the ﬁrst object in the cache.

4.4.3

Enabling Per-CPU Caches

When a cache is created, its CPU cache has to be enabled and memory allocated for it using kmalloc. The function enable_cpucache() is responsible for deciding what size to make the cache and calling kmem_tune_cpucache() to allocate memory for it. Obviously a CPU cache cannot exist until after the various sizes caches have been enabled so a global variable g_cpucache_up() is used to prevent cpucache’s been enabled before it is possible. The function enable_all_cpucaches() cycles through all caches in the cache chain and enables their cpucache. Once the CPU cache has been setup, it can be accessed without locking as a CPU will never access the wrong cpucache so it is guaranteed safe access to it. Function: enable_all_cpucaches (mm/slab.c) This function locks the cache chain and enables the cpucache for every cache. This is important after the cache_cache and sizes cache have been enabled. 1712 static void enable_all_cpucaches (void) 1713 { 1714 struct list_head* p; 1715 1716 down(&cache_chain_sem); 1717 1718 p = &cache_cache.next; 1719 do { 1720 kmem_cache_t* cachep = list_entry(p, kmem_cache_t, next); 1721 1722 enable_cpucache(cachep); 1723 p = cachep->next.next;

4.4.3. Enabling Per-CPU Caches 1724 1725 1726 1727 } } while (p != &cache_cache.next); up(&cache_chain_sem);

108

1716 Obtain the semaphore to the cache chain 1717 Get the ﬁrst cache on the chain 1719-1724 Cycle through the whole chain 1720 Get a cache from the chain. This code will skip the ﬁrst cache on the chain but cache_cache doesn’t need a cpucache as it is so rarely used 1722 Enable the cpucache 1723 Move to the next cache on the chain 1724 Release the cache chain semaphore Function: enable_cpucache (mm/slab.c) This function calculates what the size of a cpucache should be based on the size of the objects the cache contains before calling kmem_tune_cpucache() which does the actual allocation. 1691 static void enable_cpucache (kmem_cache_t *cachep) 1692 { 1693 int err; 1694 int limit; 1695 1697 if (cachep->objsize > PAGE_SIZE) 1698 return; 1699 if (cachep->objsize > 1024) 1700 limit = 60; 1701 else if (cachep->objsize > 256) 1702 limit = 124; 1703 else 1704 limit = 252; 1705 1706 err = kmem_tune_cpucache(cachep, limit, limit/2); 1707 if (err) 1708 printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n", 1709 cachep->name, -err); 1710 } 1697-1698 If an object is larger than a page, don’t have a Per CPU cache. They are too expensive

4.4.3. Enabling Per-CPU Caches

109

1699-1700 If an object is larger than 1KiB, keep the cpu cache below 3MiB in size. The limit is set to 124 objects to take the size of the cpucache descriptors into account 1701-1702 For smaller objects, just make sure the cache doesn’t go above 3MiB in size 1706 Allocate the memory for the cpucache 1708-1709 Print out an error message if the allocation failed Function: kmem_tune_cpucache (mm/slab.c) This function is responsible for allocating memory for the cpucaches. For each CPU on the system, kmalloc gives a block of memory large enough for one cpu cache and ﬁlls a cpupdate_struct_t struct. The function smp_call_function_all_cpus() then calls do_ccupdate_local() which swaps the new information with the old information in the cache descriptor. 1637 static int kmem_tune_cpucache (kmem_cache_t* cachep, int limit, int batchcount) 1638 { 1639 ccupdate_struct_t new; 1640 int i; 1641 1642 /* 1643 * These are admin-provided, so we are more graceful. 1644 */ 1645 if (limit < 0) 1646 return -EINVAL; 1647 if (batchcount < 0) 1648 return -EINVAL; 1649 if (batchcount > limit) 1650 return -EINVAL; 1651 if (limit != 0 && !batchcount) 1652 return -EINVAL; 1653 1654 memset(&new.new,0,sizeof(new.new)); 1655 if (limit) { 1656 for (i = 0; i< smp_num_cpus; i++) { 1657 cpucache_t* ccnew; 1658 1659 ccnew = kmalloc(sizeof(void*)*limit+ 1660 sizeof(cpucache_t), GFP_KERNEL); 1661 if (!ccnew) 1662 goto oom; 1663 ccnew->limit = limit; 1664 ccnew->avail = 0; 1665 new.new[cpu_logical_map(i)] = ccnew;

4.4.3. Enabling Per-CPU Caches 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 oom: 1686 1687 1688 1689 } } } new.cachep = cachep; spin_lock_irq(&cachep->spinlock); cachep->batchcount = batchcount; spin_unlock_irq(&cachep->spinlock); smp_call_function_all_cpus(do_ccupdate_local, (void *)&new); for (i = 0; i < smp_num_cpus; i++) { cpucache_t* ccold = new.new[cpu_logical_map(i)]; if (!ccold) continue; local_irq_disable(); free_block(cachep, cc_entry(ccold), ccold->avail); local_irq_enable(); kfree(ccold); } return 0; for (i--; i >= 0; i--) kfree(new.new[cpu_logical_map(i)]); return -ENOMEM;

110

1637 The parameters of the function are cachep The cache this cpucache is been allocated for limit The total number of objects that can exist in the cpucache batchcount The number of objects to allocate in one batch when the cpucache is empty 1645 The number of objects in the cache cannot be negative 1647 A negative number of objects cannot be allocated in batch 1649 A batch of objects greater than the limit cannot be allocated 1651 A batchcount must be provided if the limit is positive 1654 Zero ﬁll the update struct 1655 If a limit is provided, allocate memory for the cpucache 1656-1666 For every CPU, allocate a cpucache 1659 The amount of memory needed is limit number of pointers and the size of the cpucache descriptor

4.4.4. Updating Per-CPU Information 1661 If out of memory, clean up and exit 1663-1664 Fill in the ﬁelds for the cpucache descriptor 1665 Fill in the information for ccupdate_update_t struct 1668 Tell the ccupdate_update_t struct what cache is been updated

111

1669-1671 Acquire an interrupt safe lock to the cache descriptor and set its batchcount 1673 Get each CPU to update its cpucache information for itself. This swaps the old cpucaches in the cache descriptor with the new ones in new 1675-1683 After smp_call_function_all_cpus(), the old cpucaches are in new. This block of code cycles through them all, frees any objects in them and deletes the old cpucache 1684 Return success 1686 In the event there is no memory, delete all cpucaches that have been allocated up until this point and return failure

4.4.4

Updating Per-CPU Information

When the per-cpu caches have been created or changed, each CPU has to be told about it. It is not suﬃcient to change all the values in the cache descriptor as that would lead to cache coherency issues and spinlocks would have to used to protect the cpucache’s. Instead a ccupdate_t struct is populated with all the information each CPU needs and each CPU swaps the new data with the old information in the cache descriptor. The struct for storing the new cpucache information is deﬁned as follows 868 typedef struct ccupdate_struct_s 869 { 870 kmem_cache_t *cachep; 871 cpucache_t *new[NR_CPUS]; 872 } ccupdate_struct_t; The cachep is the cache been updated and the array new is of the cpucache descriptors for each CPU on the system. The function smp_function_all_cpus() is used to get each CPU to call the do_ccupdate_local() function which swaps the information from ccupdate_struct_t with the information in the cache descriptor. Once the information has been swapped, the old data can be deleted. Function: smp_function_all_cpus (mm/slab.c) This calls the function func() for all CPU’s. In the context of the slab allocator, the function is do_ccupdate_local() and the argument is ccupdate_struct_t.

4.4.5. Draining a Per-CPU Cache 859 static void smp_call_function_all_cpus(void (*func) (void *arg), void *arg) 860 { 861 local_irq_disable(); 862 func(arg); 863 local_irq_enable(); 864 865 if (smp_call_function(func, arg, 1, 1)) 866 BUG(); 867 } 861-863 Disable interrupts locally and call the function for this CPU

112

865 For all other CPU’s, call the function. smp_call_function() is an architecture speciﬁc function and will not be discussed further here Function: do_ccupdate_local (mm/slab.c) This function swaps the cpucache information in the cache descriptor with the information in info for this CPU. 874 static void do_ccupdate_local(void *info) 875 { 876 ccupdate_struct_t *new = (ccupdate_struct_t *)info; 877 cpucache_t *old = cc_data(new->cachep); 878 879 cc_data(new->cachep) = new->new[smp_processor_id()]; 880 new->new[smp_processor_id()] = old; 881 } 876 info is a pointer to the ccupdate_struct_t to pass to smp_call_function_all_cpus() 877 Part of the ccupdate_struct_t is a pointer to the cache this cpucache belongs to. cc_data() returns the cpucache_t for this processor 879 Place the new cpucache in cache descriptor. cc_data() returns the pointer to the cpucache for this CPU. 880 Replace the pointer in new with the old cpucache so it can be deleted later by the caller of smp_call_function_call_cpus(), kmem_tune_cpucache() for example

4.4.5

Draining a Per-CPU Cache

When a cache is been shrunk, its ﬁrst step is to drain the cpucaches of any objects they might have. This is so the slab allocator will have a clearer view of what slabs can be freed or not. This is important because if just one object in a slab is placed in a Per-CPU cache, that whole slab cannot be freed. If the system is tight on memory, saving a few milliseconds on allocations is the least of its trouble.

4.4.5. Draining a Per-CPU Cache Function: drain_cpu_caches (mm/slab.c) 885 static void drain_cpu_caches(kmem_cache_t *cachep) 886 { 887 ccupdate_struct_t new; 888 int i; 889 890 memset(&new.new,0,sizeof(new.new)); 891 892 new.cachep = cachep; 893 894 down(&cache_chain_sem); 895 smp_call_function_all_cpus(do_ccupdate_local, (void *)&new); 896 897 for (i = 0; i < smp_num_cpus; i++) { 898 cpucache_t* ccold = new.new[cpu_logical_map(i)]; 899 if (!ccold || (ccold->avail == 0)) 900 continue; 901 local_irq_disable(); 902 free_block(cachep, cc_entry(ccold), ccold->avail); 903 local_irq_enable(); 904 ccold->avail = 0; 905 } 906 smp_call_function_all_cpus(do_ccupdate_local, (void *)&new); 907 up(&cache_chain_sem); 908 } 890 Blank the update structure as it is going to be clearing all data

113

892 Set new.cachep to cachep so that smp_call_function_all_cpus() knows what cache it is aﬀecting 894 Acquire the cache descriptor semaphore 895 do_ccupdate_local() swaps the cpucache_t information in the cache descriptor with the ones in new so they can be altered here 897-905 For each CPU in the system .... 898 Get the cpucache descriptor for this CPU 899 If the structure does not exist for some reason or there is no objects available in it, move to the next CPU 901 Disable interrupts on this processor. It is possible an allocation from an interrupt handler elsewhere would try to access the per CPU cache 902 Free the block of objects (See Section 4.2.3)

4.5. Slab Allocator Initialisation 903 Re-enable interrupts 904 Show that no objects are available

114

906 The information for each CPU has been updated so call do_ccupdate_local() for each CPU to put the information back into the cache descriptor 907 Release the semaphore for the cache chain

4.5

Slab Allocator Initialisation

Here we will describe the slab allocator initialises itself. When the slab allocator creates a new cache, it allocates the kmem_cache_t from the cache_cache or kmem_cache cache. This is an obvious chicken and egg problem so the cache_cache has to be statically initialised as 357 static kmem_cache_t cache_cache = { 358 slabs_full: LIST_HEAD_INIT(cache_cache.slabs_full), 359 slabs_partial: LIST_HEAD_INIT(cache_cache.slabs_partial), 360 slabs_free: LIST_HEAD_INIT(cache_cache.slabs_free), 361 objsize: sizeof(kmem_cache_t), 362 flags: SLAB_NO_REAP, 363 spinlock: SPIN_LOCK_UNLOCKED, 364 colour_off: L1_CACHE_BYTES, 365 name: "kmem_cache", 366 }; 358-360 Initialise the three lists as empty lists 361 The size of each object is the size of a cache descriptor 362 The creation and deleting of caches is extremely rare so do not consider it for reaping ever 363 Initialise the spinlock unlocked 364 Align the objects to the L1 cache 365 The human readable name That statically deﬁnes all the ﬁelds that can be calculated at compile time. To initialise the rest of the struct, kmem_cache_init() is called from start_kernel(). Function: kmem_cache_init (mm/slab.c) This function will • Initialise the cache chain linked list • Initialise a mutex for accessing the cache chain • Calculate the cache_cache colour

4.6. Interfacing with the Buddy Allocator 416 void __init kmem_cache_init(void) 417 { 418 size_t left_over; 419 420 init_MUTEX(&cache_chain_sem); 421 INIT_LIST_HEAD(&cache_chain); 422 423 kmem_cache_estimate(0, cache_cache.objsize, 0, 424 &left_over, &cache_cache.num); 425 if (!cache_cache.num) 426 BUG(); 427 428 cache_cache.colour = left_over/cache_cache.colour_off; 429 cache_cache.colour_next = 0; 430 } 420 Initialise the semaphore for access the cache chain 421 Initialise the cache chain linked list

115

423 This estimates the number of objects and amount of bytes wasted. See Section 4.0.2 425 If even one kmem_cache_t cannot be stored in a page, there is something seriously wrong 428 colour is the number of diﬀerent cache lines that can be used while still keeping L1 cache alignment 429 colour_next indicates which line to use next. Start at 0

4.6

Interfacing with the Buddy Allocator

Function: kmem_getpages (mm/slab.c) This allocates pages for the slab allocator 486 static inline void * kmem_getpages (kmem_cache_t *cachep, unsigned long flags) 487 { 488 void *addr; 495 flags |= cachep->gfpflags; 496 addr = (void*) __get_free_pages(flags, cachep->gfporder); 503 return addr; 504 } 495 Whatever ﬂags were requested for the allocation, append the cache ﬂags to it. The only ﬂag it may append is GFP_DMA if the cache requires DMA memory 496 Call the buddy allocator (See Section 2.3) 503 Return the pages or NULL if it failed

4.6. Interfacing with the Buddy Allocator

116

Function: kmem_freepages (mm/slab.c) This frees pages for the slab allocator. Before it calls the buddy allocator API, it will remove the PG_slab bit from the page ﬂags 507 static inline void kmem_freepages (kmem_cache_t *cachep, void *addr) 508 { 509 unsigned long i = (1<<cachep->gfporder); 510 struct page *page = virt_to_page(addr); 511 517 while (i--) { 518 PageClearSlab(page); 519 page++; 520 } 521 free_pages((unsigned long)addr, cachep->gfporder); 522 } 509 Retrieve the order used for the original allocation 510 Get the struct page for the address 517-520 Clear the PG_slab bit on each page 521 Call the buddy allocator (See Section 2.4)

Chapter 5 Process Address Space
5.1 5.2 Managing the Address Space Process Memory Descriptors

The process address space is described by the mm_struct deﬁned in <linux/sched.h>

117

5.2. Process Memory Descriptors

118

210 struct mm_struct { 211 struct vm_area_struct * mmap; 212 rb_root_t mm_rb; 213 struct vm_area_struct * mmap_cache; 214 pgd_t * pgd; 215 atomic_t mm_users; 216 atomic_t mm_count; 217 int map_count; 218 struct rw_semaphore mmap_sem; 219 spinlock_t page_table_lock; 220 221 struct list_head mmlist; 222 226 unsigned long start_code, end_code, start_data, end_data; 227 unsigned long start_brk, brk, start_stack; 228 unsigned long arg_start, arg_end, env_start, env_end; 229 unsigned long rss, total_vm, locked_vm; 230 unsigned long def_flags; 231 unsigned long cpu_vm_mask; 232 unsigned long swap_address; 233 234 unsigned dumpable:1; 235 236 /* Architecture-specific MM context */ 237 mm_context_t context; 238 }; 239 mmap The head of a linked list of all VMA regions in the address space mm_rb The VMAs are arranged in a linked list and in a red-black tree. This is the root of the tree pgd The Page Global Directory for this process mm_users Count of the number of threads accessing an mm. A cloned thread will up this count to make sure an mm_struct is not destroyed early. The swap_out() code will increment this count when swapping out portions of the mm mm_count A reference count to the mm. This is important for lazy TLB switches where a task may be using one mm_struct temporarily map_count Number of VMAs in use mmap_sem This is a long lived lock which protects the vma list for readers and writers. As the taker could run for so long, a spinlock is inappropriate. A reader of the list

5.2.1. Allocating a Descriptor

119

takes this semaphore with down_read(). If they need to write, it must be taken with down_write() and the page_table_lock must be taken as well page_table_lock This protects a number of things. It protects the page tables, the rss count and the vma from modiﬁcation mmlist All mms are linked together via this ﬁeld start_code, end_code The start and end address of the code section start_data, end_data The start and end address of the data section start_brk, brk The start and end address of the heap start_stack Predictably enough, the start of the stack region arg_start, arg_end The start and end address of command line arguments env_start, env_end The start and end address of environment variables rss Resident Set Size, the number of resident pages for this process total_vm The total memory space occupied by all vma regions in the process locked_vm The amount of memory locked with mlock by the process def_flags It has only one possible value, VM_LOCKED. It is used to determine if all future mappings are locked by default or not cpu_vm_mask A bitmask representing all possible CPUs in an SMP system. The mask is used with IPI to determine if a processor should execute a particular function or not. This is important during TLB ﬂush for each CPU for example swap_address Used by the vmscan code to record the last address that was swapped from dumpable Set by prctl(), this ﬂag is important only to ptrace context Architecture speciﬁc MMU context

5.2.1

Allocating a Descriptor

Two functions are provided to allocate. To be slightly confusing, they are essentially the name. allocate_mm() will allocate a mm_struct from the slab allocator. mm_alloc() will allocate and call the function mm_init() to initialise it. Function: allocate_mm (kernel/fork.c) 226 #define allocate_mm() (kmem_cache_alloc(mm_cachep, SLAB_KERNEL))

226 Allocate a mm_struct from the slab allocator

5.2.2. Initalising a Descriptor Function: mm_alloc (kernel/fork.c) 247 struct mm_struct * mm_alloc(void) 248 { 249 struct mm_struct * mm; 250 251 mm = allocate_mm(); 252 if (mm) { 253 memset(mm, 0, sizeof(*mm)); 254 return mm_init(mm); 255 } 256 return NULL; 257 } 251 Allocate a mm_struct from the slab allocator 253 Zero out all contents of the struct 254 Perform basic initialisation

120

5.2.2

Initalising a Descriptor

The initial mm_struct in the system is called init_mm and is statically initialised at compile time using the macro INIT_MM(). 242 #define INIT_MM(name) \ 243 { \ 244 mm_rb: RB_ROOT, \ 245 pgd: swapper_pg_dir, \ 246 mm_users: ATOMIC_INIT(2), \ 247 mm_count: ATOMIC_INIT(1), \ 248 mmap_sem: __RWSEM_INITIALIZER(name.mmap_sem),\ 249 page_table_lock: SPIN_LOCK_UNLOCKED, \ 250 mmlist: LIST_HEAD_INIT(name.mmlist), \ 251 } Once it is established, new mm_struct’s are copies of their parent mm_struct copied using copy_mm() with the process speciﬁc ﬁelds initialised with init_mm(). Function: copy_mm (kernel/fork.c) This function makes a copy of the mm_struct for the given task. This is only called from do_fork() after a new process has been created and needs its own mm_struct. 314 static int copy_mm(unsigned long clone_flags, struct task_struct * tsk) 315 { 316 struct mm_struct * mm, *oldmm; 317 int retval; 318

5.2.2. Initalising a Descriptor 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 tsk->min_flt = tsk->maj_flt = 0; tsk->cmin_flt = tsk->cmaj_flt = 0; tsk->nswap = tsk->cnswap = 0; tsk->mm = NULL; tsk->active_mm = NULL; /* * Are we cloning a kernel thread? * * We need to steal a active VM for that.. */ oldmm = current->mm; if (!oldmm) return 0; if (clone_flags & CLONE_VM) { atomic_inc(&oldmm->mm_users); mm = oldmm; goto good_mm; } retval = -ENOMEM; mm = allocate_mm(); if (!mm) goto fail_nomem; /* Copy the current MM stuff.. */ memcpy(mm, oldmm, sizeof(*mm)); if (!mm_init(mm)) goto fail_nomem; if (init_new_context(tsk,mm)) goto free_pt; down_write(&oldmm->mmap_sem); retval = dup_mmap(mm); up_write(&oldmm->mmap_sem); if (retval) goto free_pt; /* * child gets a private LDT (if there was an LDT in the parent) */

121

5.2.2. Initalising a Descriptor 364 365 366 367 368 369 370 371 372 373 374 375 copy_segments(tsk, mm); good_mm: tsk->mm = mm; tsk->active_mm = mm; return 0; free_pt: mmput(mm); fail_nomem: return retval; }

122

314 The parameters are the ﬂags passed for clone and the task that is creating a copy of the mm_struct 319-324 Initialise the task_struct ﬁelds related to memory management 331 Borrow the mm of the current running process to copy from 332 A kernel thread has no mm so it can return immediately 335-340 If the CLONE_VM ﬂag is set, the child process is to share the mm with the parent process. This is required by users like pthreads. The mm_users ﬁeld is incremented so the mm is not destroyed prematurely later. The good_mm label sets the mm and active_mm and returns success 342 Allocate a new mm 347-349 Copy the parent mm and initialise the process speciﬁc mm ﬁelds with init_mm() 351-352 Initialise the MMU context for architectures that do not automatically manage their MMU 354-356 Call dup_mmap() which is responsible for copying all the VMAs regions in use by the parent process 358 dup_mmap() returns 0 on success. If it failed, the label free_pt will call mmput() which decrements the use count of the mm 364 This copies the LDT for the new process based on the parent process 367-369 Set the new mm, active_mm and return success Function: mm_init (kernel/fork.c) This function initialises process speciﬁc mm ﬁelds.

5.2.3. Destroying a Descriptor 229 static struct mm_struct * mm_init(struct mm_struct * mm) 230 { 231 atomic_set(&mm->mm_users, 1); 232 atomic_set(&mm->mm_count, 1); 233 init_rwsem(&mm->mmap_sem); 234 mm->page_table_lock = SPIN_LOCK_UNLOCKED; 235 mm->pgd = pgd_alloc(mm); 236 mm->def_flags = 0; 237 if (mm->pgd) 238 return mm; 239 free_mm(mm); 240 return NULL; 241 } 231 Set the number of users to 1 232 Set the reference count of the mm to 1 233 Initialise the semaphore protecting the VMA list 234 Initialise the spinlock protecting write access to it 235 Allocate a new PGD for the struct 236 By default, pages used by the process are not locked in memory 237 If a PGD exists, return the initialised struct 239 Initialisation failed, delete the mm_struct and return

123

5.2.3

Destroying a Descriptor

A new user to an mm increments the usage count with a simple call, atomic_inc(&mm->mm_users}; It is decremented with a call to mmput(). If the mm_users count reaches zero, all the mapped regions are deleted with exit_mmap() and the page tables destroyed as there is no longer any users of the userspace portions. The mm_count count is decremented with mmdrop() as all the users of the page tables and VMAs are counted as one mm_struct user. When mm_count reaches zero, the mm_struct will be destroyed. Function: mmput (kernel/fork.c) 275 void mmput(struct mm_struct *mm) 276 { 277 if (atomic_dec_and_lock(&mm->mm_users, &mmlist_lock)) { 278 extern struct mm_struct *swap_mm; 279 if (swap_mm == mm)

5.2.3. Destroying a Descriptor 280 281 282 283 284 285 286 287 } swap_mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist); list_del(&mm->mmlist); mmlist_nr--; spin_unlock(&mmlist_lock); exit_mmap(mm); mmdrop(mm); }

124

277 Atomically decrement the mm_users ﬁeld while holding the mmlist_lock lock. Return with the lock held if the count reaches zero 278-285 If the usage count reaches zero, the mm and associated structures need to be removed 278-280 The swap_mm is the last mm that was swapped out by the vmscan code. If the current process was the last mm swapped, move to the next entry in the list 281 Remove this mm from the list 282-283 Reduce the count of mms in the list and release the mmlist lock 284 Remove all associated mappings 285 Delete the mm Function: mmdrop (include/linux/sched.h) 767 static inline void mmdrop(struct mm_struct * mm) 768 { 769 if (atomic_dec_and_test(&mm->mm_count)) 770 __mmdrop(mm); 771 } 769 Atomically decrement the reference count. The reference count could be higher if the mm was been used by lazy tlb switching tasks 770 If the reference count reaches zero, call __mmdrop() Function: __mmdrop (kernel/fork.c) 264 inline void __mmdrop(struct mm_struct *mm) 265 { 266 BUG_ON(mm == &init_mm); 267 pgd_free(mm->pgd); 268 destroy_context(mm); 269 free_mm(mm); 270 }

5.3. Memory Regions 266 Make sure the init_mm is not destroyed 267 Delete the PGD entry 268 Delete the LDT 269 Call kmem_cache_free() for the mm freeing it with the slab allocator

125

5.3

Memory Regions

44 struct vm_area_struct { 45 struct mm_struct * vm_mm; 46 unsigned long vm_start; 47 unsigned long vm_end; 49 50 /* linked list of VM areas per task, sorted by address */ 51 struct vm_area_struct *vm_next; 52 53 pgprot_t vm_page_prot; 54 unsigned long vm_flags; 55 56 rb_node_t vm_rb; 57 63 struct vm_area_struct *vm_next_share; 64 struct vm_area_struct **vm_pprev_share; 65 66 /* Function pointers to deal with this struct. */ 67 struct vm_operations_struct * vm_ops; 68 69 /* Information about our backing store: */ 70 unsigned long vm_pgoff; 72 struct file * vm_file; 73 unsigned long vm_raend; 74 void * vm_private_data; 75 };

vm_mm The mm_struct this VMA belongs to vm_start The starting address vm_end The end address vm_next All the VMAs in an address space are linked together in an address ordered linked list with this ﬁeld vm_page_prot The protection ﬂags for all pages in this VMA. See the companion document for a full list of ﬂags

5.3. Memory Regions

126

vm_rb As well as been in a linked list, all the VMAs are stored on a red-black tree for fast lookups vm_next_share Shared VMA regions such as shared library mappings are linked together with this ﬁeld vm_pprev_share The complement to vm_next_share vm_ops The vm_ops ﬁeld contains functions pointers for open(), close() and nopage(). These are needed for syncing with information from the disk vm_pgoff This is the page aligned oﬀset within a ﬁle that is mmap’ed vm_file The struct ﬁle pointer to the ﬁle been mapped vm_raend This is the end address of a readahead window. When a fault occurs, a readahead window will page in a number of pages after the fault address. This ﬁeld records how far to read ahead vm_private_data Used by some device drivers to store private information. Not of concern to the memory manager As mentioned, all the regions are linked together on a linked list ordered by address. When searching for a free area, it is a simple matter of traversing the list. A frequent operation is to search for the VMA for a particular address, during page faulting for example. In this case, the Red-Black tree is traversed as it has O(logN) search time on average. In the event the region is backed by a ﬁle, the vm_ﬁle leads to an associated address_space. The struct contains information of relevance to the ﬁlesystem such as the number of dirty pages which must be ﬂushed to disk. It is deﬁned as follows in <linux/fs.h> 400 struct address_space { 401 struct list_head clean_pages; 402 struct list_head dirty_pages; 403 struct list_head locked_pages; 404 unsigned long nrpages; 405 struct address_space_operations *a_ops; 406 struct inode *host; 407 struct vm_area_struct *i_mmap; 408 struct vm_area_struct *i_mmap_shared; 409 spinlock_t i_shared_lock; 410 int gfp_mask; 411 }; clean_pages A list of clean pages which do not have to be synchronized with the disk dirty_pages Pages that the process has touched and need to by sync-ed locked_pages The number of pages locked in memory

5.3. Memory Regions nrpages Number of resident pages in use by the address space a_ops A struct of function pointers within the ﬁlesystem host The host inode the ﬁle belongs to i_mmap A pointer to the vma the address space is part of i_mmap_shared A pointer to the next VMA which shares this address space i_shared_lock A spinlock to protect this structure gfp_mask The mask to use when calling __alloc_pages() for new pages

127

Periodically the memory manger will need to ﬂush information to disk. The memory manager doesn’t know and doesn’t care how information is written to disk, so the a_ops struct is used to call the relevant functions. It is deﬁned as follows in <linux/fs.h> 382 struct address_space_operations { 383 int (*writepage)(struct page *); 384 int (*readpage)(struct file *, struct page *); 385 int (*sync_page)(struct page *); 386 /* 387 * ext3 requires that a successful prepare_write() * call be followed 388 * by a commit_write() call - they must be balanced 389 */ 390 int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); 391 int (*commit_write)(struct file *, struct page *, unsigned, unsigned); 392 /* Unfortunately this kludge is needed for FIBMAP. * Don’t use it */ 393 int (*bmap)(struct address_space *, long); 394 int (*flushpage) (struct page *, unsigned long); 395 int (*releasepage) (struct page *, int); 396 #define KERNEL_HAS_O_DIRECT 397 int (*direct_IO)(int, struct inode *, struct kiobuf *, unsigned long, int); 398 };

writepage Write a page to disk. The oﬀset within the ﬁle to write to is stored within the page struct. It is up to the ﬁlesystem speciﬁc code to ﬁnd the block. See buffer.c:block_write_full_page() readpage Read a page from disk. See buffer.c:block_read_full_page() sync_page Sync a dirty page with disk. See buffer.c:block_sync_page()

5.3.1. Creating A Memory Region

128

prepare_write This is called before data is copied from userspace into a page that will be written to disk. With a journaled ﬁlesystem, this ensures the ﬁlesystem log is up to date. With normal ﬁlesystems, it makes sure the needed buﬀer pages are allocated. See buffer.c:block_prepare_write() commit_write After the data has been copied from userspace, this function is called to commit the information to disk. See buffer.c:block_commit_write() bmap Maps a block so raw IO can be performed. Only of concern to the ﬁlesystem speciﬁc code. flushpage This makes sure there is no IO pending on a page before releasing it. See buffer.c:discard_bh_page() releasepage This tries to ﬂush all the buﬀers associated with a page before freeing the page itself. See try_to_free_buffers()

5.3.1

Creating A Memory Region

The system call mmap() is provided for creating new memory regions within a process. For the x86, the function calls sys_mmap2() which calls do_mmap2() directly with the same parameters. do_mmap2() is responsible for acquiring the parameters for do_mmap_pgoff() to use which is the principle function for creating new areas for all architectures. do_mmap2() ﬁrst clears the MAP_DENYWRITE and MAP_EXECUTABLE bits from the flags parameter as they are ignored by Linux, which is conﬁrmed by the mmap() manual page. If a ﬁle is being mapped, do_mmap2() will look up the struct file based on the ﬁle descriptor passed as a parameter and acquire the mm_struct→mmap_sem semaphore before calling do_mmap_pgoff(). do_mmap_pgoff() begins by performing some basic sanity checks. It ﬁrst checks the appropriate ﬁlesystem or device functions are available if a ﬁle or device is being mapped. It then ensures the size of the mapping is page aligned and that it does not attempt to create a mapping in the kernel portion of the address space. It then makes sure the size of the mapping does not overﬂow the range of pgoff and ﬁnally that the process does not have too many mapped regions already. Function: do_mmap_pgoﬀ (mm/mmap.c) This function is very large and so is broken up into a number of sections. Broadly speaking the sections are • Sanity check the parameters • Find a free linear address space large enough for the memory mapping. If a ﬁlesystem or device speciﬁc get_unmapped_area() function is provided, it will be used otherwise arch_get_unmapped_area() is called • Calculate the VM ﬂags and check them against the ﬁle access permissions • If an old area exists where the mapping is to take place, ﬁx it up so it is suitable for the new mapping

5.3.1. Creating A Memory Region

sys_mmap2

do_mmap2

do_mmap_pgoff

Figure 5.1: Call Graph: sys_mmap2()
vma_merge deny_write_access shmem_zero_setup find_vma_prepare vma_link make_pages_present

get_unmapped_area

zap_page_range

129

5.3.1. Creating A Memory Region • Allocate a vm_area_struct from the slab allocator and ﬁll in its entries • Link in the new VMA • Call the ﬁlesystem or device speciﬁc mmap() function • Update statistics and exit 393 unsigned long do_mmap_pgoff(struct file * file, unsigned long addr, unsigned long len, unsigned long prot, 394 unsigned long flags, unsigned long pgoff) 395 { 396 struct mm_struct * mm = current->mm; 397 struct vm_area_struct * vma, * prev; 398 unsigned int vm_flags; 399 int correct_wcount = 0; 400 int error; 401 rb_node_t ** rb_link, * rb_parent; 402 403 if (file && (!file->f_op || !file->f_op->mmap)) 404 return -ENODEV; 405 406 if ((len = PAGE_ALIGN(len)) == 0) 407 return addr; 408 409 if (len > TASK_SIZE) 410 return -EINVAL; 411 412 /* offset overflow? */ 413 if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) 414 return -EINVAL; 415 416 /* Too many mappings? */ 417 if (mm->map_count > max_map_count) 418 return -ENOMEM; 419

130

393 The parameters which correspond directly to the parameters to the mmap system call are file the struct ﬁle to mmap if this is a ﬁle backed mapping addr the requested address to map len the length in bytes to mmap prot is the permissions on the area flags are the ﬂags for the mapping pgoff is the oﬀset within the ﬁle to begin the mmap at

5.3.1. Creating A Memory Region

131

403-404 If a ﬁle or device is been mapped, make sure a ﬁlesystem or device speciﬁc mmap function is provided. For most ﬁlesystems, this is generic_file_mmap() 406-407 Make sure a zero length mmap is not requested 409 Ensure that it is possible to map the requested area. PAGE_OFFSET or 3GiB The limit on the x86 is

413-414 Ensure the mapping will not overﬂow the end of the largest possible ﬁle size 417-488 Only max_map_count number of mappings are allowed. By default this value is DEFAULT_MAX_MAP_COUNT or 65536 mappings 420 421 422 423 424 425 426 /* Obtain the address to map to. we verify (or select) it and * ensure that it represents a valid section of the address space. */ addr = get_unmapped_area(file, addr, len, pgoff, flags); if (addr & ~PAGE_MASK) return addr;

423 After basic sanity checks, this function will call the device or ﬁle speciﬁc get_unmapped_area() function. If a device speciﬁc one is unavailable, arch_get_unmapped_area() is called. This function is discussed in Section 5.3.3 427 428 429 430 431 432 433 434 435 436 437 438 439 440 /* Do simple checking here so the lower-level routines won’t have * to. we assume access permissions have been handled by the open * of the memory object, so we don’t do any here. */ vm_flags = calc_vm_flags(prot,flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; /* mlock MCL_FUTURE? */ if (vm_flags & VM_LOCKED) { unsigned long locked = mm->locked_vm << PAGE_SHIFT; locked += len; if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur) return -EAGAIN; }

431 calc_vm_flags() translates the prot and flags from userspace and translates them to their VM_ equivalents

5.3.1. Creating A Memory Region

132

434-438 Check if it has been requested that all future mappings be locked in memory. If yes, make sure the process isn’t locking more memory than it is allowed to. If it is, return -EAGAIN

5.3.1. Creating A Memory Region

133

441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479

if (file) { switch (flags & MAP_TYPE) { case MAP_SHARED: if ((prot & PROT_WRITE) && !(file->f_mode & FMODE_WRITE)) return -EACCES; /* Make sure we don’t allow writing to an append-only file.. */ if (IS_APPEND(file->f_dentry->d_inode) && (file->f_mode & FMODE_WRITE)) return -EACCES; /* make sure there are no mandatory locks on the file. */ if (locks_verify_locked(file->f_dentry->d_inode)) return -EAGAIN; vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); /* fall through */ case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) return -EACCES; break; default: return -EINVAL; } } else { vm_flags |= VM_SHARED | VM_MAYSHARE; switch (flags & MAP_TYPE) { default: return -EINVAL; case MAP_PRIVATE: vm_flags &= ~(VM_SHARED | VM_MAYSHARE); /* fall through */ case MAP_SHARED: break; } }

441-468 If a ﬁle is been memory mapped, check the ﬁles access permissions

5.3.1. Creating A Memory Region 444-445 If write access is requested, make sure the ﬁle is opened for write

134

448-449 Similarly, if the ﬁle is opened for append, make sure it cannot be written to. It is unclear why it is not the prot ﬁeld that is checked here 451 If the ﬁle is mandatory locked, return EAGAIN so the caller will try a second type 455-457 Fix up the ﬂags to be consistent with the ﬁle ﬂags 461-462 Make sure the ﬁle can be read before mmapping it 469-479 If the ﬁle is been mapped for anonymous use, ﬁx up the ﬂags if the requested mapping is MAP_PRIVATE to make sure the ﬂags are consistent 480 481 /* Clear old maps */ 482 munmap_back: 483 vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent); 484 if (vma && vma->vm_start < addr + len) { 485 if (do_munmap(mm, addr, len)) 486 return -ENOMEM; 487 goto munmap_back; 488 } 489 490 /* Check against address space limit. */ 491 if ((mm->total_vm << PAGE_SHIFT) + len 492 > current->rlim[RLIMIT_AS].rlim_cur) 493 return -ENOMEM; 494 495 /* Private writable mapping? Check memory availability.. */ 496 if ((vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE && 497 !(flags & MAP_NORESERVE) && 498 !vm_enough_memory(len >> PAGE_SHIFT)) 499 return -ENOMEM; 500 501 /* Can we just expand an old anonymous mapping? */ 502 if (!file && !(vm_flags & VM_SHARED) && rb_parent) 503 if (vma_merge(mm, prev, rb_parent, addr, addr + len, vm_flags)) 504 goto out; 505 483 This function steps through the RB tree for he vma corresponding to a given address 484-486 If a VMA was found and it is part of the new mmaping, remove the old mapping as the new one will cover both

5.3.1. Creating A Memory Region

135

491-493 Make sure the new mapping will not will not exceed the total VM a process is allowed to have. It is unclear why this check is not made earlier 496-499 If the caller does not speciﬁcally request that free space is not checked with MAP_NORESERVE and it is a private mapping, make sure enough memory is available to satisfy the mapping under current conditions 502-504 If two adjacent anonymous memory mappings can be treated as one, expand an old mapping rather than creating a new one

5.3.1. Creating A Memory Region

136

506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545

/* Determine the object being mapped and call the appropriate * specific mapper. the address has already been validated, but * not unmapped, but the maps are removed from the list. */ vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); if (!vma) return -ENOMEM; vma->vm_mm = mm; vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_flags = vm_flags; vma->vm_page_prot = protection_map[vm_flags & 0x0f]; vma->vm_ops = NULL; vma->vm_pgoff = pgoff; vma->vm_file = NULL; vma->vm_private_data = NULL; vma->vm_raend = 0; if (file) { error = -EINVAL; if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) goto free_vma; if (vm_flags & VM_DENYWRITE) { error = deny_write_access(file); if (error) goto free_vma; correct_wcount = 1; } vma->vm_file = file; get_file(file); error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; } else if (flags & MAP_SHARED) { error = shmem_zero_setup(vma); if (error) goto free_vma; }

510 Allocate a vm_area_struct from the slab allocator 514-523 Fill in the basic vm_area_struct ﬁelds

5.3.1. Creating A Memory Region 525-540 Fill in the ﬁle related ﬁelds if this is a ﬁle been mapped

137

527-528 These are both invalid ﬂags for a ﬁle mapping so free the vm_area_struct and return 529-534 This ﬂag is cleared by the system call mmap so it is unclear why the check is still made. Historically, an ETXTBUSY signal was sent to the calling process if the underlying ﬁle was been written to 535 Fill in the vm_file ﬁeld 536 This increments the ﬁle use count 537 Call the ﬁlesystem or device speciﬁc mmap function 538-539 If an error called, goto unmap_and_free_vma to clean up and return the error 541 If an anonymous shared mapping is required, call shmem_zero_setup() to do the hard work

5.3.1. Creating A Memory Region

138

546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579

/* Can addr have changed?? * * Answer: Yes, several device drivers can do it in their * f_op->mmap method. -DaveM */ if (addr != vma->vm_start) { /* * It is a bit too late to pretend changing the virtual * area of the mapping, we just corrupted userspace * in the do_munmap, so FIXME (not in 2.4 to avoid breaking * the driver API). */ struct vm_area_struct * stale_vma; /* Since addr changed, we rely on the mmap op to prevent * collisions with existing vmas and just use find_vma_prepare * to update the tree pointers. */ addr = vma->vm_start; stale_vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent); /* * Make sure the lowlevel driver did its job right. */ if (unlikely(stale_vma && stale_vma->vm_start < vma->vm_end)) { printk(KERN_ERR "buggy mmap operation: [<%p>]\n", file ? file->f_op->mmap : NULL); BUG(); } } vma_link(mm, vma, prev, rb_link, rb_parent); if (correct_wcount) atomic_inc(&file->f_dentry->d_inode->i_writecount);

551-574 If the address has changed, it means the device speciﬁc mmap operation mapped the vma somewhere else. find_vma_prepare() is used to ﬁnd the new vma that was set up 576 Link in the new vm_area_struct 577-578 Update the ﬁle write count

5.3.2. Finding a Mapped Memory Region

139

580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599

out: mm->total_vm += len >> PAGE_SHIFT; if (vm_flags & VM_LOCKED) { mm->locked_vm += len >> PAGE_SHIFT; make_pages_present(addr, addr + len); } return addr; unmap_and_free_vma: if (correct_wcount) atomic_inc(&file->f_dentry->d_inode->i_writecount); vma->vm_file = NULL; fput(file); /* Undo any partial mapping done by a device driver. */ zap_page_range(mm, vma->vm_start, vma->vm_end - vma->vm_start); free_vma: kmem_cache_free(vm_area_cachep, vma); return error; }

581-586 Update statistics for the process mm_struct and return the new address 588-595 This is reached if the ﬁle has been partially mapped before failing. The write statistics are updated and then all user pages are removed with zap_page_range() 596-598 This goto is used if the mapping failed immediately after the vm_area_struct is created. It is freed back to the slab allocator before the error is returned

5.3.2

Finding a Mapped Memory Region

Function: ﬁnd_vma (mm/mmap.c) 659 struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr) 660 { 661 struct vm_area_struct *vma = NULL; 662 663 if (mm) { 664 /* Check the cache first. */ 665 /* (Cache hit rate is typically around 35%.) */ 666 vma = mm->mmap_cache; 667 if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) { 668 rb_node_t * rb_node;

5.3.2. Finding a Mapped Memory Region 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 }

140

rb_node = mm->mm_rb.rb_node; vma = NULL; while (rb_node) { struct vm_area_struct * vma_tmp; vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); if (vma_tmp->vm_end > addr) { vma = vma_tmp; if (vma_tmp->vm_start <= addr) break; rb_node = rb_node->rb_left; } else rb_node = rb_node->rb_right; } if (vma) mm->mmap_cache = vma; } } return vma;

659 The two parameters are the top level mm_struct that is to be searched and the address the caller is interested in 661 Default to returning NULL for address not found 663 Make sure the caller does not try and search a bogus mm 666 mmap_cache has the result of the last call to find_vma(). This has a chance of not having to search at all through the red-black tree 667 If it is a valid VMA that is being examined, check to see if the address being searched is contained within it. If it is, the VMA was the mmap_cache one so it can be returned, otherwise the tree is searched 668-672 Start at the root of the tree 673-685 This block is the tree walk 676 The macro, as the name suggests, returns the VMA this tree node points to 678 Check if the next node traversed by the left or right leaf 680 If the current VMA is what is required, exit the while loop

5.3.2. Finding a Mapped Memory Region 687 If the VMA is valid, set the mmap_cache for the next call to find_vma()

141

690 Return the VMA that contains the address or as a side eﬀect of the tree walk, return the VMA that is closest to the requested address Function: ﬁnd_vma_prev (mm/mmap.c) 694 struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr, 695 struct vm_area_struct **pprev) 696 { 697 if (mm) { 698 /* Go through the RB tree quickly. */ 699 struct vm_area_struct * vma; 700 rb_node_t * rb_node, * rb_last_right, * rb_prev; 701 702 rb_node = mm->mm_rb.rb_node; 703 rb_last_right = rb_prev = NULL; 704 vma = NULL; 705 706 while (rb_node) { 707 struct vm_area_struct * vma_tmp; 708 709 vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); 710 711 if (vma_tmp->vm_end > addr) { 712 vma = vma_tmp; 713 rb_prev = rb_last_right; 714 if (vma_tmp->vm_start <= addr) 715 break; 716 rb_node = rb_node->rb_left; 717 } else { 718 rb_last_right = rb_node; 719 rb_node = rb_node->rb_right; 720 } 721 } 722 if (vma) { 723 if (vma->vm_rb.rb_left) { 724 rb_prev = vma->vm_rb.rb_left; 725 while (rb_prev->rb_right) 726 rb_prev = rb_prev->rb_right; 727 } 728 *pprev = NULL; 729 if (rb_prev) 730 *pprev = rb_entry(rb_prev, struct

5.3.3. Finding a Free Memory Region vm_area_struct, vm_rb); if ((rb_prev ? (*pprev)->vm_next : mm->mmap) != BUG(); return vma; } } *pprev = NULL; return NULL;

142

731 vma) 732 733 734 735 736 737 738 }

694-721 This is essentially the same as the find_vma() function already described. The only diﬀerence is that the last right node accesses is remembered as this will represent the vma previous to the requested vma. 723-727 If the returned VMA has a left node, it means that it has to be traversed. It ﬁrst takes the left leaf and then follows each right leaf until the bottom of the tree is found. 729-730 Extract the VMA from the red-black tree node 731-732 A debugging check, if this is the previous node, then its next ﬁeld should point to the VMA being returned. If it is not, it is a bug Function: ﬁnd_vma_intersection (include/linux/mm.h) 662 static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr) 663 { 664 struct vm_area_struct * vma = find_vma(mm,start_addr); 665 666 if (vma && end_addr <= vma->vm_start) 667 vma = NULL; 668 return vma; 669 } 664 Return the VMA closest to the starting address 666 If a VMA is returned and the end address is still less than the beginning of the returned VMA, the VMA does not intersect 668 Return the VMA if it does intersect

5.3.3

Finding a Free Memory Region

Function: get_unmapped_area (mm/mmap.c)

5.3.3. Finding a Free Memory Region

143

get_unmapped_area

arch_get_unmapped_area

find_vma

Figure 5.2: Call Graph: get_unmapped_area() 642 unsigned long get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) 643 { 644 if (flags & MAP_FIXED) { 645 if (addr > TASK_SIZE - len) 646 return -ENOMEM; 647 if (addr & ~PAGE_MASK) 648 return -EINVAL; 649 return addr; 650 } 651 652 if (file && file->f_op && file->f_op->get_unmapped_area) 653 return file->f_op->get_unmapped_area(file, addr, len, pgoff, flags); 654 655 return arch_get_unmapped_area(file, addr, len, pgoff, flags); 656 } 642 The parameters passed are file The ﬁle or device being mapped addr The requested address to map to len The length of the mapping pgoff The oﬀset within the ﬁle being mapped flags Protection ﬂags 644-650 Sanity checked. If it is required that the mapping be placed at the speciﬁed address, make sure it will not overﬂow the address space and that it is page aligned 652 If the struct ﬁle provides a get_unmapped_area() function, use it 655 Else use the architecture speciﬁc function

5.3.3. Finding a Free Memory Region Function: arch_get_unmapped_area (mm/mmap.c)

144

612 #ifndef HAVE_ARCH_UNMAPPED_AREA 613 static inline unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) 614 { 615 struct vm_area_struct *vma; 616 617 if (len > TASK_SIZE) 618 return -ENOMEM; 619 620 if (addr) { 621 addr = PAGE_ALIGN(addr); 622 vma = find_vma(current->mm, addr); 623 if (TASK_SIZE - len >= addr && 624 (!vma || addr + len <= vma->vm_start)) 625 return addr; 626 } 627 addr = PAGE_ALIGN(TASK_UNMAPPED_BASE); 628 629 for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) { 630 /* At this point: (!vma || addr < vma->vm_end). */ 631 if (TASK_SIZE - len < addr) 632 return -ENOMEM; 633 if (!vma || addr + len <= vma->vm_start) 634 return addr; 635 addr = vma->vm_end; 636 } 637 } 638 #else 639 extern unsigned long arch_get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); 640 #endif 612 If this is not deﬁned, it means that the architecture does not provide its own arch_get_unmapped_area() so this one is used instead 613 The parameters are the same as those for get_unmapped_area() 617-618 Sanity check, make sure the required map length is not too long 620-626 If an address is provided, use it for the mapping 621 Make sure the address is page aligned 622 find_vma() will return the region closest to the requested address

5.3.4. Inserting a memory region

145

623-625 Make sure the mapping will not overlap with another region. If it does not, return it as it is safe to use. Otherwise it gets ignored 627 TASK_UNMAPPED_BASE is the starting point for searching for a free region to use 629-636 Starting from TASK_UNMAPPED_BASE, linearly search the VMAs until a large enough region between them is found to store the new mapping. This is essentially a ﬁrst ﬁt search 639 If an external function is provided, it still needs to be declared here

5.3.4

Inserting a memory region

insert_vm_struct

find_vma_prepare

vma_link

lock_vma_mappings

__vma_link

unlock_vma_mappings

__vma_link_list

__vma_link_rb

__vma_link_file

rb_insert_color

__rb_rotate_right

__rb_rotate_left

Figure 5.3: Call Graph: insert_vm_struct() Function: __insert_vm_struct (mm/mmap.c) This is the top level function for inserting a new vma into an address space. There is a second function like it called simply insert_vm_struct() that is not described in detail here as the only diﬀerence is the one line of code increasing the map_count. 1168 void __insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) 1169 { 1170 struct vm_area_struct * __vma, * prev;

5.3.4. Inserting a memory region 1171 1172 1173 1174 1175 1176 1177 1178 1179 } rb_node_t ** rb_link, * rb_parent; __vma = find_vma_prepare(mm, vma->vm_start, &prev, &rb_link, &rb_parent); if (__vma && __vma->vm_start < vma->vm_end) BUG(); __vma_link(mm, vma, prev, rb_link, rb_parent); mm->map_count++; validate_mm(mm);

146

1168 The arguments are the mm_struct mm that represents the linear address space and the vm_area_struct that is to be inserted 1173 find_vma_prepare() locates where the new vma can be inserted. It will be inserted between prev and __vma and the required nodes for the red-black tree are also returned 1174-1175 This is a check to make sure the returned vma is invalid. It is unclear how such a broken vma could exist 1176 This function does the actual work of linking the vma struct into the linear linked list and the red-black tree 1177 Increase the map_count to show a new mapping has been added 1178 validate_mm() is a debugging macro for red-black trees. If DEBUG_MM_RB is set, the linear list of VMAs and the tree will be traversed to make sure it is valid. The tree traversal is a recursive function so it is very important that that it is used only if really necessary as a large number of mappings could cause a stack overﬂow. If it is not set, validate_mm() does nothing at all Function: ﬁnd_vma_prepare (mm/mmap.c) This is responsible for ﬁnding the correct places to insert a VMA at the supplied address. It returns a number of pieces of information via the actual return and the function arguments. The forward VMA to link to is returned with return. pprev is the previous node which is required because the list is a singly linked list. rb_link and rb_parent are the parent and leaf node the new VMA will be inserted between. 246 static struct vm_area_struct * find_vma_prepare(struct mm_struct * mm, unsigned long addr, 247 struct vm_area_struct ** pprev, 248 rb_node_t *** rb_link, rb_node_t ** rb_parent) 249 { 250 struct vm_area_struct * vma; 251 rb_node_t ** __rb_link, * __rb_parent, * rb_prev; 252 253 __rb_link = &mm->mm_rb.rb_node;

5.3.4. Inserting a memory region 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 } rb_prev = __rb_parent = NULL; vma = NULL; while (*__rb_link) { struct vm_area_struct *vma_tmp; __rb_parent = *__rb_link; vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb); if (vma_tmp->vm_end > addr) { vma = vma_tmp; if (vma_tmp->vm_start <= addr) return vma; __rb_link = &__rb_parent->rb_left; } else { rb_prev = __rb_parent; __rb_link = &__rb_parent->rb_right; } } *pprev = NULL; if (rb_prev) *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb); *rb_link = __rb_link; *rb_parent = __rb_parent; return vma;

147

246 The function arguments are described above 253-255 Initialise the search 267-272 This is a similar tree walk to what was described for find_vma(). The only real diﬀerence is the nodes last traversed are remembered with the __rb_link and __rb_parent variables 275-276 Get the back linking vma via the red-black tree 279 Return the forward linking VMA Function: vma_link (mm/mmap.c) This is the top-level function for linking a VMA into the proper lists. It is responsible for acquiring the necessary locks to make a safe insertion 337 static inline void vma_link(struct mm_struct * mm, struct vm_area_struct * vma, struct vm_area_struct * prev,

5.3.4. Inserting a memory region 338 339 { 340 341 342 343 344 345 346 347 348 } rb_node_t ** rb_link, rb_node_t * rb_parent) lock_vma_mappings(vma); spin_lock(&mm->page_table_lock); __vma_link(mm, vma, prev, rb_link, rb_parent); spin_unlock(&mm->page_table_lock); unlock_vma_mappings(vma); mm->map_count++; validate_mm(mm);

148

337 mm is the address space the vma is to be inserted into. prev is the backwards linked vma for the linear linked list of VMAs. rb_link and rb_parent are the nodes required to make the rb insertion 340 This function acquires the spinlock protecting the address_space representing the ﬁle that is been memory mapped. 341 Acquire the page table lock which protects the whole mm_struct 342 Insert the VMA 343 Free the lock protecting the mm_struct 345 Unlock the address_space for the ﬁle 346 Increase the number of mappings in this mm 347 If DEBUG_MM_RB is set, the RB trees and linked lists will be checked to make sure they are still valid Function: __vma_link (mm/mmap.c) This simply calls three helper functions which are responsible for linking the VMA into the three linked lists that link VMAs together. 329 static void __vma_link(struct mm_struct * mm, struct vm_area_struct * vma, struct vm_area_struct * prev, 330 rb_node_t ** rb_link, rb_node_t * rb_parent) 331 { 332 __vma_link_list(mm, vma, prev, rb_parent); 333 __vma_link_rb(mm, vma, rb_link, rb_parent); 334 __vma_link_file(vma); 335 } 332 This links the VMA into the linear linked lists of VMAs in this mm via the vm_next field

5.3.4. Inserting a memory region

149

333 This links the VMA into the red-black tree of VMAs in this mm whose root is stored in the vm_rb ﬁeld 334 This links the VMA into the shared mapping VMA links. Memory mapped ﬁles are linked together over potentially many mms by this function via the vm_next_share and vm_pprev_share ﬁelds Function: __vma_link_list (mm/mmap.c) 282 static inline void __vma_link_list(struct mm_struct * mm, struct vm_area_struct * vma, struct vm_area_struct * prev, 283 rb_node_t * rb_parent) 284 { 285 if (prev) { 286 vma->vm_next = prev->vm_next; 287 prev->vm_next = vma; 288 } else { 289 mm->mmap = vma; 290 if (rb_parent) 291 vma->vm_next = rb_entry(rb_parent, struct vm_area_struct, vm_rb); 292 else 293 vma->vm_next = NULL; 294 } 295 } 285 If prev is not null, the vma is simply inserted into the list 289 Else this is the ﬁrst mapping and the ﬁrst element of the list has to be stored in the mm_struct 290 The vma is stored as the parent node Function: __vma_link_rb (mm/mmap.c) The principle workings of this function are stored within <linux/rbtree.h> and will not be discussed in detail with this document. 297 static inline void __vma_link_rb(struct mm_struct * mm, struct vm_area_struct * vma, 298 rb_node_t ** rb_link, rb_node_t * rb_parent) 299 { 300 rb_link_node(&vma->vm_rb, rb_parent, rb_link); 301 rb_insert_color(&vma->vm_rb, &mm->mm_rb); 302 }

5.3.5. Merging contiguous region Function: __vma_link_ﬁle (mm/mmap.c) This function links the VMA into a linked list of shared ﬁle mappings. 304 static inline void __vma_link_file(struct vm_area_struct * vma) 305 { 306 struct file * file; 307 308 file = vma->vm_file; 309 if (file) { 310 struct inode * inode = file->f_dentry->d_inode; 311 struct address_space *mapping = inode->i_mapping; 312 struct vm_area_struct **head; 313 314 if (vma->vm_flags & VM_DENYWRITE) 315 atomic_dec(&inode->i_writecount); 316 317 head = &mapping->i_mmap; 318 if (vma->vm_flags & VM_SHARED) 319 head = &mapping->i_mmap_shared; 320 321 /* insert vma into inode’s share list */ 322 if((vma->vm_next_share = *head) != NULL) 323 (*head)->vm_pprev_share = &vma->vm_next_share; 324 *head = vma; 325 vma->vm_pprev_share = head; 326 } 327 }

150

309 Check to see if this VMA has a shared ﬁle mapping. If it does not, this function has nothing more to do 310-312 Extract the relevant information about the mapping from the VMA 314-315 If this mapping is not allowed to write even if the permissions are ok for writing, decrement the i_writecount ﬁeld. A negative value to this ﬁeld indicates that the ﬁle is memory mapped and may not be written to. Eﬀorts to open the ﬁle for writing will now fail 317-319 Check to make sure this is a shared mapping 322-325 Insert the VMA into the shared mapping linked list

5.3.5

Merging contiguous region

Function: vma_merge (mm/mmap.c) This function checks to see if a region pointed to be prev may be expanded forwards to cover the area from addr to end instead of allocating a new VMA. If it cannot, the VMA ahead is checked to see can it be expanded backwards instead.

5.3.5. Merging contiguous region

151

350 static int vma_merge(struct mm_struct * mm, struct vm_area_struct * prev, 351 rb_node_t * rb_parent, unsigned long addr, unsigned long end, unsigned long vm_flags) 352 { 353 spinlock_t * lock = &mm->page_table_lock; 354 if (!prev) { 355 prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb); 356 goto merge_next; 357 } 358 if (prev->vm_end == addr && can_vma_merge(prev, vm_flags)) { 359 struct vm_area_struct * next; 360 361 spin_lock(lock); 362 prev->vm_end = end; 363 next = prev->vm_next; 364 if (next && prev->vm_end == next->vm_start && can_vma_merge(next, vm_flags)) { 365 prev->vm_end = next->vm_end; 366 __vma_unlink(mm, next, prev); 367 spin_unlock(lock); 368 369 mm->map_count--; 370 kmem_cache_free(vm_area_cachep, next); 371 return 1; 372 } 373 spin_unlock(lock); 374 return 1; 375 } 376 377 prev = prev->vm_next; 378 if (prev) { 379 merge_next: 380 if (!can_vma_merge(prev, vm_flags)) 381 return 0; 382 if (end == prev->vm_start) { 383 spin_lock(lock); 384 prev->vm_start = addr; 385 spin_unlock(lock); 386 return 1; 387 } 388 } 389 390 return 0; 391 }

5.3.5. Merging contiguous region 350 The parameters are as follows; mm The mm the VMAs belong to prev The VMA before the address we are interested in rb_parent The parent RB node as returned by find_vma_prepare() addr The starting address of the region to be merged end The end of the region to be merged vm_ﬂags The permission ﬂags of the region to be merged 353 This is the lock to the mm struct

152

354-357 If prev is not passed it, it is taken to mean that the VMA being tested for merging is in front of the region from addr to end. The entry for that VMA is extracted from the rb_parent 358-375 Check to see can the region pointed to by prev may be expanded to cover the current region 358 The function can_vma_merge() checks the permissions of prev with those in vm_flags and that the VMA has no ﬁle mappings. If it is true, the area at prev may be expanded 361 Lock the mm struct 362 Expand the end of the VMA region (vm_end) to the end of the new mapping (end) 363 next is now the VMA in front of the newly expanded VMA 364 Check if the expanded region can be merged with the VMA in front of it 365 If it can, continue to expand the region to cover the next VMA 366 As a VMA has been merged, one region is now defunct and may be unlinked 367 No further adjustments are made to the mm struct so the lock is released 369 There is one less mapped region to reduce the map_count 370 Delete the struct describing the merged VMA 371 Return success 377 If this line is reached it means the region pointed to by prev could not be expanded forward so a check is made to see if the region ahead can be merged backwards instead 382-388 Same idea as the above block except instead of adjusted vm_end to cover end, vm_start is expanded to cover addr

5.3.6. Remapping and moving a memory region

153

Function: can_vma_merge (include/linux/mm.h) This trivial function checks to see if the permissions of the supplied VMA match the permissions in vm_flags 571 static inline int can_vma_merge(struct vm_area_struct * vma, unsigned long vm_flags) 572 { 573 if (!vma->vm_file && vma->vm_flags == vm_flags) 574 return 1; 575 else 576 return 0; 577 } 573 Self explanatory, true if there is no ﬁle/device mapping and the ﬂags equal each other

5.3.6

Remapping and moving a memory region

Function: sys_mremap (mm/mremap.c)
sys_mremap

do_mremap

do_munmap

find_vma

vm_enough_memory

make_pages_present

get_unmapped_area

move_vma

nr_free_pages

Figure 5.4: Call Graph: sys_mremap This is the system service call to remap a memory region 342 asmlinkage unsigned long sys_mremap(unsigned long addr, 343 unsigned long old_len, unsigned long new_len, 344 unsigned long flags, unsigned long new_addr) 345 { 346 unsigned long ret; 347 348 down_write(&current->mm->mmap_sem); 349 ret = do_mremap(addr, old_len, new_len, flags, new_addr); 350 up_write(&current->mm->mmap_sem); 351 return ret; 352 } 353

5.3.6. Remapping and moving a memory region 342-344 The parameters are the same as those described in the mremap man page 348 Acquire the mm semaphore 349 do_mremap() is the top level function for remapping a region 350 Release the mm semaphore 351 Return the status of the remapping

154

Function: do_mremap (mm/mremap.c) This function does most of the actual “work” required to remap, resize and move a memory region. It is quite long but can be broken up into distinct parts which will be dealt with separately here. The tasks are broadly speaking • Check usage ﬂags and page align lengths • Handle the condition where MAP_FIXED is set and the region is been moved to a new location. • If a region is shrinking, allow it to happen unconditionally • If the region is growing or moving, perform a number of checks in advance to make sure the move is allowed and safe • Handle the case where the region is been expanded and cannot be moved • Finally handle the case where the region has to be resized and moved 214 unsigned long do_mremap(unsigned long addr, 215 unsigned long old_len, unsigned long new_len, 216 unsigned long flags, unsigned long new_addr) 217 { 218 struct vm_area_struct *vma; 219 unsigned long ret = -EINVAL; 220 221 if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE)) 222 goto out; 223 224 if (addr & ~PAGE_MASK) 225 goto out; 226 227 old_len = PAGE_ALIGN(old_len); 228 new_len = PAGE_ALIGN(new_len); 229 214 The parameters of the function are addr is the old starting address

5.3.6. Remapping and moving a memory region old_len is the old region length new_len is the new region length

155

flags is the option ﬂags passed. If MREMAP_MAYMOVE is speciﬁed, it means that the region is allowed to move if there is not enough linear address space at the current space. If MREMAP_FIXED is speciﬁed, it means that the whole region is to move to the speciﬁed new_addr with the new length. The area from new_addr to new_addr+new_len will be unmapped with do_munmap(). new_addr is the address of the new region if it is moved 219 At this point, the default return is EINVAL for invalid arguments 221-222 Make sure ﬂags other than the two allowed ﬂags are not used 224-225 The address passed in must be page aligned 227-228 Page align the passed region lengths 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 if (flags & MREMAP_FIXED) { if (new_addr & ~PAGE_MASK) goto out; if (!(flags & MREMAP_MAYMOVE)) goto out; if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len) goto out; /* Check if the location we’re moving into overlaps the * old location at all, and fail if it does. */ if ((new_addr <= addr) && (new_addr+new_len) > addr) goto out; if ((addr <= new_addr) && (addr+old_len) > new_addr) goto out; do_munmap(current->mm, new_addr, new_len); }

This block handles the condition where the region location is ﬁxed and must be fully moved. It ensures the area been moved to is safe and deﬁnitely unmapped. 231 MREMAP_FIXED is the ﬂag which indicates the location is ﬁxed 232-233 The new_addr requested has to be page aligned 234-235 If MREMAP_FIXED is speciﬁed, then the MAYMOVE ﬂag must be used as well 237-238 Make sure the resized region does not exceed TASK_SIZE

5.3.6. Remapping and moving a memory region

156

243-244 Just as the comments indicate, the two regions been used for the move may not overlap 249 Unmap the region that is about to be used. It is presumed the caller ensures that the region is not in use for anything important 256 257 258 259 260 261 ret = addr; if (old_len >= new_len) { do_munmap(current->mm, addr+new_len, old_len - new_len); if (!(flags & MREMAP_FIXED) || (new_addr == addr)) goto out; }

256 At this point, the address of the resized region is the return value 257 If the old length is larger than the new length, then the region is shrinking 258 Unmap the unused region 259-230 If the region is not to be moved, either because MREMAP_FIXED is not used or the new address matches the old address, goto out which will return the address 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 ret = -EFAULT; vma = find_vma(current->mm, addr); if (!vma || vma->vm_start > addr) goto out; /* We can’t remap across vm area boundaries */ if (old_len > vma->vm_end - addr) goto out; if (vma->vm_flags & VM_DONTEXPAND) { if (new_len > old_len) goto out; } if (vma->vm_flags & VM_LOCKED) { unsigned long locked = current->mm->locked_vm << PAGE_SHIFT; locked += new_len - old_len; ret = -EAGAIN; if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur) goto out; } ret = -ENOMEM; if ((current->mm->total_vm << PAGE_SHIFT) + (new_len - old_len) > current->rlim[RLIMIT_AS].rlim_cur) goto out; /* Private writable mapping? Check memory availability.. */ if ((vma->vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE && !(flags & MAP_NORESERVE) && !vm_enough_memory((new_len - old_len) >> PAGE_SHIFT)) goto out;

5.3.6. Remapping and moving a memory region Do a number of checks to make sure it is safe to grow or move the region

157

266 At this point, the default action is to return EFAULT causing a segmentation fault as the ranges of memory been used are invalid 267 Find the VMA responsible for the requested address 268 If the returned VMA is not responsible for this address, then an invalid address was used so return a fault 271-272 If the old_len passed in exceeds the length of the VMA, it means the user is trying to remap multiple regions which is not allowed 273-276 If the VMA has been explicitly marked as non-resizable, raise a fault 277-278 If the pages for this VMA must be locked in memory, recalculate the number of locked pages that will be kept in memory. If the number of pages exceed the ulimit set for this resource, return EAGAIN indicating to the caller that the region is locked and cannot be resized 284 The default return at this point is to indicate there is not enough memory 285-287 Ensure that the user will not exist their allowed allocation of memory 289-292 Ensure that there is enough memory to satisfy the request after the resizing 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 if (old_len == vma->vm_end - addr && !((flags & MREMAP_FIXED) && (addr != new_addr)) && (old_len != new_len || !(flags & MREMAP_MAYMOVE))) { unsigned long max_addr = TASK_SIZE; if (vma->vm_next) max_addr = vma->vm_next->vm_start; /* can we just expand the current mapping? */ if (max_addr - addr >= new_len) { int pages = (new_len - old_len) >> PAGE_SHIFT; spin_lock(&vma->vm_mm->page_table_lock); vma->vm_end = addr + new_len; spin_unlock(&vma->vm_mm->page_table_lock); current->mm->total_vm += pages; if (vma->vm_flags & VM_LOCKED) { current->mm->locked_vm += pages; make_pages_present(addr + old_len, addr + new_len); } ret = addr; goto out; } }

5.3.6. Remapping and moving a memory region Handle the case where the region is been expanded and cannot be moved 297 If it is the full region that is been remapped and ... 298 The region is deﬁnitely not been moved and ... 299 The region is been expanded and cannot be moved then ... 300 Set the maximum address that can be used to TASK_SIZE, 3GiB on an x86

158

301-302 If there is another region, set the max address to be the start of the next region 304-317 Only allow the expansion if the newly sized region does not overlap with the next VMA 305 Calculate the number of extra pages that will be required 306 Lock the mm spinlock 307 Expand the VMA 308 Free the mm spinlock 309 Update the statistics for the mm 310-314 If the pages for this region are locked in memory, make them present now 315-316 Return the address of the resized region can t 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 out: 339 340 } ret = -ENOMEM; if (flags & MREMAP_MAYMOVE) { if (!(flags & MREMAP_FIXED)) { unsigned long map_flags = 0; if (vma->vm_flags & VM_SHARED) map_flags |= MAP_SHARED; new_addr = get_unmapped_area(vma->vm_file, 0, new_len, vma->vm_pgoff, map_flags); ret = new_addr; if (new_addr & ~PAGE_MASK) goto out; } ret = move_vma(vma, addr, old_len, new_len, new_addr); } return ret;

To expand the region, a new one has to be allocated and the old one moved to it

5.3.6. Remapping and moving a memory region 324 The default action is to return saying no memory is available 325 Check to make sure the region is allowed to move

159

326 If MREMAP_FIXED is not speciﬁed, it means the new location was not supplied so one must be found 328-329 Preserve the MAP_SHARED option 331 Find an unmapped region of memory large enough for the expansion 332 The return value is the address of the new region 333-334 For the returned address to be not page aligned, get_unmapped_area() would need to be broken. This could possibly be the case with a buggy device driver implementing get_unmapped_area() incorrectly 336 Call move_vma to move the region 338-339 Return the address if successful and the error code otherwise Function: move_vma (mm/mremap.c)

move_vma

find_vma_prev

move_page_tables

insert_vm_struct

do_munmap

make_pages_present

Figure 5.5: Call Graph: move_vma This function is responsible for moving all the page table entries from one VMA to another region. If necessary a new VMA will be allocated for the region being moved to. Just like the function above, it is very long but may be broken up into the following distinct parts. • Function preamble, ﬁnd the VMA preceding the area about to be moved to and the VMA in front of the region to be mapped • Handle the case where the new location is between two existing VMAs. See if the preceding region can be expanded forward or the next region expanded backwards to cover the new mapped region • Handle the case where the new location is going to be the last VMA on the list. See if the preceding region can be expanded forward • If a region could not be expanded, allocate a new VMA from the slab allocator

5.3.6. Remapping and moving a memory region

160

• Call move_page_tables(), ﬁll in the new VMA details if a new one was allocated and update statistics before returning 125 static inline unsigned long move_vma(struct vm_area_struct * vma, 126 unsigned long addr, unsigned long old_len, unsigned long new_len, 127 unsigned long new_addr) 128 { 129 struct mm_struct * mm = vma->vm_mm; 130 struct vm_area_struct * new_vma, * next, * prev; 131 int allocated_vma; 132 133 new_vma = NULL; 134 next = find_vma_prev(mm, new_addr, &prev); 125-127 The parameters are vma The VMA that the address been moved belongs to addr The starting address of the moving region old_len The old length of the region to move new_len The new length of the region moved new_addr The new address to relocate to 134 Find the VMA preceding the address been moved to indicated by prev and return the region after the new mapping as next 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 if (next) { if (prev && prev->vm_end == new_addr && can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); prev->vm_end = new_addr + new_len; spin_unlock(&mm->page_table_lock); new_vma = prev; if (next != prev->vm_next) BUG(); if (prev->vm_end == next->vm_start && can_vma_merge(next, prev->vm_flags)) { spin_lock(&mm->page_table_lock); prev->vm_end = next->vm_end; __vma_unlink(mm, next, prev); spin_unlock(&mm->page_table_lock); mm->map_count--; kmem_cache_free(vm_area_cachep, next); } } else if (next->vm_start == new_addr + new_len &&

5.3.6. Remapping and moving a memory region 154 155 156 157 158 159 160 can_vma_merge(next, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); next->vm_start = new_addr; spin_unlock(&mm->page_table_lock); new_vma = next; } } else {

161

In this block, the new location is between two existing VMAs. Checks are made to see can be preceding region be expanded to cover the new mapping and then if it can be expanded to cover the next VMA as well. If it cannot be expanded, the next region is checked to see if it can be expanded backwards. 136-137 If the preceding region touches the address to be mapped to and may be merged then enter this block which will attempt to expand regions 138 Lock the mm 139 Expand the preceding region to cover the new location 140 Unlock the mm 141 The new vma is now the preceding VMA which was just expanded 142-143 Unnecessary check to make sure the VMA linked list is intact. It is unclear how this situation could possibly occur 144 Check if the region can be expanded forward to encompass the next region 145 If it can, then lock the mm 146 Expand the VMA further to cover the next VMA 147 There is now an extra VMA so unlink it 148 Unlock the mm 150 There is one less mapping now so update the map_count 151 Free the memory used by the memory mapping 153 Else the prev region could not be expanded forward so check if the region pointed to be next may be expanded backwards to cover the new mapping instead 155 If it can, lock the mm 156 Expand the mapping backwards 157 Unlock the mm 158 The VMA representing the new mapping is now next

5.3.6. Remapping and moving a memory region 161 162 163 164 165 166 167 168 169

162

prev = find_vma(mm, new_addr-1); if (prev && prev->vm_end == new_addr && can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); prev->vm_end = new_addr + new_len; spin_unlock(&mm->page_table_lock); new_vma = prev; } }

This block is for the case where the newly mapped region is the last VMA (next is NULL) so a check is made to see can the preceding region be expanded. 161 Get the previously mapped region 162-163 Check if the regions may be mapped 164 Lock the mm 165 Expand the preceding region to cover the new mapping 166 Lock the mm 167 The VMA representing the new mapping is now prev 170 171 172 173 174 175 176 177 178

allocated_vma = 0; if (!new_vma) { new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); if (!new_vma) goto out; allocated_vma = 1; }

171 Set a ﬂag indicating if a new VMA was not allocated 172 If a VMA has not been expanded to cover the new mapping then... 173 Allocate a new VMA from the slab allocator 174-175 If it could not be allocated, goto out to return failure 176 Set the ﬂag indicated a new VMA was allocated 179 180 181 if (!move_page_tables(current->mm, new_addr, addr, old_len)) { if (allocated_vma) { *new_vma = *vma;

5.3.6. Remapping and moving a memory region 182 183 184 new_vma->vm_start = new_addr; new_vma->vm_end = new_addr+new_len; new_vma->vm_pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; new_vma->vm_raend = 0; if (new_vma->vm_file) get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); insert_vm_struct(current->mm, new_vma);

163

185 186 187 188 189 190 191 } 192 do_munmap(current->mm, addr, old_len); 193 current->mm->total_vm += new_len >> PAGE_SHIFT; 194 if (new_vma->vm_flags & VM_LOCKED) { 195 current->mm->locked_vm += new_len >> PAGE_SHIFT; 196 make_pages_present(new_vma->vm_start, 197 new_vma->vm_end); 198 } 199 return new_addr; 200 } 201 if (allocated_vma) 202 kmem_cache_free(vm_area_cachep, new_vma); 203 out: 204 return -ENOMEM; 205 } 179 move_page_tables() is responsible for copying all the page table entries. It returns 0 on success 180-191 If a new VMA was allocated, ﬁll in all the relevant details, including the ﬁle/device entries and insert it into the various VMA linked lists with insert_vm_struct() 192 Unmap the old region as it is no longer required 193 Update the total_vm size for this process. The size of the old region is not important as it is handled within do_munmap() 194-198 If the VMA has the VM_LOCKED ﬂag, all the pages within the region are made present with mark_pages_present() 199 Return the address of the new region 201-202 This is the error path. If a VMA was allocated, delete it 204 Return an out of memory error

5.3.6. Remapping and moving a memory region

164

move_page_tables

move_one_page

zap_page_range

get_one_pte

alloc_one_pte

copy_one_pte

zap_pmd_range

pte_alloc

zap_pte_range

Figure 5.6: Call Graph: move_page_tables() Function: move_page_tables (mm/mremap.c) This function is responsible copying all the page table entries from the region pointed to be old_addr to new_addr. It works by literally copying page table entries one at a time. When it is ﬁnished, it deletes all the entries from the old area. This is not the most eﬃcient way to perform the operation, but it is very easy to error recover. 90 static int move_page_tables(struct mm_struct * mm, 91 unsigned long new_addr, unsigned long old_addr, unsigned long len) 92 { 93 unsigned long offset = len; 94 95 flush_cache_range(mm, old_addr, old_addr + len); 96 102 while (offset) { 103 offset -= PAGE_SIZE; 104 if (move_one_page(mm, old_addr + offset, new_addr + offset)) 105 goto oops_we_failed; 106 } 107 flush_tlb_range(mm, old_addr, old_addr + len); 108 return 0; 109 117 oops_we_failed: 118 flush_cache_range(mm, new_addr, new_addr + len); 119 while ((offset += PAGE_SIZE) < len) 120 move_one_page(mm, new_addr + offset, old_addr + offset); 121 zap_page_range(mm, new_addr, len); 122 return -1;

5.3.6. Remapping and moving a memory region 123 }

165

90 The parameters are the mm for the process, the new location, the old location and the length of the region to move entries for 95 flush_cache_range() will ﬂush all CPU caches for this range. It must be called ﬁrst as some architectures, notably Sparc’s require that a virtual to physical mapping exist before ﬂushing the TLB 102-106 This loops through each page in the region and calls move_one_page() to move the PTE. This translates to a lot of page table walking and could be performed much better but it is a rare operation 107 Flush the TLB for the old region 108 Return success 118-120 This block moves all the PTEs back. A flush_tlb_range() is not necessary as there is no way the region could have been used yet so no TLB entries should exist 121 Zap any pages that were allocated for the move 122 Return failure Function: move_one_page (mm/mremap.c) This function is responsible for acquiring the spinlock before ﬁnding the correct PTE with get_one_pte() and copying it with copy_one_pte() 77 static int move_one_page(struct mm_struct *mm, unsigned long old_addr, unsigned long new_addr) 78 { 79 int error = 0; 80 pte_t * src; 81 82 spin_lock(&mm->page_table_lock); 83 src = get_one_pte(mm, old_addr); 84 if (src) 85 error = copy_one_pte(mm, src, alloc_one_pte(mm, new_addr)); 86 spin_unlock(&mm->page_table_lock); 87 return error; 88 } 82 Acquire the mm lock 83 Call get_one_pte() which walks the page tables to get the correct PTE 84-85 If the PTE exists, allocate a PTE for the destination and call copy_one_pte() to copy the PTEs 86 Release the lock 87 Return whatever copy_one_pte() returned

5.3.6. Remapping and moving a memory region Function: get_one_pte (mm/mremap.c) This is a very simple page table walk. 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 static inline { pgd_t * pmd_t * pte_t *

166

pte_t *get_one_pte(struct mm_struct *mm, unsigned long addr) pgd; pmd; pte = NULL;

pgd = pgd_offset(mm, addr); if (pgd_none(*pgd)) goto end; if (pgd_bad(*pgd)) { pgd_ERROR(*pgd); pgd_clear(pgd); goto end; } pmd = pmd_offset(pgd, addr); if (pmd_none(*pmd)) goto end; if (pmd_bad(*pmd)) { pmd_ERROR(*pmd); pmd_clear(pmd); goto end; } pte = pte_offset(pmd, addr); if (pte_none(*pte)) pte = NULL; end: return pte; }

24 Get the PGD for this address 25-26 If no PGD exists, return NULL as no PTE will exist either 27-31 If the PGD is bad, mark that an error occurred in the region, clear its contents and return NULL 33-40 Acquire the correct PMD in the same fashion as for the PGD 42 Acquire the PTE so it may be returned if it exists

5.3.6. Remapping and moving a memory region Function: alloc_one_pte (mm/mremap.c) Trivial function to allocate what is necessary for one PTE in a region. 49 static inline pte_t *alloc_one_pte(struct mm_struct *mm, unsigned long addr) 50 { 51 pmd_t * pmd; 52 pte_t * pte = NULL; 53 54 pmd = pmd_alloc(mm, pgd_offset(mm, addr), addr); 55 if (pmd) 56 pte = pte_alloc(mm, pmd, addr); 57 return pte; 58 } 54 If a PMD entry does not exist, allocate it

167

55-56 If the PMD exists, allocate a PTE entry. The check to make sure it succeeded is performed later in the function copy_one_pte() Function: copy_one_pte (mm/mremap.c) Copies the contents of one PTE to another. 60 static inline int copy_one_pte(struct mm_struct *mm, pte_t * src, pte_t * dst) 61 { 62 int error = 0; 63 pte_t pte; 64 65 if (!pte_none(*src)) { 66 pte = ptep_get_and_clear(src); 67 if (!dst) { 68 /* No dest? We must put it back. */ 69 dst = src; 70 error++; 71 } 72 set_pte(dst, pte); 73 } 74 return error; 75 } 65 If the source PTE does not exist, just return 0 to say the copy was successful 66 Get the PTE and remove it from its old location 67-71 If the dst does not exist, it means the call to alloc_one_pte() failed and the copy operation has failed and must be aborted 72 Move the PTE to its new location 74 Return an error if one occurred

5.3.7. Locking a Memory Region

168

5.3.7

Locking a Memory Region

Function: sys_mlock (mm/mlock.c) This is the system call mlock() for locking a region of memory into physical memory. This function simply checks to make sure that process and user limits are not exceeeded and that the region to lock is page aligned. 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 asmlinkage long sys_mlock(unsigned long start, size_t len) { unsigned long locked; unsigned long lock_limit; int error = -ENOMEM; down_write(&current->mm->mmap_sem); len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); start &= PAGE_MASK; locked = len >> PAGE_SHIFT; locked += current->mm->locked_vm; lock_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur; lock_limit >>= PAGE_SHIFT; /* check against resource limits */ if (locked > lock_limit) goto out; /* we may lock at most half of physical memory... */ /* (this check is pretty bogus, but doesn’t hurt) */ if (locked > num_physpages/2) goto out; error = do_mlock(start, len, 1); out: up_write(&current->mm->mmap_sem); return error; }

201 Take the semaphore, we are likely to sleep during this so a spinlock can not be used 202 Round the length up to the page boundary 203 Round the start address down to the page boundary 205 Calculate how many pages will be locked 206 Calculate how many pages will be locked in total by this process

5.3.7. Locking a Memory Region 208-209 Calculate what the limit is to the number of locked pages 212-213 Do not allow the process to lock more than it should 217-218 Do not allow the process to map more than half of physical memory

169

220 Call do_mlock() which starts the “real” work by ﬁnd the VMA clostest to the area to lock before calling mlock_fixup() 222 Free the semaphore 223 Return the error or success code from do_mmap() Function: sys_mlockall (mm/mlock.c) This is the system call mlockall() which attempts to lock all pages in the calling process in memory. If MCL_CURRENT is speciﬁed, all current pages will be locked. If MCL_FUTURE is speciﬁed, all future mappings will be locked. The ﬂags may be or-ed together. 238 static int do_mlockall(int flags) 239 { 240 int error; 241 unsigned int def_flags; 242 struct vm_area_struct * vma; 243 244 if (!capable(CAP_IPC_LOCK)) 245 return -EPERM; 246 247 def_flags = 0; 248 if (flags & MCL_FUTURE) 249 def_flags = VM_LOCKED; 250 current->mm->def_flags = def_flags; 251 252 error = 0; 253 for (vma = current->mm->mmap; vma ; vma = vma->vm_next) { 254 unsigned int newflags; 255 256 newflags = vma->vm_flags | VM_LOCKED; 257 if (!(flags & MCL_CURRENT)) 258 newflags &= ~VM_LOCKED; 259 error = mlock_fixup(vma, vma->vm_start, vma->vm_end, newflags); 260 if (error) 261 break; 262 } 263 return error; 264 } 244-245 The calling process must be either root or have CAP_IPC_LOCK capabilities

5.3.7. Locking a Memory Region

170

248-250 The MCL_FUTURE ﬂag says that all future pages should be locked so if set, the def_flags for VMAs should be VM_LOCKED 253-262 Cycle through all VMAs 256 Set the VM_LOCKED ﬂag in the current VMA ﬂags 257-258 If the MCL_CURRENT ﬂag has not been set requesting that all current pages be locked, then clear the VM_LOCKED ﬂag. The logic is arranged like this so that the unlock code can use this same function just with no ﬂags 259 Call mlock_fixup() which will adjust the regions as necessary 260-261 If a non-zero value is returned at any point, stop locking. It is interesting to note that VMAs already locked will not be unlocked 263 Return the success or error value Function: do_mlock (mm/mlock.c) This function is is responsible for starting the work needed to either lock or unlock a region depending on the value of the on parameter. It is broken up into two sections. The ﬁrst makes sure the region is page aligned (despite the fact the only two callers of this function do the same thing) before ﬁnding the VMA that is to be adjusted. The second part then sets the appropriate ﬂags before calling mlock_fixup() for each VMA that is aﬀected by this locking. 148 static int do_mlock(unsigned long start, size_t len, int on) 149 { 150 unsigned long nstart, end, tmp; 151 struct vm_area_struct * vma, * next; 152 int error; 153 154 if (on && !capable(CAP_IPC_LOCK)) 155 return -EPERM; 156 len = PAGE_ALIGN(len); 157 end = start + len; 158 if (end < start) 159 return -EINVAL; 160 if (end == start) 161 return 0; 162 vma = find_vma(current->mm, start); 163 if (!vma || vma->vm_start > start) 164 return -ENOMEM; Page align the request and ﬁnd the VMA 154 Only root processes can lock pages

5.3.7. Locking a Memory Region

171

156 Page align the length despite it being already done in the calling function. This is probably an oversight 157-159 Calculate the end of the locking and make sure it is a valid region. Return EINVAL if it is not 160-161 if locking a region of size 0, just return 162 Find the VMA that will be aﬀected by this locking 163-164 If the VMA for this address range does not exist, return -ENOMEM 165 166 167 168 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 }

for (nstart = start ; ; ) { unsigned int newflags;

newflags = vma->vm_flags | VM_LOCKED; if (!on) newflags &= ~VM_LOCKED; if (vma->vm_end >= end) { error = mlock_fixup(vma, nstart, end, newflags); break; } tmp = vma->vm_end; next = vma->vm_next; error = mlock_fixup(vma, nstart, tmp, newflags); if (error) break; nstart = tmp; vma = next; if (!vma || vma->vm_start != nstart) { error = -ENOMEM; break; } } return error;

Walk through the VMAs aﬀected by this locking and call mlock_fixup() for each of them. 166-192 Cycle through as many VMAs as necessary to lock the pages 171 Set the VM_LOCKED ﬂag on the VMA

5.3.8. Unlocking the region 172-173 Unless this is an unlock in which case, remove the ﬂag

172

175-177 If this VMA is the last VMA to be aﬀected by the unlocking, call mlock_fixup() with the end address for the locking and exit 180-190 Else this is whole VMA needs to be locked so call mlock_fixup() with the end of this VMA as a paramter rather than the end of the actual locking 180 tmp is the end of the mapping on this VMA 181 next is the next VMA that will be aﬀected by the locking 182 Call mlock_fixup() for this VMA 183-184 If an error occurs, back out. Note that the VMAs already locked are not ﬁxed up right 185 The next start address is the start of the next VMA 186 Move to the next VMA 187-190 If there is no VMA , return -ENOMEM. The next condition though would require the regions to be extremly broken as a result of mlock_fixup() or have overlapping VMAs 192 Return the error or success value

5.3.8

Unlocking the region

Function: sys_munlock (mm/mlock.c) Page align the request before calling do_mlock() which begins the real work of ﬁxing up the regions. 226 asmlinkage long sys_munlock(unsigned long start, size_t len) 227 { 228 int ret; 229 230 down_write(&current->mm->mmap_sem); 231 len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); 232 start &= PAGE_MASK; 233 ret = do_mlock(start, len, 0); 234 up_write(&current->mm->mmap_sem); 235 return ret; 236 } 230 Acquire the semaphore protecting the mm_struct 231 Round the length of the region up to the nearest page boundary 232 Round the start of the region down to the nearest page boundary

5.3.9. Fixing up regions after locking/unlocking 233 Call do_mlock() to ﬁx up the regions 234 Release the semaphore 235 Return the success or failure code

173

Function: sys_munlockall (mm/mlock.c) Trivial function. If the ﬂags to mlockall are 0 it gets translated as none of the current pages must be present and no future mappings should be locked either which means the VM_LOCKED ﬂag will be removed on all VMAs. 293 asmlinkage long sys_munlockall(void) 294 { 295 int ret; 296 297 down_write(&current->mm->mmap_sem); 298 ret = do_mlockall(0); 299 up_write(&current->mm->mmap_sem); 300 return ret; 301 } 297 Acquire the semaphore protecting the mm_struct 298 Call do_mlockall() with 0 as ﬂags which will remove the VM_LOCKED from all VMAs 299 Release the semaphore 300 Return the error or success code

5.3.9

Fixing up regions after locking/unlocking

Function: mlock_ﬁxup (mm/mlock.c) This function identiﬁes four separate types of locking that must be addressed. There ﬁrst is where the full VMA is to be locked where it calls mlock_fixup_all(). The second is where only the beginning portion of the VMA is aﬀected, handled by mlock_fixup_start(). The third is the locking of a region at the end handled by mlock_fixup_end() and the last is locking a region in the middle of the VMA with mlock_fixup_middle(). 117 static int mlock_fixup(struct vm_area_struct * vma, 118 unsigned long start, unsigned long end, unsigned int newflags) 119 { 120 int pages, retval; 121 122 if (newflags == vma->vm_flags) 123 return 0; 124 125 if (start == vma->vm_start) { 126 if (end == vma->vm_end)

5.3.9. Fixing up regions after locking/unlocking 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 } retval = mlock_fixup_all(vma, newflags); else retval = mlock_fixup_start(vma, end, newflags); } else { if (end == vma->vm_end) retval = mlock_fixup_end(vma, start, newflags); else retval = mlock_fixup_middle(vma, start, end, newflags); } if (!retval) { /* keep track of amount of locked VM */ pages = (end - start) >> PAGE_SHIFT; if (newflags & VM_LOCKED) { pages = -pages; make_pages_present(start, end); } vma->vm_mm->locked_vm -= pages; } return retval;

174

122-123 If no change is to be made, just return 125 If the start of the locking is at the start of the VMA, it means that either the full region is to the locked or only a portion at the beginning 126-127 The full VMA is been locked, call mlock_fixup_all() 128-129 Only a portion is to be locked, call mlock_fixup_start() 130 Else either the a region at the end is to be locked or a region in the middle 131-132 The end of the locking match the end of the VMA, call mlock_fixup_end() 133-134 A region in the middle is to be locked, call mlock_fixup_middle() 136-144 The ﬁxup functions return 0 on success. If the ﬁxup of the regions succeed and the regions are now marked as locked, call make_pages_present() which makes some basic checks before calling get_user_pages() which faults in all the pages in the same way the page fault handler does Function: mlock_ﬁxup_all (mm/mlock.c) 15 static inline int mlock_fixup_all(struct vm_area_struct * vma, int newflags) 16 { 17 spin_lock(&vma->vm_mm->page_table_lock);

5.3.9. Fixing up regions after locking/unlocking 18 19 20 21 } vma->vm_flags = newflags; spin_unlock(&vma->vm_mm->page_table_lock); return 0;

175

17-19 Trivial, lock the VMA with the spinlock, set the new ﬂags, release the lock and return success Function: mlock_ﬁxup_start (mm/mlock.c) Slightly more compilcated. A new VMA is required to represent the aﬀected region. The start of the old VMA is moved forward 23 static inline int mlock_fixup_start(struct vm_area_struct * vma, 24 unsigned long end, int newflags) 25 { 26 struct vm_area_struct * n; 27 28 n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); 29 if (!n) 30 return -EAGAIN; 31 *n = *vma; 32 n->vm_end = end; 33 n->vm_flags = newflags; 34 n->vm_raend = 0; 35 if (n->vm_file) 36 get_file(n->vm_file); 37 if (n->vm_ops && n->vm_ops->open) 38 n->vm_ops->open(n); 39 vma->vm_pgoff += (end - vma->vm_start) >> PAGE_SHIFT; 40 lock_vma_mappings(vma); 41 spin_lock(&vma->vm_mm->page_table_lock); 42 vma->vm_start = end; 43 __insert_vm_struct(current->mm, n); 44 spin_unlock(&vma->vm_mm->page_table_lock); 45 unlock_vma_mappings(vma); 46 return 0; 47 } 28 Alloc a VMA from the slab allocator for the aﬀected region 31-34 Copy in the necessary information 35-36 If the VMA has a ﬁle or device mapping, get_file() will increment the reference count 37-38 If an open() function is provided, call it

5.3.9. Fixing up regions after locking/unlocking

176

39 Update the oﬀset within the ﬁle or device mapping for the old VMA to be the end of the locked region 40 lock_vma_mappings() will lock any ﬁles if this VMA is a shared region 41-44 Lock the parent mm_struct, update its start to be the end of the aﬀected region, insert the new VMA into the processes linked lists (See Section 5.3.4) and release the lock 45 Unlock the ﬁle mappings with unlock_vma_mappings() 46 Return success Function: mlock_ﬁxup_end (mm/mlock.c) Essentially the same as mlock_fixup_start() except the aﬀected region is at the end of the VMA. 49 static inline int mlock_fixup_end(struct vm_area_struct * vma, 50 unsigned long start, int newflags) 51 { 52 struct vm_area_struct * n; 53 54 n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); 55 if (!n) 56 return -EAGAIN; 57 *n = *vma; 58 n->vm_start = start; 59 n->vm_pgoff += (n->vm_start - vma->vm_start) >> PAGE_SHIFT; 60 n->vm_flags = newflags; 61 n->vm_raend = 0; 62 if (n->vm_file) 63 get_file(n->vm_file); 64 if (n->vm_ops && n->vm_ops->open) 65 n->vm_ops->open(n); 66 lock_vma_mappings(vma); 67 spin_lock(&vma->vm_mm->page_table_lock); 68 vma->vm_end = start; 69 __insert_vm_struct(current->mm, n); 70 spin_unlock(&vma->vm_mm->page_table_lock); 71 unlock_vma_mappings(vma); 72 return 0; 73 } 54 Alloc a VMA from the slab allocator for the aﬀected region 57-61 Copy in the necessary information and update the oﬀset within the ﬁle or device mapping

5.3.9. Fixing up regions after locking/unlocking

177

62-63 If the VMA has a ﬁle or device mapping, get_file() will increment the reference count 64-65 If an open() function is provided, call it 66 lock_vma_mappings() will lock any ﬁles if this VMA is a shared region 67-70 Lock the parent mm_struct, update its start to be the end of the aﬀected region, insert the new VMA into the processes linked lists (See Section 5.3.4) and release the lock 71 Unlock the ﬁle mappings with unlock_vma_mappings() 72 Return success Function: mlock_ﬁxup_middle (mm/mlock.c) Similar to the previous two ﬁxup functions except that 2 new regions are required to ﬁx up the mapping. 75 static inline int mlock_fixup_middle(struct vm_area_struct * vma, 76 unsigned long start, unsigned long end, int newflags) 77 { 78 struct vm_area_struct * left, * right; 79 80 left = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); 81 if (!left) 82 return -EAGAIN; 83 right = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); 84 if (!right) { 85 kmem_cache_free(vm_area_cachep, left); 86 return -EAGAIN; 87 } 88 *left = *vma; 89 *right = *vma; 90 left->vm_end = start; 91 right->vm_start = end; 92 right->vm_pgoff += (right->vm_start - left->vm_start) >> PAGE_SHIFT; 93 vma->vm_flags = newflags; 94 left->vm_raend = 0; 95 right->vm_raend = 0; 96 if (vma->vm_file) 97 atomic_add(2, &vma->vm_file->f_count); 98 99 if (vma->vm_ops && vma->vm_ops->open) { 100 vma->vm_ops->open(left); 101 vma->vm_ops->open(right); 102 }

5.3.10. Deleting a memory region 103 104 105 106 107 108 109 110 111 112 113 114 115 } vma->vm_raend = 0; vma->vm_pgoff += (start - vma->vm_start) >> PAGE_SHIFT; lock_vma_mappings(vma); spin_lock(&vma->vm_mm->page_table_lock); vma->vm_start = start; vma->vm_end = end; vma->vm_flags = newflags; __insert_vm_struct(current->mm, left); __insert_vm_struct(current->mm, right); spin_unlock(&vma->vm_mm->page_table_lock); unlock_vma_mappings(vma); return 0;

178

80-87 Allocate the two new VMAs from the slab allocator 88-89 Copy in the information from the old VMA into them 90 The end of the left region is the start of the region to be aﬀected 91 The start of the right region is the end of the aﬀected region 92 Update the ﬁle oﬀset 93 The old VMA is now the aﬀected region so update its ﬂags 94-95 Make the readahead window 0 to ensure pages not belonging to their regions are not accidently read ahead 96-97 Increment the reference count to the ﬁle/device mapping if there is one 99-102 Call the open() function for the two new mappings 103-104 Cancel the readahead window and update the oﬀset within the ﬁle to be the beginning of the locked region 105 Lock the shared ﬁle/device mappings 106-112 Lock the parent mm_struct, update the VMA and insert the two new regions into the process before releasing the lock again 113 Unlock the shared mappings 114 Return success

5.3.10. Deleting a memory region

179

do_munmap

unmap_fixup

remove_shared_vm_struct

zap_page_range

free_pgtables

__insert_vm_struct

lock_vma_mappings

unlock_vma_mappings

__remove_shared_vm_struct

Figure 5.7: do_munmap

5.3.10

Deleting a memory region

Function: do_munmap (mm/mmap.c) This function is responsible for unmapping a region. If necessary, the unmapping can span multiple VMAs and it can partially unmap one if necessary. Hence the full unmapping operation is divided into two major operations. This function is responsible for ﬁnding what VMAs are aﬀected and unmap_fixup() is responsible for ﬁxing up the remaining VMAs. This function is divided up in a number of small sections will be dealt with in turn. The are broadly speaking; • Function preamble and ﬁnd the VMA to start working from • Take all VMAs aﬀected by the unmapping out of the mm and place them on a linked list headed by the variable free • Cycle through the list headed by free, unmap all the pages in the region to be unmapped and call unmap_fixup() to ﬁx up the mappings • Validate the mm and free memory associated with the unmapping 919 int do_munmap(struct mm_struct *mm, unsigned long addr, size_t len) 920 { 921 struct vm_area_struct *mpnt, *prev, **npp, *free, *extra; 922 923 if ((addr & ~PAGE_MASK) || addr > TASK_SIZE || len > TASK_SIZE-addr) 924 return -EINVAL; 925 926 if ((len = PAGE_ALIGN(len)) == 0) 927 return -EINVAL; 928 934 mpnt = find_vma_prev(mm, addr, &prev); 935 if (!mpnt) 936 return 0; 937 /* we have addr < mpnt->vm_end */

5.3.10. Deleting a memory region 938 939 940 941 943 944 945 946 951 952 953

180

if (mpnt->vm_start >= addr+len) return 0; if ((mpnt->vm_start < addr && mpnt->vm_end > addr+len) && mm->map_count >= max_map_count) return -ENOMEM; extra = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); if (!extra) return -ENOMEM;

919 The parameters are as follows; mmThe mm for the processes performing the unmap operation addrThe starting address of the region to unmap lenThe length of the region 923-924 Ensure the address is page aligned and that the area to be unmapped is not in the kernel virtual address space 926-927 Make sure the region size to unmap is page aligned 934 Find the VMA that contains the starting address and the preceding VMA so it can be easily unlinked later 935-936 If no mpnt was returned, it means the address must be past the last used VMA so the address space is unused, just return 939-940 If the returned VMA starts past the region we are trying to unmap, then the region in unused, just return 943-945 The ﬁrst part of the check sees if the VMA is just been partially unmapped, if it is, another VMA will be created later to deal with a region being broken into so to the map_count has to be checked to make sure it is not too large 951-953 In case a new mapping is required, it is allocated now as later it will be much more diﬃcult to back out in event of an error 955 956 957 958 959 960 961 962 963 npp = (prev ? &prev->vm_next : &mm->mmap); free = NULL; spin_lock(&mm->page_table_lock); for ( ; mpnt && mpnt->vm_start < addr+len; mpnt = *npp) { *npp = mpnt->vm_next; mpnt->vm_next = free; free = mpnt; rb_erase(&mpnt->vm_rb, &mm->mm_rb); }

5.3.10. Deleting a memory region 964 965 mm->mmap_cache = NULL; /* Kill the cache. */ spin_unlock(&mm->page_table_lock);

181

This section takes all the VMAs aﬀected by the unmapping and places them on a separate linked list headed by a variable called free. This makes the ﬁxup of the regions much easier. 955 npp becomes the next VMA in the list during the for loop following below. To initialise it, it is either the current VMA (mpnt) or else it becomes the ﬁrst VMA in the list 956 free is the head of a linked list of VMAs that are aﬀected by the unmapping 957 Lock the mm 958 Cycle through the list until the start of the current VMA is past the end of the region to be unmapped 959 npp becomes the next VMA in the list 960-961 Remove the current VMA from the linear linked list within the mm and place it on a linked list headed by free. The current mpnt becomes the head of the free linked list 962 Delete mpnt from the red-black tree 964 Remove the cached result in case the last looked up result is one of the regions to be unmapped 965 Free the mm 966 967 list, 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983

/* Ok - we have the memory areas we should free on the ’free’ * so * If * it * In */ while release them, and unmap the page range.. the one of the segments is only being partially unmapped, will put new vm_area_struct(s) into the address space. that case we have to be careful with VM_DENYWRITE. ((mpnt = free) != NULL) { unsigned long st, end, size; struct file *file = NULL; free = free->vm_next; st = addr < mpnt->vm_start ? mpnt->vm_start : addr; end = addr+len; end = end > mpnt->vm_end ? mpnt->vm_end : end; size = end - st;

5.3.10. Deleting a memory region 984 985 986 987 988 989 990 991 992 993 994 995 reused. 996 997 998 999 1000 if (mpnt->vm_flags & VM_DENYWRITE && (st != mpnt->vm_start || end != mpnt->vm_end) && (file = mpnt->vm_file) != NULL) { atomic_dec(&file->f_dentry->d_inode->i_writecount); } remove_shared_vm_struct(mpnt); mm->map_count--; zap_page_range(mm, st, size); /* * Fix the mapping, and free the old area if it wasn’t */ extra = unmap_fixup(mm, mpnt, st, size, extra); if (file) atomic_inc(&file->f_dentry->d_inode->i_writecount); }

182

973 Keep stepping through the list until no VMAs are left 977 Move free to the next element in the list leaving mpnt as the head about to be removed 979 st is the start of the region to be unmapped. If the addr is before the start of the VMA, the starting point is mpnt→vm_start, otherwise it is the supplied address 980-981 Calculate the end of the region to map in a similar fashion 982 Calculate the size of the region to be unmapped in this pass 984-988 If the VM_DENYWRITE ﬂag is speciﬁed, a hole will be created by this unmapping and a ﬁle is mapped then the writecount is decremented. When this ﬁeld is negative, it counts how many users there is protecting this ﬁle from being opened for writing 989 Remove the ﬁle mapping. If the ﬁle is still partially mapped, it will be acquired again during unmap_fixup() 990 Reduce the map count 992 Remove all pages within this region 997 Call the ﬁxup routing 998-999 Increment the writecount to the ﬁle as the region has been unmapped. If it was just partially unmapped, this call will simply balance out the decrement at line 987

5.3.10. Deleting a memory region 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 } validate_mm(mm); /* Release the extra vma struct if it wasn’t used */ if (extra) kmem_cache_free(vm_area_cachep, extra); free_pgtables(mm, prev, addr, addr+len); return 0;

183

1001 A debugging function only. If enabled, it will ensure the VMA tree for this mm is still valid 1004-1005 If extra VMA was not required, delete it 1007 Free all the page tables that were used for the unmapped region 1009 Return success Function: unmap_ﬁxup (mm/mmap.c) This function ﬁxes up the regions after a block has been unmapped. It is passed a list of VMAs that are aﬀected by the unmapping, the region and length to be unmapped and a spare VMA that may be required to ﬁx up the region if a whole is created. There is four principle cases it handles; The unmapping of a region, partial unmapping from the start to somewhere in the middle, partial unmapping from somewhere in the middle to the end and the creation of a hole in the middle of the region. Each case will be taken in turn. 785 static struct vm_area_struct * unmap_fixup(struct mm_struct *mm, 786 struct vm_area_struct *area, unsigned long addr, size_t len, 787 struct vm_area_struct *extra) 788 { 789 struct vm_area_struct *mpnt; 790 unsigned long end = addr + len; 791 792 area->vm_mm->total_vm -= len >> PAGE_SHIFT; 793 if (area->vm_flags & VM_LOCKED) 794 area->vm_mm->locked_vm -= len >> PAGE_SHIFT; 795 Function preamble. 785 The parameters to the function are; mm is the mm the unmapped region belongs to area is the head of the linked list of VMAs aﬀected by the unmapping addr is the starting address of the unmapping

5.3.10. Deleting a memory region len is the length of the region to be unmapped extra is a spare VMA passed in for when a hole in the middle is created 790 Calculate the end address of the region being unmapped 792 Reduce the count of the number of pages used by the process 793-794 If the pages were locked in memory, reduce the locked page count 796 797 798 799 800 801 802 803 804 /* Unmapping the whole area. */ if (addr == area->vm_start && end == area->vm_end) { if (area->vm_ops && area->vm_ops->close) area->vm_ops->close(area); if (area->vm_file) fput(area->vm_file); kmem_cache_free(vm_area_cachep, area); return extra; } The ﬁrst, and easiest, case is where the full region is being unmapped

184

797 The full region is unmapped if the addr is the start of the VMA and the end is the end of the VMA. This is interesting because if the unmapping is spanning regions, it is possible the end is beyond the end of the VMA but the full of this VMA is still being unmapped 798-799 If a close operation is supplied by the VMA, call it 800-801 If a ﬁle or device is mapped, call fput() which decrements the usage count and releases it if the count falls to 0 802 Free the memory for the VMA back to the slab allocator 803 Return the extra VMA as it was unused 807 808 809 810 811 812 813 814 815 if (end == area->vm_end) { /* * here area isn’t visible to the semaphore-less readers * so we don’t need to update it under the spinlock. */ area->vm_end = addr; lock_vma_mappings(area); spin_lock(&mm->page_table_lock); } Handle the case where the middle of the region to the end is been unmapped 812 Truncate the VMA back to addr. At this point, the pages for the region have already freed and the page table entries will be freed later so no further work is required

5.3.10. Deleting a memory region

185

813 If a ﬁle/device is being mapped, the lock protecting shared access to it is taken in the function lock_vm_mappings() 814 Lock the mm. Later in the function, the remaining VMA will be reinserted into the mm 815 816 817 818 819 820 821 else if (addr == area->vm_start) { area->vm_pgoff += (end - area->vm_start) >> PAGE_SHIFT; /* same locking considerations of the above case */ area->vm_start = end; lock_vma_mappings(area); spin_lock(&mm->page_table_lock); }

Handle the case where the VMA is been unmapped from the start to some part in the middle 816 Increase the oﬀset within the ﬁle/device mapped by the number of pages this unmapping represents 818 Move the start of the VMA to the end of the region being unmapped 819-820 Lock the ﬁle/device and mm as above else { /* Unmapping a hole: area->vm_start < addr <= end < area->vm_end */ /* Add end mapping -- leave beginning for below */ mpnt = extra; extra = NULL; mpnt->vm_mm = area->vm_mm; mpnt->vm_start = end; mpnt->vm_end = area->vm_end; mpnt->vm_page_prot = area->vm_page_prot; mpnt->vm_flags = area->vm_flags; mpnt->vm_raend = 0; mpnt->vm_ops = area->vm_ops; mpnt->vm_pgoff = area->vm_pgoff + ((end - area->vm_start) >> PAGE_SHIFT); mpnt->vm_file = area->vm_file; mpnt->vm_private_data = area->vm_private_data; if (mpnt->vm_file) get_file(mpnt->vm_file); if (mpnt->vm_ops && mpnt->vm_ops->open) mpnt->vm_ops->open(mpnt); area->vm_end = addr; /* Truncate area */

822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842

5.3.11. Deleting all memory regions 843 844 845 846 847 848 849 /* Because mpnt->vm_file == area->vm_file this locks * things correctly. */ lock_vma_mappings(area); spin_lock(&mm->page_table_lock); __insert_vm_struct(mm, mpnt); }

186

Handle the case where a hole is being created by a partial unmapping. In this case, the extra VMA is required to create a new mapping from the end of the unmapped region to the end of the old VMA 824-825 Take the extra VMA and make VMA NULL so that the calling function will know it is in use and cannot be freed 826-836 Copy in all the VMA information 837 If a ﬁle/device is mapped, get a reference to it with get_file() 839-840 If an open function is provided, call it 841 Truncate the VMA so that it ends at the start of the region to be unmapped 846-847 Lock the ﬁles and mm as with the two previous cases 848 Insert the extra VMA into the mm 850 851 852 853 854 855 }

__insert_vm_struct(mm, area); spin_unlock(&mm->page_table_lock); unlock_vma_mappings(area); return extra;

851 Reinsert the VMA into the mm 852 Unlock the page tables 853 Unlock the spinlock to the shared mapping 854 Return the extra VMA if it was not used and NULL if it was

5.3.11

Deleting all memory regions

Function: exit_mmap (mm/mmap.c) This function simply steps through all VMAs associated with the supplied mm and unmaps them.

5.3.11. Deleting all memory regions 1122 void exit_mmap(struct mm_struct * mm) 1123 { 1124 struct vm_area_struct * mpnt; 1125 1126 release_segments(mm); 1127 spin_lock(&mm->page_table_lock); 1128 mpnt = mm->mmap; 1129 mm->mmap = mm->mmap_cache = NULL; 1130 mm->mm_rb = RB_ROOT; 1131 mm->rss = 0; 1132 spin_unlock(&mm->page_table_lock); 1133 mm->total_vm = 0; 1134 mm->locked_vm = 0; 1135 1136 flush_cache_mm(mm); 1137 while (mpnt) { 1138 struct vm_area_struct * next = mpnt->vm_next; 1139 unsigned long start = mpnt->vm_start; 1140 unsigned long end = mpnt->vm_end; 1141 unsigned long size = end - start; 1142 1143 if (mpnt->vm_ops) { 1144 if (mpnt->vm_ops->close) 1145 mpnt->vm_ops->close(mpnt); 1146 } 1147 mm->map_count--; 1148 remove_shared_vm_struct(mpnt); 1149 zap_page_range(mm, start, size); 1150 if (mpnt->vm_file) 1151 fput(mpnt->vm_file); 1152 kmem_cache_free(vm_area_cachep, mpnt); 1153 mpnt = next; 1154 } 1155 flush_tlb_mm(mm); 1156 1157 /* This is just debugging */ 1158 if (mm->map_count) 1159 BUG(); 1160 1161 clear_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); 1162 }

187

1126 release_segments() will release memory segments associated with the process on its Local Descriptor Table (LDT) if the architecture supports segments and the process was using them. Some applications, notably WINE use this feature 1127 Lock the mm

5.4. Page Fault Handler 1128 mpnt becomes the ﬁrst VMA on the list 1129 Clear VMA related information from the mm so it may be unlocked 1132 Unlock the mm 1133-1134 Clear the mm statistics 1136 Flush the CPU for the address range 1137-1154 Step through every VMA that was associated with the mm 1138 Record what the next VMA to clear will be so this one may be deleted 1139-1141 Record the start, end and size of the region to be deleted 1143-1146 If there is a close operation associated with this VMA, call it 1147 Reduce the map count 1148 Remove the ﬁle/device mapping from the shared mappings list 1149 Free all pages associated with this region 1150-1151 If a ﬁle/device was mapped in this region, free it 1152 Free the VMA struct 1153 Move to the next VMA 1155 Flush the TLB for this whole mm as it is about to be unmapped

188

1158-1159 If the map_count is positive, it means the map count was not accounted for properly so call BUG() to mark it 1161 Clear the page tables associated with this region

5.4

Page Fault Handler

This function is the x86 architecture dependent function for the handling of page fault exception handlers. Each architecture registers their own but all of them have similar responsibilities. 140 asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) 141 { 142 struct task_struct *tsk; 143 struct mm_struct *mm; 144 struct vm_area_struct * vma; 145 unsigned long address; 146 unsigned long page;

5.4. Page Fault Handler

189

do_page_fault

force_sig_info

find_vma

handle_mm_fault

search_exception_table

handle_pte_fault

pte_alloc

search_one_table

do_wp_page

do_swap_page

establish_pte

do_no_page

do_anonymous_page

lru_cache_add

Figure 5.8: Call Graph: do_page_fault() 147 148 149 150 151 152 153 154 155 156 157 158 159 unsigned long fixup; int write; siginfo_t info; /* get the address */ __asm__("movl %%cr2,%0":"=r" (address)); /* It’s safe to allow irq’s after cr2 has been saved */ if (regs->eflags & X86_EFLAGS_IF) local_irq_enable(); tsk = current;

Function preamble. Get the fault address and enable interrupts 140 The parameters are regs is a struct containing what all the registers at fault time error_code indicates what sort of fault occurred 152 As the comment indicates, the cr2 register is the fault addres 155-156 If the fault is from within an interrupt, enable them 158 Set the current task

5.4. Page Fault Handler 173 174 175 176 177 178 183 184 185 if (address >= TASK_SIZE && !(error_code & 5)) goto vmalloc_fault; mm = tsk->mm; info.si_code = SEGV_MAPERR; if (in_interrupt() || !mm) goto no_context;

190

Check for exceptional faults, kernel faults, fault in interrupt and fault with no memory context 173 If the fault address is over TASK_SIZE, it is within the kernel address space. If the error code is 5, then it means it happened while in kernel mode and is not a protection error so handle a vmalloc fault 176 Record the working mm 183 If this is an interrupt, or there is no memory context (such as with a kernel thread), there is no way to safely handle the fault so goto no_context 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 down_read(&mm->mmap_sem); vma = find_vma(mm, address); if (!vma) goto bad_area; if (vma->vm_start <= address) goto good_area; if (!(vma->vm_flags & VM_GROWSDOWN)) goto bad_area; if (error_code & 4) { /* * accessing the stack below %esp is always a bug. * The "+ 32" is there due to some instructions (like * pusha) doing post-decrement on the stack and that * doesn’t show up until later.. */ if (address + 32 < regs->esp) goto bad_area; } if (expand_stack(vma, address)) goto bad_area;

If a fault in userspace, ﬁnd the VMA for the faulting address and determine if it is a good area, a bad area or if the fault occurred near a region that can be expanded such as the stack

5.4. Page Fault Handler 186 Take the long lived mm semaphore 188 Find the VMA that is responsible or is closest to the faulting address 189-190 If a VMA does not exist at all, goto bad_area

191

191-192 If the start of the region is before the address, it means this VMA is the correct VMA for the fault so goto good_area which will check the permissions 193-194 For the region that is closest, check if it can gown down (VM_GROWSDOWN). If it does, it means the stack can probably be expanded. If not, goto bad_area 195-204 Check to make sure it isn’t an access below the stack. if the error_code is 4, it means it is running in userspace 205-206 expand the stack, if it fails, goto bad_area 211 good_area: 212 info.si_code = SEGV_ACCERR; 213 write = 0; 214 switch (error_code & 3) { 215 default: /* 3: write, present */ 216 #ifdef TEST_VERIFY_AREA 217 if (regs->cs == KERNEL_CS) 218 printk("WP fault at %08lx\n", regs->eip); 219 #endif 220 /* fall through */ 221 case 2: /* write, not present */ 222 if (!(vma->vm_flags & VM_WRITE)) 223 goto bad_area; 224 write++; 225 break; 226 case 1: /* read, present */ 227 goto bad_area; 228 case 0: /* read, not present */ 229 if (!(vma->vm_flags & (VM_READ | VM_EXEC))) 230 goto bad_area; 231 } There is the ﬁrst part of a good area is handled. The permissions need to be checked in case this is a protection fault. 212 By default return an error 214 Check the error code against bits 0 and 1 of the error code. Bit 0 at 0 means page was not present. At 1, it means a protection fault like a write to a read-only area. Bit 1 is 0 if it was a read fault and 1 if a write 215 If it is 3, both bits are 1 so it is a write protection fault

5.4. Page Fault Handler 221 Bit 1 is a 1 so it is a write fault

192

222-223 If the region can not be written to, it is a bad write to goto bad_area. If the region can be written to, this is a page that is marked Copy On Write (COW) 224 Flag that a write has occurred 226-227 This is a read and the page is present. There is no reason for the fault so must be some other type of exception like a divide by zero, goto bad_area where it is handled 228-230 A read occurred on a missing page. Make sure it is ok to read or exec this page. If not, goto bad_area. The check for exec is made because the x86 can not exec protect a page and instead uses the read protect ﬂag. This is why both have to be checked 233 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 survive: switch (handle_mm_fault(mm, vma, address, write)) { case 1: tsk->min_flt++; break; case 2: tsk->maj_flt++; break; case 0: goto do_sigbus; default: goto out_of_memory; } /* * Did it hit the DOS screen memory VA from vm86 mode? */ if (regs->eflags & VM_MASK) { unsigned long bit = (address - 0xA0000) >> PAGE_SHIFT; if (bit < 32) tsk->thread.screen_bitmap |= 1 << bit; } up_read(&mm->mmap_sem); return;

At this point, an attempt is going to be made to handle the fault gracefully with handle_mm_fault(). 239 Call handle_mm_fault() with the relevant information about the fault. This is the architecture independent part of the handler 240-242 A return of 1 means it was a minor fault. Update statistics 243-245 A return of 2 means it was a major fault. Update statistics

5.4. Page Fault Handler

193

246-247 A return of 0 means some IO error happened during the fault so go to the do_sigbus handler 248-249 Any other return means memory could not be allocated for the fault so we are out of memory. In reality this does not happen as another function out_of_memory() is invoked in mm/oom_kill.c before this could happen which is a lot more graceful about who it kills 255-259 Not sure 260 Release the lock to the mm 261 Return as the fault has been successfully handled 267 bad_area: 268 up_read(&mm->mmap_sem); 269 270 /* User mode accesses just cause a SIGSEGV */ 271 if (error_code & 4) { 272 tsk->thread.cr2 = address; 273 tsk->thread.error_code = error_code; 274 tsk->thread.trap_no = 14; 275 info.si_signo = SIGSEGV; 276 info.si_errno = 0; 277 /* info.si_code has been set above */ 278 info.si_addr = (void *)address; 279 force_sig_info(SIGSEGV, &info, tsk); 280 return; 281 } 282 283 /* 284 * Pentium F0 0F C7 C8 bug workaround. 285 */ 286 if (boot_cpu_data.f00f_bug) { 287 unsigned long nr; 288 289 nr = (address - idt) >> 3; 290 291 if (nr == 6) { 292 do_invalid_op(regs, 0); 293 return; 294 } 295 } This is the bad area handler such as using memory with no vm_area_struct managing it. If the fault is not by a user process or the f00f bug, the no_context label is fallen through to.

5.4. Page Fault Handler

194

271 An error code of 4 implies userspace so it is a simple case of sending a SIGSEGV to kill the process 272-274 Set thread information about what happened which can be read by a debugger later 275 Record that a SIGSEGV signal was sent 276 clear errno 278 Record the address 279 Send the SIGSEGV signal. The process will exit and dump all the relevant information 280 Return as the fault has been successfully handled 286-295 An bug in the ﬁrst Pentiums was called the f00f bug which caused the processor to constantly page fault. It was used as a local DoS attack on a running Linux system. This bug was trapped within a few hours and a patch released. Now it results in a harmless termination of the process rather than a locked system 296 297 no_context: 298 /* Are we prepared to handle this kernel fault? */ 299 if ((fixup = search_exception_table(regs->eip)) != 0) { 300 regs->eip = fixup; 301 return; 302 } 299-302 Check can this exception be handled and if so, call the proper exception handler after returning. This is really important during copy_from_user() and copy_to_user() when an exception handler is especially installed to trap reads and writes to invalid regions in userspace without having to make expensive checks. It means that a small ﬁxup block of code can be called rather than falling through to the next block which causes an oops 303 304 /* 305 * Oops. The kernel tried to access some bad page. We’ll have to 306 * terminate things with extreme prejudice. 307 */ 308 309 bust_spinlocks(1); 310 311 if (address < PAGE_SIZE) 312 printk(KERN_ALERT "Unable to handle kernel NULL pointer dereference"); 313 else

5.4. Page Fault Handler 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 printk(KERN_ALERT "Unable to handle kernel paging request"); printk(" at virtual address %08lx\n",address); printk(" printing eip:\n"); printk("%08lx\n", regs->eip); asm("movl %%cr3,%0":"=r" (page)); page = ((unsigned long *) __va(page))[address >> 22]; printk(KERN_ALERT "*pde = %08lx\n", page); if (page & 1) { page &= PAGE_MASK; address &= 0x003ff000; page = ((unsigned long *) __va(page))[address >> PAGE_SHIFT]; printk(KERN_ALERT "*pte = %08lx\n", page); } die("Oops", regs, error_code); bust_spinlocks(0); do_exit(SIGKILL);

195

This is the no_context handler. Some bad exception occurred which is going to end up in the process been terminated in all likeliness. Otherwise the kernel faulted when it deﬁnitely should have and an OOPS report is generated. 309-329 Otherwise the kernel faulted when it really shouldn’t have and it is a kernel bug. This block generates an oops report 309 Forcibly free spinlocks which might prevent a message getting to console 311-312 If the address is < PAGE_SIZE, it means that a null pointer was used. Linux deliberately has page 0 unassigned to trap this type of fault which is a common programming error 313-314 Otherwise it is just some bad kernel error such as a driver trying to access userspace incorrectly 315-320 Print out information about the fault 321-326 Print out information about the page been faulted 327 Die and generate an oops report which can be used later to get a stack trace so a developer can see more accurately where and how the fault occurred 329 Forcibly kill the faulting process 335 out_of_memory: 336 if (tsk->pid == 1) { 337 yield(); 338 goto survive;

5.4. Page Fault Handler 339 340 341 342 343 344 } up_read(&mm->mmap_sem); printk("VM: killing process %s\n", tsk->comm); if (error_code & 4) do_exit(SIGKILL); goto no_context;

196

The out of memory handler. Usually ends with the faulting process getting killed unless it is init 336-339 If the process is init, just yield and goto survive which will try to handle the fault gracefully. init should never be killed 340 Free the mm semaphore 341 Print out a helpful “You are Dead” message 342 If from userspace, just kill the process 344 If in kernel space, go to the no_context handler which in this case will probably result in a kernel oops 345 346 do_sigbus: 347 up_read(&mm->mmap_sem); 348 353 tsk->thread.cr2 = address; 354 tsk->thread.error_code = error_code; 355 tsk->thread.trap_no = 14; 356 info.si_signo = SIGBUS; 357 info.si_errno = 0; 358 info.si_code = BUS_ADRERR; 359 info.si_addr = (void *)address; 360 force_sig_info(SIGBUS, &info, tsk); 361 362 /* Kernel mode? Handle exceptions or die */ 363 if (!(error_code & 4)) 364 goto no_context; 365 return; 347 Free the mm lock 353-359 Fill in information to show a SIGBUS occurred at the faulting address so that a debugger can trap it later 360 Send the signal 363-364 If in kernel mode, try and handle the exception during no_context

5.4. Page Fault Handler 365 If in userspace, just return and the process will die in due course 367 vmalloc_fault: 368 { 376 int offset = __pgd_offset(address); 377 pgd_t *pgd, *pgd_k; 378 pmd_t *pmd, *pmd_k; 379 pte_t *pte_k; 380 381 asm("movl %%cr3,%0":"=r" (pgd)); 382 pgd = offset + (pgd_t *)__va(pgd); 383 pgd_k = init_mm.pgd + offset; 384 385 if (!pgd_present(*pgd_k)) 386 goto no_context; 387 set_pgd(pgd, *pgd_k); 388 389 pmd = pmd_offset(pgd, address); 390 pmd_k = pmd_offset(pgd_k, address); 391 if (!pmd_present(*pmd_k)) 392 goto no_context; 393 set_pmd(pmd, *pmd_k); 394 395 pte_k = pte_offset(pmd_k, address); 396 if (!pte_present(*pte_k)) 397 goto no_context; 398 return; 399 } 400 }

197

This is the vmalloc fault handler. In this case the process page table needs to be synchronized with the reference page table. This could occur if a global TLB ﬂush ﬂushed some kernel page tables as well and the page table information just needs to be copied back in. 376 Get the oﬀset within a PGD 381 Copy the address of the PGD for the process from the cr3 register to pgd 382 Calculate the pgd pointer from the process PGD 383 Calculate for the kernel reference PGD 385-386 If the pgd entry is invalid for the kernel page table, goto no_context 386 Set the page table entry in the process page table with a copy from the kernel reference page table

5.4.1. Handling the Page Fault

198

389-393 Same idea for the PMD. Copy the page table entry from the kernel reference page table to the process page tables 395 Check the PTE 396-397 If it is not present, it means the page was not valid even in the kernel reference page table so goto no_context to handle what is probably a kernel bug, probably a reference to a random part of unused kernel space 398 Otherwise return knowing the process page tables have been updated and are in sync with the kernel page tables

5.4.1

Handling the Page Fault

This is the top level pair of functions for the architecture independent page fault handler. Function: handle_mm_fault (mm/memory.c) This function allocates the PMD and PTE necessary for this new PTE hat is about to be allocated. It takes the necessary locks to protect the page tables before calling handle_pte_fault() to fault in the page itself. 1364 int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma, 1365 unsigned long address, int write_access) 1366 { 1367 pgd_t *pgd; 1368 pmd_t *pmd; 1369 1370 current->state = TASK_RUNNING; 1371 pgd = pgd_offset(mm, address); 1372 1373 /* 1374 * We need the page table lock to synchronize with kswapd 1375 * and the SMP-safe atomic PTE updates. 1376 */ 1377 spin_lock(&mm->page_table_lock); 1378 pmd = pmd_alloc(mm, pgd, address); 1379 1380 if (pmd) { 1381 pte_t * pte = pte_alloc(mm, pmd, address); 1382 if (pte) 1383 return handle_pte_fault(mm, vma, address, write_access, pte); 1384 } 1385 spin_unlock(&mm->page_table_lock); 1386 return -1; 1387 } 1364 The parameters of the function are;

5.4.1. Handling the Page Fault mm is the mm_struct for the faulting process vma is the vm_area_struct managing the region the fault occurred in address is the faulting address write_access is 1 if the fault is a write fault 1370 Set the current state of the process 1371 Get the pgd entry from the top level page table 1377 Lock the mm_struct as the page tables will change 1378 pmd_alloc will allocate a pmd_t if one does not already exist 1380 If the pmd has been successfully allocated then... 1381 Allocate a PTE for this address if one does not already exist

199

1382-1383 Handle the page fault with handle_pte_fault() and return the status code 1385 Failure path, unlock the mm_struct 1386 Return -1 which will be interpreted as an out of memory condition which is correct as this line is only reached if a PMD or PTE could not be allocated Function: handle_pte_fault (mm/memory.c) This function decides what type of fault this is and which function should handle it. do_no_page() is called if this is the ﬁrst time a page is to be allocated. do_swap_page() handles the case where the page was swapped out to disk. do_wp_page() breaks COW pages. If none of them are appropriate, the PTE entry is simply updated. If it was written to, it is marked dirty and it is marked accessed to show it is a young page. 1331 static inline int handle_pte_fault(struct mm_struct *mm, 1332 struct vm_area_struct * vma, unsigned long address, 1333 int write_access, pte_t * pte) 1334 { 1335 pte_t entry; 1336 1337 entry = *pte; 1338 if (!pte_present(entry)) { 1339 /* 1340 * If it truly wasn’t present, we know that kswapd 1341 * and the PTE updates will not touch it later. So 1342 * drop the lock. 1343 */ 1344 if (pte_none(entry)) 1345 return do_no_page(mm, vma, address, write_access, pte); 1346 return do_swap_page(mm, vma, address, pte, entry,

5.4.2. Demand Allocation write_access); 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 } } if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, entry); entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); establish_pte(vma, address, pte, entry); spin_unlock(&mm->page_table_lock); return 1;

200

1331 The parameters of the function are the same as those for handle_mm_fault() except the PTE for the fault is included 1337 Record the PTE 1338 Handle the case where the PTE is not present 1344 If the PTE has never been ﬁlled, handle the allocation of the PTE with do_no_page() 1346 If the page has been swapped out to backing storage, handle it with do_swap_page() 1349-1354 Handle the case where the page is been written to 1350-1351 If the PTE is marked write-only, it is a COW page so handle it with do_wp_page() 1353 Otherwise just simply mark the page as dirty 1355 Mark the page as accessed 1356 establish_pte() copies the PTE and then updates the TLB and MMU cache. This does not copy in a new PTE but some architectures require the TLB and MMU update 1357 Unlock the mm_struct and return that a minor fault occurred

5.4.2

Demand Allocation

Function: do_no_page (mm/memory.c) This function is called the ﬁrst time a page is referenced so that it may be allocated and ﬁlled with data if necessary. If it is an anonymous page, determined by the lack of a vm_ops available to the VMA or the lack of a nopage() function, then do_anonymous_page() is called. Otherwise the supplied nopage() function is called to allocate a page and it is inserted into the page tables here. The function has the following tasks;

5.4.2. Demand Allocation

201

do_no_page

do_anonymous_page

lru_cache_add

mark_page_accessed

Figure 5.9: Call Graph: do_no_page() • Check if do_anonymous_page() should be used and if so, call it and return the page it allocates. If not, call the supplied nopage() function and ensure it allocates a page successfully. • Break COW early if appropriate • Add the page to the page table entries and call the appropriate architecture dependent hooks 1245 static int do_no_page(struct mm_struct * mm, struct vm_area_struct * vma, 1246 unsigned long address, int write_access, pte_t *page_table) 1247 { 1248 struct page * new_page; 1249 pte_t entry; 1250 1251 if (!vma->vm_ops || !vma->vm_ops->nopage) 1252 return do_anonymous_page(mm, vma, page_table, write_access, address); 1253 spin_unlock(&mm->page_table_lock); 1254 1255 new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0); 1256 1257 if (new_page == NULL) /* no page was available -- SIGBUS */ 1258 return 0; 1259 if (new_page == NOPAGE_OOM) 1260 return -1; 1245 The parameters supplied are the same as those for handle_pte_fault() 1251-1252 If no vm_ops is supplied or no nopage() function is supplied, then call do_anonymous_page() to allocate a page and return it

5.4.2. Demand Allocation

202

1253 Otherwise free the page table lock as the nopage() function can not be called with spinlocks held 1255 Call the supplied nopage function, in the case of ﬁlesystems, this is frequently filemap_nopage() but will be diﬀerent for each device driver 1257-1258 If NULL is returned, it means some error occurred in the nopage function such as an IO error while reading from disk. In this case, 0 is returned which results in a SIGBUS been sent to the faulting process 1259-1260 If NOPAGE_OOM is returned, the physical page allocator failed to allocate a page and -1 is returned which will forcibly kill the process 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 if (write_access && !(vma->vm_flags & VM_SHARED)) { struct page * page = alloc_page(GFP_HIGHUSER); if (!page) { page_cache_release(new_page); return -1; } copy_user_highpage(page, new_page, address); page_cache_release(new_page); lru_cache_add(page); new_page = page; }

Break COW early in this block if appropriate. COW is broken if the fault is a write fault and the region is not shared with VM_SHARED. If COW was not broken in this case, a second fault would occur immediately upon return. 1265 Check if COW should be broken early 1266 If so, allocate a new page for the process 1267-1270 If the page could not be allocated, reduce the reference count to the page returned by the nopage() function and return -1 for out of memory 1271 Otherwise copy the contents 1272 Reduce the reference count to the returned page which may still be in use by another process 1273 Add the new page to the LRU lists so it may be reclaimed by kswapd later 1276 1277 1288 1289 1290 1291

spin_lock(&mm->page_table_lock); /* Only go through if we didn’t race with anybody else... */ if (pte_none(*page_table)) { ++mm->rss; flush_page_to_ram(new_page);

5.4.2. Demand Allocation 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 */ 1305 1306 1307 1308 } flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = pte_mkwrite(pte_mkdirty(entry)); set_pte(page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); spin_unlock(&mm->page_table_lock); return 1; }

203

/* no need to invalidate: a not-present page shouldn’t be cached update_mmu_cache(vma, address, entry); spin_unlock(&mm->page_table_lock); return 2; /* Major fault */

1277 Lock the page tables again as the allocations have ﬁnished and the page tables are about to be updated 1289 Check if there is still no PTE in the entry we are about to use. If two faults hit here at the same time, it is possible another processor has already completed the page fault and this one should be backed out 1290-1297 If there is no PTE entered, complete the fault 1290 Increase the RSS count as the process is now using another page 1291 As the page is about to be mapped to the process space, it is possible for some architectures that writes to the page in kernel space will not be visible to the process. flush_page_to_ram() ensures the cache will be coherent 1292 flush_icache_page() is similar in principle except it ensures the icache and dcache’s are coherent 1293 Create a pte_t with the appropriate permissions 1294-1295 If this is a write, then make sure the PTE has write permissions 1296 Place the new PTE in the process page tables 1297-1302 If the PTE is already ﬁlled, the page acquired from the nopage() function must be released 1299 Decrement the reference count to the page. If it drops to 0, it will be freed

5.4.2. Demand Allocation

204

1300-1301 Release the mm_struct lock and return 1 to signal this is a minor page fault as no major work had to be done for this fault as it was all done by the winner of the race 1305 Update the MMU cache for architectures that require it 1306-1307 Release the mm_struct lock and return 2 to signal this is a major page fault Function: do_anonymous_page (mm/memory.c) This function allocates a new page for a process accessing a page for the ﬁrst time. If it is a read access, a system wide page containing only zeros is mapped into the process. If it is write, a zero ﬁlled page is allocated and placed within the page tables 1190 static int do_anonymous_page(struct mm_struct * mm, struct vm_area_struct * vma, pte_t *page_table, int write_access, unsigned long addr) 1191 { 1192 pte_t entry; 1193 1194 /* Read-only mapping of ZERO_PAGE. */ 1195 entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); 1196 1197 /* ..except if it’s a write access */ 1198 if (write_access) { 1199 struct page *page; 1200 1201 /* Allocate our own private page. */ 1202 spin_unlock(&mm->page_table_lock); 1203 1204 page = alloc_page(GFP_HIGHUSER); 1205 if (!page) 1206 goto no_mem; 1207 clear_user_highpage(page, addr); 1208 1209 spin_lock(&mm->page_table_lock); 1210 if (!pte_none(*page_table)) { 1211 page_cache_release(page); 1212 spin_unlock(&mm->page_table_lock); 1213 return 1; 1214 } 1215 mm->rss++; 1216 flush_page_to_ram(page); 1217 entry = pte_mkwrite( pte_mkdirty(mk_pte(page, vma->vm_page_prot))); 1218 lru_cache_add(page); 1219 mark_page_accessed(page);

5.4.3. Demand Paging 1220 } 1221 1222 set_pte(page_table, entry); 1223 1224 /* No need to invalidate - it was non-present before */ 1225 update_mmu_cache(vma, addr, entry); 1226 spin_unlock(&mm->page_table_lock); 1227 return 1; /* Minor fault */ 1228 1229 no_mem: 1230 return -1; 1231 } 1190 The parameters are the same as those passed to handle_pte_fault()

205

1195 For read accesses, simply map the system wide empty_zero_page which the ZERO_PAGE macro returns with the given permissions. The page is write protected so that a write to the page will result in a page fault 1198-1220 If this is a write fault, then allocate a new page and zero ﬁll it 1202 Unlock the mm_struct as the allocation of a new page could sleep 1204 Allocate a new page 1205 If a page could not be allocated, return -1 to handle the OOM situation 1207 Zero ﬁll the page 1209 Reacquire the lock as the page tables are to be updated 1216 Ensure the cache is coherent 1217 Mark the PTE writable and dirty as it has been written to 1218 Add the page to the LRU list so it may be reclaimed by the swapper later 1219 Mark the page accessed which ensures the page is marked hot and on the top of the active list 1222 Fix the PTE in the page tables for this process 1225 Update the MMU cache if the architecture needs it 1226 Free the page table lock 1227 Return as a minor fault as even though it is possible the page allocator spent time writing out pages, data did not have to be read from disk to ﬁll this page

5.4.3. Demand Paging

206

5.4.3

Demand Paging

Function: do_swap_page (mm/memory.c) This function handles the case where a page has been swapped out. A swapped out page may exist in the swap cache if it is shared between a number of processes or recently swapped in during readahead. This function is broken up into three parts • Search for the page in swap cache • If it does not exist, call swapin_readahead() to read in the page • Insert the page into the process page tables 1117 static int do_swap_page(struct mm_struct * mm, 1118 struct vm_area_struct * vma, unsigned long address, 1119 pte_t * page_table, pte_t orig_pte, int write_access) 1120 { 1121 struct page *page; 1122 swp_entry_t entry = pte_to_swp_entry(orig_pte); 1123 pte_t pte; 1124 int ret = 1; 1125 1126 spin_unlock(&mm->page_table_lock); 1127 page = lookup_swap_cache(entry); Function preamble, check for the page in the swap cache 1117-1119 The parameters are the same as those supplied to handle_pte_fault() 1122 Get the swap entry information from the PTE 1126 Free the mm_struct spinlock 1127 Lookup the page in the swap cache 1128 1129 1130 1131 1136 1137 1138 1; 1139 1140 1141 1142 1143 1144 1145 if (!page) { swapin_readahead(entry); page = read_swap_cache_async(entry); if (!page) { int retval; spin_lock(&mm->page_table_lock); retval = pte_same(*page_table, orig_pte) ? -1 : spin_unlock(&mm->page_table_lock); return retval; } /* Had to read the page from swap area: Major fault */ ret = 2; }

5.4.3. Demand Paging

207

If the page did not exist in the swap cache, then read it from backing storage with swapin_readhead() which reads in the requested pages and a number of pages after it. Once it completes, read_swap_cache_async() should be able to return the page. 1128-1145 This block is executed if the page was not in the swap cache 1129 swapin_readahead() reads in the requested page and a number of pages after it. The number of pages read in is determined by the page_cluster variable in mm/swap.c which is initialised to 2 on machines with less than 16MiB of memory and 3 otherwise. 2page_cluster pages are read in after the requested page unless a bad or empty page entry is encountered 1230 Look up the requested page 1131-1141 If the page does not exist, there was another fault which swapped in this page and removed it from the cache while spinlocks were dropped 1137 Lock the mm_struct 1138 Compare the two PTEs. If they do not match, -1 is returned to signal an IO error, else 1 is returned to mark a minor page fault as a disk access was not required for this particular page. 1139-1140 Free the mm_struct and return the status 1144 The disk had to be accessed to mark that this is a major page fault 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 mark_page_accessed(page); lock_page(page); /* * Back out if somebody else faulted in this pte while we * released the page table lock. */ spin_lock(&mm->page_table_lock); if (!pte_same(*page_table, orig_pte)) { spin_unlock(&mm->page_table_lock); unlock_page(page); page_cache_release(page); return 1; } /* The page isn’t present yet, go ahead with the fault. */ swap_free(entry); if (vm_swap_full()) remove_exclusive_swap_page(page);

5.4.3. Demand Paging 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 }

208

mm->rss++; pte = mk_pte(page, vma->vm_page_prot); if (write_access && can_share_swap_page(page)) pte = pte_mkdirty(pte_mkwrite(pte)); unlock_page(page); flush_page_to_ram(page); flush_icache_page(vma, page); set_pte(page_table, pte); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, address, pte); spin_unlock(&mm->page_table_lock); return ret;

Place the page in the process page tables 1147 Mark the page as active so it will be moved to the top of the active LRU list 1149 Lock the page which has the side eﬀect of waiting for the IO swapping in the page to complete 1155-1161 If someone else faulted in the page before we could, the reference to the page is dropped, the lock freed and return that this was a minor fault 1165 The function swap_free() reduces the reference to a swap entry. If it drops to 0, it is actually freed 1166-1167 Page slots in swap space are reserved for pages once they have been swapped out once if possible. If the swap space is full though, the reservation is broken and the slot freed up for another page 1169 The page is now going to be used so increment the mm_struct’s RSS count 1170 Make a PTE for this page 1171 If the page is been written to and it is shared between more than one process, mark it dirty so that it will be kept in sync with the backing storage and swap cache for other processes 1173 Unlock the page 1175 As the page is about to be mapped to the process space, it is possible for some architectures that writes to the page in kernel space will not be visible to the process. flush_page_to_ram() ensures the cache will be coherent 1176 flush_icache_page() is similar in principle except it ensures the icache and dcache’s are coherent

5.4.4. Copy On Write (COW) Pages 1177 Set the PTE in the process page tables 1180 Update the MMU cache if the architecture requires it

209

1181-1182 Unlock the mm_struct and return whether it was a minor or major page fault

5.4.4

Copy On Write (COW) Pages

5.4.4. Copy On Write (COW) Pages

do_swap_page

lookup_swap_cache

swapin_readahead

mark_page_accessed

lock_page

swap_free

remove_exclusive_swap_page

can_share_swap_page

unlock_page

read_swap_cache_async

activate_page

exclusive_swap_page

page_waitqueue

Figure 5.10: do_swap_page
activate_page_nolock

210

5.4.4. Copy On Write (COW) Pages

211

do_wp_page

can_share_swap_page

unlock_page

copy_cow_page

break_cow

lru_cache_add

exclusive_swap_page

page_waitqueue

establish_pte

Figure 5.11: do_wp_page

Chapter 6 High Memory Management
6.1 Mapping High Memory Pages
kmap

__out_of_line_bug

kmap_high

map_new_virtual

flush_all_zero_pkmaps

add_wait_queue

remove_wait_queue

Figure 6.1: Call Graph: kmap() Function: kmap (include/asm-i386/highmem.h) 62 static inline void *kmap(struct page *page) 63 { 64 if (in_interrupt()) 65 out_of_line_bug(); 66 if (page < highmem_start_page) 67 return page_address(page); 68 return kmap_high(page); 69 } 64-65 This function may not be used from interrupt as it may sleep. out_of_line_bug() calls do_exit() and returns an error code. BUG() is not used because BUG() kills the 212

6.1. Mapping High Memory Pages

213

process with extreme prejudice which would result in the fabled “Aiee, killing interrupt handler!” kernel panic 66-67 If the page is already in low memory, return a direct mapping 68 Call kmap_high() for the beginning of the architecture independent work Function: kmap_high (mm/highmem.c) 129 void *kmap_high(struct page *page) 130 { 131 unsigned long vaddr; 132 139 spin_lock(&kmap_lock); 140 vaddr = (unsigned long) page->virtual; 141 if (!vaddr) 142 vaddr = map_new_virtual(page); 143 pkmap_count[PKMAP_NR(vaddr)]++; 144 if (pkmap_count[PKMAP_NR(vaddr)] < 2) 145 BUG(); 146 spin_unlock(&kmap_lock); 147 return (void*) vaddr; 148 } 139 The kmap_lock protects the virtual ﬁeld of a page and the pkmap_count array 140 Get the virtual address of the page 141-142 If it is not already mapped, call map_new_virtual() which will map the page and return the virtual address 143 Increase the reference count for this page mapping 144-145 If the count is currently less than 2, it is a serious bug. In reality, severe breakage would have to be introduced to cause this to happen 146 Free the kmap_lock Function: map_new_virtual (mm/highmem.c) This function is divided into three principle parts. The scanning for a free slot, waiting on a queue if none is avaialble and mapping the page. 80 static inline unsigned long map_new_virtual(struct page *page) 81 { 82 unsigned long vaddr; 83 int count; 84 85 start: 86 count = LAST_PKMAP;

6.1. Mapping High Memory Pages 87 88 89 90 91 92 93 94 95 96 97 98

214

/* Find an empty entry */ for (;;) { last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK; if (!last_pkmap_nr) { flush_all_zero_pkmaps(); count = LAST_PKMAP; } if (!pkmap_count[last_pkmap_nr]) break; /* Found a usable entry */ if (--count) continue;

86 Start scanning at the last possible slot 88-119 This look keeps scanning and waiting until a slot becomes free. This allows the possibility of an inﬁnite loop for some processes if they were unlucky 89 last_pkmap_nr is the last pkmap that was scanned. To prevent searching over the same pages, this value is recorded so the list is searched circularly. When it reaches LAST_PKMAP, it wraps around to 0 90-93 When last_pkmap_nr wraps around, call flush_all_zero_pkmaps() which will set all entries from 1 to 0 in the pkmap_count array before ﬂushing the TLB. Count is set back to LAST_PKMAP to restart scanning 94-95 If this element is 0, a usable slot has been found for the page 96-96 Move to the next index to scan 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 { DECLARE_WAITQUEUE(wait, current); current->state = TASK_UNINTERRUPTIBLE; add_wait_queue(&pkmap_map_wait, &wait); spin_unlock(&kmap_lock); schedule(); remove_wait_queue(&pkmap_map_wait, &wait); spin_lock(&kmap_lock); /* Somebody else might have mapped it while we slept */ if (page->virtual) return (unsigned long) page->virtual; /* Re-start */ goto start; } }

6.1. Mapping High Memory Pages

215

If there is no available slot after scanning all the pages once, we sleep on the pkmap_map_wait queue until we are woken up after an unmap 103 Declare the wait queue 105 Set the task as interruptible because we are sleeping in kernel space 106 Add ourselves to the pkmap_map_wait queue 107 Free the kmap_lock spinlock 108 Call schedule() which will put us to sleep. We are woken up after a slot becomes free after an unmap 109 Remove ourselves from the ait queue 110 Re-acquire kmap_lock 113-114 If someone else mapped the page while we slept, just return the address and the reference count will be incremented by kmap_high() 117 Restart the scanning 120 121 122 123 124 125 126 127 } vaddr = PKMAP_ADDR(last_pkmap_nr); set_pte(&(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot)); pkmap_count[last_pkmap_nr] = 1; page->virtual = (void *) vaddr; return vaddr;

A slot has been found, map the page 120 Get the virtual address for the slot found 121 Make the PTE entry with the page and required protection and place it in the page tables at the found slot 123 Initialise the value in the pkmap_count array to 1. The count is incremented in the parent function and we are sure this is the ﬁrst mapping if we are in this function in the ﬁrst place 124 Set the virtual ﬁeld for the page 126 Return the virtual address

6.1. Mapping High Memory Pages

216

Function: ﬂush_all_zero_pkmaps (mm/highmem.c) This function cycles through the pkmap_count array and sets all entries from 1 to 0 before ﬂushing the TLB. 42 static void flush_all_zero_pkmaps(void) 43 { 44 int i; 45 46 flush_cache_all(); 47 48 for (i = 0; i < LAST_PKMAP; i++) { 49 struct page *page; 50 57 if (pkmap_count[i] != 1) 58 continue; 59 pkmap_count[i] = 0; 60 61 /* sanity check */ 62 if (pte_none(pkmap_page_table[i])) 63 BUG(); 64 72 page = pte_page(pkmap_page_table[i]); 73 pte_clear(&pkmap_page_table[i]); 74 75 page->virtual = NULL; 76 } 77 flush_tlb_all(); 78 } 46 As the global page tables are about to change, the CPU caches of all processors have to be ﬂushed 48-76 Cycle through the entire pkmap_count array 57-58 If the element is not 1, move to the next element 59 Set from 1 to 0 62-63 Make sure the PTE is not somehow mapped 72-73 Unmap the page from the PTE and clear the PTE 75 Update the virtual ﬁeld as the page is unmapped 77 Flush the TLB

6.1.1. Unmapping Pages

217

6.1.1

Unmapping Pages

Function: kunmap (include/asm-i386/highmem.h) 71 static inline void kunmap(struct page *page) 72 { 73 if (in_interrupt()) 74 out_of_line_bug(); 75 if (page < highmem_start_page) 76 return; 77 kunmap_high(page); 78 } 73-74 kunmap() cannot be called from interrupt so exit gracefully 75-76 If the page already is in low memory, there is no need to unmap 77 Call the architecture independent function kunmap_high() Function: kunmap_high (mm/highmem.c) 150 void kunmap_high(struct page *page) 151 { 152 unsigned long vaddr; 153 unsigned long nr; 154 int need_wakeup; 155 156 spin_lock(&kmap_lock); 157 vaddr = (unsigned long) page->virtual; 158 if (!vaddr) 159 BUG(); 160 nr = PKMAP_NR(vaddr); 161 166 need_wakeup = 0; 167 switch (--pkmap_count[nr]) { 168 case 0: 169 BUG(); 170 case 1: 181 need_wakeup = waitqueue_active(&pkmap_map_wait); 182 } 183 spin_unlock(&kmap_lock); 184 185 /* do wake-up, if needed, race-free outside of the spin lock */ 186 if (need_wakeup) 187 wake_up(&pkmap_map_wait); 188 } 156 Acquire kmap_lock protecting the virtual() ﬁeld and the pkmap_count array

6.2. Mapping High Memory Pages Atomically 157 Get the virtual page

218

158-159 If the virtual ﬁeld is not set, it is a double unmapping or unmapping of a nonmapped page so BUG() 160 Get the index within the pkmap_count array 166 By default, a wakeup call to processes calling kmap() is not needed 167 Check the value of the index after decrement 168-169 Falling to 0 is a bug as the TLB needs to be ﬂushed to make 0 a valid entry 170-181 If it has dropped to 1 (free entry but needs TLB ﬂush), check to see if there is anyone sleeping on the pkmap_map_wait queue. If necessary, the queue will be woken up after the spinlock is freed 183 Free kmap_lock 186-187 If there is waiters on the queue and a slot has been freed, wake them up

6.2

Mapping High Memory Pages Atomically

The following is an example km_type enumeration for the x86. It lists the diﬀerent uses interrupts have for atomically calling kmap. Note how KM_TYPE_NR is the last element so it doubles up as a count of the number of elements. 4 enum km_type { 5 KM_BOUNCE_READ, 6 KM_SKB_SUNRPC_DATA, 7 KM_SKB_DATA_SOFTIRQ, 8 KM_USER0, 9 KM_USER1, 10 KM_BH_IRQ, 11 KM_TYPE_NR 12 }; Function: kmap_atomic (include/asm-i386/highmem.h) This is the atomic version of kmap(). Note that at no point is a spinlock held or does it sleep. A spinlock is not required as every processor has its own reserved space. 86 static inline void *kmap_atomic(struct page *page, enum km_type type) 87 { 88 enum fixed_addresses idx; 89 unsigned long vaddr; 90 91 if (page < highmem_start_page) 92 return page_address(page);

6.2. Mapping High Memory Pages Atomically 93 94 idx = type + KM_TYPE_NR*smp_processor_id(); 95 vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); 96 #if HIGHMEM_DEBUG 97 if (!pte_none(*(kmap_pte-idx))) 98 out_of_line_bug(); 99 #endif 100 set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); 101 __flush_tlb_one(vaddr); 102 103 return (void*) vaddr; 104 }

219

86 The parameters are the page to map and the type of usage required. One slot per usage per processor is maintained 91-92 If the page is in low memory, return a direct mapping 94 type gives which slot to use. KM_TYPE_NR * smp_processor_id() gives the set of slots reserved for this processor 95 Get the virtual address 97-98 Debugging code. In reality a PTE will always exist 100 Set the PTE into the reserved slot 101 Flush the TLB for this slot 103 Return the virtual address Function: kunmap_atomic (include/asm-i386/highmem.h) This entire function is debug code. The reason is that as pages are only mapped here atomically, they will only be used in a tiny place for a short time before being unmapped. It is safe to leave the page there as it will not be referenced after unmapping and another mapping to the same slot will simply replce it. 106 static inline void kunmap_atomic(void *kvaddr, enum km_type type) 107 { 108 #if HIGHMEM_DEBUG 109 unsigned long vaddr = (unsigned long) kvaddr; 110 enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id(); 111 112 if (vaddr < FIXADDR_START) // FIXME 113 return; 114 115 if (vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx)) 116 out_of_line_bug();

6.3. Bounce Buﬀers 117 118 119 120 121 122 123 124 #endif 125 }

220

/* * force other mappings to Oops if they’ll try to access * this pte without first remap it */ pte_clear(kmap_pte-idx); __flush_tlb_one(vaddr);

109 Get the virtual address 112-113 If the address supplied is not in the ﬁxed area, return 115-116 If the address does not correspond to the reserved slot for this type of usage and processor, declare it 122-123 Unmap the page now so that if it is referenced again, it will cause an Oops

6.3

Bounce Buﬀers

Function: create_buﬀers (mm/highmem.c)

create_bounce

alloc_bounce_bh

alloc_bounce_page

set_bh_page

copy_from_high_bh

kmem_cache_alloc

yield

wakeup_bdflush

Figure 6.2: Call Graph: create_bounce High level function for the creation of bounce buﬀers. It is broken into two major parts, the allocation of the necessary resources, and the copying of data from the template. 398 struct buffer_head * create_bounce(int rw, struct buffer_head * bh_orig) 399 { 400 struct page *page; 401 struct buffer_head *bh; 402 403 if (!PageHighMem(bh_orig->b_page)) 404 return bh_orig;

6.3. Bounce Buﬀers 405 406 413 414 415 416

221

bh = alloc_bounce_bh(); page = alloc_bounce_page(); set_bh_page(bh, page, 0);

398 The parameters of the function are rw is set to 1 if this is a write buﬀer bh_orig is the template buﬀer head to copy from 403-404 If the template buﬀer head is already in low memory, simply return it 406 Allocate a buﬀer head from the slab allocator or from the emergency pool if it fails 413 Allocate a page from the buddy allocator or the emergency pool if it fails 415 Associate the allocated page with the allocated buffer_head 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 bh->b_next = NULL; bh->b_blocknr = bh_orig->b_blocknr; bh->b_size = bh_orig->b_size; bh->b_list = -1; bh->b_dev = bh_orig->b_dev; bh->b_count = bh_orig->b_count; bh->b_rdev = bh_orig->b_rdev; bh->b_state = bh_orig->b_state; #ifdef HIGHMEM_DEBUG bh->b_flushtime = jiffies; bh->b_next_free = NULL; bh->b_prev_free = NULL; /* bh->b_this_page */ bh->b_reqnext = NULL; bh->b_pprev = NULL; #endif /* bh->b_page */ if (rw == WRITE) { bh->b_end_io = bounce_end_io_write; copy_from_high_bh(bh, bh_orig); } else bh->b_end_io = bounce_end_io_read; bh->b_private = (void *)bh_orig; bh->b_rsector = bh_orig->b_rsector; #ifdef HIGHMEM_DEBUG memset(&bh->b_wait, -1, sizeof(bh->b_wait)); #endif

6.3. Bounce Buﬀers 444 445 446 }

222

return bh;

Populate the newly created buffer_head 424 Copy in information essentially verbatim except for the b_list ﬁeld as this buﬀer is not directly connected to the others on the list 426-431 Debugging only information 434-437 If this is a buﬀer that is to be written to then the callback function to end the IO is bounce_end_io_write() which is called when the device has received all the information. As the data exists in high memory, it is copied “down” with copy_from_high_bh() 437-438 If we are waiting for a device to write data into the buﬀer, then the callback function bounce_end_io_read() is used 439-440 Copy the remaining information from the template buffer_head 445 Return the new bounce buﬀer Function: alloc_bounce_bh (mm/highmem.c) This function ﬁrst tries to allocate a buffer_head from the slab allocator and if that fails, an emergency pool will be used. 362 struct buffer_head *alloc_bounce_bh (void) 363 { 364 struct list_head *tmp; 365 struct buffer_head *bh; 366 367 bh = kmem_cache_alloc(bh_cachep, SLAB_NOHIGHIO); 368 if (bh) 369 return bh; 373 374 wakeup_bdflush(); 367 Try to allocate a new buffer_head from the slab allocator. Note how the request is made to not use IO operations that involve high IO to avoid recursion 368-369 If the allocation was successful, return 374 If it was not, wake up bdﬂush to launder pages 375 376 repeat_alloc: 380 tmp = &emergency_bhs; 381 spin_lock_irq(&emergency_lock);

6.3. Bounce Buﬀers 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 } if (!list_empty(tmp)) { bh = list_entry(tmp->next, struct buffer_head, b_inode_buffers); list_del(tmp->next); nr_emergency_bhs--; } spin_unlock_irq(&emergency_lock); if (bh) return bh; /* we need to wait I/O completion */ run_task_queue(&tq_disk); yield(); goto repeat_alloc;

223

The allocation from the slab failed so allocate from the emergency pool. 380 Get the end of the emergency buﬀer head list 381 Acquire the lock protecting the pools 382-386 If the pool is not empty, take a buffer_head from the list and decrement the nr_emergency_bhs counter 387 Release the lock 388-389 If the allocation was successful, return it 392 If not, we are seriously short of memory and the only way the pool will replenish is if high memory IO completes. Therefore, requests on tq_disk are started so the data will be written to disk, probably freeing up pages in the process 394 Yield the processor 395 Attempt to allocate from the emergency pools again Function: alloc_bounce_page (mm/highmem.c) This function is essentially identical to alloc_bounce_bh() It ﬁrst tries to allocate a page from the buddy allocator and if that fails, an emergency pool will be used. 326 struct page *alloc_bounce_page (void) 327 { 328 struct list_head *tmp; 329 struct page *page; 330 331 page = alloc_page(GFP_NOHIGHIO); 332 if (page)

6.3. Bounce Buﬀers 333 337 338 return page; wakeup_bdflush();

224

331-333 Allocate from the buddy allocator and return the page if successful 338 Wake bdﬂush to launder pages 339 340 repeat_alloc: 344 tmp = &emergency_pages; 345 spin_lock_irq(&emergency_lock); 346 if (!list_empty(tmp)) { 347 page = list_entry(tmp->next, struct page, list); 348 list_del(tmp->next); 349 nr_emergency_pages--; 350 } 351 spin_unlock_irq(&emergency_lock); 352 if (page) 353 return page; 354 355 /* we need to wait I/O completion */ 356 run_task_queue(&tq_disk); 357 358 yield(); 359 goto repeat_alloc; 360 } 344 Get the end of the emergency buﬀer head list 334 Acquire the lock protecting the pools 346-350 If the pool is not empty, take a page from the list and decrement the number of available nr_emergency_pages 351 Release the lock 352-353 If the allocation was successful, return it 356 Run the IO task queue to try and replenish the emergency pool 394 Yield the processor 395 Attempt to allocate from the emergency pools again

6.3.1. Copying via Bounce Buﬀers

225

6.3.1

Copying via Bounce Buﬀers

Function: bounce_end_io_write (mm/highmem.c) This function is called when a bounce buﬀer used for writing to a device completes IO. As the buﬀer is copied from high memory and to the device, there is nothing left to do except reclaim the resources 312 static void bounce_end_io_write (struct buffer_head *bh, int uptodate) 313 { 314 bounce_end_io(bh, uptodate); 315 } Function: bounce_end_io_read (mm/highmem.c) This is called when data has been read from the device and needs to be copied to high memory. It is called from interrupt so has to be more careful 317 static void bounce_end_io_read (struct buffer_head *bh, int uptodate) 318 { 319 struct buffer_head *bh_orig = (struct buffer_head *)(bh->b_private); 320 321 if (uptodate) 322 copy_to_high_bh_irq(bh_orig, bh); 323 bounce_end_io(bh, uptodate); 324 } 321-322 The data is just copied to the bounce buﬀer to needs to be moved to high memory with copy_to_high_bh_irq() 323 Reclaim the resources Function: copy_from_high_bh (mm/highmem.c) This function copies data from a high memory buffer_head to a bounce buﬀer. 208 static inline void copy_from_high_bh (struct buffer_head *to, 209 struct buffer_head *from) 210 { 211 struct page *p_from; 212 char *vfrom; 213 214 p_from = from->b_page; 215 216 vfrom = kmap_atomic(p_from, KM_USER0); 217 memcpy(to->b_data, vfrom + bh_offset(from), to->b_size); 218 kunmap_atomic(vfrom, KM_USER0); 219 }

6.3.1. Copying via Bounce Buﬀers

226

216 Map the high memory page into low memory. This path is protected by the IRQ safe lock io_request_lock so it is safe to call kmap_atomic() 217 Copy the data 218 Unmap the page Function: copy_to_high_bh_irq (mm/highmem.c) Called from interrupt after the device has ﬁnished writing data to the bounce buﬀer. This function copies data to high memory 221 static inline void copy_to_high_bh_irq (struct buffer_head *to, 222 struct buffer_head *from) 223 { 224 struct page *p_to; 225 char *vto; 226 unsigned long flags; 227 228 p_to = to->b_page; 229 __save_flags(flags); 230 __cli(); 231 vto = kmap_atomic(p_to, KM_BOUNCE_READ); 232 memcpy(vto + bh_offset(to), from->b_data, to->b_size); 233 kunmap_atomic(vto, KM_BOUNCE_READ); 234 __restore_flags(flags); 235 } 229-230 Save the ﬂags and disable interrupts 231 Map the high memory page into low memory 232 Copy the data 233 Unmap the page 234 Restore the interrupt ﬂags Function: bounce_end_io (mm/highmem.c) Reclaims the resources used by the bounce buﬀers. If emergency pools are depleted, the resources are added to it. 237 static inline void bounce_end_io (struct buffer_head *bh, int uptodate) 238 { 239 struct page *page; 240 struct buffer_head *bh_orig = (struct buffer_head *)(bh->b_private); 241 unsigned long flags; 242

6.3.1. Copying via Bounce Buﬀers

227

243 bh_orig->b_end_io(bh_orig, uptodate); 244 245 page = bh->b_page; 246 247 spin_lock_irqsave(&emergency_lock, flags); 248 if (nr_emergency_pages >= POOL_SIZE) 249 __free_page(page); 250 else { 251 /* 252 * We are abusing page->list to manage 253 * the highmem emergency pool: 254 */ 255 list_add(&page->list, &emergency_pages); 256 nr_emergency_pages++; 257 } 258 259 if (nr_emergency_bhs >= POOL_SIZE) { 260 #ifdef HIGHMEM_DEBUG 261 /* Don’t clobber the constructed slab cache */ 262 init_waitqueue_head(&bh->b_wait); 263 #endif 264 kmem_cache_free(bh_cachep, bh); 265 } else { 266 /* 267 * Ditto in the bh case, here we abuse b_inode_buffers: 268 */ 269 list_add(&bh->b_inode_buffers, &emergency_bhs); 270 nr_emergency_bhs++; 271 } 272 spin_unlock_irqrestore(&emergency_lock, flags); 273 } 243 Call the IO completion callback for the original buffer_head 245 Get the pointer to the buﬀer page to free 247 Acquire the lock to the emergency pool 248-249 If the page pool is full, just return the page to the buddy allocator 250-257 Otherwise add this page to the emergency pool 259-265 If the buffer_head pool is full, just return it to the slab allocator 265-271 Otherwise add this buffer_head to the pool 272 Release the lock

6.4. Emergency Pools

228

6.4

Emergency Pools

There is only one function of relevance to the emergency pools and that is the init function. It is called during system startup and then the code is deleted as it is never needed again Function: init_emergency_pool (mm/highmem.c) Create a pool for emergency pages and for emergency buffer_heads 275 static __init int init_emergency_pool(void) 276 { 277 struct sysinfo i; 278 si_meminfo(&i); 279 si_swapinfo(&i); 280 281 if (!i.totalhigh) 282 return 0; 283 284 spin_lock_irq(&emergency_lock); 285 while (nr_emergency_pages < POOL_SIZE) { 286 struct page * page = alloc_page(GFP_ATOMIC); 287 if (!page) { 288 printk("couldn’t refill highmem emergency pages"); 289 break; 290 } 291 list_add(&page->list, &emergency_pages); 292 nr_emergency_pages++; 293 } 281-282 If there is no high memory available, do not bother 284 Acquire the lock protecting the pools 285-293 Allocate POOL_SIZE pages from the buddy allocator and add them to a linked list. Keep a count of the number of pages in the pool with nr_emergency_pages 294 295 296 297 298 299 300 301 302 303 304 while (nr_emergency_bhs < POOL_SIZE) { struct buffer_head * bh = kmem_cache_alloc(bh_cachep, SLAB_ATOMIC); if (!bh) { printk("couldn’t refill highmem emergency bhs"); break; } list_add(&bh->b_inode_buffers, &emergency_bhs); nr_emergency_bhs++; } spin_unlock_irq(&emergency_lock); printk("allocated %d pages and %d bhs reserved for the highmem bounces\n",

6.4. Emergency Pools 305 306 307 308 } nr_emergency_pages, nr_emergency_bhs); return 0;

229

294-302 Allocate POOL_SIZE buffer_heads from the slab allocator and add them to a linked list linked by b_inode_buffers. Keep track of how many heads are in the pool with nr_emergency_bhs 303 Release the lock protecting the pools 307 Return success

Chapter 7 Page Frame Reclamation
7.1 Page Swap Daemon

Function: kswapd_init (mm/vmscan.c) Start the kswapd kernel thread 767 static int __init kswapd_init(void) 768 { 769 printk("Starting kswapd\n"); 770 swap_setup(); 771 kernel_thread(kswapd, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL); 772 return 0; 773 } 770 swap_setup() setups up how many pages will be prefetched when reading from backing storage based on the amount of physical memory 771 Start the kswapd kernel thread Function: kswapd (mm/vmscan.c) The main function of the kswapd kernel thread. 720 int kswapd(void *unused) 721 { 722 struct task_struct *tsk = current; 723 DECLARE_WAITQUEUE(wait, tsk); 724 725 daemonize(); 726 strcpy(tsk->comm, "kswapd"); 727 sigfillset(&tsk->blocked); 728 741 tsk->flags |= PF_MEMALLOC; 742 746 for (;;) { 747 __set_current_state(TASK_INTERRUPTIBLE); 230

7.1. Page Swap Daemon 748 749 750 751 752 753 754 755 756 762 763 764 765 } add_wait_queue(&kswapd_wait, &wait); mb(); if (kswapd_can_sleep()) schedule(); __set_current_state(TASK_RUNNING); remove_wait_queue(&kswapd_wait, &wait); kswapd_balance(); run_task_queue(&tq_disk); }

231

725 Call daemonize() which will make this a kernel thread, remove the mm context, close all ﬁles and re-parent the process 726 Set the name of the process 727 Ignore all signals 741 By setting this ﬂag, the physical page allocator will always try to satisfy requests for pages. As this process will always be trying to free pages, it is worth satisfying requests 746-764 Endlessly loop 747-748 This adds kswapd to the wait queue in preparation to sleep 750 The Memory Block (mb) function ensures that all reads and writes that occurred before this line will be visible to all CPU’s 751 kswapd_can_sleep() cycles through all nodes and zones checking the need_balance ﬁeld. If any of them are set to 1, kswapd can not sleep 752 By calling schedule, kswapd will sleep until woken again by the physical page allocator 754-755 Once woken up, kswapd is removed from the wait queue as it is now running 762 kswapd_balance() cycles through all zones and calls try_to_free_pages_zone() for each zone that requires balance 763 Run the task queue for processes waiting to write to disk Function: kswapd_can_sleep (mm/vmscan.c) Simple function to cycle through all pgdats to call kswapd_can_sleep_pgdat() on each.

7.1. Page Swap Daemon 695 static int kswapd_can_sleep(void) 696 { 697 pg_data_t * pgdat; 698 699 for_each_pgdat(pgdat) { 700 if (!kswapd_can_sleep_pgdat(pgdat)) 701 return 0; 702 } 703 704 return 1; 705 }

232

699-702 for_each_pgdat() does exactly as the name implies. It cycles through all available pgdat’s. On the x86, there will only be one Function: kswapd_can_sleep_pgdat (mm/vmscan.c) Cycles through all zones to make sure none of them need balance. 680 static int kswapd_can_sleep_pgdat(pg_data_t * pgdat) 681 { 682 zone_t * zone; 683 int i; 684 685 for (i = pgdat->nr_zones-1; i >= 0; i--) { 686 zone = pgdat->node_zones + i; 687 if (!zone->need_balance) 688 continue; 689 return 0; 690 } 691 692 return 1; 693 } 685-689 Simple for loop to cycle through all zones 686 The node_zones ﬁeld is an array of all available zones so adding i gives the index 687-688 If the zone does not need balance, continue 689 0 is returned if any needs balance indicating kswapd can not sleep 692 Return indicating kswapd can sleep if the for loop completes Function: kswapd_balance (mm/vmscan.c) Continuously cycle through each pgdat until none require balancing

7.1. Page Swap Daemon 667 static void kswapd_balance(void) 668 { 669 int need_more_balance; 670 pg_data_t * pgdat; 671 672 do { 673 need_more_balance = 0; 674 675 for_each_pgdat(pgdat) 676 need_more_balance |= kswapd_balance_pgdat(pgdat); 677 } while (need_more_balance); 678 } 672-677 Continuously cycle through each pgdat

233

675 For each pgdat, call kswapd_balance_pgdat(). If any of them had required balancing, need_more_balance will be equal to 1 Function: kswapd_balance_pgdat (mm/vmscan.c) 641 static int kswapd_balance_pgdat(pg_data_t * pgdat) 642 { 643 int need_more_balance = 0, i; 644 zone_t * zone; 645 646 for (i = pgdat->nr_zones-1; i >= 0; i--) { 647 zone = pgdat->node_zones + i; 648 if (unlikely(current->need_resched)) 649 schedule(); 650 if (!zone->need_balance) 651 continue; 652 if (!try_to_free_pages_zone(zone, GFP_KSWAPD)) { 653 zone->need_balance = 0; 654 __set_current_state(TASK_INTERRUPTIBLE); 655 schedule_timeout(HZ); 656 continue; 657 } 658 if (check_classzone_need_balance(zone)) 659 need_more_balance = 1; 660 else 661 zone->need_balance = 0; 662 } 663 664 return need_more_balance; 665 } 646-662 Cycle through each zone and call try_to_free_pages_zone() if it needs rebalancing

7.2. Page Cache 647 node_zones is an array and i is an index within it

234

648-649 Call schedule() if the quanta is expired to prevent kswapd hogging the CPU 650-651 If the zone does not require balance, move to the next one 652-657 If the function returns 0, it means the out_of_memory() function was called because a suﬃcient number of pages could not be freed. kswapd sleeps for 1 second to give the system a chance to reclaim the killed processes pages 658-661 If is was successful, check_classzone_need_balance() is called to see if the zone requires further balancing or not 664 Return 1 if one zone requires further balancing

7.2

Page Cache

Function: lru_cache_add (mm/swap.c) Adds a page to the LRU inactive_list. 58 void lru_cache_add(struct page * page) 59 { 60 if (!PageLRU(page)) { 61 spin_lock(&pagemap_lru_lock); 62 if (!TestSetPageLRU(page)) 63 add_page_to_inactive_list(page); 64 spin_unlock(&pagemap_lru_lock); 65 } 66 } 60 If the page is not already part of the LRU lists, add it 61 Acquire the LRU lock 62-63 Test and set the LRU bit. If it was clear then call add_page_to_inactive_list() 64 Release the LRU lock Function: add_page_to_active_list (include/linux/swap.h) Adds the page to the active_list 179 #define add_page_to_active_list(page) 180 do { 181 DEBUG_LRU_PAGE(page); 182 SetPageActive(page); 183 list_add(&(page)->lru, &active_list); 184 nr_active_pages++; 185 } while (0) \ \ \ \ \ \

7.2. Page Cache

235

181 The DEBUG_LRU_PAGE() macro will call BUG() if the page is already on the LRU list or is marked been active 182 Update the ﬂags of the page to show it is active 183 Add the page to the active_list 184 Update the count of the number of pages in the active_list Function: add_page_to_inactive_list (include/linux/swap.h) Adds the page to the inactive_list 187 #define add_page_to_inactive_list(page) 188 do { 189 DEBUG_LRU_PAGE(page); 190 list_add(&(page)->lru, &inactive_list); 191 nr_inactive_pages++; 192 } while (0) \ \ \ \ \

189 The DEBUG_LRU_PAGE() macro will call BUG() if the page is already on the LRU list or is marked been active 190 Add the page to the inactive_list 191 Update the count of the number of inactive pages on the list Function: lru_cache_del (mm/swap.c) Acquire the lock protecting the LRU lists before calling __lru_cache_del(). 90 void lru_cache_del(struct page * page) 91 { 92 spin_lock(&pagemap_lru_lock); 93 __lru_cache_del(page); 94 spin_unlock(&pagemap_lru_lock); 95 } 92 Acquire the LRU lock 93 __lru_cache_del() does the “real” work of removing the page from the LRU lists 94 Release the LRU lock

7.2. Page Cache Function: __lru_cache_del (mm/swap.c) Select which function is needed to remove the page from the LRU list. 75 void __lru_cache_del(struct page * page) 76 { 77 if (TestClearPageLRU(page)) { 78 if (PageActive(page)) { 79 del_page_from_active_list(page); 80 } else { 81 del_page_from_inactive_list(page); 82 } 83 } 84 } 77 Test and clear the ﬂag indicating the page is in the LRU 78-82 If the page is on the LRU, select the appropriate removal function

236

78-79 If the page is active, then call del_page_from_active_list() else delete from the inactive list with del_page_from_inactive_list() Function: del_page_from_active_list (include/linux/swap.h) Remove the page from the active_list 194 #define del_page_from_active_list(page) 195 do { 196 list_del(&(page)->lru); 197 ClearPageActive(page); 198 nr_active_pages--; 199 } while (0) 196 Delete the page from the list 197 Clear the ﬂag indicating it is part of active_list. The ﬂag indicating it is part of the LRU list has already been cleared by __lru_cache_del() 198 Update the count of the number of pages in the active_list Function: del_page_from_inactive_list (include/linux/swap.h) 201 #define del_page_from_inactive_list(page) 202 do { 203 list_del(&(page)->lru); 204 nr_inactive_pages--; 205 } while (0) 203 Remove the page from the LRU list 204 Update the count of the number of pages in the inactive_list \ \ \ \ \ \ \ \ \

7.2. Page Cache

237

Function: mark_page_accessed (mm/ﬁlemap.c) This marks that a page has been referenced. If the page is already on the active_list or the referenced ﬂag is clear, the referenced ﬂag will be simply set. If it is in the inactive_list and the referenced ﬂag has been set, activate_page() will be called to move the page to the top of the active_list. 1316 void mark_page_accessed(struct page *page) 1317 { 1318 if (!PageActive(page) && PageReferenced(page)) { 1319 activate_page(page); 1320 ClearPageReferenced(page); 1321 } else 1322 SetPageReferenced(page); 1323 } 1318-1321 If the page is on the inactive_list (!PageActive) and has been referenced recently (PageReferenced), activate_page() is called to move it to the active_list 1322 Otherwise, mark the page as been referenced Function: activate_lock (mm/swap.c) Acquire the LRU lock before calling activate_page_nolock() which moves the page from the inactive_list to the active_list. 47 void activate_page(struct page * page) 48 { 49 spin_lock(&pagemap_lru_lock); 50 activate_page_nolock(page); 51 spin_unlock(&pagemap_lru_lock); 52 } 49 Acquire the LRU lock 50 Call the main work function 51 Release the LRU lock Function: activate_page_nolock (mm/swap.c) Move the page from the inactive_list to the active_list 39 static inline void activate_page_nolock(struct page * page) 40 { 41 if (PageLRU(page) && !PageActive(page)) { 42 del_page_from_inactive_list(page); 43 add_page_to_active_list(page); 44 } 45 } 41 Make sure the page is on the LRU and not already on the active_list 42-43 Delete the page from the inactive_list and add to the active_list

7.2. Page Cache Function: page_cache_get (include/linux/pagemap.h) 31 #define page_cache_get(x) get_page(x)

238

31 Simple call get_page() which simply uses atomic_inc() to increment the page reference count Function: page_cache_release (include/linux/pagemap.h) 32 #define page_cache_release(x) __free_page(x)

32 Call __free_page() which decrements the page count. If the count reaches 0, the page will be freed Function: add_to_page_cache (mm/ﬁlemap.c) Acquire the lock protecting the page cache before calling __add_to_page_cache() which will add the page to the page hash table and inode queue which allows the pages belonging to ﬁles to be found quickly. 665 void add_to_page_cache(struct page * page, struct address_space * mapping, unsigned long offset) 666 { 667 spin_lock(&pagecache_lock); 668 __add_to_page_cache(page, mapping, offset, page_hash(mapping, offset)); 669 spin_unlock(&pagecache_lock); 670 lru_cache_add(page); 671 } 667 Acquire the lock protecting the page hash and inode queues 668 Call the function which performs the “real” work 669 Release the lock protecting the hash and inode queue 670 Add the page to the page cache Function: __add_to_page_cache (mm/ﬁlemap.c) Clear all page ﬂags, lock it, take a reference and add it to the inode and hash queues. 651 static inline void __add_to_page_cache(struct page * page, 652 struct address_space *mapping, unsigned long offset, 653 struct page **hash) 654 { 655 unsigned long flags; 656 657 flags = page->flags & ~(1 << PG_uptodate |

7.3. Shrinking all caches 1 << PG_error | 1 << PG_dirty | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_checked); page->flags = flags | (1 << PG_locked); page_cache_get(page); page->index = offset; add_page_to_inode_queue(mapping, page); add_page_to_hash_queue(page, hash);

239

658 659 660 661 662 663 }

657 Clear all page ﬂags 658 Lock the page 659 Take a reference to the page in case it gets freed prematurely 660 Update the index so it is known what ﬁle oﬀset this page represents 661 Add the page to the inode queue. This links the page via the page→list to the clean_pages list in the address_space and points the page→mapping to the same address_space 662 Add it to the page hash. Pages are hashed based on the address_space and the inode. It allows pages belonging to an address_space to be found without having to lineraly search the inode queue

7.3

Shrinking all caches

Function: shrink_caches (mm/vmscan.c) 560 static int shrink_caches(zone_t * classzone, int priority, unsigned int gfp_mask, int nr_pages) 561 { 562 int chunk_size = nr_pages; 563 unsigned long ratio; 564 565 nr_pages -= kmem_cache_reap(gfp_mask); 566 if (nr_pages <= 0) 567 return 0; 568 569 nr_pages = chunk_size; 570 /* try to keep the active list 2/3 of the size of the cache */ 571 ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); 572 refill_inactive(ratio); 573 574 nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);

7.3. Shrinking all caches

240

shrink_caches

kmem_cache_reap

refill_inactive

shrink_cache

shrink_dcache_memory

shrink_icache_memory

try_to_release_page

swap_out

__remove_inode_page

__delete_from_swap_cache

swap_free

__free_pages

try_to_free_buffers

swap_out_mm

mmput

find_vma

swap_out_vma

swap_out_pgd

swap_out_pmd

try_to_swap_out

Figure 7.1: shrink_cache

7.3. Shrinking all caches 575 if (nr_pages <= 0) 576 return 0; 577 578 shrink_dcache_memory(priority, gfp_mask); 579 shrink_icache_memory(priority, gfp_mask); 580 #ifdef CONFIG_QUOTA 581 shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); 582 #endif 583 584 return nr_pages; 585 } 560 The parameters are as follows; classzone is the zone that pages should be freed from priority determines how much work will be done to free pages gfp_mask determines what sort of actions may be taken nr_pages is the number of pages remaining to be freed

241

565-567 Ask the slab allocator to free up some pages. If enough are freed, the function returns otherwise nr_pages will be freed from other caches 571-572 Move pages from the active_list to the inactive_list with refill_inactive(). The number of pages moved depends on how many pages need to be freed and to have active_list about two thirds the size of the page cache 574-575 Shrink the page cache, if enough pages are freed, return 578-582 Shrink the dcache, icache and dqcache. These are small objects in themselves but the cascading eﬀect frees up a lot of disk buﬀers 584 Return the number of pages remaining to be freed Function: try_to_free_pages (mm/vmscan.c) This function cycles through all pgdats and tries to balance the preferred allocation zone (usually ZONE_NORMAL) for each of them. This function is only called from one place, buffer.c:free_more_memory() when the buﬀer manager fails to create new buﬀers or grow existing ones. It calls try_to_free_pages() with GFP_NOIO as the gfp_mask. This results in the ﬁrst zone in pg_data_t→node_zonelists having pages freed so that buﬀers can grow. This array is the preferred order of zones to allocate from and usually will begin with ZONE_NORMAL which is required by the buﬀer manager. On NUMA architectures, some nodes may have ZONE_DMA as the preferred zone if the memory bank is dedicated to IO devices and UML also uses only this zone. As the buﬀer manager is restricted in the zones is uses, there is no point balancing other zones. 607 int try_to_free_pages(unsigned int gfp_mask) 608 {

7.3. Shrinking all caches 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 } pg_data_t *pgdat; zonelist_t *zonelist; unsigned long pf_free_pages; int error = 0; pf_free_pages = current->flags & PF_FREE_PAGES; current->flags &= ~PF_FREE_PAGES; for_each_pgdat(pgdat) { zonelist = pgdat->node_zonelists + (gfp_mask & GFP_ZONEMASK); error |= try_to_free_pages_zone( zonelist->zones[0], gfp_mask); } current->flags |= pf_free_pages; return error;

242

614-615 This clears the PF_FREE_PAGES ﬂag if it is set so that pages freed by the process will be returned to the global pool rather than reserved for the process itself 617-620 Cycle through all nodes and call try_to_free_pages() for the preferred zone in each node 618 This function is only called with GFP_NOIO as a parameter. When ANDed with GFP_ZONEMASK, it will always result in 0 622-623 Restore the process ﬂags and return the result Function: try_to_free_pages_zone (mm/vmscan.c) Try to free SWAP_CLUSTER_MAX pages from the supplied zone. 587 int try_to_free_pages_zone(zone_t *classzone, unsigned int gfp_mask) 588 { 589 int priority = DEF_PRIORITY; 590 int nr_pages = SWAP_CLUSTER_MAX; 591 592 gfp_mask = pf_gfp_mask(gfp_mask); 593 do { 594 nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages); 595 if (nr_pages <= 0) 596 return 1; 597 } while (--priority); 598 599 /*

7.4. Reﬁlling inactive_list 600 601 602 603 604 605 } * Hmm.. Cache shrink failed - time to kill something? * Mhwahahhaha! This is the part I really like. Giggle. */ out_of_memory(); return 0;

243

589 Start with the lowest priority. Statically deﬁned to be 6 590 Try and free SWAP_CLUSTER_MAX pages. Statically deﬁned to be 32 592 pf_gfp_mask() checks the PF_NOIO ﬂag in the current process ﬂags. If no IO can be performed, it ensures there is no incompatible ﬂags in the GFP mask 593-597 Starting with the lowest priority and increasing with each pass, call shrink_caches() until nr_pages has been freed 595-596 If enough pages were freed, return indicating that the work is complete 603 If enough pages could not be freed even at highest priority (where at worst the full inactive_list is scanned) then check to see if we are out of memory. If we are, then a process will be selected to be killed 604 Return indicating that we failed to free enough pages

7.4

Reﬁlling inactive_list

Function: reﬁll_inactive (mm/vmscan.c) Move nr_pages from the active_list to the inactive_list 533 static void refill_inactive(int nr_pages) 534 { 535 struct list_head * entry; 536 537 spin_lock(&pagemap_lru_lock); 538 entry = active_list.prev; 539 while (nr_pages && entry != &active_list) { 540 struct page * page; 541 542 page = list_entry(entry, struct page, lru); 543 entry = entry->prev; 544 if (PageTestandClearReferenced(page)) { 545 list_del(&page->lru); 546 list_add(&page->lru, &active_list); 547 continue; 548 } 549

7.5. Reclaiming pages from the page cache 550 551 552 553 554 555 556 557 } nr_pages--; del_page_from_active_list(page); add_page_to_inactive_list(page); SetPageReferenced(page); } spin_unlock(&pagemap_lru_lock);

244

537 Acquire the lock protecting the LRU list 538 Take the last entry in the active_list 539-555 Move nr_pages or until the active_list is empty 542 Get the struct page for this entry 544-548 Test and clear the referenced ﬂag. If it has been referenced, then it is moved back to the top of the active_list 550-553 Move one page from the active_list to the inactive_list 554 Mark it referenced so that if it is referenced again soon, it will be promoted back to the active_list without requiring a second reference 556 Release the lock protecting the LRU list

7.5

Reclaiming pages from the page cache

Function: shrink_cache (mm/vmscan.c) 338 static int shrink_cache(int nr_pages, zone_t * classzone, unsigned int gfp_mask, int priority) 339 { 340 struct list_head * entry; 341 int max_scan = nr_inactive_pages / priority; 342 int max_mapped = min((nr_pages << (10 - priority)), max_scan / 10); 343 344 spin_lock(&pagemap_lru_lock); 345 while (--max_scan >= 0 && (entry = inactive_list.prev) != &inactive_list) { 338 The parameters are as follows; nr_pages The number of pages to swap out classzone The zone we are interested in swapping pages out for. Pages not belonging to this zone are skipped

7.5. Reclaiming pages from the page cache gfp_mask The gfp mask determining what actions may be taken

245

priority The priority of the function, starts at DEF_PRIORITY (6) and decreases to the highest priority of 1 341 The maximum number of pages to scan is the number of pages in the active_list divided by the priority. At lowest priority, 1/6th of the list may scanned. At highest priority, the full list may be scanned 342 The maximum amount of process mapped pages allowed is either one tenth of the max_scan value or nrp ages ∗ 210−priority . If this number of pages are found, whole processes will be swapped out 344 Lock the LRU list 345 Keep scanning until max_scan pages have been scanned or the inactive_list is empty 346 347 348 349 350 351 352 353 354 355 struct page * page; if (unlikely(current->need_resched)) { spin_unlock(&pagemap_lru_lock); __set_current_state(TASK_RUNNING); schedule(); spin_lock(&pagemap_lru_lock); continue; }

348-354 Reschedule if the quanta has been used up 349 Free the LRU lock as we are about to sleep 350 Show we are still running 351 Call schedule() so another process can be context switched in 352 Re-acquire the LRU lock 353 Move to the next page, this has the curious side eﬀect of skipping over one page. It is unclear why this happens and is possibly a bug 356 357 358 359 360 361 362 363 364 page = list_entry(entry, struct page, lru); BUG_ON(!PageLRU(page)); BUG_ON(PageActive(page)); list_del(entry); list_add(entry, &inactive_list); /*

7.5. Reclaiming pages from the page cache 365 366 367 368 369 370 371 372 373 374 375 376 377 3

246

* Zero page counts can happen because we unlink the pages * _after_ decrementing the usage count.. */ if (unlikely(!page_count(page))) continue; if (!memclass(page_zone(page), classzone)) continue; /* Racy check to avoid trylocking when not worthwhile */ if (!page->buffers && (page_count(page) != 1 || !page->mapping)) goto page_mapped;

356 Get the struct page for this entry in the LRU 358-359 It is a bug if the page either belongs to the active_list or is currently marked as active 361-362 Move the page to the top of the inactive_list so that if the page is skipped, it will not be simply examined a second time 368-369 If the page count has already reached 0, skip over it. This is possible if another process has just unlinked the page and is waiting for something like IO to complete before removing it from the LRU 371-372 Skip over this page if it belongs to a zone we are not currently interested in 375-376 If the page is mapped by a process, then goto page_mapped where the max_mapped is decremented and next page examined. If max_mapped reaches 0, process pages will be swapped out 382 383 384 385 386 387 388 389 390 391 if (unlikely(TryLockPage(page))) { if (PageLaunder(page) && (gfp_mask & __GFP_FS)) { page_cache_get(page); spin_unlock(&pagemap_lru_lock); wait_on_page(page); page_cache_release(page); spin_lock(&pagemap_lru_lock); } continue; }

Page is locked and the launder bit is set. In this case, wait until the IO is complete and then try to free the page

7.5. Reclaiming pages from the page cache

247

382-383 If we could not lock the page, the PG_launder bit is set and the GFP ﬂags allow the caller to perform FS operations, then... 384 Take a reference to the page so it does not disappear while we sleep 385 Free the LRU lock 386 Wait until the IO is complete 387 Release the reference to the page. If it reaches 0, the page will be freed 388 Re-acquire the LRU lock 390 Move to the next page 392 393 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417

if (PageDirty(page) && is_page_cache_freeable(page) && page->mapping) { int (*writepage)(struct page *); writepage = page->mapping->a_ops->writepage; if ((gfp_mask & __GFP_FS) && writepage) { ClearPageDirty(page); SetPageLaunder(page); page_cache_get(page); spin_unlock(&pagemap_lru_lock); writepage(page); page_cache_release(page); spin_lock(&pagemap_lru_lock); continue; } }

This handles the case where a page is dirty, is not mapped by any process has no buﬀers and is backed by a ﬁle or device mapping. The page is cleaned and will be removed by the previous block of code during the next pass through the list. 393 PageDirty checks the PG_dirty bit, is_page_cache_freeable() will return true if it is not mapped by any process and has no buﬀers 404 Get a pointer to the necessary writepage() function for this mapping or device 405-416 This block of code can only be executed if a writepage() function is available and the GFP ﬂags allow ﬁle operations 406-407 Clear the dirty bit and mark that the page is being laundered 408 Take a reference to the page so it will not be freed unexpectedly

7.5. Reclaiming pages from the page cache 409 Unlock the LRU list 411 Call the writepage function 412 Release the reference to the page 414-415 Re-acquire the LRU list lock and move to the next page 424 425 426 427 428 429 430 431 438 439 440 441 443 444 445 446 447 448 454 455 456 457 458 460 461 462 463 464 465 466 if (page->buffers) { spin_unlock(&pagemap_lru_lock); /* avoid to free a locked page */ page_cache_get(page); if (try_to_release_page(page, gfp_mask)) { if (!page->mapping) { spin_lock(&pagemap_lru_lock); UnlockPage(page); __lru_cache_del(page); page_cache_release(page); if (--nr_pages) continue; break; } else { page_cache_release(page); spin_lock(&pagemap_lru_lock); } } else { UnlockPage(page); page_cache_release(page); spin_lock(&pagemap_lru_lock); continue; } } Page has buﬀers associated with it that must be freed. 425 Release the LRU lock as we may sleep 428 Take a reference to the page

248

430 Call try_to_release_page() which will attempt to release the buﬀers associated with the page. Returns 1 if it succeeds

7.5. Reclaiming pages from the page cache 431-447 Handle where the release of buﬀers succeeded

249

431-448 If the mapping is not ﬁlled, it is an anonymous page which must be removed from the page cache 438-440 Take the LRU list lock, unlock the page, delete it from the page cache and free it 445-446 Update nr_pages to show a page has been freed and move to the next page 447 If nr_pages drops to 0, then exit the loop as the work is completed 449-456 If the page does have an associated mapping then simply drop the reference to the page and re-acquire the LRU lock 459-464 If the buﬀers could not be freed, then unlock the page, drop the reference to it, re-acquire the LRU lock and move to the next page 467 468 spin_lock(&pagecache_lock); 469 473 if (!page->mapping || !is_page_cache_freeable(page)) { 474 spin_unlock(&pagecache_lock); 475 UnlockPage(page); 476 page_mapped: 477 if (--max_mapped >= 0) 478 continue; 479 484 spin_unlock(&pagemap_lru_lock); 485 swap_out(priority, gfp_mask, classzone); 486 return nr_pages; 487 } 468 From this point on, pages in the swap cache are likely to be examined which is protected by the pagecache_lock which must be now held 473-487 An anonymous page with no buﬀers is mapped by a process 474-475 Release the page cache lock and the page 477-478 Decrement max_mapped. If it has not reached 0, move to the next page 484-485 Too many mapped pages have been found in the page cache. The LRU lock is released and swap_out() is called to begin swapping out whole processes 493 494 495 496 497 if (PageDirty(page)) { spin_unlock(&pagecache_lock); UnlockPage(page); continue; }

7.5. Reclaiming pages from the page cache

250

493-497 The page has no references but could have been dirtied by the last process to free it if the dirty bit was set in the PTE. It is left in the page cache and will get laundered later. Once it has been cleaned, it can be safely deleted 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 /* point of no return */ if (likely(!PageSwapCache(page))) { __remove_inode_page(page); spin_unlock(&pagecache_lock); } else { swp_entry_t swap; swap.val = page->index; __delete_from_swap_cache(page); spin_unlock(&pagecache_lock); swap_free(swap); } __lru_cache_del(page); UnlockPage(page); /* effectively free the page here */ page_cache_release(page); if (--nr_pages) continue; break; }

500-503 If the page does not belong to the swap cache, it is part of the inode queue so it is removed 504-508 Remove it from the swap cache as there is no more references to it 511 Delete it from the page cache 512 Unlock the page 515 Free the page 517-518 Decrement nr_page and move to the next page if it is not 0 519 If it reaches 0, the work of the function is complete 521 522 523 524 } spin_unlock(&pagemap_lru_lock); return nr_pages;

521-524 Function exit. Free the LRU lock and return the number of pages left to free

7.6. Swapping Out Process Pages

251

7.6

Swapping Out Process Pages
swap_out

swap_out_mm

mmput

find_vma

swap_out_vma

swap_out_pgd

swap_out_pmd

try_to_swap_out

Figure 7.2: Call Graph: swap_out Function: swap_out (mm/vmscan.c) This function linearaly searches through every processes page tables trying to swap out SWAP_CLUSTER_MAX number of pages. The process it starts with is the swap_mm and the starting address is mm→swap_address 296 static int swap_out(unsigned int priority, unsigned int gfp_mask, zone_t * classzone) 297 { 298 int counter, nr_pages = SWAP_CLUSTER_MAX; 299 struct mm_struct *mm; 300 301 counter = mmlist_nr; 302 do { 303 if (unlikely(current->need_resched)) { 304 __set_current_state(TASK_RUNNING); 305 schedule(); 306 } 307 308 spin_lock(&mmlist_lock); 309 mm = swap_mm; 310 while (mm->swap_address == TASK_SIZE || mm == &init_mm) { 311 mm->swap_address = 0; 312 mm = list_entry(mm->mmlist.next,

7.6. Swapping Out Process Pages struct mm_struct, mmlist); if (mm == swap_mm) goto empty; swap_mm = mm; } /* Make sure the mm doesn’t disappear when we drop the lock.. */ atomic_inc(&mm->mm_users); spin_unlock(&mmlist_lock);

252

313 314 315 316 317 318

319 320 321 322 nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone); 323 324 mmput(mm); 325 326 if (!nr_pages) 327 return 1; 328 } while (--counter >= 0); 329 330 return 0; 331 332 empty: 333 spin_unlock(&mmlist_lock); 334 return 0; 335 } 301 Set the counter so the process list is only scanned once 303-306 Reschedule if the quanta has been used up to prevent CPU hogging 308 Acquire the lock protecting the mm list 309 Start with the swap_mm. It is interesting this is never checked to make sure it is valid. It is possible, albeit unlikely that the mm has been freed since the last scan and the slab holding the mm_struct released making the pointer totally invalid. The lack of bug reports might be because the slab never managed to get freed up and would be diﬃcult to trigger 310-316 Move to the next process if the swap_address has reached the TASK_SIZE or if the mm is the init_mm 311 Start at the beginning of the process space 312 Get the mm for this process 313-314 If it is the same, there is no running processes that can be examined 315 Record the swap_mm for the next pass

7.6. Swapping Out Process Pages

253

319 Increase the reference count so that the mm does not get freed while we are scanning 320 Release the mm lock 322 Begin scanning the mm with swap_out_mm() 324 Drop the reference to the mm 326-327 If the required number of pages has been freed, return success 328 If we failed on this pass, increase the priority so more processes will be scanned 330 Return failure Function: swap_out_mm (mm/vmscan.c) Walk through each VMA and call swap_out_mm() for each one. 256 static inline int swap_out_mm(struct mm_struct * mm, int count, int * mmcounter, zone_t * classzone) 257 { 258 unsigned long address; 259 struct vm_area_struct* vma; 260 265 spin_lock(&mm->page_table_lock); 266 address = mm->swap_address; 267 if (address == TASK_SIZE || swap_mm != mm) { 268 /* We raced: don’t count this mm but try again */ 269 ++*mmcounter; 270 goto out_unlock; 271 } 272 vma = find_vma(mm, address); 273 if (vma) { 274 if (address < vma->vm_start) 275 address = vma->vm_start; 276 277 for (;;) { 278 count = swap_out_vma(mm, vma, address, count, classzone); 279 vma = vma->vm_next; 280 if (!vma) 281 break; 282 if (!count) 283 goto out_unlock; 284 address = vma->vm_start; 285 } 286 } 287 /* Indicate that we reached the end of address space */ 288 mm->swap_address = TASK_SIZE;

7.6. Swapping Out Process Pages 289 290 out_unlock: 291 spin_unlock(&mm->page_table_lock); 292 return count; 293 } 265 Acquire the page table lock for this mm 266 Start with the address contained in swap_address

254

267-271 If the address is TASK_SIZE, it means that a thread raced and scanned this process already. Increase mmcounter so that swap_out_mm() knows to go to another process 272 Find the VMA for this address 273 Presuming a VMA was found then .... 274-275 Start at the beginning of the VMA 277-285 Scan through this and each subsequent VMA calling swap_out_vma() for each one. If the requisite number of pages (count) is freed, then ﬁnish scanning and return 288 Once the last VMA has been scanned, set swap_address to TASK_SIZE so that this process will be skipped over by swap_out_mm() next time Function: swap_out_vma (mm/vmscan.c) Walk through this VMA and for each PGD in it, call swap_out_pgd(). 227 static inline int swap_out_vma(struct mm_struct * mm, struct vm_area_struct * vma, unsigned long address, int count, zone_t * classzone) 228 { 229 pgd_t *pgdir; 230 unsigned long end; 231 232 /* Don’t swap out areas which are reserved */ 233 if (vma->vm_flags & VM_RESERVED) 234 return count; 235 236 pgdir = pgd_offset(mm, address); 237 238 end = vma->vm_end; 239 BUG_ON(address >= end); 240 do { 241 count = swap_out_pgd(mm, vma, pgdir, address, end, count, classzone); 242 if (!count)

7.6. Swapping Out Process Pages 243 244 245 246 247 248 } break; address = (address + PGDIR_SIZE) & PGDIR_MASK; pgdir++; } while (address && (address < end)); return count;

255

233-234 Skip over this VMA if the VM_RESERVED ﬂag is set. This is used by some device drivers such as the SCSI generic driver 236 Get the starting PGD for the address 238 Mark where the end is and BUG() it if the starting address is somehow past the end 240 Cycle through PGDs until the end address is reached 241 Call swap_out_pgd() keeping count of how many more pages need to be freed 242-243 If enough pages have been freed, break and return 244-245 Move to the next PGD and move the address to the next PGD aligned address 247 Return the remaining number of pages to be freed Function: swap_out_pgd (mm/vmscan.c) Step through all PMD’s in the supplied PGD and call swap_out_pmd() 197 static inline int swap_out_pgd(struct mm_struct * mm, struct vm_area_struct * vma, pgd_t *dir, unsigned long address, unsigned long end, int count, zone_t * classzone) 198 { 199 pmd_t * pmd; 200 unsigned long pgd_end; 201 202 if (pgd_none(*dir)) 203 return count; 204 if (pgd_bad(*dir)) { 205 pgd_ERROR(*dir); 206 pgd_clear(dir); 207 return count; 208 } 209 210 pmd = pmd_offset(dir, address); 211 212 pgd_end = (address + PGDIR_SIZE) & PGDIR_MASK; 213 if (pgd_end && (end > pgd_end)) 214 end = pgd_end;

7.6. Swapping Out Process Pages 215 216 do { 217 count = swap_out_pmd(mm, vma, pmd, address, end, count, classzone); 218 if (!count) 219 break; 220 address = (address + PMD_SIZE) & PMD_MASK; 221 pmd++; 222 } while (address && (address < end)); 223 return count; 224 } 202-203 If there is no PGD, return 204-208 If the PGD is bad, ﬂag it as such and return 210 Get the starting PMD

256

212-214 Calculate the end to be the end of this PGD or the end of the VMA been scanned, whichever is closer 216-222 For each PMD in this PGD, call swap_out_pmd(). If enough pages get freed, break and return 223 Return the number of pages remaining to be freed Function: swap_out_pmd (mm/vmscan.c) For each PTE in this PMD, call try_to_swap_out(). On completion, mm→swap_address is updated to show where we ﬁnished to prevent the same page been examined soon after this scan. 158 static inline int swap_out_pmd(struct mm_struct * mm, struct vm_area_struct * vma, pmd_t *dir, unsigned long address, unsigned long end, int count, zone_t * classzone) 159 { 160 pte_t * pte; 161 unsigned long pmd_end; 162 163 if (pmd_none(*dir)) 164 return count; 165 if (pmd_bad(*dir)) { 166 pmd_ERROR(*dir); 167 pmd_clear(dir); 168 return count; 169 } 170 171 pte = pte_offset(dir, address);

7.6. Swapping Out Process Pages 172 173 174 175 176 177 178 179 180 181 182

257

pmd_end = (address + PMD_SIZE) & PMD_MASK; if (end > pmd_end) end = pmd_end; do { if (pte_present(*pte)) { struct page *page = pte_page(*pte); if (VALID_PAGE(page) && !PageReserved(page)) { count -= try_to_swap_out(mm, vma, address, pte, page, classzone); if (!count) { address += PAGE_SIZE; break; } } } address += PAGE_SIZE; pte++; } while (address && (address < end)); mm->swap_address = address; return count;

183 184 185 186 187 188 189 190 191 192 193 194 }

163-164 Return if there is no PMD 165-169 If the PMD is bad, ﬂag it as such and return 171 Get the starting PTE 173-175 Calculate the end to be the end of the PMD or the end of the VMA, whichever is closer 177-191 Cycle through each PTE 178 Make sure the PTE is marked present 179 Get the struct page for this PTE 181 If it is a valid page and it is not reserved then ... 182 Call try_to_swap_out() 183-186 If enough pages have been swapped out, move the address to the next page and break to return 189-190 Move to the next page and PTE

7.6. Swapping Out Process Pages 192 Update the swap_address to show where we last ﬁnished oﬀ 193 Return the number of pages remaining to be freed

258

Function: try_to_swap_out (mm/vmscan.c) This function tries to swap out a page from a process. It is quite a large function so will be dealt with in parts. Broadly speaking they are • Function preamble, ensure this is a page that should be swapped out • Remove the page and PTE from the page tables • Handle the case where the page is already in the swap cache • Handle the case where the page is dirty or has associated buﬀers • Handle the case where the page is been added to the swap cache 47 static inline int try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page, zone_t * classzone) 48 { 49 pte_t pte; 50 swp_entry_t entry; 51 52 /* Don’t look at this pte if it’s been accessed recently. */ 53 if ((vma->vm_flags & VM_LOCKED) || ptep_test_and_clear_young(page_table)) { 54 mark_page_accessed(page); 55 return 0; 56 } 57 58 /* Don’t bother unmapping pages that are active */ 59 if (PageActive(page)) 60 return 0; 61 62 /* Don’t bother replenishing zones not under pressure.. */ 63 if (!memclass(page_zone(page), classzone)) 64 return 0; 65 66 if (TryLockPage(page)) 67 return 0; 53-56 If the page is locked (for tasks like IO) or the PTE shows the page has been accessed recently then clear the referenced bit and call mark_page_accessed() to make the struct page reﬂect the age. Return 0 to show it was not swapped out

7.6. Swapping Out Process Pages 59-60 If the page is on the active_list, do not swap it out 63-64 If the page belongs to a zone we are not interested in, do not swap it out 66-67 If the page could not be locked, do not swap it out 74 75 76 77 78 79 80 flush_cache_page(vma, address); pte = ptep_get_and_clear(page_table); flush_tlb_page(vma, address); if (pte_dirty(pte)) set_page_dirty(page);

259

74 Call the architecture hook to ﬂush this page from all CPU’s 75 Get the PTE from the page tables and clear it 76 Call the architecture hook to ﬂush the TLB 78-79 If the PTE was marked dirty, mark the struct page dirty so it will be laundered correctly 86 if (PageSwapCache(page)) { 87 entry.val = page->index; 88 swap_duplicate(entry); 89 set_swap_pte: 90 set_pte(page_table, swp_entry_to_pte(entry)); 91 drop_pte: 92 mm->rss--; 93 UnlockPage(page); 94 { 95 int freeable = page_count(page) - !!page->buffers <= 2; 96 page_cache_release(page); 97 return freeable; 98 } 99 } Handle the case where the page is already in the swap cache 87-88 Fill in the index value for the swap entry. swap_duplicate() veriﬁes the swap identiﬁer is valid and increases the counter in the swap_map if it is 90 Fill the PTE with information needed to get the page from swap 92 Update RSS to show there is one less page 93 Unlock the page

7.6. Swapping Out Process Pages 95 The page is free-able if the count is currently 2 or less and has no buﬀers 96 Decrement the reference count and free the page if it reaches 0 97 Return if the page was freed or not 115 116 117 118 124 125 if (page->mapping) goto drop_pte; if (!PageDirty(page)) goto drop_pte; if (page->buffers) goto preserve;

260

115-116 If the page has an associated mapping, simply drop it and it will be caught during another scan of the page cache later 117-118 If the page is clean, it is safe to simply drop it 124-125 If it has associated buﬀers due to a truncate followed by a page fault, then re-attach the page and PTE to the page tables as it can’t be handled yet 126 127 /* 128 * This is a dirty, swappable page. First of all, 129 * get a suitable swap entry for it, and make sure 130 * we have the swap cache set up to associate the 131 * page with that swap entry. 132 */ 133 for (;;) { 134 entry = get_swap_page(); 135 if (!entry.val) 136 break; 137 /* Add it to the swap cache and mark it dirty 138 * (adding to the page cache will clear the dirty 139 * and uptodate bits, so we need to do it again) 140 */ 141 if (add_to_swap_cache(page, entry) == 0) { 142 SetPageUptodate(page); 143 set_page_dirty(page); 144 goto set_swap_pte; 145 } 146 /* Raced with "speculative" read_swap_cache_async */ 147 swap_free(entry); 148 } 149 150 /* No swap space left */ 151 preserve: 152 set_pte(page_table, pte);

7.6. Swapping Out Process Pages 153 154 155 } UnlockPage(page); return 0;

261

134 Allocate a swap entry for this page 135-136 If one could not be allocated, break out where the PTE and page will be reattached to the process page tables 141 Add the page to the swap cache 142 Mark the page as up to date in memory 143 Mark the page dirty so that it will be written out to swap soon 144 Goto set_swap_pte which will update the PTE with information needed to get the page from swap later 147 If the add to swap cache failed, it means that the page was placed in the swap cache already by a readahead so drop the work done here 152 Reattach the PTE to the page tables 153 Unlock the page 154 Return that no page was freed

Chapter 8 Swap Management
8.1 8.2 Describing the Swap Area Scanning for free entries

get_swap_page

scan_swap_map

Figure 8.1: Call Graph: get_swap_page() Function: get_swap_page (mm/swapﬁle.c) This is the high level API function for getting a slot in swap space. 99 swp_entry_t get_swap_page(void) 100 { 101 struct swap_info_struct * p; 102 unsigned long offset; 103 swp_entry_t entry; 104 int type, wrapped = 0; 105 106 entry.val = 0; /* Out of memory */ 107 swap_list_lock(); 108 type = swap_list.next; 109 if (type < 0) 110 goto out; 111 if (nr_swap_pages <= 0) 262

8.2. Scanning for free entries 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 out: 143 144 145 } goto out; while (1) { p = &swap_info[type]; if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) { swap_device_lock(p); offset = scan_swap_map(p); swap_device_unlock(p); if (offset) { entry = SWP_ENTRY(type,offset); type = swap_info[type].next; if (type < 0 || p->prio != swap_info[type].prio) { swap_list.next = swap_list.head; } else { swap_list.next = type; } goto out; } } type = p->next; if (!wrapped) { if (type < 0 || p->prio != swap_info[type].prio) { type = swap_list.head; wrapped = 1; } } else if (type < 0) goto out; /* out of swap space */ } swap_list_unlock(); return entry;

263

107 Lock the list of swap pages 108 Get the next swap area that is to be used for allocating from 109-110 If there is no swap areas, return NULL 111-112 If the accounting says there is no swap pages, return NULL 114-141 Cycle through all swap areas 115 Get the swap info struct 116 If this swap area is available for writing to and is active...

8.2. Scanning for free entries 117 Lock the swap area 118 Call scan_swap_map() which searches for a free slot 119 Unlock the swap device 120-130 If a slot was free... 121 Encode an identiﬁer for the entry with SWP_ENTRY() 122 Record the next swap area to use

264

123-126 If the next area is the end of the list or the priority of the next swap area does not match the current one, move back to the head 126-128 Otherwise move to the next area 129 Goto out 132 Move to the next swap area 133-138 Check for wrapaound. Set wrapped to 1 if we get to the end of the list of swap areas 139-140 If there was no available swap areas, goto out 142 The exit to this function 143 Unlock the swap area list 144 Return the entry if one was found and NULL otherwise Function: scan_swap_map (mm/swapﬁle.c) This function tries to allocate SWAPFILE_CLUSTER number of pages sequentially in swap. When it has allocated that many, it searches for another block of free slots of size SWAPFILE_CLUSTER. If it fails to ﬁnd one, it resorts to allocating the ﬁrst free slot. 36 static inline int scan_swap_map(struct swap_info_struct *si) 37 { 38 unsigned long offset; 47 if (si->cluster_nr) { 48 while (si->cluster_next <= si->highest_bit) { 49 offset = si->cluster_next++; 50 if (si->swap_map[offset]) 51 continue; 52 si->cluster_nr--; 53 goto got_page; 54 } 55 }

8.2. Scanning for free entries

265

Allocate SWAPFILE_CLUSTER pages sequentially. cluster_nr is initialised to SWAPFILE_CLUTER and decrements with each allocation 47 If cluster_nr is still postive, allocate the next available sequential slot 48 While the current oﬀset to use (cluster_next) is less then the highest known free slot (highest_bit) then ... 49 Record the oﬀset and update cluster_next to the next free slot 50-51 If the slot is not actually free, move to the next one 52 Slot has been found, decrement the cluster_nr ﬁeld 53 Goto the out path 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 si->cluster_nr = SWAPFILE_CLUSTER; /* try to find an empty (even not aligned) cluster. */ offset = si->lowest_bit; check_next_cluster: if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit) { int nr; for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++) if (si->swap_map[nr]) { offset = nr+1; goto check_next_cluster; } /* We found a completly empty cluster, so start * using it. */ goto got_page; }

At this stage, SWAPFILE_CLUSTER pages have been allocated sequentially so ﬁnd the next free block of SWAPFILE_CLUSTER pages. 56 Re-initialise the count of sequential pages to allocate to SWAPFILE_CLUSTER 59 Starting searching at the lowest known free slot 61 If the oﬀset plus the cluster size is less than the known last free slot, then examine all the pages to see if this is a large free block 64 Scan from offset to offset + SWAPFILE_CLUSTER 65-69 If this slot is used, then start searching again for a free slot beginning after this known alloated one

8.2. Scanning for free entries 73 A large cluster was found so use it 75 76 77 78 79 /* No luck, so now go finegrined as usual. -Andrea */ for (offset = si->lowest_bit; offset <= si->highest_bit ; offset++) { if (si->swap_map[offset]) continue; si->lowest_bit = offset+1;

266

This unusual for loop extract starts scanning for a free page starting from lowest_bit 77-78 If the slot is in use, move to the next one 79 Update the lowest_bit known probable free slot to the succeeding one 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 } got_page: if (offset == si->lowest_bit) si->lowest_bit++; if (offset == si->highest_bit) si->highest_bit--; if (si->lowest_bit > si->highest_bit) { si->lowest_bit = si->max; si->highest_bit = 0; } si->swap_map[offset] = 1; nr_swap_pages--; si->cluster_next = offset+1; return offset; } si->lowest_bit = si->max; si->highest_bit = 0; return 0;

A slot has been found, do some housekeeping and return it 81-82 If this oﬀset is the known lowest free slot(lowest_bit), increment it 83-84 If this oﬀset is the highest known likely free slot, decrement it 85-88 If the low and high mark meet, the swap area is not worth searching any more so set the low slot to be the highest possible slot and the high mark to 0 to cut down on search time later. This will be ﬁxed up by the next free 89 Set the reference count for the slot 90 Update the accounting for the number of available swap pages (nr_swap_pages) 91 Set cluster_next to the adjacent slot so the next search will start here 92 Return the free slot 94-96 No free slot available, mark the area unsearchable and return 0

8.3. Swap Cache

267

8.3

Swap Cache

Function: add_to_swap_cache (mm/swap_state.c) This function wraps around the normal page cache handler. It ﬁrst checks if the page is already in the swap cache with swap_duplicate() and if it does not, it calls add_to_page_cache_unique() instead. 70 int add_to_swap_cache(struct page *page, swp_entry_t entry) 71 { 72 if (page->mapping) 73 BUG(); 74 if (!swap_duplicate(entry)) { 75 INC_CACHE_INFO(noent_race); 76 return -ENOENT; 77 } 78 if (add_to_page_cache_unique(page, &swapper_space, entry.val, 79 page_hash(&swapper_space, entry.val)) != 0) { 80 swap_free(entry); 81 INC_CACHE_INFO(exist_race); 82 return -EEXIST; 83 } 84 if (!PageLocked(page)) 85 BUG(); 86 if (!PageSwapCache(page)) 87 BUG(); 88 INC_CACHE_INFO(add_total); 89 return 0; 90 } 72-73 A check is made with PageSwapCache() before this function is called which ensures the page has no existing mapping. If code is calling this function directly, it should have ensured no existing mapping existed 74-77 Try an increment the count for this entry with swap_duplicate(). If a slot already exists in the swap_map, increment the statistic recording the number of races involving adding pages to the swap cache and return ENOENT 78 Try and add the page to the page cache with add_to_page_cache_unique(). This function is similar to add_to_page_cache() except it searches the page cache for a duplicate entry with __find_page_nolock(). The managing address space is swapper_space. The “oﬀset within the ﬁle” in this case is the oﬀset within swap_map, hence entry.val and ﬁnally the page is hashed based on address_space and oﬀset within swap_map 80-83 If it already existed in the page cache, we raced so increment the statistic recording the number of races to insert an existing page into the swap cache and return EEXIST 84-85 If the page is locked for IO, it is a bug

8.3. Swap Cache 86-87 If it is not now in the swap cache, something went seriously wrong 88 Increment the statistic recording the total number of pages in the swap cache 89 Return success

268

Function: swap_duplicate (mm/swapﬁle.c) This function veriﬁes a swap entry is valid and if so, increments its swap map count. 1143 int swap_duplicate(swp_entry_t entry) 1144 { 1145 struct swap_info_struct * p; 1146 unsigned long offset, type; 1147 int result = 0; 1148 1149 type = SWP_TYPE(entry); 1150 if (type >= nr_swapfiles) 1151 goto bad_file; 1152 p = type + swap_info; 1153 offset = SWP_OFFSET(entry); 1154 1155 swap_device_lock(p); 1156 if (offset < p->max && p->swap_map[offset]) { 1157 if (p->swap_map[offset] < SWAP_MAP_MAX - 1) { 1158 p->swap_map[offset]++; 1159 result = 1; 1160 } else if (p->swap_map[offset] <= SWAP_MAP_MAX) { 1161 if (swap_overflow++ < 5) 1162 printk(KERN_WARNING "swap_dup: swap entry overflow\n"); 1163 p->swap_map[offset] = SWAP_MAP_MAX; 1164 result = 1; 1165 } 1166 } 1167 swap_device_unlock(p); 1168 out: 1169 return result; 1170 1171 bad_file: 1172 printk(KERN_ERR "swap_dup: %s%08lx\n", Bad_file, entry.val); 1173 goto out; 1174 } 1143 The parameter is the swap entry to increase the swap_map count for 1149-1151 Get the oﬀset within the swap_info for the swap_info_struct containing this entry. If it is greater than the number of swap areas, goto bad_file

8.3. Swap Cache 1152-1153 Get the relevant swap_info_struct and get the oﬀset within its swap_map 1155 Lock the swap device

269

1156 Make a quick sanity check to ensure the oﬀset is within the swap_map and that the slot indicated has a positive count. A 0 count would mean the slot is not free and this is a bogus swp_entry_t 1157-1159 If the count is not SWAP_MAP_MAX, simply increment it and return 1 for success 1160-1165 Else the count would overﬂow so set it to SWAP_MAP_MAX and reserve the slot permanently. In reality this condition is virtually impossible 1167-1169 Unlock the swap device and return 1172-1173 If a bad device was used, print out the error message and return failure Function: swap_free (mm/swapﬁle.c) Decrements the corresponding swap_map entry for the swp_entry_t 214 void swap_free(swp_entry_t entry) 215 { 216 struct swap_info_struct * p; 217 218 p = swap_info_get(entry); 219 if (p) { 220 swap_entry_free(p, SWP_OFFSET(entry)); 221 swap_info_put(p); 222 } 223 } 218 swap_info_get() fetches the correct swap_info_struct and performs a number of debugging checks to ensure it is a valid area and a valid swap_map entry. If all is sane, it will lock the swap device 219-222 If it is valid, the corresponding swap_map entry is decremented with swap_entry_free() and swap_info_put called to free the device Function: swap_entry_free (mm/swapﬁle.c) 192 static int swap_entry_free(struct swap_info_struct *p, unsigned long offset) 193 { 194 int count = p->swap_map[offset]; 195 196 if (count < SWAP_MAP_MAX) { 197 count--; 198 p->swap_map[offset] = count; 199 if (!count) {

8.3. Swap Cache 200 201 202 203 204 205 206 207 208 } if (offset < p->lowest_bit) p->lowest_bit = offset; if (offset > p->highest_bit) p->highest_bit = offset; nr_swap_pages++; } } return count;

270

194 Get the current count 196 If the count indicates the slot is not permanently reserved then.. 197-198 Decrement the count and store it in the swap_map 199 If the count reaches 0, the slot is free so update some information 200-201 If this freed slot is below lowest_bit, update lowest_bit which indicates the lowest known free slot 202-203 Similarly, update the highest_bit if this newly freed slot is above it 204 Increment the count indicating the number of free swap slots 207 Return the current count Function: swap_info_get (mm/swapﬁle.c) This function ﬁnds the swap_info_struct for the given entry, performs some basic checking and then locks the device. 147 static struct swap_info_struct * swap_info_get(swp_entry_t entry) 148 { 149 struct swap_info_struct * p; 150 unsigned long offset, type; 151 152 if (!entry.val) 153 goto out; 154 type = SWP_TYPE(entry); 155 if (type >= nr_swapfiles) 156 goto bad_nofile; 157 p = & swap_info[type]; 158 if (!(p->flags & SWP_USED)) 159 goto bad_device; 160 offset = SWP_OFFSET(entry); 161 if (offset >= p->max) 162 goto bad_offset; 163 if (!p->swap_map[offset])

8.3. Swap Cache 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 goto bad_free; swap_list_lock(); if (p->prio > swap_info[swap_list.next].prio) swap_list.next = type; swap_device_lock(p); return p; bad_free: printk(KERN_ERR goto out; bad_offset: printk(KERN_ERR goto out; bad_device: printk(KERN_ERR goto out; bad_nofile: printk(KERN_ERR out: return NULL; }

271

"swap_free: %s%08lx\n", Unused_offset, entry.val);

"swap_free: %s%08lx\n", Bad_offset, entry.val);

"swap_free: %s%08lx\n", Unused_file, entry.val);

"swap_free: %s%08lx\n", Bad_file, entry.val);

152-153 If the supplied entry is NULL, return 154 Get the oﬀset within the swap_info array 155-156 Ensure it is a valid area 157 Get the address of the area 158-159 If the area is not active yet, print a bad device error and return 160 Get the oﬀset within the swap_map 161-162 Make sure the oﬀset is not after the end of the map 163-164 Make sure the slot is currently in use 165 Lock the swap area list 166-167 If this area is of higher priority than the area that would be next, ensure the current area is used 168-169 Lock the swap device and return the swap area descriptor

8.3. Swap Cache Function: swap_info_put (mm/swapﬁle.c) This function simply unlocks the area and list 186 static void swap_info_put(struct swap_info_struct * p) 187 { 188 swap_device_unlock(p); 189 swap_list_unlock(); 190 } 188 Unlock the device 189 Unlock the swap area list Function: lookup_swap_cache (mm/swap_state.c) Top level function for ﬁnding a page in the swap cache 161 struct page * lookup_swap_cache(swp_entry_t entry) 162 { 163 struct page *found; 164 165 found = find_get_page(&swapper_space, entry.val); 166 /* 167 * Unsafe to assert PageSwapCache and mapping on page found: 168 * if SMP nothing prevents swapoff from deleting this page from 169 * the swap cache at this moment. find_lock_page would prevent 170 * that, but no need to change: we _have_ got the right page. 171 */ 172 INC_CACHE_INFO(find_total); 173 if (found) 174 INC_CACHE_INFO(find_success); 175 return found; 176 }

272

165 find_get_page() is the principle function for returning the struct page. It uses the normal page hashing and cache functions for quickly ﬁnding it 172 Increase the statistic recording the number of times a page was searched for in the cache 173-174 If one was found, increment the successful ﬁnd count 175 Return the struct page or NULL if it did not exist Function: ﬁnd_get_page (include/linux/pagemap.h) Top level macro for ﬁnding a page in the page cache. It simply looks up the page hash 75 #define find_get_page(mapping, index) \ 76 __find_get_page(mapping, index, page_hash(mapping, index)) 76 page_hash() locates an entry in the page_hash_table based on the address_space and oﬀset

8.3. Swap Cache

273

Function: __ﬁnd_get_page (mm/ﬁlemap.c) This function is responsible for ﬁnding a struct page given an entry in page_hash_table as a starting point. 915 struct page * __find_get_page(struct address_space *mapping, 916 unsigned long offset, struct page **hash) 917 { 918 struct page *page; 919 920 /* 921 * We scan the hash list read-only. Addition to and removal from 922 * the hash-list needs a held write-lock. 923 */ 924 spin_lock(&pagecache_lock); 925 page = __find_page_nolock(mapping, offset, *hash); 926 if (page) 927 page_cache_get(page); 928 spin_unlock(&pagecache_lock); 929 return page; 930 } 924 Acquire the read-only page cache lock 925 Call the page cache traversal function which presumes a lock is held 926-927 If the page was found, obtain a reference to it with page_cache_get() so it is not freed prematurely 928 Release the page cache lock 929 Return the page or NULL if not found Function: __ﬁnd_page_nolock (mm/ﬁlemap.c) This function traverses the hash collision list looking for the page speciﬁed by the address_space and offset. 441 static inline struct page * __find_page_nolock( struct address_space *mapping, unsigned long offset, struct page *page) 442 { 443 goto inside; 444 445 for (;;) { 446 page = page->next_hash; 447 inside: 448 if (!page) 449 goto not_found;

8.4. Activating a Swap Area 450 if (page->mapping != mapping) 451 continue; 452 if (page->index == offset) 453 break; 454 } 455 456 not_found: 457 return page; 458 } 443 Begin by examining the ﬁrst page in the list 448-449 If the page is NULL, the right one could not be found so return NULL 450 If the address_space does not match, move to the next page on the collision list 452 If the offset matchs, return it, else move on 446 Move to the next page on the hash list 457 Return the found page or NULL if not

274

8.4

Activating a Swap Area

Function: sys_swapon (mm/swapﬁle.c) This, quite large, function is responsible for the activating of swap space. Broadly speaking the tasks is takes are as follows; • Find a free swap_info_struct in the swap_info array an initialise it with default values • Call user_path_walk() which traverses the directory tree for the supplied specialfile and populates a namidata structure with the available data on the ﬁle, such as the dentry and the ﬁlesystem information for where it is stored (vfsmount) • Populate swap_info_struct ﬁelds pertaining to the dimensions of the swap area and how to ﬁnd it. If the swap area is a partition, the block size will be conﬁgured to the PAGE_SIZE before calculating the size. If it is a ﬁle, the information is obtained directly from the inode • Ensure the area is not already activated. If not, allocate a page from memory and read the ﬁrst page sized slot from the swap area. This page contains information such as the number of good slots and how to populate the swap_info_struct→swap_map with the bad entries • Allocate memory with vmalloc() for swap_info_struct→swap_map and initialise each entry with 0 for good slots and SWAP_MAP_BAD otherwise. Ideally the header

8.4. Activating a Swap Area

275

information will be a version 2 ﬁle format as version 1 was limited to swap areas of just under 128MiB for architectures with 4KiB page sizes like the x861 • After ensuring the information indicated in the header matches the actual swap area, ﬁll in the remaining information in the swap_info_struct such as the maximum number of pages and the available good pages. Update the global statistics for nr_swap_pages and total_swap_pages • The swap area is now fully active and initialised and so it is inserted into the swap list in the correct position based on priority of the newly activated area 855 asmlinkage long sys_swapon(const char * specialfile, int swap_flags) 856 { 857 struct swap_info_struct * p; 858 struct nameidata nd; 859 struct inode * swap_inode; 860 unsigned int type; 861 int i, j, prev; 862 int error; 863 static int least_priority = 0; 864 union swap_header *swap_header = 0; 865 int swap_header_version; 866 int nr_good_pages = 0; 867 unsigned long maxpages = 1; 868 int swapfilesize; 869 struct block_device *bdev = NULL; 870 unsigned short *swap_map; 871 872 if (!capable(CAP_SYS_ADMIN)) 873 return -EPERM; 874 lock_kernel(); 875 swap_list_lock(); 876 p = swap_info; 855 The two parameters are the path to the swap area and the ﬂags for activation 872-873 The activating process must have the CAP_SYS_ADMIN capability or be the superuser to activate a swap area 874 Acquire the Big Kernel Lock 875 Lock the list of swap areas 876 Get the ﬁrst swap area in the swap_info array
1

See the Code Commentary for the comprehensive reason for this

8.4. Activating a Swap Area 877 for (type = 0 ; type < nr_swapfiles ; type++,p++) 878 if (!(p->flags & SWP_USED)) 879 break; 880 error = -EPERM; 881 if (type >= MAX_SWAPFILES) { 882 swap_list_unlock(); 883 goto out; 884 } 885 if (type >= nr_swapfiles) 886 nr_swapfiles = type+1; 887 p->flags = SWP_USED; 888 p->swap_file = NULL; 889 p->swap_vfsmnt = NULL; 890 p->swap_device = 0; 891 p->swap_map = NULL; 892 p->lowest_bit = 0; 893 p->highest_bit = 0; 894 p->cluster_nr = 0; 895 p->sdev_lock = SPIN_LOCK_UNLOCKED; 896 p->next = -1; 897 if (swap_flags & SWAP_FLAG_PREFER) { 898 p->prio = 899 (swap_flags & SWAP_FLAG_PRIO_MASK)>>SWAP_FLAG_PRIO_SHIFT; 900 } else { 901 p->prio = --least_priority; 902 } 903 swap_list_unlock(); Find a free swap_info_struct and initialise it with default values 877-879 Cycle through the swap_info until a struct is found that is not in use

276

880 By default the error returned is Permission Denied which indicates the caller did not have the proper permissions or too many swap areas are already in use 881 If no struct was free, MAX_SWAPFILE areas have already been activated so unlock the swap list and return 885-886 If the selected swap area is after the last known active area (nr_swapfiles), then update nr_swapfiles 887 Set the ﬂag indicating the area is in use 888-896 Initialise ﬁelds to default values 897-902 If the caller has speciﬁed a priority, use it else set it to least_priority and decrement it. This way, the swap areas will be prioritised in order of activation

8.4. Activating a Swap Area 903 Release the swap list lock 904 905 906 907 908 909 910 911 912 error = user_path_walk(specialfile, &nd); if (error) goto bad_swap_2; p->swap_file = nd.dentry; p->swap_vfsmnt = nd.mnt; swap_inode = nd.dentry->d_inode; error = -EINVAL;

277

Traverse the VFS and get some information about the special ﬁle 904 user_path_walk() traverses the directory structure to obtain a nameidata structure describing the specialfile 905-906 If it failed, return failure 908 Fill in the swap_file ﬁeld with the returned dentry 909 Similarily, ﬁll in the swap_vfsmnt 910 Record the inode of the special ﬁle 911 Now the default error is EINVAL indicating that the special ﬁle was found but it was not a block device or a regular ﬁle 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 safe*/ 929 930 if (S_ISBLK(swap_inode->i_mode)) { kdev_t dev = swap_inode->i_rdev; struct block_device_operations *bdops; devfs_handle_t de; p->swap_device = dev; set_blocksize(dev, PAGE_SIZE); bd_acquire(swap_inode); bdev = swap_inode->i_bdev; de = devfs_get_handle_from_inode(swap_inode); bdops = devfs_get_ops(de); if (bdops) bdev->bd_op = bdops; error = blkdev_get(bdev, FMODE_READ|FMODE_WRITE, 0, BDEV_SWAP); devfs_put_ops(de);/*Decrement module use count now we’re if (error) goto bad_swap_2;

8.4. Activating a Swap Area 931 932 933 934 935 936 937 938 939 940 941 942 943 set_blocksize(dev, PAGE_SIZE); error = -ENODEV; if (!dev || (blk_size[MAJOR(dev)] && !blk_size[MAJOR(dev)][MINOR(dev)])) goto bad_swap; swapfilesize = 0; if (blk_size[MAJOR(dev)]) swapfilesize = blk_size[MAJOR(dev)][MINOR(dev)] >> (PAGE_SHIFT - 10); } else if (S_ISREG(swap_inode->i_mode)) swapfilesize = swap_inode->i_size >> PAGE_SHIFT; else goto bad_swap;

278

If a partition, conﬁgure the block device before calculating the size of the area, else obtain it from the inode for the ﬁle. 913 Check if the special ﬁle is a block device 914-939 This code segment handles the case where the swap area is a partition 914 Record a pointer to the device structure for the block device 918 Store a pointer to the device structure describing the special ﬁle which will be needed for block IO operations 919 Set the block size on the device to be PAGE_SIZE as it will be page sized chunks swap is interested in 921 The bd_acquire() function increments the usage count for this block device 922 Get a pointer to the block_device structure which is a descriptor for the device ﬁle which is needed to open it 923 Get a devfs handle if devfs is enabled. devfs is beyond the scope of this document 924-925 Increment the usage count of this device entry 927 Open the block device in read/write mode and set the BDEV_SWAP ﬂag which is an enumerated type but is ignored when do_open() is called 928 Decrement the use count of the devfs entry 929-930 If an error occured on open, return failure 931 Set the block size again 932 After this point, the default error is to indicate no device could be found 933-935 Ensure the returned device is ok

8.4. Activating a Swap Area

279

937-939 Calculate the size of the swap ﬁle as the number of page sized chunks that exist in the block device as indicated by blk_size. The size of the swap area is calculated to make sure the information in the swap area is sane 941 If the swap area is a regular ﬁle, obtain the size directly from the inode and calculate how many page sized chunks exist 943 If the ﬁle is not a block device or regular ﬁle, return error 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 error = -EBUSY; for (i = 0 ; i < nr_swapfiles ; i++) { struct swap_info_struct *q = &swap_info[i]; if (i == type || !q->swap_file) continue; if (swap_inode->i_mapping == q->swap_file->d_inode->i_mapping) goto bad_swap; } swap_header = (void *) __get_free_page(GFP_USER); if (!swap_header) { printk("Unable to start swapping: out of memory :-)\n"); error = -ENOMEM; goto bad_swap; } lock_page(virt_to_page(swap_header)); rw_swap_page_nolock(READ, SWP_ENTRY(type,0), (char *) swap_header); if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10)) swap_header_version = 1; else if (!memcmp("SWAPSPACE2",swap_header->magic.magic,10)) swap_header_version = 2; else { printk("Unable to find swap-space signature\n"); error = -EINVAL; goto bad_swap; }

945 The next check makes sure the area is not already active. If it is, the error EBUSY will be returned 946-962 Read through the while swap_info struct and ensure the area to be activated is not already active 954-959 Allocate a page for reading the swap area information from disk

8.4. Activating a Swap Area

280

961 The function lock_page() locks a page and makes sure it is synced with disk if it is ﬁle backed. In this case, it’ll just mark the page as locked which is required for the rw_swap_page_nolock() function 962 Read the ﬁrst page slot in the swap area into swap_header 964-672 Decide which version the swap area information is and set the swap_header_version variable with it. If the swap area could not be identiﬁed, return EINVAL 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1. 976 Zero out the magic string identiﬁng the version of the swap area 978-979 Initialise ﬁelds in swap_info_struct to 0 980-988 A bitmap with 8*PAGE_SIZE entries is stored in the swap area. The full page, minus 10 bits for the magic string, is used to describe the swap map limiting swap switch (swap_header_version) { case 1: memset(((char *) swap_header)+PAGE_SIZE-10,0,10); j = 0; p->lowest_bit = 0; p->highest_bit = 0; for (i = 1 ; i < 8*PAGE_SIZE ; i++) { if (test_bit(i,(char *) swap_header)) { if (!p->lowest_bit) p->lowest_bit = i; p->highest_bit = i; maxpages = i+1; j++; } } nr_good_pages = j; p->swap_map = vmalloc(maxpages * sizeof(short)); if (!p->swap_map) { error = -ENOMEM; goto bad_swap; } for (i = 1 ; i < maxpages ; i++) { if (test_bit(i,(char *) swap_header)) p->swap_map[i] = 0; else p->swap_map[i] = SWAP_MAP_BAD; } break; Read in the information needed to populate the swap_map when the swap area is version

8.4. Activating a Swap Area

281

areas to just under 128MiB in size. If the bit is set to 1, there is a slot on disk available. This pass will calculate how many slots are available so a swap_map may be allocated 981 Test if the bit for this slot is set 982-983 If the lowest_bit ﬁeld is not yet set, set it to this slot. In most cases, lowest_bit will be initialised to 1 984 As long as new slots are found, keep updating the highest_bit 985 Count the number of pages 986 j is the count of good pages in the area 990 Allocate memory for the swap_map with vmalloc() 991-994 If memory could not be allocated, return ENOMEM 995-1000 For each slot, check if the slot is “good”. If yes, initialise the slot count to 0, else set it to SWAP_MAP_BAD so it will not be used 1001 Exit the switch statement 1003 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1025 1026 1027 1028 1029 1030 case 2: if (swap_header->info.version != 1) { printk(KERN_WARNING "Unable to handle swap header version %d\n", swap_header->info.version); error = -EINVAL; goto bad_swap; } p->lowest_bit = 1; maxpages = SWP_OFFSET(SWP_ENTRY(0,~0UL)) - 1; if (maxpages > swap_header->info.last_page) maxpages = swap_header->info.last_page; p->highest_bit = maxpages - 1; error = -EINVAL; if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES) goto bad_swap; if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) { error = -ENOMEM; goto bad_swap; } error = 0;

8.4. Activating a Swap Area 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 memset(p->swap_map, 0, maxpages * sizeof(short)); for (i=0; i<swap_header->info.nr_badpages; i++) { int page = swap_header->info.badpages[i]; if (page <= 0 || page >= swap_header->info.last_page) error = -EINVAL; else p->swap_map[page] = SWAP_MAP_BAD; } nr_good_pages = swap_header->info.last_page swap_header->info.nr_badpages 1 /* header page */; if (error) goto bad_swap; } Read the header information when the ﬁle format is version 2

282

1006-1012 Make absolutly sure we can handle this swap ﬁle format and return EINVAL if we cannot. Remember that with this version, the swap_header struct is placed nicely on disk 1014 Initialise lowest_bit to the known lowest available slot 1015-1017 Calculate the maxpages initially as the maximum possible size of a swap_map and then set it to the size indicated by the information on disk. This ensures the swap_map array is not accidently overloaded 1018 Initialise highest_bit 1020-1022 Make sure the number of bad pages that exist does not exceed MAX_SWAP_BADPAGES 1025-1028 Allocate memory for the swap_map with vmalloc() 1031 Initialise the full swap_map to 0 indicating all slots are available 1032-1038 Using the information loaded from disk, set each slot that is unusuable to SWAP_MAP_BAD 1039-1041 Calculate the number of available good pages 1042-1043 Return if an error occured 1045 1046 1047 1048 1049 1050

if (swapfilesize && maxpages > swapfilesize) { printk(KERN_WARNING "Swap area shorter than signature indicates\n"); error = -EINVAL; goto bad_swap;

8.4. Activating a Swap Area 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066

283

} if (!nr_good_pages) { printk(KERN_WARNING "Empty swap-file\n"); error = -EINVAL; goto bad_swap; } p->swap_map[0] = SWAP_MAP_BAD; swap_list_lock(); swap_device_lock(p); p->max = maxpages; p->flags = SWP_WRITEOK; p->pages = nr_good_pages; nr_swap_pages += nr_good_pages; total_swap_pages += nr_good_pages; printk(KERN_INFO "Adding Swap: %dk swap-space (priority %d)\n", nr_good_pages<<(PAGE_SHIFT-10), p->prio);

1046-1051 Ensure the information loaded from disk matches the actual dimensions of the swap area. If they do not match, print a warning and return an error 1052-1056 If no good pages were available, return an error 1057 Make sure the ﬁrst page in the map containing the swap header information is not used. If it was, the header information would be overwritten the ﬁrst time this area was used 1058-1059 Lock the swap list and the swap device 1060-1062 Fill in the remaining ﬁelds in the swap_info_struct 1063-1064 Update global statistics for the number of available swap pages (nr_swap_pages) and the total number of swap pages (total_swap_pages) 1065-1066 Print an informational message about the swap activation 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 /* insert swap space into swap_list: */ prev = -1; for (i = swap_list.head; i >= 0; i = swap_info[i].next) { if (p->prio >= swap_info[i].prio) { break; } prev = i; } p->next = i; if (prev < 0) { swap_list.head = swap_list.next = p - swap_info; } else { swap_info[prev].next = p - swap_info;

8.4. Activating a Swap Area 1081 1082 1083 1084 1085 } swap_device_unlock(p); swap_list_unlock(); error = 0; goto out;

284

1070-1080 Insert the new swap area into the correct slot in the swap list based on priority 1082 Unlock the swap device 1083 Unlock the swap list 1084-1085 Return success 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 bad_swap: if (bdev) blkdev_put(bdev, BDEV_SWAP); bad_swap_2: swap_list_lock(); swap_map = p->swap_map; nd.mnt = p->swap_vfsmnt; nd.dentry = p->swap_file; p->swap_device = 0; p->swap_file = NULL; p->swap_vfsmnt = NULL; p->swap_map = NULL; p->flags = 0; if (!(swap_flags & SWAP_FLAG_PREFER)) ++least_priority; swap_list_unlock(); if (swap_map) vfree(swap_map); path_release(&nd); out: if (swap_header) free_page((long) swap_header); unlock_kernel(); return error; }

1087-1088 Drop the reference to the block device 1090-1104 This is the error path where the swap list need to be unlocked, the slot in swap_info reset to being unused and the memory allocated for swap_map freed if it was assigned 1104 Drop the reference to the special ﬁle

8.5. Deactivating a Swap Area

285

1106-1107 Release the page containing the swap header information as it is no longer needed 1108 Drop the Big Kernel Lock 1109 Return the error or success value

8.5

Deactivating a Swap Area

Function: sys_swapoﬀ (mm/swapﬁle.c) This function is principally concerned with updating the swap_info_struct and the swap lists. The main task of paging in all pages in the area is the responsibility of try_to_unuse(). The function tasks are broadly • Call user_path_walk() to acquire the information about the special ﬁle to be deactivated and then take the BKL • Remove the swap_info_struct from the swap list and update the global statistics on the number of swap pages available (nr_swap_pages) and the total number of swap entries (total_swap_pages. Once this is acquired, the BKL can be released again • Call try_to_unuse() which will page in all pages from the swap area to be deactivated. • If there was not enough available memory to page in all the entries, the swap area is reinserted back into the running system as it cannot be simply dropped. If it succeeded, the swap_info_struct is placed into an uninitialised state and the swap_map memory freed with vfree() 707 asmlinkage long sys_swapoff(const char * specialfile) 708 { 709 struct swap_info_struct * p = NULL; 710 unsigned short *swap_map; 711 struct nameidata nd; 712 int i, type, prev; 713 int err; 714 715 if (!capable(CAP_SYS_ADMIN)) 716 return -EPERM; 717 718 err = user_path_walk(specialfile, &nd); 719 if (err) 720 goto out; 721 715-716 Only the superuser or a process with CAP_SYS_ADMIN capabilities may deactivate an area

8.5. Deactivating a Swap Area

286

718-719 Acquire information about the special ﬁle representing the swap area with user_path_walk(). Return on error 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 lock_kernel(); prev = -1; swap_list_lock(); for (type = swap_list.head; type >= 0; type = swap_info[type].next) { p = swap_info + type; if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) { if (p->swap_file == nd.dentry) break; } prev = type; } err = -EINVAL; if (type < 0) { swap_list_unlock(); goto out_dput; } if (prev < 0) { swap_list.head = p->next; } else { swap_info[prev].next = p->next; } if (type == swap_list.next) { /* just pick something that’s safe... */ swap_list.next = swap_list.head; } nr_swap_pages -= p->pages; total_swap_pages -= p->pages; p->flags = SWP_USED;

Acquire the BKL, ﬁnd the swap_info_struct for the area to be deactivated and remove it from the swap list. 722 Acquire the BKL 724 Lock the swap list 725-732 Traverse the swap list and ﬁnd the swap_info_struct for the requested area. Use the dentry to identify the area 734-737 If the struct could not be found, return 739-747 Remove from the swap list making sure that this is not the head

8.5. Deactivating a Swap Area 748 Update the total number of free swap slots 749 Update the total number of existing swap slots 750 Mark the area as active but may not be written to 751 752 753 swap_list_unlock(); unlock_kernel(); err = try_to_unuse(type);

287

751 Unlock the swap list 752 Release the BKL 753 Page in all pages from this swap area 754 lock_kernel(); 755 if (err) { 756 /* re-insert swap space back into swap_list */ 757 swap_list_lock(); 758 for (prev = -1, i = swap_list.head; i >= 0; prev = i, i = swap_info[i].next) 759 if (p->prio >= swap_info[i].prio) 760 break; 761 p->next = i; 762 if (prev < 0) 763 swap_list.head = swap_list.next = p - swap_info; 764 else 765 swap_info[prev].next = p - swap_info; 766 nr_swap_pages += p->pages; 767 total_swap_pages += p->pages; 768 p->flags = SWP_WRITEOK; 769 swap_list_unlock(); 770 goto out_dput; 771 } Acquire the BKL. If we failed to page in all pages, then reinsert the area into the swap list 754 Acquire the BKL 757 Lock the swap list 758-765 Reinsert the area into the swap list. The position it is inserted at depends on the swap area priority 766-767 Update the global statistics 768 Mark the area as safe to write to again

8.5. Deactivating a Swap Area 769-770 Unlock the swap list and return 772 if (p->swap_device) 773 blkdev_put(p->swap_file->d_inode->i_bdev, BDEV_SWAP); 774 path_release(&nd); 775 776 swap_list_lock(); 777 swap_device_lock(p); 778 nd.mnt = p->swap_vfsmnt; 779 nd.dentry = p->swap_file; 780 p->swap_vfsmnt = NULL; 781 p->swap_file = NULL; 782 p->swap_device = 0; 783 p->max = 0; 784 swap_map = p->swap_map; 785 p->swap_map = NULL; 786 p->flags = 0; 787 swap_device_unlock(p); 788 swap_list_unlock(); 789 vfree(swap_map); 790 err = 0; 791 792 out_dput: 793 unlock_kernel(); 794 path_release(&nd); 795 out: 796 return err; 797 }

288

Else the swap area was successfully deactivated to close the block device and mark the swap_info_struct free 772-773 Close the block device 774 Release the path information 776-777 Acquire the swap list and swap device lock 778-786 Reset the ﬁelds in swap_info_struct to default values 787-788 Release the swap list and swap device 788 Free the memory used for the swap_map 793 Release the BKL 794 Release the path information in the event we reached here via the error path 796 Return success or failure

8.5. Deactivating a Swap Area

289

Function: try_to_unuse (mm/swapﬁle.c) This function is heavily commented in the source code albeit it consists of speculation or is slightly inaccurate. The comments are omitted here. 513 static int try_to_unuse(unsigned int type) 514 { 515 struct swap_info_struct * si = &swap_info[type]; 516 struct mm_struct *start_mm; 517 unsigned short *swap_map; 518 unsigned short swcount; 519 struct page *page; 520 swp_entry_t entry; 521 int i = 0; 522 int retval = 0; 523 int reset_overflow = 0; 524 539 start_mm = &init_mm; 540 atomic_inc(&init_mm.mm_users); 541 539-540 The starting mm_struct to page in pages for is init_mm. The count is incremented even though this particular struct will not disappear to prevent having to write special cases in the remainder of the function 555 556 557 558 559 560 561 562 563 564 571 572 573 574 575 576 577 578 579 580 581 582 while ((i = find_next_to_unuse(si, i))) { /* * Get a page for the entry, using the existing swap * cache page if there is one. Otherwise, get a clean * page and read the swap into it. */ swap_map = &si->swap_map[i]; entry = SWP_ENTRY(type, i); page = read_swap_cache_async(entry); if (!page) { if (!*swap_map) continue; retval = -ENOMEM; break; } /* * Don’t hold on to start_mm if it looks like exiting. */ if (atomic_read(&start_mm->mm_users) == 1) { mmput(start_mm); start_mm = &init_mm;

8.5. Deactivating a Swap Area 583 584 atomic_inc(&init_mm.mm_users); }

290

555 This is the beginning of the major loop in this function. Starting from the beginning of the swap_map, it searches for the next entry to be freed with find_next_to_unuse() until all swap map entries have been paged in 561-563 Get the swp_entry_t and call read_swap_cache_async() to ﬁnd the page in the swap cache or have a new page allocated for reading in from the disk 564-575 If we failed to get the page, it means the slot has already been freed independently by another process or thread (process could be exiting elsewhere) or we are out of memory. If independently freed, we continue to the next map, else we return ENOMEM 580 Check to make sure this mm is not exiting. If it is, decrement its count and go back to init_mm 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 /* * Wait for and lock page. When do_swap_page races with * try_to_unuse, do_swap_page can handle the fault much * faster than try_to_unuse can locate the entry. This * apparently redundant "wait_on_page" lets try_to_unuse * defer to do_swap_page in such a case - in some tests, * do_swap_page and try_to_unuse repeatedly compete. */ wait_on_page(page); lock_page(page); /* * Remove all references to entry, without blocking. * Whenever we reach init_mm, there’s no address space * to search, but use it as a reminder to search shmem. */ swcount = *swap_map; if (swcount > 1) { flush_page_to_ram(page); if (start_mm == &init_mm) shmem_unuse(entry, page); else unuse_process(start_mm, entry, page); }

594 Wait on the page to complete IO. Once it returns, we know for a fact the page exists in memory with the same information as that on disk 595 Lock the page 602 Get the swap map reference count

8.5. Deactivating a Swap Area 603 If the count is positive then...

291

605 As the page is about to be inserted into proces page tables, it must be freed from the D-Cache or the process may not “see” changes made to the page by the kernel 605-606 If we are using the init_mm, call shmem_unuse() which will free the page from any shared memory regions that are in use 608 Else update the PTE in the current mm which references this page 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 if (*swap_map > 1) { int set_start_mm struct list_head struct mm_struct struct mm_struct

= (*swap_map >= swcount); *p = &start_mm->mmlist; *new_start_mm = start_mm; *mm;

spin_lock(&mmlist_lock); while (*swap_map > 1 && (p = p->next) != &start_mm->mmlist) { mm = list_entry(p, struct mm_struct, mmlist); swcount = *swap_map; if (mm == &init_mm) { set_start_mm = 1; shmem_unuse(entry, page); } else unuse_process(mm, entry, page); if (set_start_mm && *swap_map < swcount) { new_start_mm = mm; set_start_mm = 0; } } atomic_inc(&new_start_mm->mm_users); spin_unlock(&mmlist_lock); mmput(start_mm); start_mm = new_start_mm; }

610-635 If an entry still exists, begin traversing through all mm_structs ﬁnding references to this page and update the respective PTE 616 Lock the mm list 617-630 Keep searching until all mm_structs have been found. Do not traverse the full list more than once 619 Get the mm_struct for this list entry

8.5. Deactivating a Swap Area

292

621-625 Call shmem_unuse() if the mm is init_mm, else call unuse_process() to traverse the process page tables and update the PTE 626-627 Record if we need to start searching mm_structs starting from init_mm again 650 651 652 653 654 655 656 657 658 if (*swap_map == SWAP_MAP_MAX) { swap_list_lock(); swap_device_lock(si); nr_swap_pages++; *swap_map = 1; swap_device_unlock(si); swap_list_unlock(); reset_overflow = 1; }

650 If the swap map entry is permanently mapped, we have to hope that all processes have their PTEs updated to point to the page and in reality the swap map entry is free. In reality, it is highly unlikely a slot would be permanetly reserved in the ﬁrst place 641-657 Lock the list and swap device, set the swap map entry to 1, unlock them again and record that a reset overﬂow occured 674 675 676 677 678 679 if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) { rw_swap_page(WRITE, page); lock_page(page); } if (PageSwapCache(page)) delete_from_swap_cache(page);

674-677 In the very rare event a reference still exists to the page, write the page back to disk so at least if another process really has a reference to it, it’ll copy the page back in from disk correctly 678-679 Delete the page from the swap cache so the page swap daemon will not use the page under any circumstances 686 687 688 SetPageDirty(page); UnlockPage(page); page_cache_release(page);

686 Mark the page dirty so that the swap out code will preserve the page and if it needs to remove it again, it’ll write it correctly to a new swap area 687 Unlock the page 688 Release our reference to it in the page cache

8.5. Deactivating a Swap Area 695 696 697 698 699 700 701 702 703 704 705 } if (current->need_resched) schedule(); } mmput(start_mm); if (reset_overflow) { printk(KERN_WARNING "swapoff: cleared swap entry overflow\n"); swap_overflow = 0; } return retval;

293

695-696 Call schedule() if necessary so the deactivation of swap does not hog the entire CPU 699 Drop our reference to the mm 700-703 If a permanently mapped page had to be removed, then print out a warning so that in the very unlikely event an error occurs later, there will be a hint to what might have happend 704 Return success or failure

Index
activate_lock, 237 activate_page_nolock, 237 add_page_to_active_list, 234 add_page_to_inactive_list, 235 __add_to_page_cache, 238 add_to_page_cache, 238 add_to_swap_cache, 267 address_space, 126 alloc_area_pmd, 51 alloc_area_pte, 52 alloc_bootmem, 15 __alloc_bootmem, 16 __alloc_bootmem_core, 17 alloc_bootmem_node, 16 __alloc_bootmem_node, 17 alloc_bounce_bh, 222 alloc_bounce_page, 223 alloc_one_pte, 167 alloc_page, 43 alloc_pages, 29 _alloc_pages, 30 __alloc_pages, 31 allocate_mm, 119 arch_get_unmapped_area, 131, 144 bootmem_data, 8 bounce_end_io, 226 bounce_end_io_read, 225 bounce_end_io_write, 225 BREAK_GFP_ORDER_HI, 103 BREAK_GFP_ORDER_LO, 103 cache_cache, 114 CACHE_NAMELEN, 103 calc_vm_ﬂags, 131 can_vma_merge, 153 cc_data, 106 cc_entry, 107 ccupdate_t, 111 294 CHECK_PAGE, 98 clock_searchp, 75 contig_page_data, 30 copy_from_high_bh, 225 copy_mm, 120 copy_one_pte, 167 copy_to_high_bh_irq, 226 cpu_vm_mask, 119 cpucache, 106 create_buﬀers, 220 def_ﬂags, 119 DEFAULT_MAX_MAP_COUNT, 131 del_page_from_active_list, 236 del_page_from_inactive_list, 236 do_anonymous_page, 204 do_ccupdate_local, 111, 112 do_mlock, 170 do_mmap2, 128 do_mmap_pgoﬀ, 128 do_mremap, 154 do_munmap, 179 do_no_page, 200 do_swap_page, 206 drain_cpu_caches, 112 enable_all_cpucaches, 107 enable_cpucache, 107, 108 exit_mmap, 123, 186 expand, 35, 37 ﬁnd_get_page, 272 __ﬁnd_get_page, 273 __ﬁnd_page_nolock, 273 ﬁnd_vma, 139 ﬁnd_vma_intersection, 142 ﬁnd_vma_prepare, 146 ﬁnd_vma_prev, 141 ﬂush_all_zero_pkmaps, 216

INDEX free_all_bootmem, 26 free_all_bootmem_core, 27 free_area_pmd, 56 free_area_pte, 57 free_block, 101 __free_block, 102 free_bootmem, 22 free_bootmem_core, 23 __free_page, 45 __free_pages, 39 free_pages, 39, 45 free_pages_init, 26 __free_pages_ok, 39 g_cpucache_up, 107 __get_dma_pages, 44 __get_free_page, 43 __get_free_pages, 44 get_one_pte, 166 get_swap_page, 262 get_unmapped_area, 142 get_vm_area, 48 get_zeroed_page, 44 gfp_mask, 30 handle_mm_fault, 198 handle_pte_fault, 199 __init, 25 init_bootmem, 11 init_bootmem_core, 12 init_bootmem_node, 12 init_emergency_pool, 228 INIT_MM, 120 init_mm, 120 __insert_vm_struct, 145 kfree, 105 km_type, 218 kmalloc, 104 kmap, 212 kmap_atomic, 218 kmap_high, 213 kmem_cache, 114 __kmem_cache_alloc (SMP Case), 92 __kmem_cache_alloc (UP Case), 91 kmem_cache_alloc, 90 kmem_cache_alloc_batch, 96 kmem_cache_alloc_head, 93 kmem_cache_alloc_one, 94 kmem_cache_alloc_one_tail, 95 kmem_cache_create, 59 kmem_cache_destroy, 73 kmem_cache_estimate, 68 kmem_cache_free, 97 __kmem_cache_free, 98, 99 kmem_cache_free_one, 100 kmem_cache_grow, 83 kmem_cache_init, 114 kmem_cache_init_objs, 88 kmem_cache_reap, 76 kmem_cache_shrink, 70 __kmem_cache_shrink, 71 __kmem_cache_shrink_locked, 72 kmem_cache_sizes_init, 102 kmem_cache_slabmgmt, 81 kmem_ﬁnd_general_cachep, 82 kmem_freepages, 116 kmem_getpages, 115 kmem_slab_destroy, 87 kmem_tune_cpucache, 107, 109 kswapd, 230 kswapd_balance, 232 kswapd_balance_pgdat, 233 kswapd_can_sleep, 231 kswapd_can_sleep_pgdat, 232 kswapd_init, 230 kunmap, 217 kunmap_atomic, 219 kunmap_high, 217 locked_vm, 119 lookup_swap_cache, 272 lru_cache_add, 234 lru_cache_del, 235 __lru_cache_del, 236 map_new_virtual, 213 mark_page_accessed, 237 max_map_count, 131 mem_init, 24 mlock_ﬁxup, 173 mlock_ﬁxup_all, 174

295

INDEX mlock_ﬁxup_end, 176 mlock_ﬁxup_middle, 177 mlock_ﬁxup_start, 175 mm_alloc, 119, 120 mm_init, 119, 122 mmap_sem, 118 mmdrop, 123, 124 __mmdrop, 124 mmlist, 119 mmput, 123 move_one_page, 165 move_page_tables, 164 move_vma, 159 one_highpage_init, 26 Page Frame Number (PFN), 9 page_cache_get, 238 page_cache_release, 238 ptep_get_and_clear, 58 REAP_SCANLEN, 75 reﬁll_inactive, 243 reserve_bootmem, 13 reserve_bootmem_core, 14 reserve_bootmem_node, 14 rmqueue, 35 rss, 119 scan_swap_map, 264 SET_PAGE_CACHE, 86 SET_PAGE_SLAB, 86 setup_memory, 8 shrink_cache, 244 shrink_caches, 239 slab_break_gfp_order, 103 smp_function_all_cpus, 111 STATS_INC_GROWN, 86 swap_duplicate, 268 swap_entry_free, 269 swap_free, 269 swap_info_get, 270 swap_info_put, 272 swap_out, 251 swap_out_mm, 253 swap_out_pgd, 255 swap_out_pmd, 256 swap_out_vma, 254 SWP_ENTRY, 264 sys_mlock, 168 sys_mlockall, 169 sys_mmap2, 128 sys_mremap, 153 sys_munlock, 172 sys_munlockall, 173 sys_swapoﬀ, 285 sys_swapon, 274 total_vm, 119 try_to_free_pages, 241 try_to_free_pages_zone, 242 try_to_swap_out, 258 try_to_unuse, 289 unmap_ﬁxup, 183 vfree, 53 vma_link, 147 __vma_link, 148 __vma_link_ﬁle, 150 __vma_link_list, 149 __vma_link_rb, 149 vma_merge, 150 vmalloc, 46 __vmalloc, 47 vmalloc_area_pages, 50 vmfree_area_pages, 55

296