Skip to content
Snippets Groups Projects
  • Joao Martins's avatar
    11aad263
    mm/hugetlb_vmemmap: remap head page to newly allocated page · 11aad263
    Joao Martins authored
    Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
    back to page allocator is as following: for a 2M hugetlb page it will reuse
    the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
    1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
    that means that it breaks the first 4K of a potentially contiguous chunk of
    memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
    this reason the memory that it's free back to page allocator cannot be used
    for hugetlb to allocate huge pages of the same size, but rather only of a
    smaller huge page size:
    
    Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
    having 64G):
    
    * Before allocation:
    Free pages count per migrate type at order       0      1      2      3
    4      5      6      7      8      9     10
    ...
    Node    0, zone   Normal, type      Movable    340    100     32     15
    1      2      0      0      0      1  15558
    
    $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
     31987
    
    * After:
    
    Node    0, zone   Normal, type      Movable  30893  32006  31515      7
    0      0      0      0      0      0      0
    
    Notice how the memory freed back are put back into 4K / 8K / 16K page
    pools. And it allocates a total of 31987 pages (63974M).
    
    To fix this behaviour rather than remapping second vmemmap page (thus
    breaking the contiguous block of memory backing the struct pages)
    repopulate the first vmemmap page with a new one. We allocate and copy
    from the currently mapped vmemmap page, and then remap it later on.
    The same algorithm works if there's a pre initialized walk::reuse_page
    and the head page doesn't need to be skipped and instead we remap it
    when the @addr being changed is the @reuse_addr.
    
    The new head page is allocated in vmemmap_remap_free() given that on
    restore there's no need for functional change. Note that, because right
    now one hugepage is remapped at a time, thus only one free 4K page at a
    time is needed to remap the head page. Should it fail to allocate said
    new page, it reuses the one that's already mapped just like before. As a
    result, for every 64G of contiguous hugepages it can give back 1G more
    of contiguous memory per 64G, while needing in total 128M new 4K pages
    (for 2M hugetlb) or 256k (for 1G hugetlb).
    
    After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
    guest, each node with 64G):
    
    * Before allocation
    Free pages count per migrate type at order       0      1      2      3
    4      5      6      7      8      9     10
    ...
    Node    0, zone   Normal, type      Movable      1      1      1      0
    0      1      0      0      1      1  15564
    
    $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    32394
    
    * After:
    
    Node    0, zone   Normal, type      Movable      0     50     97    108
    96     81     70     46     18      0      0
    
    In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
    of the 32394 (64788M) allocated. So the memory freed back is indeed being
    used back in hugetlb and there's no massive order-0..order-2 pages
    accumulated unused.
    
    [joao.m.martins@oracle.com: v3]
      Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com
    [joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write]
      Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com
    Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.com
    
    
    Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
    Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    11aad263
    History
    mm/hugetlb_vmemmap: remap head page to newly allocated page
    Joao Martins authored
    Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
    back to page allocator is as following: for a 2M hugetlb page it will reuse
    the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
    1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
    that means that it breaks the first 4K of a potentially contiguous chunk of
    memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
    this reason the memory that it's free back to page allocator cannot be used
    for hugetlb to allocate huge pages of the same size, but rather only of a
    smaller huge page size:
    
    Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
    having 64G):
    
    * Before allocation:
    Free pages count per migrate type at order       0      1      2      3
    4      5      6      7      8      9     10
    ...
    Node    0, zone   Normal, type      Movable    340    100     32     15
    1      2      0      0      0      1  15558
    
    $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
     31987
    
    * After:
    
    Node    0, zone   Normal, type      Movable  30893  32006  31515      7
    0      0      0      0      0      0      0
    
    Notice how the memory freed back are put back into 4K / 8K / 16K page
    pools. And it allocates a total of 31987 pages (63974M).
    
    To fix this behaviour rather than remapping second vmemmap page (thus
    breaking the contiguous block of memory backing the struct pages)
    repopulate the first vmemmap page with a new one. We allocate and copy
    from the currently mapped vmemmap page, and then remap it later on.
    The same algorithm works if there's a pre initialized walk::reuse_page
    and the head page doesn't need to be skipped and instead we remap it
    when the @addr being changed is the @reuse_addr.
    
    The new head page is allocated in vmemmap_remap_free() given that on
    restore there's no need for functional change. Note that, because right
    now one hugepage is remapped at a time, thus only one free 4K page at a
    time is needed to remap the head page. Should it fail to allocate said
    new page, it reuses the one that's already mapped just like before. As a
    result, for every 64G of contiguous hugepages it can give back 1G more
    of contiguous memory per 64G, while needing in total 128M new 4K pages
    (for 2M hugetlb) or 256k (for 1G hugetlb).
    
    After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
    guest, each node with 64G):
    
    * Before allocation
    Free pages count per migrate type at order       0      1      2      3
    4      5      6      7      8      9     10
    ...
    Node    0, zone   Normal, type      Movable      1      1      1      0
    0      1      0      0      1      1  15564
    
    $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
    32394
    
    * After:
    
    Node    0, zone   Normal, type      Movable      0     50     97    108
    96     81     70     46     18      0      0
    
    In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
    of the 32394 (64788M) allocated. So the memory freed back is indeed being
    used back in hugetlb and there's no massive order-0..order-2 pages
    accumulated unused.
    
    [joao.m.martins@oracle.com: v3]
      Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com
    [joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write]
      Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com
    Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.com
    
    
    Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
    Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>