Skip to content
Snippets Groups Projects
  1. Feb 21, 2023
    • Ondrej Mosnacek's avatar
      sysctl: fix proc_dobool() usability · f1aa2eb5
      Ondrej Mosnacek authored
      
      Currently proc_dobool expects a (bool *) in table->data, but sizeof(int)
      in table->maxsize, because it uses do_proc_dointvec() directly.
      
      This is unsafe for at least two reasons:
      1. A sysctl table definition may use { .data = &variable, .maxsize =
         sizeof(variable) }, not realizing that this makes the sysctl unusable
         (see the Fixes: tag) and that they need to use the completely
         counterintuitive sizeof(int) instead.
      2. proc_dobool() will currently try to parse an array of values if given
         .maxsize >= 2*sizeof(int), but will try to write values of type bool
         by offsets of sizeof(int), so it will not work correctly with neither
         an (int *) nor a (bool *). There is no .maxsize validation to prevent
         this.
      
      Fix this by:
      1. Constraining proc_dobool() to allow only one value and .maxsize ==
         sizeof(bool).
      2. Wrapping the original struct ctl_table in a temporary one with .data
         pointing to a local int variable and .maxsize set to sizeof(int) and
         passing this one to proc_dointvec(), converting the value to/from
         bool as needed (using proc_dou8vec_minmax() as an example).
      3. Extending sysctl_check_table() to enforce proc_dobool() expectations.
      4. Fixing the proc_dobool() docstring (it was just copy-pasted from
         proc_douintvec, apparently...).
      5. Converting all existing proc_dobool() users to set .maxsize to
         sizeof(bool) instead of sizeof(int).
      
      Fixes: 83efeeeb ("tty: Allow TIOCSTI to be disabled")
      Fixes: a2071573 ("sysctl: introduce new proc handler proc_dobool")
      Signed-off-by: default avatarOndrej Mosnacek <omosnace@redhat.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      f1aa2eb5
  2. Feb 18, 2023
  3. Feb 17, 2023
  4. Feb 13, 2023
  5. Feb 12, 2023
    • David Chen's avatar
      Fix page corruption caused by racy check in __free_pages · 462a8e08
      David Chen authored
      When we upgraded our kernel, we started seeing some page corruption like
      the following consistently:
      
        BUG: Bad page state in process ganesha.nfsd  pfn:1304ca
        page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
        flags: 0x17ffffc0000000()
        raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
        raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
        page dumped because: nonzero mapcount
        CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P    B      O      5.10.158-1.nutanix.20221209.el7.x86_64 #1
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
        Call Trace:
         dump_stack+0x74/0x96
         bad_page.cold+0x63/0x94
         check_new_page_bad+0x6d/0x80
         rmqueue+0x46e/0x970
         get_page_from_freelist+0xcb/0x3f0
         ? _cond_resched+0x19/0x40
         __alloc_pages_nodemask+0x164/0x300
         alloc_pages_current+0x87/0xf0
         skb_page_frag_refill+0x84/0x110
         ...
      
      Sometimes, it would also show up as corruption in the free list pointer
      and cause crashes.
      
      After bisecting the issue, we found the issue started from commit
      e320d301 ("mm/page_alloc.c: fix freeing non-compound pages"):
      
      	if (put_page_testzero(page))
      		free_the_page(page, order);
      	else if (!PageHead(page))
      		while (order-- > 0)
      			free_the_page(page + (1 << order), order);
      
      So the problem is the check PageHead is racy because at this point we
      already dropped our reference to the page.  So even if we came in with
      compound page, the page can already be freed and PageHead can return
      false and we will end up freeing all the tail pages causing double free.
      
      Fixes: e320d301 ("mm/page_alloc.c: fix freeing non-compound pages")
      Link: https://lore.kernel.org/lkml/BYAPR02MB448855960A9656EEA81141FC94D99@BYAPR02MB4488.namprd02.prod.outlook.com/
      
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChunwei Chen <david.chen@nutanix.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      462a8e08
  6. Feb 10, 2023
    • Qi Zheng's avatar
      mm: shrinkers: fix deadlock in shrinker debugfs · badc28d4
      Qi Zheng authored
      The debugfs_remove_recursive() is invoked by unregister_shrinker(), which
      is holding the write lock of shrinker_rwsem.  It will waits for the
      handler of debugfs file complete.  The handler also needs to hold the read
      lock of shrinker_rwsem to do something.  So it may cause the following
      deadlock:
      
       	CPU0				CPU1
      
      debugfs_file_get()
      shrinker_debugfs_count_show()/shrinker_debugfs_scan_write()
      
           				unregister_shrinker()
      				--> down_write(&shrinker_rwsem);
      				    debugfs_remove_recursive()
      					// wait for (A)
      				    --> wait_for_completion();
      
          // wait for (B)
      --> down_read_killable(&shrinker_rwsem)
      debugfs_file_put() -- (A)
      
      				    up_write() -- (B)
      
      The down_read_killable() can be killed, so that the above deadlock can be
      recovered.  But it still requires an extra kill action, otherwise it will
      block all subsequent shrinker-related operations, so it's better to fix
      it.
      
      [akpm@linux-foundation.org: fix CONFIG_SHRINKER_DEBUG=n stub]
      Link: https://lkml.kernel.org/r/20230202105612.64641-1-zhengqi.arch@bytedance.com
      
      
      Fixes: 5035ebc6 ("mm: shrinkers: introduce debugfs interface for memory shrinkers")
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      badc28d4
    • Kefeng Wang's avatar
      mm: hwpoison: support recovery from ksm_might_need_to_copy() · 6b970599
      Kefeng Wang authored
      When the kernel copies a page from ksm_might_need_to_copy(), but runs into
      an uncorrectable error, it will crash since poisoned page is consumed by
      kernel, this is similar to the issue recently fixed by Copy-on-write
      poison recovery.
      
      When an error is detected during the page copy, return VM_FAULT_HWPOISON
      in do_swap_page(), and install a hwpoison entry in unuse_pte() when
      swapoff, which help us to avoid system crash.  Note, memory failure on a
      KSM page will be skipped, but still call memory_failure_queue() to be
      consistent with general memory failure process, and we could support KSM
      page recovery in the feature.
      
      [wangkefeng.wang@huawei.com: enhance unuse_pte(), fix issue found by lkp]
        Link: https://lkml.kernel.org/r/20221213120523.141588-1-wangkefeng.wang@huawei.com
      [wangkefeng.wang@huawei.com: update changelog, alter ksm_might_need_to_copy(), restore unlikely() in unuse_pte()]
        Link: https://lkml.kernel.org/r/20230201074433.96641-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20221209072801.193221-1-wangkefeng.wang@huawei.com
      
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b970599
    • Christophe Leroy's avatar
      kasan: fix Oops due to missing calls to kasan_arch_is_ready() · 55d77bae
      Christophe Leroy authored
      On powerpc64, you can build a kernel with KASAN as soon as you build it
      with RADIX MMU support.  However if the CPU doesn't have RADIX MMU, KASAN
      isn't enabled at init and the following Oops is encountered.
      
        [    0.000000][    T0] KASAN not enabled as it requires radix!
      
        [    4.484295][   T26] BUG: Unable to handle kernel data access at 0xc00e000000804a04
        [    4.485270][   T26] Faulting instruction address: 0xc00000000062ec6c
        [    4.485748][   T26] Oops: Kernel access of bad area, sig: 11 [#1]
        [    4.485920][   T26] BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        [    4.486259][   T26] Modules linked in:
        [    4.486637][   T26] CPU: 0 PID: 26 Comm: kworker/u2:2 Not tainted 6.2.0-rc3-02590-gf8a023b0a805 #249
        [    4.486907][   T26] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 0xf000005 of:SLOF,HEAD pSeries
        [    4.487445][   T26] Workqueue: eval_map_wq .tracer_init_tracefs_work_func
        [    4.488744][   T26] NIP:  c00000000062ec6c LR: c00000000062bb84 CTR: c0000000002ebcd0
        [    4.488867][   T26] REGS: c0000000049175c0 TRAP: 0380   Not tainted  (6.2.0-rc3-02590-gf8a023b0a805)
        [    4.489028][   T26] MSR:  8000000002009032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 44002808  XER: 00000000
        [    4.489584][   T26] CFAR: c00000000062bb80 IRQMASK: 0
        [    4.489584][   T26] GPR00: c0000000005624d4 c000000004917860 c000000001cfc000 1800000000804a04
        [    4.489584][   T26] GPR04: c0000000003a2650 0000000000000cc0 c00000000000d3d8 c00000000000d3d8
        [    4.489584][   T26] GPR08: c0000000049175b0 a80e000000000000 0000000000000000 0000000017d78400
        [    4.489584][   T26] GPR12: 0000000044002204 c000000003790000 c00000000435003c c0000000043f1c40
        [    4.489584][   T26] GPR16: c0000000043f1c68 c0000000043501a0 c000000002106138 c0000000043f1c08
        [    4.489584][   T26] GPR20: c0000000043f1c10 c0000000043f1c20 c000000004146c40 c000000002fdb7f8
        [    4.489584][   T26] GPR24: c000000002fdb834 c000000003685e00 c000000004025030 c000000003522e90
        [    4.489584][   T26] GPR28: 0000000000000cc0 c0000000003a2650 c000000004025020 c000000004025020
        [    4.491201][   T26] NIP [c00000000062ec6c] .kasan_byte_accessible+0xc/0x20
        [    4.491430][   T26] LR [c00000000062bb84] .__kasan_check_byte+0x24/0x90
        [    4.491767][   T26] Call Trace:
        [    4.491941][   T26] [c000000004917860] [c00000000062ae70] .__kasan_kmalloc+0xc0/0x110 (unreliable)
        [    4.492270][   T26] [c0000000049178f0] [c0000000005624d4] .krealloc+0x54/0x1c0
        [    4.492453][   T26] [c000000004917990] [c0000000003a2650] .create_trace_option_files+0x280/0x530
        [    4.492613][   T26] [c000000004917a90] [c000000002050d90] .tracer_init_tracefs_work_func+0x274/0x2c0
        [    4.492771][   T26] [c000000004917b40] [c0000000001f9948] .process_one_work+0x578/0x9f0
        [    4.492927][   T26] [c000000004917c30] [c0000000001f9ebc] .worker_thread+0xfc/0x950
        [    4.493084][   T26] [c000000004917d60] [c00000000020be84] .kthread+0x1a4/0x1b0
        [    4.493232][   T26] [c000000004917e10] [c00000000000d3d8] .ret_from_kernel_thread+0x58/0x60
        [    4.495642][   T26] Code: 60000000 7cc802a6 38a00000 4bfffc78 60000000 7cc802a6 38a00001 4bfffc68 60000000 3d20a80e 7863e8c2 792907c6 <7c6348ae> 20630007 78630fe0 68630001
        [    4.496704][   T26] ---[ end trace 0000000000000000 ]---
      
      The Oops is due to kasan_byte_accessible() not checking the readiness of
      KASAN.  Add missing call to kasan_arch_is_ready() and bail out when not
      ready.  The same problem is observed with ____kasan_kfree_large() so fix
      it the same.
      
      Also, as KASAN is not available and no shadow area is allocated for linear
      memory mapping, there is no point in allocating shadow mem for vmalloc
      memory as shown below in /sys/kernel/debug/kernel_page_tables
      
        ---[ kasan shadow mem start ]---
        0xc00f000000000000-0xc00f00000006ffff  0x00000000040f0000       448K         r  w       pte  valid  present        dirty  accessed
        0xc00f000000860000-0xc00f00000086ffff  0x000000000ac10000        64K         r  w       pte  valid  present        dirty  accessed
        0xc00f3ffffffe0000-0xc00f3fffffffffff  0x0000000004d10000       128K         r  w       pte  valid  present        dirty  accessed
        ---[ kasan shadow mem end ]---
      
      So, also verify KASAN readiness before allocating and poisoning
      shadow mem for VMAs.
      
      Link: https://lkml.kernel.org/r/150768c55722311699fdcf8f5379e8256749f47d.1674716617.git.christophe.leroy@csgroup.eu
      
      
      Fixes: 41b7a347 ("powerpc: Book3S 64-bit outline-only KASAN support")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reported-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Suggested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55d77bae
  7. Feb 07, 2023
  8. Feb 04, 2023
  9. Feb 03, 2023
  10. Feb 01, 2023
    • Longlong Xia's avatar
      mm/swapfile: add cond_resched() in get_swap_pages() · 7717fc1a
      Longlong Xia authored
      The softlockup still occurs in get_swap_pages() under memory pressure.  64
      CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram
      device is 50MB with same priority as si.  Use the stress-ng tool to
      increase memory pressure, causing the system to oom frequently.
      
      The plist_for_each_entry_safe() loops in get_swap_pages() could reach tens
      of thousands of times to find available space (extreme case:
      cond_resched() is not called in scan_swap_map_slots()).  Let's add
      cond_resched() into get_swap_pages() when failed to find available space
      to avoid softlockup.
      
      Link: https://lkml.kernel.org/r/20230128094757.1060525-1-xialonglong1@huawei.com
      
      
      Signed-off-by: default avatarLonglong Xia <xialonglong1@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7717fc1a
    • Zhaoyang Huang's avatar
      mm: use stack_depot_early_init for kmemleak · 993f57e0
      Zhaoyang Huang authored
      Mirsad report the below error which is caused by stack_depot_init()
      failure in kvcalloc.  Solve this by having stackdepot use
      stack_depot_early_init().
      
      On 1/4/23 17:08, Mirsad Goran Todorovac wrote:
      I hate to bring bad news again, but there seems to be a problem with the output of /sys/kernel/debug/kmemleak:
      
      [root@pc-mtodorov ~]# cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff951c118568b0 (size 16):
      comm "kworker/u12:2", pid 56, jiffies 4294893952 (age 4356.548s)
      hex dump (first 16 bytes):
          6d 65 6d 73 74 69 63 6b 30 00 00 00 00 00 00 00 memstick0.......
          backtrace:
      [root@pc-mtodorov ~]#
      
      Apparently, backtrace of called functions on the stack is no longer
      printed with the list of memory leaks.  This appeared on Lenovo desktop
      10TX000VCR, with AlmaLinux 8.7 and BIOS version M22KT49A (11/10/2022) and
      6.2-rc1 and 6.2-rc2 builds.  This worked on 6.1 with the same
      CONFIG_KMEMLEAK=y and MGLRU enabled on a vanilla mainstream kernel from
      Mr.  Torvalds' tree.  I don't know if this is deliberate feature for some
      reason or a bug.  Please find attached the config, lshw and kmemleak
      output.
      
      [vbabka@suse.cz: remove stack_depot_init() call]
      Link: https://lore.kernel.org/all/5272a819-ef74-65ff-be61-4d2d567337de@alu.unizg.hr/
      Link: https://lkml.kernel.org/r/1674091345-14799-2-git-send-email-zhaoyang.huang@unisoc.com
      
      
      Fixes: 56a61617 ("mm: use stack_depot for recording kmemleak's backtrace")
      Reported-by: default avatarMirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarZhaoyang Huang <zhaoyang.huang@unisoc.com>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Cc: ke.wang <ke.wang@unisoc.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      993f57e0
    • Mike Kravetz's avatar
      migrate: hugetlb: check for hugetlb shared PMD in node migration · 73bdf65e
      Mike Kravetz authored
      migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
      move pages shared with another process to a different node.  page_mapcount
      > 1 is being used to determine if a hugetlb page is shared.  However, a
      hugetlb page will have a mapcount of 1 if mapped by multiple processes via
      a shared PMD.  As a result, hugetlb pages shared by multiple processes and
      mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
      
      To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
      consider the page shared.
      
      Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
      
      
      Fixes: e2d8cf40 ("migrate: add hugepage migration code to migrate_pages()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73bdf65e
    • Zach O'Keefe's avatar
      mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups · edb5d0cf
      Zach O'Keefe authored
      In commit 34488399 ("mm/madvise: add file and shmem support to
      MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
      
      	-       if (!pmd_present(pmde))
      	-               return SCAN_PMD_NULL;
      	+       if (pmd_none(pmde))
      	+               return SCAN_PMD_NONE;
      
      This was for-use by MADV_COLLAPSE file/shmem codepaths, where
      MADV_COLLAPSE might identify a pte-mapped hugepage, only to have
      khugepaged race-in, free the pte table, and clear the pmd.  Such codepaths
      include:
      
      A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
         already in the pagecache.
      B) In retract_page_tables(), if we fail to grab mmap_lock for the target
         mm/address.
      
      In these cases, collapse_pte_mapped_thp() really does expect a none (not
      just !present) pmd, and we want to suitably identify that case separate
      from the case where no pmd is found, or it's a bad-pmd (of course, many
      things could happen once we drop mmap_lock, and the pmd could plausibly
      undergo multiple transitions due to intervening fault, split, etc). 
      Regardless, the code is prepared install a huge-pmd only when the existing
      pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
      
      However, the commit introduces a logical hole; namely, that we've allowed
      !none- && !huge- && !bad-pmds to be classified as genuine
      pte-table-mapping-pmds.  One such example that could leak through are swap
      entries.  The pmd values aren't checked again before use in
      pte_offset_map_lock(), which is expecting nothing less than a genuine
      pte-table-mapping-pmd.
      
      We want to put back the !pmd_present() check (below the pmd_none() check),
      but need to be careful to deal with subtleties in pmd transitions and
      treatments by various arch.
      
      The issue is that __split_huge_pmd_locked() temporarily clears the present
      bit (or otherwise marks the entry as invalid), but pmd_present() and
      pmd_trans_huge() still need to return true while the pmd is in this
      transitory state.  For example, x86's pmd_present() also checks the
      _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
      checks a PMD_PRESENT_INVALID bit.
      
      Covering all 4 cases for x86 (all checks done on the same pmd value):
      
      1) pmd_present() && pmd_trans_huge()
         All we actually know here is that the PSE bit is set. Either:
         a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
            is set.
            => huge-pmd
         b) We are currently racing with __split_huge_page().  The danger here
            is that we proceed as-if we have a huge-pmd, but really we are
            looking at a pte-mapping-pmd.  So, what is the risk of this
            danger?
      
            The only relevant path is:
      
      	madvise_collapse() -> collapse_pte_mapped_thp()
      
            Where we might just incorrectly report back "success", when really
            the memory isn't pmd-backed.  This is fine, since split could
            happen immediately after (actually) successful madvise_collapse().
            So, it should be safe to just assume huge-pmd here.
      
      2) pmd_present() && !pmd_trans_huge()
         Either:
         a) PSE not set and either PRESENT or PROTNONE is.
            => pte-table-mapping pmd (or PROT_NONE)
         b) devmap.  This routine can be called immediately after
            unlocking/locking mmap_lock -- or called with no locks held (see
            khugepaged_scan_mm_slot()), so previous VMA checks have since been
            invalidated.
      
      3) !pmd_present() && pmd_trans_huge()
        Not possible.
      
      4) !pmd_present() && !pmd_trans_huge()
        Neither PRESENT nor PROTNONE set
        => not present
      
      I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
      powerpc, longarch, x86, mips, s390) and this logic roughly translates
      (though devmap treatment is unique to x86 and powerpc, and (3) doesn't
      necessarily hold in general -- but that doesn't matter since
      !pmd_present() always takes failure path).
      
      Also, add a comment above find_pmd_or_thp_or_none() to help future
      travelers reason about the validity of the code; namely, the possible
      mutations that might happen out from under us, depending on how mmap_lock
      is held (if at all).
      
      Link: https://lkml.kernel.org/r/20230125225358.2576151-1-zokeefe@google.com
      
      
      Fixes: 34488399 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      edb5d0cf
    • Vlastimil Babka's avatar
      mm, mremap: fix mremap() expanding for vma's with vm_ops->close() · d014cd7c
      Vlastimil Babka authored
      Fabian has reported another regression in 6.1 due to ca3d76b0 ("mm:
      add merging after mremap resize").  The problem is that vma_merge() can
      fail when vma has a vm_ops->close() method, causing is_mergeable_vma()
      test to be negative.  This was happening for vma mapping a file from
      fuse-overlayfs, which does have the method.  But when we are simply
      expanding the vma, we never remove it due to the "merge" with the added
      area, so the test should not prevent the expansion.
      
      As a quick fix, check for such vmas and expand them using vma_adjust()
      directly as was done before commit ca3d76b0.  For a more robust long
      term solution we should try to limit the check for vma_ops->close only to
      cases that actually result in vma removal, so that no merge would be
      prevented unnecessarily.
      
      [akpm@linux-foundation.org: fix indenting whitespace, reflow comment]
      Link: https://lkml.kernel.org/r/20230117101939.9753-1-vbabka@suse.cz
      
      
      Fixes: ca3d76b0 ("mm: add merging after mremap resize")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarFabian Vogt <fvogt@suse.com>
        Link: https://bugzilla.suse.com/show_bug.cgi?id=1206359#c35
      
      
      Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Cc: Jakub Matěna <matenajakub@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d014cd7c
    • Yu Zhao's avatar
      mm: multi-gen LRU: fix crash during cgroup migration · de08eaa6
      Yu Zhao authored
      lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself.  This
      isn't true for the following scenario:
      
          CPU 1                         CPU 2
      
        clone()
          cgroup_can_fork()
                                      cgroup_procs_write()
          cgroup_post_fork()
                                        task_lock()
                                        lru_gen_migrate_mm()
                                        task_unlock()
          task_lock()
          lru_gen_add_mm()
          task_unlock()
      
      And when the above happens, kernel crashes because of linked list
      corruption (mm_struct->lru_gen.list).
      
      Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/
      Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com
      
      
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarmsizanoen <msizanoen@qtmlabs.xyz>
      Tested-by: default avatarmsizanoen <msizanoen@qtmlabs.xyz>
      Cc: <stable@vger.kernel.org>	[6.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de08eaa6
    • Michal Hocko's avatar
      Revert "mm: add nodes= arg to memory.reclaim" · 55ab834a
      Michal Hocko authored
      This reverts commit 12a5d395.
      
      Although it is recognized that a finer grained pro-active reclaim is
      something we need and want the semantic of this implementation is really
      ambiguous.
      
      In a follow up discussion it became clear that there are two essential
      usecases here.  One is to use memory.reclaim to pro-actively reclaim
      memory and expectation is that the requested and reported amount of memory
      is uncharged from the memcg.  Another usecase focuses on pro-active
      demotion when the memory is merely shuffled around to demotion targets
      while the overall charged memory stays unchanged.
      
      The current implementation considers demoted pages as reclaimed and that
      break both usecases.  [1] has tried to address the reporting part but
      there are more issues with that summarized in [2] and follow up emails.
      
      Let's revert the nodemask based extension of the memcg pro-active
      reclaim for now until we settle with a more robust semantic.
      
      [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
      [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
      
      Link: https://lkml.kernel.org/r/Y5xASNe1x8cusiTx@dhcp22.suse.cz
      
      
      Fixes: 12a5d395 ("mm: add nodes= arg to memory.reclaim")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: zefan li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55ab834a
    • Nhat Pham's avatar
      zsmalloc: fix a race with deferred_handles storing · 85b32581
      Nhat Pham authored
      Currently, there is a race between zs_free() and zs_reclaim_page():
      zs_reclaim_page() finds a handle to an allocated object, but before the
      eviction happens, an independent zs_free() call to the same handle could
      come in and overwrite the object value stored at the handle with the last
      deferred handle.  When zs_reclaim_page() finally gets to call the eviction
      handler, it will see an invalid object value (i.e the previous deferred
      handle instead of the original object value).
      
      This race happens quite infrequently.  We only managed to produce it with
      out-of-tree developmental code that triggers zsmalloc writeback with a
      much higher frequency than usual.
      
      This patch fixes this race by storing the deferred handle in the object
      header instead.  We differentiate the deferred handle from the other two
      cases (handle for allocated object, and linkage for free object) with a
      new tag.  If zspage reclamation succeeds, we will free these deferred
      handles by walking through the zspage objects.  On the other hand, if
      zspage reclamation fails, we reconstruct the zspage freelist (with the
      deferred handle tag and allocated tag) before trying again with the
      reclamation.
      
      [arnd@arndb.de: avoid unused-function warning]
        Link: https://lkml.kernel.org/r/20230117170507.2651972-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20230110231701.326724-1-nphamcs@gmail.com
      
      
      Fixes: 9997bc01 ("zsmalloc: implement writeback mechanism for zsmalloc")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85b32581
    • Jann Horn's avatar
      mm/khugepaged: fix ->anon_vma race · 023f47a8
      Jann Horn authored
      If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
      it to be locked.
      
      Page table traversal is allowed under any one of the mmap lock, the
      anon_vma lock (if the VMA is associated with an anon_vma), and the
      mapping lock (if the VMA is associated with a mapping); and so to be
      able to remove page tables, we must hold all three of them. 
      retract_page_tables() bails out if an ->anon_vma is attached, but does
      this check before holding the mmap lock (as the comment above the check
      explains).
      
      If we racily merged an existing ->anon_vma (shared with a child
      process) from a neighboring VMA, subsequent rmap traversals on pages
      belonging to the child will be able to see the page tables that we are
      concurrently removing while assuming that nothing else can access them.
      
      Repeat the ->anon_vma check once we hold the mmap lock to ensure that
      there really is no concurrent page table access.
      
      Hitting this bug causes a lockdep warning in collapse_and_free_pmd(),
      in the line "lockdep_assert_held_write(&vma->anon_vma->root->rwsem)". 
      It can also lead to use-after-free access.
      
      Link: https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20230111133351.807024-1-jannh@google.com
      
      
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reported-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@intel.linux.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      023f47a8
  11. Jan 29, 2023
  12. Jan 20, 2023
  13. Jan 19, 2023
    • Christian Brauner's avatar
      fs: port inode_owner_or_capable() to mnt_idmap · 01beba79
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      01beba79
    • Christian Brauner's avatar
      fs: port inode_init_owner() to mnt_idmap · f2d40141
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      f2d40141
    • Christian Brauner's avatar
      fs: port xattr to mnt_idmap · 39f60c1c
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      39f60c1c
    • Christian Brauner's avatar
      fs: port ->fileattr_set() to pass mnt_idmap · 8782a9ae
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      8782a9ae
    • Christian Brauner's avatar
      fs: port ->set_acl() to pass mnt_idmap · 13e83a49
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      13e83a49
    • Christian Brauner's avatar
      fs: port ->tmpfile() to pass mnt_idmap · 011e2b71
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      011e2b71
    • Christian Brauner's avatar
      fs: port ->rename() to pass mnt_idmap · e18275ae
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      e18275ae
    • Christian Brauner's avatar
      fs: port ->mknod() to pass mnt_idmap · 5ebb29be
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      5ebb29be
    • Christian Brauner's avatar
      fs: port ->mkdir() to pass mnt_idmap · c54bd91e
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      c54bd91e
    • Christian Brauner's avatar
      fs: port ->symlink() to pass mnt_idmap · 7a77db95
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      7a77db95
    • Christian Brauner's avatar
      fs: port ->create() to pass mnt_idmap · 6c960e68
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      6c960e68
    • Christian Brauner's avatar
      fs: port ->getattr() to pass mnt_idmap · b74d24f7
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      b74d24f7
    • Christian Brauner's avatar
      fs: port ->setattr() to pass mnt_idmap · c1632a0f
      Christian Brauner authored
      
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      c1632a0f
    • Peter Xu's avatar
      mm: fix a few rare cases of using swapin error pte marker · 7e3ce3f8
      Peter Xu authored
      This patch should harden commit 15520a3f ("mm: use pte markers for
      swap errors") on using pte markers for swapin errors on a few corner
      cases.
      
      1. Propagate swapin errors across fork()s: if there're swapin errors in
         the parent mm, after fork()s the child should sigbus too when an error
         page is accessed.
      
      2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
         marker can be quickly switched to a swapin error.
      
      3. Explicitly ignore swapin error pte markers in change_protection().
      
      I mostly don't worry on (2) or (3) at all, but we should still have them. 
      Case (1) is special because it can potentially cause silent data corrupt
      on child when parent has swapin error triggered with swapoff, but since
      swapin error is rare itself already it's probably not easy to trigger
      either.
      
      Currently there is a priority difference between the uffd-wp bit and the
      swapin error entry, in which the swapin error always has higher priority
      (e.g.  we don't need to wr-protect a swapin error pte marker).
      
      If there will be a 3rd bit introduced, we'll probably need to consider a
      more involved approach so we may need to start operate on the bits.  Let's
      leave that for later.
      
      This patch is tested with case (1) explicitly where we'll get corrupted
      data before in the child if there's existing swapin error pte markers, and
      after patch applied the child can be rightfully killed.
      
      We don't need to copy stable for this one since 15520a3f just landed
      as part of v6.2-rc1, only "Fixes" applied.
      
      Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com
      
      
      Fixes: 15520a3f ("mm: use pte markers for swap errors")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7e3ce3f8
    • Peter Xu's avatar
      mm/uffd: fix pte marker when fork() without fork event · 49d6d7fb
      Peter Xu authored
      Patch series "mm: Fixes on pte markers".
      
      Patch 1 resolves the syzkiller report from Pengfei.
      
      Patch 2 further harden pte markers when used with the recent swapin error
      markers.  The major case is we should persist a swapin error marker after
      fork(), so child shouldn't read a corrupted page.
      
      
      This patch (of 2):
      
      When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
      have it and has pte marker installed.  The warning is improper along with
      the comment.  The right thing is to inherit the pte marker when needed, or
      keep the dst pte empty.
      
      A vague guess is this happened by an accident when there's the prior patch
      to introduce src/dst vma into this helper during the uffd-wp feature got
      developed and I probably messed up in the rebase, since if we replace
      dst_vma with src_vma the warning & comment it all makes sense too.
      
      Hugetlb did exactly the right here (copy_hugetlb_page_range()).  Fix the
      general path.
      
      Reproducer:
      
      https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c
      
      Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808
      
      Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com
      
      
      Fixes: c56d1b62 ("mm/shmem: handle uffd-wp during fork()")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarPengfei Xu <pengfei.xu@intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org> # 5.19+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      49d6d7fb
  14. Jan 13, 2023
Loading