Skip to content
Snippets Groups Projects
  1. Mar 04, 2023
    • Linus Torvalds's avatar
      umh: simplify the capability pointer logic · e7783615
      Linus Torvalds authored
      
      The usermodehelper code uses two fake pointers for the two capability
      cases: CAP_BSET for reading and writing 'usermodehelper_bset', and
      CAP_PI to read and write 'usermodehelper_inheritable'.
      
      This seems to be a completely unnecessary indirection, since we could
      instead just use the pointers themselves, and never have to do any "if
      this then that" kind of logic.
      
      So just get rid of the fake pointer values, and use the real pointer
      values instead.
      
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7783615
  2. Mar 03, 2023
    • Guilherme G. Piccoli's avatar
      panic: fix the panic_print NMI backtrace setting · b905039e
      Guilherme G. Piccoli authored
      Commit 8d470a45 ("panic: add option to dump all CPUs backtraces in
      panic_print") introduced a setting for the "panic_print" kernel parameter
      to allow users to request a NMI backtrace on panic.  Problem is that the
      panic_print handling happens after the secondary CPUs are already
      disabled, hence this option ended-up being kind of a no-op - kernel skips
      the NMI trace in idling CPUs, which is the case of offline CPUs.
      
      Fix it by checking the NMI backtrace bit in the panic_print prior to the
      CPU disabling function.
      
      Link: https://lkml.kernel.org/r/20230226160838.414257-1-gpiccoli@igalia.com
      
      
      Fixes: 8d470a45 ("panic: add option to dump all CPUs backtraces in panic_print")
      Signed-off-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Cc: <stable@vger.kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michael Kelley <mikelley@microsoft.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b905039e
  3. Mar 02, 2023
  4. Mar 01, 2023
    • Linus Torvalds's avatar
      capability: just use a 'u64' instead of a 'u32[2]' array · f122a08b
      Linus Torvalds authored
      
      Back in 2008 we extended the capability bits from 32 to 64, and we did
      it by extending the single 32-bit capability word from one word to an
      array of two words.  It was then obfuscated by hiding the "2" behind two
      macro expansions, with the reasoning being that maybe it gets extended
      further some day.
      
      That reasoning may have been valid at the time, but the last thing we
      want to do is to extend the capability set any more.  And the array of
      values not only causes source code oddities (with loops to deal with
      it), but also results in worse code generation.  It's a lose-lose
      situation.
      
      So just change the 'u32[2]' into a 'u64' and be done with it.
      
      We still have to deal with the fact that the user space interface is
      designed around an array of these 32-bit values, but that was the case
      before too, since the array layouts were different (ie user space
      doesn't use an array of 32-bit values for individual capability masks,
      but an array of 32-bit slices of multiple masks).
      
      So that marshalling of data is actually simplified too, even if it does
      remain somewhat obscure and odd.
      
      This was all triggered by my reaction to the new "cap_isidentical()"
      introduced recently.  By just using a saner data structure, it went from
      
      	unsigned __capi;
      	CAP_FOR_EACH_U32(__capi) {
      		if (a.cap[__capi] != b.cap[__capi])
      			return false;
      	}
      	return true;
      
      to just being
      
      	return a.val == b.val;
      
      instead.  Which is rather more obvious both to humans and to compilers.
      
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Paul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f122a08b
  5. Feb 24, 2023
  6. Feb 23, 2023
  7. Feb 22, 2023
  8. Feb 21, 2023
  9. Feb 20, 2023
  10. Feb 18, 2023
  11. Feb 17, 2023
    • Dan Williams's avatar
      dax/kmem: Fix leak of memory-hotplug resources · e686c325
      Dan Williams authored
      
      While experimenting with CXL region removal the following corruption of
      /proc/iomem appeared.
      
      Before:
      f010000000-f04fffffff : CXL Window 0
        f010000000-f02fffffff : region4
          f010000000-f02fffffff : dax4.0
            f010000000-f02fffffff : System RAM (kmem)
      
      After (modprobe -r cxl_test):
      f010000000-f02fffffff : **redacted binary garbage**
        f010000000-f02fffffff : System RAM (kmem)
      
      ...and testing further the same is visible with persistent memory
      assigned to kmem:
      
      Before:
      480000000-243fffffff : Persistent Memory
        480000000-57e1fffff : namespace3.0
        580000000-243fffffff : dax3.0
          580000000-243fffffff : System RAM (kmem)
      
      After (ndctl disable-region all):
      480000000-243fffffff : Persistent Memory
        580000000-243fffffff : ***redacted binary garbage***
          580000000-243fffffff : System RAM (kmem)
      
      The corrupted data is from a use-after-free of the "dax4.0" and "dax3.0"
      resources, and it also shows that the "System RAM (kmem)" resource is
      not being removed. The bug does not appear after "modprobe -r kmem", it
      requires the parent of "dax4.0" and "dax3.0" to be removed which
      re-parents the leaked "System RAM (kmem)" instances. Those in turn
      reference the freed resource as a parent.
      
      First up for the fix is release_mem_region_adjustable() needs to
      reliably delete the resource inserted by add_memory_driver_managed().
      That is thwarted by a check for IORESOURCE_SYSRAM that predates the
      dax/kmem driver, from commit:
      
      65c78784 ("kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable")
      
      That appears to be working around the behavior of HMM's
      "MEMORY_DEVICE_PUBLIC" facility that has since been deleted. With that
      check removed the "System RAM (kmem)" resource gets removed, but
      corruption still occurs occasionally because the "dax" resource is not
      reliably removed.
      
      The dax range information is freed before the device is unregistered, so
      the driver can not reliably recall (another use after free) what it is
      meant to release. Lastly if that use after free got lucky, the driver
      was covering up the leak of "System RAM (kmem)" due to its use of
      release_resource() which detaches, but does not free, child resources.
      The switch to remove_resource() forces remove_memory() to be responsible
      for the deletion of the resource added by add_memory_driver_managed().
      
      Fixes: c2f3011e ("device-dax: add an allocation interface for device-dax instances")
      Cc: <stable@vger.kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/167653656244.3147810.5705900882794040229.stgit@dwillia2-xfh.jf.intel.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      e686c325
    • Andrii Nakryiko's avatar
      bpf: Fix global subprog context argument resolution logic · d384dce2
      Andrii Nakryiko authored
      
      KPROBE program's user-facing context type is defined as typedef
      bpf_user_pt_regs_t. This leads to a problem when trying to passing
      kprobe/uprobe/usdt context argument into global subprog, as kernel
      always strip away mods and typedefs of user-supplied type, but takes
      expected type from bpf_ctx_convert as is, which causes mismatch.
      
      Current way to work around this is to define a fake struct with the same
      name as expected typedef:
      
        struct bpf_user_pt_regs_t {};
      
        __noinline my_global_subprog(struct bpf_user_pt_regs_t *ctx) { ... }
      
      This patch fixes the issue by resolving expected type, if it's not
      a struct. It still leaves the above work-around working for backwards
      compatibility.
      
      Fixes: 91cc1a99 ("bpf: Annotate context types")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/bpf/20230216045954.3002473-2-andrii@kernel.org
      d384dce2
  12. Feb 16, 2023
  13. Feb 15, 2023
    • Eduard Zingerman's avatar
      bpf: BPF_ST with variable offset should preserve STACK_ZERO marks · 31ff2135
      Eduard Zingerman authored
      
      BPF_STX instruction preserves STACK_ZERO marks for variable offset
      writes in situations like below:
      
        *(u64*)(r10 - 8) = 0   ; STACK_ZERO marks for fp[-8]
        r0 = random(-7, -1)    ; some random number in range of [-7, -1]
        r0 += r10              ; r0 is now a variable offset pointer to stack
        r1 = 0
        *(u8*)(r0) = r1        ; BPF_STX writing zero, STACK_ZERO mark for
                               ; fp[-8] is preserved
      
      This commit updates verifier.c:check_stack_write_var_off() to process
      BPF_ST in a similar manner, e.g. the following example:
      
        *(u64*)(r10 - 8) = 0   ; STACK_ZERO marks for fp[-8]
        r0 = random(-7, -1)    ; some random number in range of [-7, -1]
        r0 += r10              ; r0 is now variable offset pointer to stack
        *(u8*)(r0) = 0         ; BPF_ST writing zero, STACK_ZERO mark for
                               ; fp[-8] is preserved
      
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20230214232030.1502829-4-eddyz87@gmail.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      31ff2135
    • Eduard Zingerman's avatar
      bpf: track immediate values written to stack by BPF_ST instruction · ecdf985d
      Eduard Zingerman authored
      
      For aligned stack writes using BPF_ST instruction track stored values
      in a same way BPF_STX is handled, e.g. make sure that the following
      commands produce similar verifier knowledge:
      
        fp[-8] = 42;             r1 = 42;
                             fp[-8] = r1;
      
      This covers two cases:
       - non-null values written to stack are stored as spill of fake
         registers;
       - null values written to stack are stored as STACK_ZERO marks.
      
      Previously both cases above used STACK_MISC marks instead.
      
      Some verifier test cases relied on the old logic to obtain STACK_MISC
      marks for some stack values. These test cases are updated in the same
      commit to avoid failures during bisect.
      
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20230214232030.1502829-2-eddyz87@gmail.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ecdf985d
    • Mukesh Ojha's avatar
      ring-buffer: Handle race between rb_move_tail and rb_check_pages · 8843e06f
      Mukesh Ojha authored
      It seems a data race between ring_buffer writing and integrity check.
      That is, RB_FLAG of head_page is been updating, while at same time
      RB_FLAG was cleared when doing integrity check rb_check_pages():
      
        rb_check_pages()            rb_handle_head_page():
        --------                    --------
        rb_head_page_deactivate()
                                    rb_head_page_set_normal()
        rb_head_page_activate()
      
      We do intergrity test of the list to check if the list is corrupted and
      it is still worth doing it. So, let's refactor rb_check_pages() such that
      we no longer clear and set flag during the list sanity checking.
      
      [1] and [2] are the test to reproduce and the crash report respectively.
      
      1:
      ``` read_trace.sh
        while true;
        do
          # the "trace" file is closed after read
          head -1 /sys/kernel/tracing/trace > /dev/null
        done
      ```
      ``` repro.sh
        sysctl -w kernel.panic_on_warn=1
        # function tracer will writing enough data into ring_buffer
        echo function > /sys/kernel/tracing/current_tracer
        ./read_trace.sh &
        ./read_trace.sh &
        ./read_trace.sh &
        ./read_trace.sh &
        ./read_trace.sh &
        ./read_trace.sh &
        ./read_trace.sh &
        ./read_trace.sh &
      ```
      
      2:
      ------------[ cut here ]------------
      WARNING: CPU: 9 PID: 62 at kernel/trace/ring_buffer.c:2653
      rb_move_tail+0x450/0x470
      Modules linked in:
      CPU: 9 PID: 62 Comm: ksoftirqd/9 Tainted: G        W          6.2.0-rc6+
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
      RIP: 0010:rb_move_tail+0x450/0x470
      Code: ff ff 4c 89 c8 f0 4d 0f b1 02 48 89 c2 48 83 e2 fc 49 39 d0 75 24
      83 e0 03 83 f8 02 0f 84 e1 fb ff ff 48 8b 57 10 f0 ff 42 08 <0f> 0b 83
      f8 02 0f 84 ce fb ff ff e9 db
      RSP: 0018:ffffb5564089bd00 EFLAGS: 00000203
      RAX: 0000000000000000 RBX: ffff9db385a2bf81 RCX: ffffb5564089bd18
      RDX: ffff9db281110100 RSI: 0000000000000fe4 RDI: ffff9db380145400
      RBP: ffff9db385a2bf80 R08: ffff9db385a2bfc0 R09: ffff9db385a2bfc2
      R10: ffff9db385a6c000 R11: ffff9db385a2bf80 R12: 0000000000000000
      R13: 00000000000003e8 R14: ffff9db281110100 R15: ffffffffbb006108
      FS:  0000000000000000(0000) GS:ffff9db3bdcc0000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005602323024c8 CR3: 0000000022e0c000 CR4: 00000000000006e0
      Call Trace:
       <TASK>
       ring_buffer_lock_reserve+0x136/0x360
       ? __do_softirq+0x287/0x2df
       ? __pfx_rcu_softirq_qs+0x10/0x10
       trace_function+0x21/0x110
       ? __pfx_rcu_softirq_qs+0x10/0x10
       ? __do_softirq+0x287/0x2df
       function_trace_call+0xf6/0x120
       0xffffffffc038f097
       ? rcu_softirq_qs+0x5/0x140
       rcu_softirq_qs+0x5/0x140
       __do_softirq+0x287/0x2df
       run_ksoftirqd+0x2a/0x30
       smpboot_thread_fn+0x188/0x220
       ? __pfx_smpboot_thread_fn+0x10/0x10
       kthread+0xe7/0x110
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x2c/0x50
       </TASK>
      ---[ end trace 0000000000000000 ]---
      
      [ crash report and test reproducer credit goes to Zheng Yejian]
      
      Link: https://lore.kernel.org/linux-trace-kernel/1676376403-16462-1-git-send-email-quic_mojha@quicinc.com
      
      
      
      Cc: <mhiramat@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 1039221c ("ring-buffer: Do not disable recording when there is an iterator")
      Reported-by: default avatarZheng Yejian <zhengyejian1@huawei.com>
      Signed-off-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      8843e06f
    • Munehisa Kamata's avatar
      sched/psi: Fix use-after-free in ep_remove_wait_queue() · c2dbe32d
      Munehisa Kamata authored
      
      If a non-root cgroup gets removed when there is a thread that registered
      trigger and is polling on a pressure file within the cgroup, the polling
      waitqueue gets freed in the following path:
      
       do_rmdir
         cgroup_rmdir
           kernfs_drain_open_files
             cgroup_file_release
               cgroup_pressure_release
                 psi_trigger_destroy
      
      However, the polling thread still has a reference to the pressure file and
      will access the freed waitqueue when the file is closed or upon exit:
      
       fput
         ep_eventpoll_release
           ep_free
             ep_remove_wait_queue
               remove_wait_queue
      
      This results in use-after-free as pasted below.
      
      The fundamental problem here is that cgroup_file_release() (and
      consequently waitqueue's lifetime) is not tied to the file's real lifetime.
      Using wake_up_pollfree() here might be less than ideal, but it is in line
      with the comment at commit 42288cb4 ("wait: add wake_up_pollfree()")
      since the waitqueue's lifetime is not tied to file's one and can be
      considered as another special case. While this would be fixable by somehow
      making cgroup_file_release() be tied to the fput(), it would require
      sizable refactoring at cgroups or higher layer which might be more
      justifiable if we identify more cases like this.
      
        BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
        Write of size 4 at addr ffff88810e625328 by task a.out/4404
      
      	CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
      	Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
      	Call Trace:
      	<TASK>
      	dump_stack_lvl+0x73/0xa0
      	print_report+0x16c/0x4e0
      	kasan_report+0xc3/0xf0
      	kasan_check_range+0x2d2/0x310
      	_raw_spin_lock_irqsave+0x60/0xc0
      	remove_wait_queue+0x1a/0xa0
      	ep_free+0x12c/0x170
      	ep_eventpoll_release+0x26/0x30
      	__fput+0x202/0x400
      	task_work_run+0x11d/0x170
      	do_exit+0x495/0x1130
      	do_group_exit+0x100/0x100
      	get_signal+0xd67/0xde0
      	arch_do_signal_or_restart+0x2a/0x2b0
      	exit_to_user_mode_prepare+0x94/0x100
      	syscall_exit_to_user_mode+0x20/0x40
      	do_syscall_64+0x52/0x90
      	entry_SYSCALL_64_after_hwframe+0x63/0xcd
      	</TASK>
      
       Allocated by task 4404:
      
      	kasan_set_track+0x3d/0x60
      	__kasan_kmalloc+0x85/0x90
      	psi_trigger_create+0x113/0x3e0
      	pressure_write+0x146/0x2e0
      	cgroup_file_write+0x11c/0x250
      	kernfs_fop_write_iter+0x186/0x220
      	vfs_write+0x3d8/0x5c0
      	ksys_write+0x90/0x110
      	do_syscall_64+0x43/0x90
      	entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       Freed by task 4407:
      
      	kasan_set_track+0x3d/0x60
      	kasan_save_free_info+0x27/0x40
      	____kasan_slab_free+0x11d/0x170
      	slab_free_freelist_hook+0x87/0x150
      	__kmem_cache_free+0xcb/0x180
      	psi_trigger_destroy+0x2e8/0x310
      	cgroup_file_release+0x4f/0xb0
      	kernfs_drain_open_files+0x165/0x1f0
      	kernfs_drain+0x162/0x1a0
      	__kernfs_remove+0x1fb/0x310
      	kernfs_remove_by_name_ns+0x95/0xe0
      	cgroup_addrm_files+0x67f/0x700
      	cgroup_destroy_locked+0x283/0x3c0
      	cgroup_rmdir+0x29/0x100
      	kernfs_iop_rmdir+0xd1/0x140
      	vfs_rmdir+0xfe/0x240
      	do_rmdir+0x13d/0x280
      	__x64_sys_rmdir+0x2c/0x30
      	do_syscall_64+0x43/0x90
      	entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 0e94682b ("psi: introduce psi monitor")
      Signed-off-by: default avatarMunehisa Kamata <kamatam@amazon.com>
      Signed-off-by: default avatarMengchi Cheng <mengcc@amazon.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
      Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com
      c2dbe32d
Loading