Skip to content
Snippets Groups Projects
  1. Oct 03, 2022
    • Alexander Potapenko's avatar
      init: kmsan: call KMSAN initialization routines · 3c206509
      Alexander Potapenko authored
      kmsan_init_shadow() scans the mappings created at boot time and creates
      metadata pages for those mappings.
      
      When the memblock allocator returns pages to pagealloc, we reserve 2/3 of
      those pages and use them as metadata for the remaining 1/3.  Once KMSAN
      starts, every page allocated by pagealloc has its associated shadow and
      origin pages.
      
      kmsan_initialize() initializes the bookkeeping for init_task and enables
      KMSAN.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-18-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c206509
  2. Sep 27, 2022
    • Liam R. Howlett's avatar
      Maple Tree: add new data structure · 54a611b6
      Liam R. Howlett authored
      Patch series "Introducing the Maple Tree"
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      Davidlor said
      
      : Yes I like the maple tree, and at this stage I don't think we can ask for
      : more from this series wrt the MM - albeit there seems to still be some
      : folks reporting breakage.  Fundamentally I see Liam's work to (re)move
      : complexity out of the MM (not to say that the actual maple tree is not
      : complex) by consolidating the three complimentary data structures very
      : much worth it considering performance does not take a hit.  This was very
      : much a turn off with the range locking approach, which worst case scenario
      : incurred in prohibitive overhead.  Also as Liam and Matthew have
      : mentioned, RCU opens up a lot of nice performance opportunities, and in
      : addition academia[1] has shown outstanding scalability of address spaces
      : with the foundation of replacing the locked rbtree with RCU aware trees.
      
      A similar work has been discovered in the academic press
      
      	https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
      
      Sheer coincidence.  We designed our tree with the intention of solving the
      hardest problem first.  Upon settling on a b-tree variant and a rough
      outline, we researched ranged based b-trees and RCU b-trees and did find
      that article.  So it was nice to find reassurances that we were on the
      right path, but our design choice of using ranges made that paper unusable
      for us.
      
      This patch (of 70):
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      There is additional BUG_ON() calls added within the tree, most of which
      are in debug code.  These will be replaced with a WARN_ON() call in the
      future.  There is also additional BUG_ON() calls within the code which
      will also be reduced in number at a later date.  These exist to catch
      things such as out-of-range accesses which would crash anyways.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
      Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
      
      
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      54a611b6
  3. Sep 12, 2022
    • Li Zhe's avatar
      page_ext: introduce boot parameter 'early_page_ext' · c4f20f14
      Li Zhe authored
      In commit 2f1ee091 ("Revert "mm: use early_pfn_to_nid in
      page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
      avoid some panic problem.  It seems that we cannot track early page
      allocations in current kernel even if page structure has been initialized
      early.
      
      This patch introduces a new boot parameter 'early_page_ext' to resolve
      this problem.  If we pass it to the kernel, page_ext_init() will be moved
      up and the feature 'deferred initialization of struct pages' will be
      disabled to initialize the page allocator early and prevent the panic
      problem above.  It can help us to catch early page allocations.  This is
      useful especially when we find that the free memory value is not the same
      right after different kernel booting.
      
      [akpm@linux-foundation.org: fix section issue by removing __meminitdata]
      Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
      
      
      Signed-off-by: default avatarLi Zhe <lizhe.67@bytedance.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4f20f14
  4. Aug 23, 2022
    • Mark Rutland's avatar
      arm64: fix rodata=full · 2e8cff0a
      Mark Rutland authored
      
      On arm64, "rodata=full" has been suppored (but not documented) since
      commit:
      
        c55191e9 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well")
      
      As it's necessary to determine the rodata configuration early during
      boot, arm64 has an early_param() handler for this, whereas init/main.c
      has a __setup() handler which is run later.
      
      Unfortunately, this split meant that since commit:
      
        f9a40b08 ("init/main.c: return 1 from handled __setup() functions")
      
      ... passing "rodata=full" would result in a spurious warning from the
      __setup() handler (though RO permissions would be configured
      appropriately).
      
      Further, "rodata=full" has been broken since commit:
      
        0d6ea3ac ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
      
      ... which caused strtobool() to parse "full" as false (in addition to
      many other values not documented for the "rodata=" kernel parameter.
      
      This patch fixes this breakage by:
      
      * Moving the core parameter parser to an __early_param(), such that it
        is available early.
      
      * Adding an (optional) arch hook which arm64 can use to parse "full".
      
      * Updating the documentation to mention that "full" is valid for arm64.
      
      * Having the core parameter parser handle "on" and "off" explicitly,
        such that any undocumented values (e.g. typos such as "ful") are
        reported as errors rather than being silently accepted.
      
      Note that __setup() and early_param() have opposite conventions for
      their return values, where __setup() uses 1 to indicate a parameter was
      handled and early_param() uses 0 to indicate a parameter was handled.
      
      Fixes: f9a40b08 ("init/main.c: return 1 from handled __setup() functions")
      Fixes: 0d6ea3ac ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Jagdish Gediya <jvgediya@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com
      
      
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      2e8cff0a
  5. Aug 21, 2022
  6. Jul 27, 2022
  7. Jul 23, 2022
    • Tejun Heo's avatar
      cgroup: Make !percpu threadgroup_rwsem operations optional · 6a010a49
      Tejun Heo authored
      
      3942a9bd ("locking, rcu, cgroup: Avoid synchronize_sched() in
      __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
      because the impiled synchronize_rcu() on write locking was pushing up the
      latencies too much for android which constantly moves processes between
      cgroups.
      
      This makes the hotter paths - fork and exit - slower as they're always
      forced into the slow path. There is no reason to force this on everyone
      especially given that more common static usage pattern can now completely
      avoid write-locking the rwsem. Write-locking is elided when turning on and
      off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
      cgroup without grabbing the rwsem.
      
      Restore the default percpu operations and introduce the mount option
      "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
      lower latencies for the dynamic operations.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Michal Koutn� <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dmitry Shmidt <dimitrysh@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      6a010a49
  8. Jul 18, 2022
    • Dan Moulding's avatar
      init: add "hostname" kernel parameter · 5a704629
      Dan Moulding authored
      The gethostname system call returns the hostname for the current machine. 
      However, the kernel has no mechanism to initially set the current
      machine's name in such a way as to guarantee that the first userspace
      process to call gethostname will receive a meaningful result.  It relies
      on some unspecified userspace process to first call sethostname before
      gethostname can produce a meaningful name.
      
      Traditionally the machine's hostname is set from userspace by the init
      system.  The init system, in turn, often relies on a configuration file
      (say, /etc/hostname) to provide the value that it will supply in the call
      to sethostname.  Consequently, the file system containing /etc/hostname
      usually must be available before the hostname will be set.  There may,
      however, be earlier userspace processes that could call gethostname before
      the file system containing /etc/hostname is mounted.  Such a process will
      get some other, likely meaningless, name from gethostname (such as
      "(none)", "localhost", or "darkstar").
      
      A real-world example where this can happen, and lead to undesirable
      results, is with mdadm.  When assembling arrays, mdadm distinguishes
      between "local" arrays and "foreign" arrays.  A local array is one that
      properly belongs to the current machine, and a foreign array is one that
      is (possibly temporarily) attached to the current machine, but properly
      belongs to some other machine.  To determine if an array is local or
      foreign, mdadm may compare the "homehost" recorded on the array with the
      current hostname.  If mdadm is run before the root file system is mounted,
      perhaps because the root file system itself resides on an md-raid array,
      then /etc/hostname isn't yet available and the init system will not yet
      have called sethostname, causing mdadm to incorrectly conclude that all of
      the local arrays are foreign.
      
      Solving this problem *could* be delegated to the init system.  It could be
      left up to the init system (including any init system that starts within
      an initramfs, if one is in use) to ensure that sethostname is called
      before any other userspace process could possibly call gethostname. 
      However, it may not always be obvious which processes could call
      gethostname (for example, udev itself might not call gethostname, but it
      could via udev rules invoke processes that do).  Additionally, the init
      system has to ensure that the hostname configuration value is stored in
      some place where it will be readily accessible during early boot. 
      Unfortunately, every init system will attempt to (or has already attempted
      to) solve this problem in a different, possibly incorrect, way.  This
      makes getting consistently working configurations harder for users.
      
      I believe it is better for the kernel to provide the means by which the
      hostname may be set early, rather than making this a problem for the init
      system to solve.  The option to set the hostname during early startup, via
      a kernel parameter, provides a simple, reliable way to solve this problem.
      It also could make system configuration easier for some embedded systems.
      
      [dmoulding@me.com: v2]
        Link: https://lkml.kernel.org/r/20220506060310.7495-2-dmoulding@me.com
      Link: https://lkml.kernel.org/r/20220505180651.22849-2-dmoulding@me.com
      
      
      Signed-off-by: default avatarDan Moulding <dmoulding@me.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a704629
  9. Jul 15, 2022
  10. Jul 12, 2022
    • Christophe Leroy's avatar
      module: Move module's Kconfig items in kernel/module/ · 73b4fc92
      Christophe Leroy authored
      
      In init/Kconfig, the part dedicated to modules is quite large.
      
      Move it into a dedicated Kconfig in kernel/module/
      
      MODULES_TREE_LOOKUP was outside of the 'if MODULES', but as it is
      only used when MODULES are set, move it in with everything else to
      avoid confusion.
      
      MODULE_SIG_FORMAT is left in init/Kconfig because this configuration
      item is not used in kernel/modules/ but in kernel/ and can be
      selected independently from CONFIG_MODULES. It is for instance
      selected from security/integrity/ima/Kconfig.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      73b4fc92
  11. Jul 07, 2022
  12. Jul 02, 2022
  13. Jun 30, 2022
    • Frederic Weisbecker's avatar
      context_tracking: Split user tracking Kconfig · 24a9c541
      Frederic Weisbecker authored
      
      Context tracking is going to be used not only to track user transitions
      but also idle/IRQs/NMIs. The user tracking part will then become a
      separate feature. Prepare Kconfig for that.
      
      [ frederic: Apply Max Filippov feedback. ]
      
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
      Cc: Yu Liao <liaoyu15@huawei.com>
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
      Cc: Alex Belits <abelits@marvell.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNicolas Saenz Julienne <nsaenzju@redhat.com>
      Tested-by: default avatarNicolas Saenz Julienne <nsaenzju@redhat.com>
      24a9c541
  14. Jun 20, 2022
    • Paul E. McKenney's avatar
      rcu-tasks: Add data structures for lightweight grace periods · 434c9eef
      Paul E. McKenney authored
      
      This commit adds fields to task_struct and to rcu_tasks_percpu that will
      be used to avoid the task-list scan for RCU Tasks Trace grace periods,
      and also initializes these fields.
      
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: KP Singh <kpsingh@kernel.org>
      434c9eef
  15. Jun 09, 2022
    • Linus Torvalds's avatar
      gcc-12: disable '-Warray-bounds' universally for now · f0be87c4
      Linus Torvalds authored
      
      In commit 8b202ee2 ("s390: disable -Warray-bounds") the s390 people
      disabled the '-Warray-bounds' warning for gcc-12, because the new logic
      in gcc would cause warnings for their use of the S390_lowcore macro,
      which accesses absolute pointers.
      
      It turns out gcc-12 has many other issues in this area, so this takes
      that s390 warning disable logic, and turns it into a kernel build config
      entry instead.
      
      Part of the intent is that we can make this all much more targeted, and
      use this conflig flag to disable it in only particular configurations
      that cause problems, with the s390 case as an example:
      
              select GCC12_NO_ARRAY_BOUNDS
      
      and we could do that for other configuration cases that cause issues.
      
      Or we could possibly use the CONFIG_CC_NO_ARRAY_BOUNDS thing in a more
      targeted way, and disable the warning only for particular uses: again
      the s390 case as an example:
      
        KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
      
      but this ends up just doing it globally in the top-level Makefile, since
      the current issues are spread fairly widely all over:
      
        KBUILD_CFLAGS-$(CONFIG_CC_NO_ARRAY_BOUNDS) += -Wno-array-bounds
      
      We'll try to limit this later, since the gcc-12 problems are rare enough
      that *much* of the kernel can be built with it without disabling this
      warning.
      
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0be87c4
  16. May 27, 2022
  17. May 24, 2022
    • Masahiro Yamada's avatar
      kbuild: link symbol CRCs at final link, removing CONFIG_MODULE_REL_CRCS · 7b453719
      Masahiro Yamada authored
      
      include/{linux,asm-generic}/export.h defines a weak symbol, __crc_*
      as a placeholder.
      
      Genksyms writes the version CRCs into the linker script, which will be
      used for filling the __crc_* symbols. The linker script format depends
      on CONFIG_MODULE_REL_CRCS. If it is enabled, __crc_* holds the offset
      to the reference of CRC.
      
      It is time to get rid of this complexity.
      
      Now that modpost parses text files (.*.cmd) to collect all the CRCs,
      it can generate C code that will be linked to the vmlinux or modules.
      
      Generate a new C file, .vmlinux.export.c, which contains the CRCs of
      symbols exported by vmlinux. It is compiled and linked to vmlinux in
      scripts/link-vmlinux.sh.
      
      Put the CRCs of symbols exported by modules into the existing *.mod.c
      files. No additional build step is needed for modules. As before,
      *.mod.c are compiled and linked to *.ko in scripts/Makefile.modfinal.
      
      No linker magic is used here. The new C implementation works in the
      same way, whether CONFIG_RELOCATABLE is enabled or not.
      CONFIG_MODULE_REL_CRCS is no longer needed.
      
      Previously, Kbuild invoked additional $(LD) to update the CRCs in
      objects, but this step is unneeded too.
      
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Tested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Tested-by: default avatarNicolas Schier <nicolas@fjasle.eu>
      Reviewed-by: default avatarNicolas Schier <nicolas@fjasle.eu>
      Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM-14 (x86-64)
      7b453719
  18. May 19, 2022
  19. May 18, 2022
    • Jason A. Donenfeld's avatar
      random: handle latent entropy and command line from random_init() · 2f14062b
      Jason A. Donenfeld authored
      
      Currently, start_kernel() adds latent entropy and the command line to
      the entropy bool *after* the RNG has been initialized, deferring when
      it's actually used by things like stack canaries until the next time
      the pool is seeded. This surely is not intended.
      
      Rather than splitting up which entropy gets added where and when between
      start_kernel() and random_init(), just do everything in random_init(),
      which should eliminate these kinds of bugs in the future.
      
      While we're at it, rename the awkwardly titled "rand_initialize()" to
      the more standard "random_init()" nomenclature.
      
      Reviewed-by: default avatarDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      2f14062b
  20. May 13, 2022
    • Jason A. Donenfeld's avatar
      init: call time_init() before rand_initialize() · fe222a6c
      Jason A. Donenfeld authored
      
      Currently time_init() is called after rand_initialize(), but
      rand_initialize() makes use of the timer on various platforms, and
      sometimes this timer needs to be initialized by time_init() first. In
      order for random_get_entropy() to not return zero during early boot when
      it's potentially used as an entropy source, reverse the order of these
      two calls. The block doing random initialization was right before
      time_init() before, so changing the order shouldn't have any complicated
      effects.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarStafford Horne <shorne@gmail.com>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      fe222a6c
    • Peter Xu's avatar
      mm/uffd: move USERFAULTFD configs into mm/ · 430529b5
      Peter Xu authored
      We used to have USERFAULTFD configs stored in init/.  It makes sense as a
      start because that's the default place for storing syscall related
      configs.
      
      However userfaultfd evolved a bit in the past few years and some more
      config options were added.  They're no longer related to syscalls and
      start to be not suitable to be kept in the init/ directory anymore,
      because they're pure mm concepts.
      
      But it's not ideal either to keep the userfaultfd configs separate from
      each other.  Hence this patch moves the userfaultfd configs under init/ to
      be under mm/ so that we'll start to group all userfaultfd configs
      together.
      
      We do have quite a few examples of syscall related configs that are not
      put under init/Kconfig: FTRACE_SYSCALLS, SWAP, FILE_LOCKING,
      MEMFD_CREATE..  They all reside in the dir where they're more suitable for
      the concept.  So it seems there's no restriction to keep the role of
      having syscall related CONFIG_* under init/ only.
      
      Link: https://lkml.kernel.org/r/20220420144823.35277-1-peterx@redhat.com
      
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      430529b5
  21. May 12, 2022
    • Aaron Tomlin's avatar
      module: Introduce module unload taint tracking · 99bd9956
      Aaron Tomlin authored
      
      Currently, only the initial module that tainted the kernel is
      recorded e.g. when an out-of-tree module is loaded.
      
      The purpose of this patch is to allow the kernel to maintain a record of
      each unloaded module that taints the kernel. So, in addition to
      displaying a list of linked modules (see print_modules()) e.g. in the
      event of a detected bad page, unloaded modules that carried a taint/or
      taints are displayed too. A tainted module unload count is maintained.
      
      The number of tracked modules is not fixed. This feature is disabled by
      default.
      
      Signed-off-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      99bd9956
  22. May 10, 2022
  23. May 07, 2022
    • Eric W. Biederman's avatar
      init: Deal with the init process being a user mode process · 68d85f0a
      Eric W. Biederman authored
      It is silly for user_mode_thread to leave PF_KTHREAD set
      on the resulting task.  Update the init process so that
      it does not care if PF_KTHREAD is set or not.
      
      Ensure do_populate_rootfs flushes all delayed fput work by calling
      task_work_run.  In the rare instance that async_schedule_domain calls
      do_populate_rootfs synchronously it is possible do_populate_rootfs
      will be called directly from the init process.  At which point fput
      will call "task_work_add(current, ..., TWA_RESUME)".  The files on the
      initramfs need to be completely put before we attempt to exec them
      (which is before the code enters userspace).  So call task_work_run
      just in case there are any pending fput operations.
      
      Link: https://lkml.kernel.org/r/20220506141512.516114-5-ebiederm@xmission.com
      
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      68d85f0a
  24. May 06, 2022
    • Eric W. Biederman's avatar
      kthread: Don't allocate kthread_struct for init and umh · 343f4c49
      Eric W. Biederman authored
      
      If kthread_is_per_cpu runs concurrently with free_kthread_struct the
      kthread_struct that was just freed may be read from.
      
      This bug was introduced by commit 40966e31 ("kthread: Ensure
      struct kthread is present for all kthreads").  When kthread_struct
      started to be allocated for all tasks that have PF_KTHREAD set.  This
      in turn required the kthread_struct to be freed in kernel_execve and
      violated the assumption that kthread_struct will have the same
      lifetime as the task.
      
      Looking a bit deeper this only applies to callers of kernel_execve
      which is just the init process and the user mode helper processes.
      These processes really don't want to be kernel threads but are for
      historical reasons.  Mostly that copy_thread does not know how to take
      a kernel mode function to the process with for processes without
      PF_KTHREAD or PF_IO_WORKER set.
      
      Solve this by not allocating kthread_struct for the init process and
      the user mode helper processes.
      
      This is done by adding a kthread member to struct kernel_clone_args.
      Setting kthread in fork_idle and kernel_thread.  Adding
      user_mode_thread that works like kernel_thread except it does not set
      kthread.  In fork only allocating the kthread_struct if .kthread is set.
      
      I have looked at kernel/kthread.c and since commit 40966e31
      ("kthread: Ensure struct kthread is present for all kthreads") there
      have been no assumptions added that to_kthread or __to_kthread will
      not return NULL.
      
      There are a few callers of to_kthread or __to_kthread that assume a
      non-NULL struct kthread pointer will be returned.  These functions are
      kthread_data(), kthread_parmme(), kthread_exit(), kthread(),
      kthread_park(), kthread_unpark(), kthread_stop().  All of those functions
      can reasonably expected to be called when it is know that a task is a
      kthread so that assumption seems reasonable.
      
      Cc: stable@vger.kernel.org
      Fixes: 40966e31 ("kthread: Ensure struct kthread is present for all kthreads")
      Reported-by: default avatarМаксим Кутявин <maximkabox13@gmail.com>
      Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com
      
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      343f4c49
  25. Apr 29, 2022
  26. Apr 26, 2022
  27. Apr 13, 2022
  28. Apr 06, 2022
    • tangmeng's avatar
      kernel/do_mount_initrd: move real_root_dev sysctls to its own file · d772cc2c
      tangmeng authored
      
      kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
      dishes, this makes it very difficult to maintain.
      
      To help with this maintenance let's start by moving sysctls to places
      where they actually belong.  The proc sysctl maintainers do not want to
      know what sysctl knobs you wish to add for your own piece of code, we
      just care about the core logic.
      
      All filesystem syctls now get reviewed by fs folks. This commit
      follows the commit of fs, move the real_root_dev sysctl to its own file,
      kernel/do_mount_initrd.c.
      
      Signed-off-by: default avatartangmeng <tangmeng@uniontech.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      d772cc2c
    • Oliver Glitta's avatar
      mm/slub: use stackdepot to save stack trace in objects · 5cf909c5
      Oliver Glitta authored
      
      Many stack traces are similar so there are many similar arrays.
      Stackdepot saves each unique stack only once.
      
      Replace field addrs in struct track with depot_stack_handle_t handle.  Use
      stackdepot to save stack trace.
      
      The benefits are smaller memory overhead and possibility to aggregate
      per-cache statistics in the following patch using the stackdepot handle
      instead of matching stacks manually.
      
      [ vbabka@suse.cz: rebase to 5.17-rc1 and adjust accordingly ]
      
      This was initially merged as commit 78869146 and reverted by commit
      ae14c63a due to several issues, that should now be fixed.
      The problem of unconditional memory overhead by stackdepot has been
      addressed by commit 2dba5eb1 ("lib/stackdepot: allow optional init
      and stack_table allocation by kvmalloc()"), so the dependency on
      stackdepot will result in extra memory usage only when a slab cache
      tracking is actually enabled, and not for all CONFIG_SLUB_DEBUG builds.
      The build failures on some architectures were also addressed, and the
      reported issue with xfs/433 test did not reproduce on 5.17-rc1 with this
      patch.
      
      Signed-off-by: default avatarOliver Glitta <glittao@gmail.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-and-tested-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      5cf909c5
  29. Mar 24, 2022
  30. Mar 11, 2022
  31. Feb 23, 2022
Loading