Skip to content
Snippets Groups Projects
  1. Feb 22, 2023
  2. Feb 03, 2023
  3. Jan 26, 2023
  4. Jan 17, 2023
  5. Jan 14, 2023
  6. Jan 12, 2023
  7. Jan 11, 2023
  8. Jan 10, 2023
  9. Jan 08, 2023
  10. Dec 27, 2022
    • Mathieu Desnoyers's avatar
      sched: Introduce per-memory-map concurrency ID · af7f588d
      Mathieu Desnoyers authored
      This feature allows the scheduler to expose a per-memory map concurrency
      ID to user-space. This concurrency ID is within the possible cpus range,
      and is temporarily (and uniquely) assigned while threads are actively
      running within a memory map. If a memory map has fewer threads than
      cores, or is limited to run on few cores concurrently through sched
      affinity or cgroup cpusets, the concurrency IDs will be values close
      to 0, thus allowing efficient use of user-space memory for per-cpu
      data structures.
      
      This feature is meant to be exposed by a new rseq thread area field.
      
      The primary purpose of this feature is to do the heavy-lifting needed
      by memory allocators to allow them to use per-cpu data structures
      efficiently in the following situations:
      
      - Single-threaded applications,
      - Multi-threaded applications on large systems (many cores) with limited
        cpu affinity mask,
      - Multi-threaded applications on large systems (many cores) with
        restricted cgroup cpuset per container.
      
      One of the key concern from scheduler maintainers is the overhead
      associated with additional spin locks or atomic operations in the
      scheduler fast-path. This is why the following optimization is
      implemented.
      
      On context switch between threads belonging to the same memory map,
      transfer the mm_cid from prev to next without any atomic ops. This
      takes care of use-cases involving frequent context switch between
      threads belonging to the same memory map.
      
      Additional optimizations can be done if the spin locks added when
      context switching between threads belonging to different memory maps end
      up being a performance bottleneck. Those are left out of this patch
      though. A performance impact would have to be clearly demonstrated to
      justify the added complexity.
      
      The credit goes to Paul Turner (Google) for the original virtual cpu id
      idea. This feature is implemented based on the discussions with Paul
      Turner and Peter Oskolkov (Google), but I took the liberty to implement
      scheduler fast-path optimizations and my own NUMA-awareness scheme. The
      rumor has it that Google have been running a rseq vcpu_id extension
      internally in production for a year. The tcmalloc source code indeed has
      comments hinting at a vcpu_id prototype extension to the rseq system
      call [1].
      
      The following benchmarks do not show any significant overhead added to
      the scheduler context switch by this feature:
      
      * perf bench sched messaging (process)
      
      Baseline:                    86.5±0.3 ms
      With mm_cid:                 86.7±2.6 ms
      
      * perf bench sched messaging (threaded)
      
      Baseline:                    84.3±3.0 ms
      With mm_cid:                 84.7±2.6 ms
      
      * hackbench (process)
      
      Baseline:                    82.9±2.7 ms
      With mm_cid:                 82.9±2.9 ms
      
      * hackbench (threaded)
      
      Baseline:                    85.2±2.6 ms
      With mm_cid:                 84.4±2.9 ms
      
      [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26
      
      
      
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com
      af7f588d
  11. Dec 15, 2022
  12. Dec 10, 2022
  13. Nov 22, 2022
  14. Nov 18, 2022
    • XU pengfei's avatar
      initramfs: remove unnecessary (void*) conversion · 4197530b
      XU pengfei authored
      Remove unnecessary void* type casting.
      
      Link: https://lkml.kernel.org/r/20221026080517.3221-1-xupengfei@nfschina.com
      
      
      Signed-off-by: default avatarXU pengfei <xupengfei@nfschina.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Martin Wilck <mwilck@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: wuchi <wuchi.zero@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4197530b
    • Alexey Dobriyan's avatar
      proc: give /proc/cmdline size · 941baf6f
      Alexey Dobriyan authored
      Most /proc files don't have length (in fstat sense).  This leads to
      inefficiencies when reading such files with APIs commonly found in modern
      programming languages.  They open file, then fstat descriptor, get st_size
      == 0 and either assume file is empty or start reading without knowing
      target size.
      
      cat(1) does OK because it uses large enough buffer by default.  But naive
      programs copy-pasted from SO aren't:
      
      	let mut f = std::fs::File::open("/proc/cmdline").unwrap();
      	let mut buf: Vec<u8> = Vec::new();
      	f.read_to_end(&mut buf).unwrap();
      
      will result in
      
      	openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
      	statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
      	statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
      	lseek(3, 0, SEEK_CUR)                   = 0
      	read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
      	read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
      	read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
      	read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
      	read(3, "", 12)
      
      open/stat is OK, lseek looks silly but there are 3 unnecessary reads
      because Rust starts with 32 bytes per Vec<u8> and grows from there.
      
      In case of /proc/cmdline, the length is known precisely.
      
      Make variables readonly while I'm at it.
      
      P.S.: I tried to scp /proc/cpuinfo today and got empty file
      	but this is separate story.
      
      Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
      
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      941baf6f
  15. Nov 15, 2022
    • Zhen Lei's avatar
      kallsyms: Add self-test facility · 30f3bb09
      Zhen Lei authored
      
      Added test cases for basic functions and performance of functions
      kallsyms_lookup_name(), kallsyms_on_each_symbol() and
      kallsyms_on_each_match_symbol(). It also calculates the compression rate
      of the kallsyms compression algorithm for the current symbol set.
      
      The basic functions test begins by testing a set of symbols whose address
      values are known. Then, traverse all symbol addresses and find the
      corresponding symbol name based on the address. It's impossible to
      determine whether these addresses are correct, but we can use the above
      three functions along with the addresses to test each other. Due to the
      traversal operation of kallsyms_on_each_symbol() is too slow, only 60
      symbols can be tested in one second, so let it test on average once
      every 128 symbols. The other two functions validate all symbols.
      
      If the basic functions test is passed, print only performance test
      results. If the test fails, print error information, but do not perform
      subsequent performance tests.
      
      Start self-test automatically after system startup if
      CONFIG_KALLSYMS_SELFTEST=y.
      
      Example of output content: (prefix 'kallsyms_selftest:' is omitted
       start
        ---------------------------------------------------------
       | nr_symbols | compressed size | original size | ratio(%) |
       |---------------------------------------------------------|
       |     107543 |       1357912   |      2407433  |  56.40   |
        ---------------------------------------------------------
       kallsyms_lookup_name() looked up 107543 symbols
       The time spent on each symbol is (ns): min=630, max=35295, avg=7353
       kallsyms_on_each_symbol() traverse all: 11782628 ns
       kallsyms_on_each_match_symbol() traverse all: 9261 ns
       finish
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      30f3bb09
  16. Nov 01, 2022
  17. Oct 21, 2022
  18. Oct 12, 2022
  19. Oct 03, 2022
  20. Sep 29, 2022
    • Jason A. Donenfeld's avatar
      random: split initialization into early step and later step · f6238499
      Jason A. Donenfeld authored
      
      The full RNG initialization relies on some timestamps, made possible
      with initialization functions like time_init() and timekeeping_init().
      However, these are only available rather late in initialization.
      Meanwhile, other things, such as memory allocator functions, make use of
      the RNG much earlier.
      
      So split RNG initialization into two phases. We can provide arch
      randomness very early on, and then later, after timekeeping and such are
      available, initialize the rest.
      
      This ensures that, for example, slabs are properly randomized if RDRAND
      is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree
      of its security, because its random seed is potentially deterministic,
      since it hasn't yet incorporated RDRAND. It also makes it possible to
      use a better seed in kfence, which currently relies on only the cycle
      counter.
      
      Another positive consequence is that on systems with RDRAND, running
      with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all.
      
      One subtle side effect of this change is that on systems with no RDRAND,
      RDTSC is now only queried by random_init() once, committing the moment
      of the function call, instead of multiple times as before. This is
      intentional, as the multiple RDTSCs in a loop before weren't
      accomplishing very much, with jitter being better provided by
      try_to_generate_entropy(). Plus, filling blocks with RDTSC is still
      being done in extract_entropy(), which is necessarily called before
      random bytes are served anyway.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      f6238499
  21. Sep 28, 2022
  22. Sep 27, 2022
    • Liam R. Howlett's avatar
      Maple Tree: add new data structure · 54a611b6
      Liam R. Howlett authored
      Patch series "Introducing the Maple Tree"
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      Davidlor said
      
      : Yes I like the maple tree, and at this stage I don't think we can ask for
      : more from this series wrt the MM - albeit there seems to still be some
      : folks reporting breakage.  Fundamentally I see Liam's work to (re)move
      : complexity out of the MM (not to say that the actual maple tree is not
      : complex) by consolidating the three complimentary data structures very
      : much worth it considering performance does not take a hit.  This was very
      : much a turn off with the range locking approach, which worst case scenario
      : incurred in prohibitive overhead.  Also as Liam and Matthew have
      : mentioned, RCU opens up a lot of nice performance opportunities, and in
      : addition academia[1] has shown outstanding scalability of address spaces
      : with the foundation of replacing the locked rbtree with RCU aware trees.
      
      A similar work has been discovered in the academic press
      
      	https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
      
      Sheer coincidence.  We designed our tree with the intention of solving the
      hardest problem first.  Upon settling on a b-tree variant and a rough
      outline, we researched ranged based b-trees and RCU b-trees and did find
      that article.  So it was nice to find reassurances that we were on the
      right path, but our design choice of using ranges made that paper unusable
      for us.
      
      This patch (of 70):
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      There is additional BUG_ON() calls added within the tree, most of which
      are in debug code.  These will be replaced with a WARN_ON() call in the
      future.  There is also additional BUG_ON() calls within the code which
      will also be reduced in number at a later date.  These exist to catch
      things such as out-of-range accesses which would crash anyways.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
      Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
      
      
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      54a611b6
  23. Sep 12, 2022
  24. Sep 07, 2022
  25. Sep 05, 2022
  26. Aug 23, 2022
    • Mark Rutland's avatar
      arm64: fix rodata=full · 2e8cff0a
      Mark Rutland authored
      
      On arm64, "rodata=full" has been suppored (but not documented) since
      commit:
      
        c55191e9 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well")
      
      As it's necessary to determine the rodata configuration early during
      boot, arm64 has an early_param() handler for this, whereas init/main.c
      has a __setup() handler which is run later.
      
      Unfortunately, this split meant that since commit:
      
        f9a40b08 ("init/main.c: return 1 from handled __setup() functions")
      
      ... passing "rodata=full" would result in a spurious warning from the
      __setup() handler (though RO permissions would be configured
      appropriately).
      
      Further, "rodata=full" has been broken since commit:
      
        0d6ea3ac ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
      
      ... which caused strtobool() to parse "full" as false (in addition to
      many other values not documented for the "rodata=" kernel parameter.
      
      This patch fixes this breakage by:
      
      * Moving the core parameter parser to an __early_param(), such that it
        is available early.
      
      * Adding an (optional) arch hook which arm64 can use to parse "full".
      
      * Updating the documentation to mention that "full" is valid for arm64.
      
      * Having the core parameter parser handle "on" and "off" explicitly,
        such that any undocumented values (e.g. typos such as "ful") are
        reported as errors rather than being silently accepted.
      
      Note that __setup() and early_param() have opposite conventions for
      their return values, where __setup() uses 1 to indicate a parameter was
      handled and early_param() uses 0 to indicate a parameter was handled.
      
      Fixes: f9a40b08 ("init/main.c: return 1 from handled __setup() functions")
      Fixes: 0d6ea3ac ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Jagdish Gediya <jvgediya@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com
      
      
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      2e8cff0a
  27. Aug 21, 2022
  28. Jul 27, 2022
Loading