Skip to content
Snippets Groups Projects
  1. Sep 12, 2022
  2. Jan 22, 2022
  3. Nov 20, 2021
    • Alexander Mikhalitsyn's avatar
      ipc: WARN if trying to remove ipc object which is absent · 126e8bee
      Alexander Mikhalitsyn authored
      Patch series "shm: shm_rmid_forced feature fixes".
      
      Some time ago I met kernel crash after CRIU restore procedure,
      fortunately, it was CRIU restore, so, I had dump files and could do
      restore many times and crash reproduced easily.  After some
      investigation I've constructed the minimal reproducer.  It was found
      that it's use-after-free and it happens only if sysctl
      kernel.shm_rmid_forced = 1.
      
      The key of the problem is that the exit_shm() function not handles shp's
      object destroy when task->sysvshm.shm_clist contains items from
      different IPC namespaces.  In most cases this list will contain only
      items from one IPC namespace.
      
      How can this list contain object from different namespaces? The
      exit_shm() function is designed to clean up this list always when
      process leaves IPC namespace.  But we made a mistake a long time ago and
      did not add a exit_shm() call into the setns() syscall procedures.
      
      The first idea was just to add this call to setns() syscall but it
      obviously changes semantics of setns() syscall and that's
      userspace-visible change.  So, I gave up on this idea.
      
      The first real attempt to address the issue was just to omit forced
      destroy if we meet shp object not from current task IPC namespace [1].
      But that was not the best idea because task->sysvshm.shm_clist was
      protected by rwsem which belongs to current task IPC namespace.  It
      means that list corruption may occur.
      
      Second approach is just extend exit_shm() to properly handle shp's from
      different IPC namespaces [2].  This is really non-trivial thing, I've
      put a lot of effort into that but not believed that it's possible to
      make it fully safe, clean and clear.
      
      Thanks to the efforts of Manfred Spraul working an elegant solution was
      designed.  Thanks a lot, Manfred!
      
      Eric also suggested the way to address the issue in ("[RFC][PATCH] shm:
      In shm_exit destroy all created and never attached segments") Eric's
      idea was to maintain a list of shm_clists one per IPC namespace, use
      lock-less lists.  But there is some extra memory consumption-related
      concerns.
      
      An alternative solution which was suggested by me was implemented in
      ("shm: reset shm_clist on setns but omit forced shm destroy").  The idea
      is pretty simple, we add exit_shm() syscall to setns() but DO NOT
      destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
      clean up the task->sysvshm.shm_clist list.
      
      This chages semantics of setns() syscall a little bit but in comparision
      to the "naive" solution when we just add exit_shm() without any special
      exclusions this looks like a safer option.
      
      [1] https://lkml.org/lkml/2021/7/6/1108
      [2] https://lkml.org/lkml/2021/7/14/736
      
      This patch (of 2):
      
      Let's produce a warning if we trying to remove non-existing IPC object
      from IPC namespace kht/idr structures.
      
      This allows us to catch possible bugs when the ipc_rmid() function was
      called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
      arguments.
      
      Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
      Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
      
      
      Co-developed-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      126e8bee
  4. Sep 08, 2021
    • Rafael Aquini's avatar
      ipc: replace costly bailout check in sysvipc_find_ipc() · 20401d10
      Rafael Aquini authored
      sysvipc_find_ipc() was left with a costly way to check if the offset
      position fed to it is bigger than the total number of IPC IDs in use.  So
      much so that the time it takes to iterate over /proc/sysvipc/* files grows
      exponentially for a custom benchmark that creates "N" SYSV shm segments
      and then times the read of /proc/sysvipc/shm (milliseconds):
      
          12 msecs to read   1024 segs from /proc/sysvipc/shm
          18 msecs to read   2048 segs from /proc/sysvipc/shm
          65 msecs to read   4096 segs from /proc/sysvipc/shm
         325 msecs to read   8192 segs from /proc/sysvipc/shm
        1303 msecs to read  16384 segs from /proc/sysvipc/shm
        5182 msecs to read  32768 segs from /proc/sysvipc/shm
      
      The root problem lies with the loop that computes the total amount of ids
      in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
      "ids->in_use".  That is a quite inneficient way to get to the maximum
      index in the id lookup table, specially when that value is already
      provided by struct ipc_ids.max_idx.
      
      This patch follows up on the optimization introduced via commit
      15df03c8 ("sysvipc: make get_maxid O(1) again") and gets rid of the
      aforementioned costly loop replacing it by a simpler checkpoint based on
      ipc_get_maxidx() returned value, which allows for a smooth linear increase
      in time complexity for the same custom benchmark:
      
           2 msecs to read   1024 segs from /proc/sysvipc/shm
           2 msecs to read   2048 segs from /proc/sysvipc/shm
           4 msecs to read   4096 segs from /proc/sysvipc/shm
           9 msecs to read   8192 segs from /proc/sysvipc/shm
          19 msecs to read  16384 segs from /proc/sysvipc/shm
          39 msecs to read  32768 segs from /proc/sysvipc/shm
      
      Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com
      
      
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Waiman Long <llong@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20401d10
  5. Jul 01, 2021
    • Manfred Spraul's avatar
      ipc/util.c: use binary search for max_idx · b869d5be
      Manfred Spraul authored
      If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
      MSG_INFO or SHM_INFO, then the return value is the index of the highest
      used index in the kernel's internal array recording information about all
      SysV objects of the requested type for the current namespace.  (This
      information can be used with repeated ..._STAT or ..._STAT_ANY operations
      to obtain information about all SysV objects on the system.)
      
      There is a cache for this value.  But when the cache needs up be updated,
      then the highest used index is determined by looping over all possible
      values.  With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
      loop over 16 million entries.  And due to /proc/sys/kernel/*next_id, the
      index values do not need to be consecutive.
      
      With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a
      loop, I have observed a performance increase of around factor 13000.
      
      As there is no get_last() function for idr structures: Implement a
      "get_last()" using a binary search.
      
      As far as I see, ipc is the only user that needs get_last(), thus
      implement it in ipc/util.c and not in a central location.
      
      [akpm@linux-foundation.org: tweak comment, fix typo]
      
      Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com
      
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: <1vier1@web.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b869d5be
  6. May 14, 2020
    • Vasily Averin's avatar
      ipc/util.c: sysvipc_find_ipc() incorrectly updates position index · 5e698222
      Vasily Averin authored
      
      Commit 89163f93 ("ipc/util.c: sysvipc_find_ipc() should increase
      position index") is causing this bug (seen on 5.6.8):
      
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
      
         # ipcmk -Q
         Message queue id: 0
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x82db8127 0          root       644        0            0
      
         # ipcmk -Q
         Message queue id: 1
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x82db8127 0          root       644        0            0
         0x76d1fb2a 1          root       644        0            0
      
         # ipcrm -q 0
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x76d1fb2a 1          root       644        0            0
         0x76d1fb2a 1          root       644        0            0
      
         # ipcmk -Q
         Message queue id: 2
         # ipcrm -q 2
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x76d1fb2a 1          root       644        0            0
         0x76d1fb2a 1          root       644        0            0
      
         # ipcmk -Q
         Message queue id: 3
         # ipcrm -q 1
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x7c982867 3          root       644        0            0
         0x7c982867 3          root       644        0            0
         0x7c982867 3          root       644        0            0
         0x7c982867 3          root       644        0            0
      
      Whenever an IPC item with a low id is deleted, the items with higher ids
      are duplicated, as if filling a hole.
      
      new_pos should jump through hole of unused ids, pos can be updated
      inside "for" cycle.
      
      Fixes: 89163f93 ("ipc/util.c: sysvipc_find_ipc() should increase position index")
      Reported-by: default avatarAndreas Schwab <schwab@suse.de>
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/4921fe9b-9385-a2b4-1dc4-1099be6d2e39@virtuozzo.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e698222
  7. Apr 11, 2020
  8. Apr 07, 2020
    • Alexey Dobriyan's avatar
      proc: faster open/read/close with "permanent" files · d919b33d
      Alexey Dobriyan authored
      
      Now that "struct proc_ops" exist we can start putting there stuff which
      could not fly with VFS "struct file_operations"...
      
      Most of fs/proc/inode.c file is dedicated to make open/read/.../close
      reliable in the event of disappearing /proc entries which usually happens
      if module is getting removed.  Files like /proc/cpuinfo which never
      disappear simply do not need such protection.
      
      Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
      "permanent" files.
      
      Enable "permanent" flag for
      
      	/proc/cpuinfo
      	/proc/kmsg
      	/proc/modules
      	/proc/slabinfo
      	/proc/stat
      	/proc/sysvipc/*
      	/proc/swaps
      
      More will come once I figure out foolproof way to prevent out module
      authors from marking their stuff "permanent" for performance reasons
      when it is not.
      
      This should help with scalability: benchmark is "read /proc/cpuinfo R times
      by N threads scattered over the system".
      
      	N	R	t, s (before)	t, s (after)
      	-----------------------------------------------------
      	64	4096	1.582458	1.530502	-3.2%
      	256	4096	6.371926	6.125168	-3.9%
      	1024	4096	25.64888	24.47528	-4.6%
      
      Benchmark source:
      
      #include <chrono>
      #include <iostream>
      #include <thread>
      #include <vector>
      
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      
      const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
      int N;
      const char *filename;
      int R;
      
      int xxx = 0;
      
      int glue(int n)
      {
      	cpu_set_t m;
      	CPU_ZERO(&m);
      	CPU_SET(n, &m);
      	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
      }
      
      void f(int n)
      {
      	glue(n % NR_CPUS);
      
      	while (*(volatile int *)&xxx == 0) {
      	}
      
      	for (int i = 0; i < R; i++) {
      		int fd = open(filename, O_RDONLY);
      		char buf[4096];
      		ssize_t rv = read(fd, buf, sizeof(buf));
      		asm volatile ("" :: "g" (rv));
      		close(fd);
      	}
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc < 4) {
      		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
      ";
      		return 1;
      	}
      
      	N = atoi(argv[1]);
      	filename = argv[2];
      	R = atoi(argv[3]);
      
      	for (int i = 0; i < NR_CPUS; i++) {
      		if (glue(i) == 0)
      			break;
      	}
      
      	std::vector<std::thread> T;
      	T.reserve(N);
      	for (int i = 0; i < N; i++) {
      		T.emplace_back(f, i);
      	}
      
      	auto t0 = std::chrono::system_clock::now();
      	{
      		*(volatile int *)&xxx = 1;
      		for (auto& t: T) {
      			t.join();
      		}
      	}
      	auto t1 = std::chrono::system_clock::now();
      	std::chrono::duration<double> dt = t1 - t0;
      	std::cout << dt.count() << '
      ';
      
      	return 0;
      }
      
      P.S.:
      Explicit randomization marker is added because adding non-function pointer
      will silently disable structure layout randomization.
      
      [akpm@linux-foundation.org: coding style fixes]
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d919b33d
  9. Feb 04, 2020
  10. Dec 09, 2019
  11. May 15, 2019
    • Manfred Spraul's avatar
      ipc: do cyclic id allocation for the ipc object. · 99db46ea
      Manfred Spraul authored
      For ipcmni_extend mode, the sequence number space is only 7 bits.  So
      the chance of id reuse is relatively high compared with the non-extended
      mode.
      
      To alleviate this id reuse problem, this patch enables cyclic allocation
      for the index to the radix tree (idx).  The disadvantage is that this
      can cause a slight slow-down of the fast path, as the radix tree could
      be higher than necessary.
      
      To limit the radix tree height, I have chosen the following limits:
       1) The cycling is done over in_use*1.5.
       2) At least, the cycling is done over
         "normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
         "ipcmni_extended": 4096 elements
      
      Result:
      - for normal mode:
      	No change for <= 42 active ipc elements. With more than 42
      	active ipc elements, a 2nd level would be added to the radix
      	tree.
      	Without cyclic allocation, a 2nd level would be added only with
      	more than 63 active elements.
      
      - for extended mode:
      	Cycling creates always at least a 2-level radix tree.
      	With more than 2730 active objects, a 3rd level would be
      	added, instead of > 4095 active objects until the 3rd level
      	is added without cyclic allocation.
      
      For a 2-level radix tree compared to a 1-level radix tree, I have
      observed < 1% performance impact.
      
      Notes:
      1) Normal "x=semget();y=semget();" is unaffected: Then the idx
        is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
        is used.
      
      2) The -1% happens in a microbenchmark after this situation:
      	x=semget();
      	for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
      	y=semget();
      	Now perform semget calls on x and y that do not sleep.
      
      3) The worst-case reuse cycle time is unfortunately unaffected:
         If you have 2^24-1 ipc objects allocated, and get/remove the last
         possible element in a loop, then the id is reused after 128
         get/remove pairs.
      
      Performance check:
      A microbenchmark that performes no-op semop() randomly on two IDs,
      with only these two IDs allocated.
      The IDs were set using /proc/sys/kernel/sem_next_id.
      The test was run 5 times, averages are shown.
      
      1 & 2: Base (6.22 seconds for 10.000.000 semops)
      1 & 40: -0.2%
      1 & 3348: - 0.8%
      1 & 27348: - 1.6%
      1 & 15777204: - 3.2%
      
      Or: ~12.6 cpu cycles per additional radix tree level.
      The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
      than what I remember (spectre impact?).
      
      V2 of the patch:
      - use "min" and "max"
      - use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
      	(2<<12).
      
      [akpm@linux-foundation.org: fix max() warning]
      Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
      
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99db46ea
    • Manfred Spraul's avatar
      ipc: conserve sequence numbers in ipcmni_extend mode · 3278a2c2
      Manfred Spraul authored
      Rewrite, based on the patch from Waiman Long:
      
      The mixing in of a sequence number into the IPC IDs is probably to avoid
      ID reuse in userspace as much as possible.  With ipcmni_extend mode, the
      number of usable sequence numbers is greatly reduced leading to higher
      chance of ID reuse.
      
      To address this issue, we need to conserve the sequence number space as
      much as possible.  Right now, the sequence number is incremented for
      every new ID created.  In reality, we only need to increment the
      sequence number when new allocated ID is not greater than the last one
      allocated.  It is in such case that the new ID may collide with an
      existing one.  This is being done irrespective of the ipcmni mode.
      
      In order to avoid any races, the index is first allocated and then the
      pointer is replaced.
      
      Changes compared to the initial patch:
       - Handle failures from idr_alloc().
       - Avoid that concurrent operations can see the wrong sequence number.
         (This is achieved by using idr_replace()).
       - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
         ipcmni_seq_shift().
       - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().
      
      Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
      
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3278a2c2
    • Waiman Long's avatar
      ipc: allow boot time extension of IPCMNI from 32k to 16M · 5ac893b8
      Waiman Long authored
      The maximum number of unique System V IPC identifiers was limited to
      32k.  That limit should be big enough for most use cases.
      
      However, there are some users out there requesting for more, especially
      those that are migrating from Solaris which uses 24 bits for unique
      identifiers.  To satisfy the need of those users, a new boot time kernel
      option "ipcmni_extend" is added to extend the IPCMNI value to 16M.  This
      is a 512X increase which should be big enough for users out there that
      need a large number of unique IPC identifier.
      
      The use of this new option will change the pattern of the IPC
      identifiers returned by functions like shmget(2).  An application that
      depends on such pattern may not work properly.  So it should only be
      used if the users really need more than 32k of unique IPC numbers.
      
      This new option does have the side effect of reducing the maximum number
      of unique sequence numbers from 64k down to 128.  So it is a trade-off.
      
      The computation of a new IPC id is not done in the performance critical
      path.  So a little bit of additional overhead shouldn't have any real
      performance impact.
      
      Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
      
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ac893b8
  12. Apr 08, 2019
    • NeilBrown's avatar
      rhashtable: use bit_spin_locks to protect hash bucket. · 8f0db018
      NeilBrown authored
      
      This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
      bucket pointer to lock the hash chain for that bucket.
      
      The benefits of a bit spin_lock are:
       - no need to allocate a separate array of locks.
       - no need to have a configuration option to guide the
         choice of the size of this array
       - locking cost is often a single test-and-set in a cache line
         that will have to be loaded anyway.  When inserting at, or removing
         from, the head of the chain, the unlock is free - writing the new
         address in the bucket head implicitly clears the lock bit.
         For __rhashtable_insert_fast() we ensure this always happens
         when adding a new key.
       - even when lockings costs 2 updates (lock and unlock), they are
         in a cacheline that needs to be read anyway.
      
      The cost of using a bit spin_lock is a little bit of code complexity,
      which I think is quite manageable.
      
      Bit spin_locks are sometimes inappropriate because they are not fair -
      if multiple CPUs repeatedly contend of the same lock, one CPU can
      easily be starved.  This is not a credible situation with rhashtable.
      Multiple CPUs may want to repeatedly add or remove objects, but they
      will typically do so at different buckets, so they will attempt to
      acquire different locks.
      
      As we have more bit-locks than we previously had spinlocks (by at
      least a factor of two) we can expect slightly less contention to
      go with the slightly better cache behavior and reduced memory
      consumption.
      
      To enhance type checking, a new struct is introduced to represent the
        pointer plus lock-bit
      that is stored in the bucket-table.  This is "struct rhash_lock_head"
      and is empty.  A pointer to this needs to be cast to either an
      unsigned lock, or a "struct rhash_head *" to be useful.
      Variables of this type are most often called "bkt".
      
      Previously "pprev" would sometimes point to a bucket, and sometimes a
      ->next pointer in an rhash_head.  As these are now different types,
      pprev is NULL when it would have pointed to the bucket. In that case,
      'blk' is used, together with correct locking protocol.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f0db018
  13. Aug 22, 2018
  14. Jun 22, 2018
    • NeilBrown's avatar
      rhashtable: split rhashtable.h · 0eb71a9d
      NeilBrown authored
      
      Due to the use of rhashtables in net namespaces,
      rhashtable.h is included in lots of the kernel,
      so a small changes can required a large recompilation.
      This makes development painful.
      
      This patch splits out rhashtable-types.h which just includes
      the major type declarations, and does not include (non-trivial)
      inline code.  rhashtable.h is no longer included by anything
      in the include/ directory.
      Common include files only include rhashtable-types.h so a large
      recompilation is only triggered when that changes.
      
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0eb71a9d
  15. Apr 11, 2018
  16. Mar 24, 2018
    • Eric W. Biederman's avatar
      ipc/util: Helpers for making the sysvipc operations pid namespace aware · 03f1fc09
      Eric W. Biederman authored
      
      Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm
      and /proc/sysvipc/sem are opened, and make it available through
      the new helper ipc_seq_pid_ns.
      
      This makes it possible to report the pids in these files in the
      pid namespace of the opener of the files.
      
      Implement ipc_update_pid.  A simple impline helper that will only update
      a struct pid pointer if the new value does not equal the old value.  This
      removes the need for wordy code sequences like:
      
      	old = object->pid;
      	object->pid = new;
      	put_pid(old);
      
      and
      
      	old = object->pid;
      	if (old != new) {
      		object->pid = new;
      		put_pid(old);
      	}
      
      Allowing the following to be written instead:
      
      	ipc_update_pid(&object->pid, new);
      
      Which is easier to read and ensures that the pid reference count is
      not touched the old and the new values are the same.  Not touching
      the reference count in this case is important to help avoid issues
      like af_unix experienced, where multiple threads of the same
      process managed to bounce the struct pid between cpu cache lines,
      but updating the pids reference count.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      03f1fc09
  17. Feb 07, 2018
  18. Nov 18, 2017
  19. Nov 02, 2017
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  20. Sep 09, 2017
    • Guillaume Knispel's avatar
      ipc: optimize semget/shmget/msgget for lots of keys · 0cfb6aee
      Guillaume Knispel authored
      ipc_findkey() used to scan all objects to look for the wanted key.  This
      is slow when using a high number of keys.  This change adds an rhashtable
      of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).
      
      This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
      56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]
      
      Other (more micro) benchmark results, by the author: On an i5 laptop, the
      following loop executed right after a reboot took, without and with this
      change:
      
          for (int i = 0, k=0x424242; i < KEYS; ++i)
              semget(k++, 1, IPC_CREAT | 0600);
      
                       total       total          max single  max single
         KEYS        without        with        call without   call with
      
            1            3.5         4.9   µs            3.5         4.9
           10            7.6         8.6   µs            3.7         4.7
           32           16.2        15.9   µs            4.3         5.3
          100           72.9        41.8   µs            3.7         4.7
         1000        5,630.0       502.0   µs             *           *
        10000    1,340,000.0     7,240.0   µs             *           *
        31900   17,600,000.0    22,200.0   µs             *           *
      
       *: unreliable measure: high variance
      
      The duration for a lookup-only usage was obtained by the same loop once
      the keys are present:
      
                       total       total          max single  max single
         KEYS        without        with        call without   call with
      
            1            2.1         2.5   µs            2.1         2.5
           10            4.5         4.8   µs            2.2         2.3
           32           13.0        10.8   µs            2.3         2.8
          100           82.9        25.1   µs             *          2.3
         1000        5,780.0       217.0   µs             *           *
        10000    1,470,000.0     2,520.0   µs             *           *
        31900   17,400,000.0     7,810.0   µs             *           *
      
      Finally, executing each semget() in a new process gave, when still
      summing only the durations of these syscalls:
      
      creation:
                       total       total
         KEYS        without        with
      
            1            3.7         5.0   µs
           10           32.9        36.7   µs
           32          125.0       109.0   µs
          100          523.0       353.0   µs
         1000       20,300.0     3,280.0   µs
        10000    2,470,000.0    46,700.0   µs
        31900   27,800,000.0   219,000.0   µs
      
      lookup-only:
                       total       total
         KEYS        without        with
      
            1            2.5         2.7   µs
           10           25.4        24.4   µs
           32          106.0        72.6   µs
          100          591.0       352.0   µs
         1000       22,400.0     2,250.0   µs
        10000    2,510,000.0    25,700.0   µs
        31900   28,200,000.0   115,000.0   µs
      
      [1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop
      
      Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
      
      
      Signed-off-by: default avatarGuillaume Knispel <guillaume.knispel@supersonicimagine.com>
      Reviewed-by: default avatarMarc Pardo <marc.pardo@supersonicimagine.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
      Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cfb6aee
    • Elena Reshetova's avatar
      ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t · 9405c03e
      Elena Reshetova authored
      refcount_t type and corresponding API should be used instead of atomic_t
      when the variable is used as a reference counter.  This allows to avoid
      accidental refcounter overflows that might lead to use-after-free
      situations.
      
      Link: http://lkml.kernel.org/r/1499417992-3238-4-git-send-email-elena.reshetova@intel.com
      
      
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: <arozansk@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9405c03e
  21. Jul 13, 2017
  22. May 09, 2017
    • Michal Hocko's avatar
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko authored
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not opposed [2] to convert them as well.
      
      [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
      
      This patch (of 9):
      
      Using kmalloc with the vmalloc fallback for larger allocations is a
      common pattern in the kernel code.  Yet we do not have any common helper
      for that and so users have invented their own helpers.  Some of them are
      really creative when doing so.  Let's just add kv[mz]alloc and make sure
      it is implemented properly.  This implementation makes sure to not make
      a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
      to not warn about allocation failures.  This also rules out the OOM
      killer as the vmalloc is a more approapriate fallback than a disruptive
      user visible action.
      
      This patch also changes some existing users and removes helpers which
      are specific for them.  In some cases this is not possible (e.g.
      ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
      require GFP_NO{FS,IO} context which is not vmalloc compatible in general
      (note that the page table allocation is GFP_KERNEL).  Those need to be
      fixed separately.
      
      While we are at it, document that __vmalloc{_node} about unsupported gfp
      mask because there seems to be a lot of confusion out there.
      kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
      superset) flags to catch new abusers.  Existing ones would have to die
      slowly.
      
      [sfr@canb.auug.org.au: f2fs fixup]
        Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7c3e901
  23. Apr 02, 2017
    • Mauro Carvalho Chehab's avatar
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      Mauro Carvalho Chehab authored
      
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      0e056eb5
Loading