- Jan 28, 2023
-
-
Rik van Riel authored
Instead of waiting for an RCU grace period between each ipc_namespace structure that is being freed, wait an RCU grace period for every batch of ipc_namespace structures. Thanks to Al Viro for the suggestion of the helper function. This speeds up the run time of the test case that allocates ipc_namespaces in a loop from 6 minutes, to a little over 1 second: real 0m1.192s user 0m0.038s sys 0m1.152s Signed-off-by:
Rik van Riel <riel@surriel.com> Reported-by:
Chris Mason <clm@meta.com> Tested-by:
Giuseppe Scrivano <gscrivan@redhat.com> Suggested-by:
Al Viro <viro@zeniv.linux.org.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Jan 19, 2023
-
-
Christian Brauner authored
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Christian Brauner (Microsoft) <brauner@kernel.org>
-
Christian Brauner authored
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Christian Brauner (Microsoft) <brauner@kernel.org>
-
- Jan 18, 2023
-
-
Christian Brauner authored
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Christian Brauner (Microsoft) <brauner@kernel.org>
-
- Dec 12, 2022
-
-
Zhengchao Shao authored
When setup_mq_sysctls() failed in init_mqueue_fs(), mqueue_inode_cachep is not released. In order to fix this issue, the release path is reordered. Link: https://lkml.kernel.org/r/20221209092929.1978875-1-shaozhengchao@huawei.com Fixes: dc55e35f ("ipc: Store mqueue sysctls in the ipc namespace") Signed-off-by:
Zhengchao Shao <shaozhengchao@huawei.com> Cc: Alexey Gladkov <legion@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jingyu Wang <jingyuwang_vip@163.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: YueHaibing <yuehaibing@huawei.com> Cc: Yu Zhe <yuzhe@nfschina.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Oct 03, 2022
-
-
Jingyu Wang authored
iput() already handles null and non-null parameters, so there is no need to use if(). Link: https://lkml.kernel.org/r/20220908185452.76590-1-jingyuwang_vip@163.com Signed-off-by:
Jingyu Wang <jingyuwang_vip@163.com> Acked-by:
Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jul 20, 2022
-
-
Hangyu Hua authored
commit db7cfc38 ("ipc: Free mq_sysctls if ipc namespace creation failed") Here's a similar memory leak to the one fixed by the patch above. retire_mq_sysctls need to be called when init_mqueue_fs fails after setup_mq_sysctls. Fixes: dc55e35f ("ipc: Store mqueue sysctls in the ipc namespace") Signed-off-by:
Hangyu Hua <hbh25y@gmail.com> Link: https://lkml.kernel.org/r/20220715062301.19311-1-hbh25y@gmail.com Signed-off-by:
Eric W. Biederman <ebiederm@xmission.com>
-
- Jul 18, 2022
-
-
Yu Zhe authored
Remove unnecessary void* type casting. Link: https://lkml.kernel.org/r/20220628021251.17197-1-yuzhe@nfschina.com Signed-off-by:
Yu Zhe <yuzhe@nfschina.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- May 10, 2022
-
-
Waiman Long authored
When running the stress-ng clone benchmark with multiple testing threads, it was found that there were significant spinlock contention in sget_fc(). The contended spinlock was the sb_lock. It is under heavy contention because the following code in the critcal section of sget_fc(): hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) { if (test(old, fc)) goto share_extant_sb; } After testing with added instrumentation code, it was found that the benchmark could generate thousands of ipc namespaces with the corresponding number of entries in the mqueue's fs_supers list where the namespaces are the key for the search. This leads to excessive time in scanning the list for a match. Looking back at the mqueue calling sequence leading to sget_fc(): mq_init_ns() => mq_create_mount() => fc_mount() => vfs_get_tree() => mqueue_get_tree() => get_tree_keyed() => vfs_get_super() => sget_fc() Currently, mq_init_ns() is the only mqueue function that will indirectly call mqueue_get_tree() with a newly allocated ipc namespace as the key for searching. As a result, there will never be a match with the exising ipc namespaces stored in the mqueue's fs_supers list. So using get_tree_keyed() to do an existing ipc namespace search is just a waste of time. Instead, we could use get_tree_nodev() to eliminate the useless search. By doing so, we can greatly reduce the sb_lock hold time and avoid the spinlock contention problem in case a large number of ipc namespaces are present. Of course, if the code is modified in the future to allow mqueue_get_tree() to be called with an existing ipc namespace instead of a new one, we will have to use get_tree_keyed() in this case. The following stress-ng clone benchmark command was run on a 2-socket 48-core Intel system: ./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20 The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after patch. This is an increase of 54% in performance. Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com Fixes: 935c6912 ("ipc: Convert mqueue fs to fs_context") Signed-off-by:
Waiman Long <longman@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Mar 22, 2022
-
-
Muchun Song authored
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by:
Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by:
Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 08, 2022
-
-
Alexey Gladkov authored
Right now, the mqueue sysctls take ipc namespaces into account in a rather hacky way. This works in most cases, but does not respect the user namespace. Within the user namespace, the user cannot change the /proc/sys/fs/mqueue/* parametres. This poses a problem in the rootless containers. To solve this I changed the implementation of the mqueue sysctls just like some other sysctls. So far, the changes do not provide additional access to files. This will be done in a future patch. v3: * Don't implemenet set_permissions to keep the current behavior. v2: * Fixed compilation problem if CONFIG_POSIX_MQUEUE_SYSCTL is not specified. Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/b0ccbb2489119f1f20c737cf1930c3a9c4e4243a.1644862280.git.legion@kernel.org Signed-off-by:
Eric W. Biederman <ebiederm@xmission.com>
-
- May 23, 2021
-
-
Varad Gautam authored
do_mq_timedreceive calls wq_sleep with a stack local address. The sender (do_mq_timedsend) uses this address to later call pipelined_send. This leads to a very hard to trigger race where a do_mq_timedreceive call might return and leave do_mq_timedsend to rely on an invalid address, causing the following crash: RIP: 0010:wake_q_add_safe+0x13/0x60 Call Trace: __x64_sys_mq_timedsend+0x2a9/0x490 do_syscall_64+0x80/0x680 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5928e40343 The race occurs as: 1. do_mq_timedreceive calls wq_sleep with the address of `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it holds a valid `struct ext_wait_queue *` as long as the stack has not been overwritten. 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call __pipelined_op. 3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY). Here is where the race window begins. (`this` is `ewq_addr`.) 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it will see `state == STATE_READY` and break. 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's stack. (Although the address may not get overwritten until another function happens to touch it, which means it can persist around for an indefinite time.) 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a `struct ext_wait_queue *`, and uses it to find a task_struct to pass to the wake_q_add_safe call. In the lucky case where nothing has overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct. In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a bogus address as the receiver's task_struct causing the crash. do_mq_timedsend::__pipelined_op() should not dereference `this` after setting STATE_READY, as the receiver counterpart is now free to return. Change __pipelined_op to call wake_q_add_safe on the receiver's task_struct returned by get_task_struct, instead of dereferencing `this` which sits on the receiver's stack. As Manfred pointed out, the race potentially also exists in ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix those in the same way. Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com Fixes: c5b2cbdb ("ipc/mqueue.c: update/document memory barriers") Fixes: 8116b54e ("ipc/sem.c: document and update memory barriers") Fixes: 0d97a82b ("ipc/msg.c: update and document memory barriers") Signed-off-by:
Varad Gautam <varad.gautam@suse.com> Reported-by:
Matthias von Faber <matthias.vonfaber@aox-tech.de> Acked-by:
Davidlohr Bueso <dbueso@suse.de> Acked-by:
Manfred Spraul <manfred@colorfullife.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 30, 2021
-
-
Alexey Gladkov authored
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Signed-off-by:
Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org Signed-off-by:
Eric W. Biederman <ebiederm@xmission.com>
-
- Jan 24, 2021
-
-
Christian Brauner authored
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Christian Brauner <christian.brauner@ubuntu.com>
-
Christian Brauner authored
The various vfs_*() helpers are called by filesystems or by the vfs itself to perform core operations such as create, link, mkdir, mknod, rename, rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace and pass it down. Afterwards the checks and operations are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Christian Brauner <christian.brauner@ubuntu.com>
-
Christian Brauner authored
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
James Morris <jamorris@linux.microsoft.com> Acked-by:
Serge Hallyn <serge@hallyn.com> Signed-off-by:
Christian Brauner <christian.brauner@ubuntu.com>
-
- May 08, 2020
-
-
Oleg Nesterov authored
Commit cc731525 ("signal: Remove kernel interal si_code magic") changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no longer works if the sender doesn't have rights to send a signal. Change __do_notify() to use do_send_sig_info() instead of kill_pid_info() to avoid check_kill_permission(). This needs the additional notify.sigev_signo != 0 check, shouldn't we change do_mq_notify() to deny sigev_signo == 0 ? Test-case: #include <signal.h> #include <mqueue.h> #include <unistd.h> #include <sys/wait.h> #include <assert.h> static int notified; static void sigh(int sig) { notified = 1; } int main(void) { signal(SIGIO, sigh); int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL); assert(fd >= 0); struct sigevent se = { .sigev_notify = SIGEV_SIGNAL, .sigev_signo = SIGIO, }; assert(mq_notify(fd, &se) == 0); if (!fork()) { assert(setuid(1) == 0); mq_send(fd, "",1,0); return 0; } wait(NULL); mq_unlink("/mq"); assert(notified); return 0; } [manfred@colorfullife.com: 1) Add self_exec_id evaluation so that the implementation matches do_notify_parent 2) use PIDTYPE_TGID everywhere] Fixes: cc731525 ("signal: Remove kernel interal si_code magic") Reported-by:
Yoji <yoji.fujihar.min@gmail.com> Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Manfred Spraul <manfred@colorfullife.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
"Eric W. Biederman" <ebiederm@xmission.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Markus Elfring <elfring@users.sourceforge.net> Cc: <1vier1@web.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/e2a782e4-eab9-4f5c-c749-c07a8f7a4e66@colorfullife.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 07, 2020
-
-
Somala Swaraj authored
Signed-off-by:
somala swaraj <somalaswaraj@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200301135530.18340-1-somalaswaraj@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Feb 04, 2020
-
-
Manfred Spraul authored
Update and document memory barriers for mqueue.c: - ewp->state is read without any locks, thus READ_ONCE is required. - add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need acquire semantics if the value is STATE_READY. - use wake_q_add_safe() - document why __set_current_state() may be used: Reading task->state cannot happen before the wake_q_add() call, which happens while holding info->lock. Thus the spin_unlock() is the RELEASE, and the spin_lock() is the ACQUIRE. For completeness: there is also a 3 CPU scenario, if the to be woken up task is already on another wake_q. Then: - CPU1: spin_unlock() of the task that goes to sleep is the RELEASE - CPU2: the spin_lock() of the waker is the ACQUIRE - CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE - CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE Link: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com Signed-off-by:
Manfred Spraul <manfred@colorfullife.com> Reviewed-by:
Davidlohr Bueso <dbueso@suse.de> Cc: Waiman Long <longman@redhat.com> Cc: <1vier1@web.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Davidlohr Bueso authored
pipelined_send() and pipelined_receive() are identical, so merge them. [manfred@colorfullife.com: add changelog] Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com Signed-off-by:
Davidlohr Bueso <dave@stgolabs.net> Signed-off-by:
Manfred Spraul <manfred@colorfullife.com> Cc: <1vier1@web.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Sep 26, 2019
-
-
Markus Elfring authored
Null pointers were assigned to local variables in a few cases as exception handling. The jump target “out” was used where no meaningful data processing actions should eventually be performed by branches of an if statement then. Use an additional jump target for calling dev_kfree_skb() directly. Return also directly after error conditions were detected when no extra clean-up is needed by this function implementation. Link: http://lkml.kernel.org/r/592ef10e-0b69-72d0-9789-fc48f638fdfd@web.de Signed-off-by:
Markus Elfring <elfring@users.sourceforge.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Markus Elfring authored
dev_kfree_skb() input parameter validation, thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Link: http://lkml.kernel.org/r/07477187-63e5-cc80-34c1-32dd16b38e12@web.de Signed-off-by:
Markus Elfring <elfring@users.sourceforge.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Sep 05, 2019
-
-
Al Viro authored
For vfs_get_keyed_super users. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Jul 17, 2019
-
-
Kees Cook authored
Andreas Christoforou reported: UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow: 9 * 2305843009213693951 cannot be represented in type 'long int' ... Call Trace: mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414 evict+0x472/0x8c0 fs/inode.c:558 iput_final fs/inode.c:1547 [inline] iput+0x51d/0x8c0 fs/inode.c:1573 mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320 mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459 vfs_mkobj+0x39e/0x580 fs/namei.c:2892 prepare_open ipc/mqueue.c:731 [inline] do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771 Which could be triggered by: struct mq_attr attr = { .mq_flags = 0, .mq_maxmsg = 9, .mq_msgsize = 0x1fffffffffffffff, .mq_curmsgs = 0, }; if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1) perror("mq_open"); mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and preparing to return -EINVAL. During the cleanup, it calls mqueue_evict_inode() which performed resource usage tracking math for updating "user", before checking if there was a valid "user" at all (which would indicate that the calculations would be sane). Instead, delay this check to after seeing a valid "user". The overflow was real, but the results went unused, so while the flaw is harmless, it's noisy for kernel fuzzers, so just fix it by moving the calculation under the non-NULL "user" where it actually gets used. Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook Signed-off-by:
Kees Cook <keescook@chromium.org> Reported-by:
Andreas Christoforou <andreaschristofo@gmail.com> Acked-by:
"Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- May 26, 2019
-
-
Al Viro authored
... so that we could lift the capability checks into ->get_tree() caller Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- May 15, 2019
-
-
Davidlohr Bueso authored
Our msg priorities became an rbtree as of d6629859 ("ipc/mqueue: improve performance of send/recv"). However, consuming a msg in msg_get() remains logarithmic (still being better than the case before of course). By applying well known techniques to cache pointers we can have the node with the highest priority in O(1), which is specially nice for the rt cases. Furthermore, some callers can call msg_get() in a loop. A new msg_tree_erase() helper is also added to encapsulate the tree removal and node_cache game. Passes ltp mq testcases. Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net Signed-off-by:
Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Davidlohr Bueso authored
We already store the current task fo the new waiter before calling wq_sleep() in both send and recv paths. Trivially remove the redundant assignment. Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net Signed-off-by:
Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Li Rongqing authored
msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is enabled on large memory SMP systems, the pages initialization can take a long time, if msgctl10 requests a huge block memory, and it will block rcu scheduler, so release cpu actively. After adding schedule() in free_msg, free_msg can not be called when holding spinlock, so adding msg to a tmp list, and free it out of spinlock rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505 rcu: Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978 rcu: (detected by 11, t=35024 jiffies, g=44237529, q=16542267) msgctl10 R running task 21608 32505 2794 0x00000082 Call Trace: preempt_schedule_irq+0x4c/0xb0 retint_kernel+0x1b/0x2d RIP: 0010:__is_insn_slot_addr+0xfb/0x250 Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48 RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57 RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780 RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3 R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73 R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec kernel_text_address+0xc1/0x100 __kernel_text_address+0xe/0x30 unwind_get_return_address+0x2f/0x50 __save_stack_trace+0x92/0x100 create_object+0x380/0x650 __kmalloc+0x14c/0x2b0 load_msg+0x38/0x1a0 do_msgsnd+0x19e/0xcf0 do_syscall_64+0x117/0x400 entry_SYSCALL_64_after_hwframe+0x49/0xbe rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170 rcu: (detected by 14, t=35016 jiffies, g=44237525, q=12423063) msgctl10 R running task 21608 32170 32155 0x00000082 Call Trace: preempt_schedule_irq+0x4c/0xb0 retint_kernel+0x1b/0x2d RIP: 0010:lock_acquire+0x4d/0x340 Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82 RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13 RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64 RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000 R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000 is_bpf_text_address+0x32/0xe0 kernel_text_address+0xec/0x100 __kernel_text_address+0xe/0x30 unwind_get_return_address+0x2f/0x50 __save_stack_trace+0x92/0x100 save_stack+0x32/0xb0 __kasan_slab_free+0x130/0x180 kfree+0xfa/0x2d0 free_msg+0x24/0x50 do_msgrcv+0x508/0xe60 do_syscall_64+0x117/0x400 entry_SYSCALL_64_after_hwframe+0x49/0xbe Davidlohr said: "So after releasing the lock, the msg rbtree/list is empty and new calls will not see those in the newly populated tmp_msg list, and therefore they cannot access the delayed msg freeing pointers, which is good. Also the fact that the node_cache is now freed before the actual messages seems to be harmless as this is wanted for msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the info->lock the thing is freed anyway so it should not change things" Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com Signed-off-by:
Li RongQing <lirongqing@baidu.com> Signed-off-by:
Zhang Yu <zhangyu31@baidu.com> Reviewed-by:
Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- May 02, 2019
-
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Feb 28, 2019
-
-
David Howells authored
Convert the mqueue filesystem to use the filesystem context stuff. Notes: (1) The relevant ipc namespace is selected in when the context is initialised (and it defaults to the current task's ipc namespace). The caller can override this before calling vfs_get_tree(). (2) Rather than simply calling kern_mount_data(), mq_init_ns() and mq_internal_mount() create a context, adjust it and then do the rest of the mount procedure. (3) The lazy mqueue mounting on creation of a new namespace is retained from a previous patch, but the avoidance of sget() if no superblock yet exists is reverted and the superblock is again keyed on the namespace pointer. Yes, there was a performance gain in not searching the superblock hash, but it's only paid once per ipc namespace - and only if someone uses mqueue within that namespace, so I'm not sure it's worth it, especially as calling sget() allows avoidance of recursion. Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Feb 07, 2019
-
-
Arnd Bergmann authored
A lot of system calls that pass a time_t somewhere have an implementation using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have been reworked so that this implementation can now be used on 32-bit architectures as well. The missing step is to redefine them using the regular SYSCALL_DEFINEx() to get them out of the compat namespace and make it possible to build them on 32-bit architectures. Any system call that ends in 'time' gets a '32' suffix on its name for that version, while the others get a '_time32' suffix, to distinguish them from the normal version, which takes a 64-bit time argument in the future. In this step, only 64-bit architectures are changed, doing this rename first lets us avoid touching the 32-bit architectures twice. Acked-by:
Catalin Marinas <catalin.marinas@arm.com> Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
- Oct 03, 2018
-
-
Eric W. Biederman authored
Linus recently observed that if we did not worry about the padding member in struct siginfo it is only about 48 bytes, and 48 bytes is much nicer than 128 bytes for allocating on the stack and copying around in the kernel. The obvious thing of only adding the padding when userspace is including siginfo.h won't work as there are sigframe definitions in the kernel that embed struct siginfo. So split siginfo in two; kernel_siginfo and siginfo. Keeping the traditional name for the userspace definition. While the version that is used internally to the kernel and ultimately will not be padded to 128 bytes is called kernel_siginfo. The definition of struct kernel_siginfo I have put in include/signal_types.h A set of buildtime checks has been added to verify the two structures have the same field offsets. To make it easy to verify the change kernel_siginfo retains the same size as siginfo. The reduction in size comes in a following change. Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
- Aug 27, 2018
-
-
Arnd Bergmann authored
Christoph Hellwig suggested a slightly different path for handling backwards compatibility with the 32-bit time_t based system calls: Rather than simply reusing the compat_sys_* entry points on 32-bit architectures unchanged, we get rid of those entry points and the compat_time types by renaming them to something that makes more sense on 32-bit architectures (which don't have a compat mode otherwise), and then share the entry points under the new name with the 64-bit architectures that use them for implementing the compatibility. The following types and interfaces are renamed here, and moved from linux/compat_time.h to linux/time32.h: old new --- --- compat_time_t old_time32_t struct compat_timeval struct old_timeval32 struct compat_timespec struct old_timespec32 struct compat_itimerspec struct old_itimerspec32 ns_to_compat_timeval() ns_to_old_timeval32() get_compat_itimerspec64() get_old_itimerspec32() put_compat_itimerspec64() put_old_itimerspec32() compat_get_timespec64() get_old_timespec32() compat_put_timespec64() put_old_timespec32() As we already have aliases in place, this patch addresses only the instances that are relevant to the system call interface in particular, not those that occur in device drivers and other modules. Those will get handled separately, while providing the 64-bit version of the respective interfaces. I'm not renaming the timex, rusage and itimerval structures, as we are still debating what the new interface will look like, and whether we will need a replacement at all. This also doesn't change the names of the syscall entry points, which can be done more easily when we actually switch over the 32-bit architectures to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix. Suggested-by:
Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/ Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
- Apr 20, 2018
-
-
Arnd Bergmann authored
Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop) take a timespec argument. After we move 32-bit architectures over to useing 64-bit time_t based syscalls, we need seperate entry points for the old 32-bit based interfaces. This changes the #ifdef guards for the existing 32-bit compat syscalls to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be enabled on all existing 32-bit architectures. Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
Arnd Bergmann authored
This is a preparatation for changing over __kernel_timespec to 64-bit times, which involves assigning new system call numbers for mq_timedsend(), mq_timedreceive() and semtimedop() for compatibility with future y2038 proof user space. The existing ABIs will remain available through compat code. Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
- Mar 25, 2018
-
-
Eric W. Biederman authored
This reverts commit 36735a6a. Aleksa Sarai <asarai@suse.de> writes: > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount > > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1], > which was introduced by 36735a6a ("mqueue: switch to on-demand > creation of internal mount"). > > Basically, the reproducer boils down to being able to mount mqueue if > you create a new user namespace, even if you don't unshare the IPC > namespace. > > Previously this was not possible, and you would get an -EPERM. The mount > is the *host* mqueue mount, which is being cached and just returned from > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if > it was intentional -- since I'm not familiar with mqueue). > > To me it looks like there is a missing permission check. I've included a > patch below that I've compile-tested, and should block the above case. > Can someone please tell me if I'm missing something? Is this actually > safe? > > [1]: https://github.com/docker/docker/issues/36674 The issue is a lot deeper than a missing permission check. sb->s_user_ns was is improperly set as well. So in addition to the filesystem being mounted when it should not be mounted, so things are not allow that should be. We are practically to the release of 4.16 and there is no agreement between Al Viro and myself on what the code should looks like to fix things properly. So revert the code to what it was before so that we can take our time and discuss this properly. Fixes: 36735a6a ("mqueue: switch to on-demand creation of internal mount") Reported-by:
Felix Abecassis <fabecassis@nvidia.com> Reported-by:
Aleksa Sarai <asarai@suse.de> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
- Feb 11, 2018
-
-
Linus Torvalds authored
This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by:
Al Viro <viro@zeniv.linux.org.uk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Feb 07, 2018
-
-
Jonathan Haws authored
Previous behavior added tasks to the work queue using the static_prio value instead of the dynamic priority value in prio. This caused RT tasks to be added to the work queue in a FIFO manner rather than by priority. Normal tasks were handled by priority. This fix utilizes the dynamic priority of the task to ensure that both RT and normal tasks are added to the work queue in priority order. Utilizing the dynamic priority (prio) rather than the base priority (normal_prio) was chosen to ensure that if a task had a boosted priority when it was added to the work queue, it would be woken sooner to to ensure that it releases any other locks it may be holding in a more timely manner. It is understood that the task could have a lower priority when it wakes than when it was added to the queue in this (unlikely) case. Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.edu Signed-off-by:
Jonathan Haws <jhaws@sdl.usu.edu> Reviewed-by:
Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by:
Davidlohr Bueso <dbueso@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jan 12, 2018
-
-
Eric W. Biederman authored
Call clear_siginfo to ensure stack allocated siginfos are fully initialized before being passed to the signal sending functions. This ensures that if there is the kind of confusion documented by TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized data to userspace when the kernel generates a signal with SI_USER but the copy to userspace assumes it is a different kind of signal, and different fields are initialized. This also prepares the way for turning copy_siginfo_to_user into a copy_to_user, by removing the need in many cases to perform a field by field copy simply to skip the uninitialized fields. Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
- Jan 05, 2018
-
-
Al Viro authored
Instead of doing that upon each ipcns creation, we do that the first time mq_open(2) or mqueue mount is done in an ipcns. What's more, doing that allows to get rid of mount_ns() use - we can go with considerably cheaper mount_nodev(), avoiding the loop over all mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting instance in O(1) time instead of O(instances) mount_ns() would've cost us. Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've added handling of userland mqueue mounts (original had been broken in that area) and added a switch to mount_nodev(). Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-