- Feb 22, 2023
-
-
Pavel Begunkov authored
There are no users of wq_list_merge, kill it. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5f9ad0301949213230ad9000a8359d591aae615a.1677002255.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jul 25, 2022
-
-
Pavel Begunkov authored
It's annoying to have io-wq.h as a dependency every time we want some of struct io_wq_work_list helpers, move them into a separate file. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c1d891ce12b30767d1d2a3b7db2ca3abc1ecc4a2.1655802465.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Move io_uring types to linux/include, need them public so tracing can see the definitions and we can clean trace/events/io_uring.h Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a15f12e8cb7289b2de0deaddcc7518d98a132d17.1655384063.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Add a comment on why we keep ->cancel_seq in struct io_wq_work instead of struct io_kiocb despite it needed only by io_uring but not io-wq. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/988e87eec9dc700b5dae933df3aefef303502f6c.1655310733.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
In preparation for splitting io_uring up a bit, move it into its own top level directory. It didn't really belong in fs/ anyway, as it's not a file system only API. This adds io_uring/ and moves the core files in there, and updates the MAINTAINERS file for the new location. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 25, 2022
-
-
Jens Axboe authored
The current cancelation will lookup and cancel the first request it finds based on the key passed in. Add a flag that allows to cancel any request that matches they key. It completes with the number of requests found and canceled, or res < 0 if an error occured. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-4-axboe@kernel.dk
-
- Apr 12, 2022
-
-
Jens Axboe authored
There are two reasons why this isn't the best idea: - It's an odd area to grab a bit of storage space, hence it's an odd area to grab storage from. - It puts the 3rd io_kiocb cacheline into the hot path, where normal hot path just needs the first two. Use 'cflags' for joint fd/cflags storage. We only need fd until we successfully issue, and we only need cflags once a request is done and is completed. Fixes: 6bf9c47a ("io_uring: defer file assignment") Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 07, 2022
-
-
Jens Axboe authored
If an application uses direct open or accept, it knows in advance what direct descriptor value it will get as it picks it itself. This allows combined requests such as: sqe = io_uring_get_sqe(ring); io_uring_prep_openat_direct(sqe, ..., file_slot); sqe->flags |= IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS; sqe = io_uring_get_sqe(ring); io_uring_prep_read(sqe,file_slot, buf, buf_size, 0); sqe->flags |= IOSQE_FIXED_FILE; io_uring_submit(ring); where we prepare both a file open and read, and only get a completion event for the read when both have completed successfully. Currently links are fully prepared before the head is issued, but that fails if the dependent link needs a file assigned that isn't valid until the head has completed. Conversely, if the same chain is performed but the fixed file slot is already valid, then we would be unexpectedly returning data from the old file slot rather than the newly opened one. Make sure we're consistent here. Allow deferral of file setup, which makes this documented case work. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jan 08, 2022
-
-
Eric W. Biederman authored
The point of using set_child_tid to hold the kthread pointer was that it already did what is necessary. There are now restrictions on when set_child_tid can be initialized and when set_child_tid can be used in schedule_tail. Which indicates that continuing to use set_child_tid to hold the kthread pointer is a bad idea. Instead of continuing to use the set_child_tid field of task_struct generalize the pf_io_worker field of task_struct and use it to hold the kthread pointer. Rename pf_io_worker (which is a void * pointer) to worker_private so it can be used to store kthreads struct kthread pointer. Update the kthread code to store the kthread pointer in the worker_private field. Remove the places where set_child_tid had to be dealt with carefully because kthreads also used it. Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org Suggested-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com>
-
- Dec 07, 2021
-
-
Hao Xu authored
add a helper to merge two wq_lists, it will be useful in the next patches. Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-2-haoxu@linux.alibaba.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Oct 19, 2021
-
-
Pavel Begunkov authored
The invariant of io_wq_work_list is that it's empty IFF ->first is NULL, so no need to initially set ->last. With now having more users of the list it may play a role, i.e. used in each tw iteration and on every completion flushing. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c464ab5cab6e46a858c6d39c107e92b3b5291f13.1633373302.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Use single linked lists for keeping iopoll requests, takes less space, may be faster, but mostly will be of benefit for further patches. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/314033676b100cd485518c3bc55e1b95a0dcd71f.1632516769.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Apart from just using lists (i.e. io_wq_work_list), we also want to have stacks, which are a bit faster, and have some interoperability between them. Add a stack implementation based on io_wq_work_node and some helpers. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5d3a412a5ac0d47e0f0499d70d2207d70a68925e.1632516769.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 29, 2021
-
-
Jens Axboe authored
io-wq divides work into two categories: 1) Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring. 2) Work that may never complete, we call this unbounded work. The amount of workers here is just limited by RLIMIT_NPROC. For various uses cases, it's handy to have the kernel limit the maximum amount of pending workers for both categories. Provide a way to do with with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation. IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets the max worker count to what is being passed in for each category. The old values are returned into that same array. If 0 is being passed in for either category, it simply returns the current value. The value is capped at RLIMIT_NPROC. This actually isn't that important as it's more of a hint, if we're exceeding the value then our attempt to fork a new worker will fail. This happens naturally already if more than one node is in the system, as these values are per-node internally for io-wq. Reported-by:
Johannes Lundberg <johalun0@gmail.com> Link: https://github.com/axboe/liburing/issues/420 Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Aug 23, 2021
-
-
Pavel Begunkov authored
Prepare nodes that we're going to add before actually linking them, it's always safer and costs us nothing. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 18, 2021
-
-
Pavel Begunkov authored
io-wq now doesn't have anything to do with creds now, so move ->creds from struct io_wq_work into request (aka struct io_kiocb). Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8520c72ab8b8f4b96db12a228a2ab4c094ae64e1.1623949695.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Jun 17, 2021
-
-
Jens Axboe authored
io-wq defaults to per-node masks for IO workers. This works fine by default, but isn't particularly handy for workloads that prefer more specific affinities, for either performance or isolation reasons. This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU mask that is then applied to IO thread workers, and an IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the default of per-node. Note that no care is given to existing IO threads, they will need to go through a reschedule before the affinity is correct if they are already running or sleeping. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 26, 2021
-
-
Pavel Begunkov authored
There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Apr 12, 2021
-
-
Jens Axboe authored
io-wq relies on a manager thread to create/fork new workers, as needed. But there's really no strong need for it anymore. We have the following cases that fork a new worker: 1) Work queue. This is done from the task itself always, and it's trivial to create a worker off that path, if needed. 2) All workers have gone to sleep, and we have more work. This is called off the sched out path. For this case, use a task_work items to queue a fork-worker operation. 3) Hashed work completion. Don't think we need to do anything off this case. If need be, it could just use approach 2 as well. Part of this change is incrementing the running worker count before the fork, to avoid cases where we observe we need a worker and then queue creation of one. Then new work comes in, we fork a new one. That last queue operation should have waited for the previous worker to come up, it's quite possible we don't even need it. Hence move the worker running from before we fork it off to more efficiently handle that case. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Mar 18, 2021
-
-
Stefan Metzmacher authored
Link: https://lore.kernel.org/r/8c1d14f3748105f4caeda01716d47af2fa41d11c.1615809009.git.metze@samba.org Signed-off-by:
Stefan Metzmacher <metze@samba.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Mar 06, 2021
-
-
Jens Axboe authored
If we go async with a request, grab the creds that the task currently has assigned and make sure that the async side switches to them. This is handled in the same way that we do for registered personalities. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Mar 05, 2021
-
-
Jens Axboe authored
This allows us to do task creation and setup without needing to use completions to try and synchronize with the starting thread. Get rid of the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup completion events - we can now do setup before the task is running. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Mar 04, 2021
-
-
Jens Axboe authored
If we move it in there, then we no longer have to care about it in io-wq. This means we can drop the cred handling in io-wq, and we can drop the REQ_F_WORK_INITIALIZED flag and async init functions as that was the last user of it since we moved to the new workers. Then we can also drop io_wq_work->creds, and just hold the personality u16 in there instead. Suggested-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we put the io-wq from io_uring, we really want it to exit. Provide a helper that does that for us. Couple that with not having the manager hold a reference to the 'wq' and the normal SQPOLL exit will tear down the io-wq context appropriate. On the io-wq side, our wq context is per task, so only the task itself is manipulating ->manager and hence it's safe to check and clear without any extra locking. We just need to ensure that the manager task stays around, in case it exits. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Feb 25, 2021
-
-
Jens Axboe authored
exec will cancel any threads, including the ones that io-wq is using. This isn't a problem, in fact we'd prefer it to be that way since it means we know that any async work cancels naturally without having to handle it proactively. But it does mean that we need to setup a new manager, as the manager and workers are gone. Handle this at queue time, and cancel work if we fail. Since the manager can go away without us noticing, ensure that the manager itself holds a reference to the 'wq' as well. Rename io_wq_destroy() to io_wq_put() to reflect that. In the future we can now simplify exec cancelation handling, for now just leave it the same. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Before the io-wq thread change, we maintained a hash work map and lock per-node per-ring. That wasn't ideal, as we really wanted it to be per ring. But now that we have per-task workers, the hash map ends up being just per-task. That'll work just fine for the normal case of having one task use a ring, but if you share the ring between tasks, then it's considerably worse than it was before. Make the hash map per ctx instead, which provides full per-ctx buffered write serialization on hashed writes. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Feb 24, 2021
-
-
Jens Axboe authored
We're now just using fork like we would from userspace, so there's no need to try and impose extra restrictions or accounting on the user side of things. That's already being done for us. That also means we don't have to pass in the user_struct anymore, that's correctly inherited through ->creds on fork. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Feb 22, 2021
-
-
Jens Axboe authored
We want to use this in io_uring proper as well, for the SQPOLL thread. Rename it from fork_thread() to io_wq_fork_thread(), and make it available through the io-wq.h header. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We are no longer grabbing state, so no need to maintain an IO identity that we COW if there are changes. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
The async workers are siblings of the task itself, so by definition we have all the state that we need. Remove any of the state grabbing that we have, and requests flagging what they need. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Instead of using regular kthread kernel threads, create kernel threads that are like a real thread that the task would create. This ensures that we get all the context that we need, without having to carry that state around. This greatly reduces the code complexity, and the risk of missing state for a given request type. With the move away from kthread, we can also dump everything related to assigned state to the new threads. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We don't support attach anymore, so doesn't make sense to carry the use_refs reference count. Get rid of it. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We hit this case when the task is exiting, and we need somewhere to do background cleanup of requests. Instead of relying on the io-wq task manager to do this work for us, just stuff it somewhere where we can safely run it ourselves directly. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Feb 10, 2021
-
-
Jens Axboe authored
task_work is a LIFO list, due to how it's implemented as a lockless list. For long chains of task_work, this can be problematic as the first entry added is the last one processed. Similarly, we'd waste a lot of CPU cycles reversing this list. Wrap the task_work so we have a single task_work entry per task per ctx, and use that to run it in the right order. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Feb 04, 2021
-
-
Pavel Begunkov authored
Saving one lock/unlock for io-wq is not super important, but adds some ugliness in the code. More important, atomic decs not turning it to zero for some archs won't give the right ordering/barriers so the io_steal_work() may pretty easily get subtly and completely broken. Return back 2-step io-wq work exchange and clean it up. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Feb 01, 2021
-
-
Jens Axboe authored
It's no longer used as IORING_OP_CLOSE got rid for the need of flagging it as uncancelable, kill it of. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Dec 20, 2020
-
-
Jens Axboe authored
io_uring no longer issues full cancelations on the io-wq, so remove any remnants of this code and the IO_WQ_BIT_CANCEL flag. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Dec 18, 2020
-
-
Xiaoguang Wang authored
For the first time a req punted to io-wq, we'll initialize io_wq_work's list to be NULL, then insert req to io_wqe->work_list. If this req is not inserted into tail of io_wqe->work_list, this req's io_wq_work list will point to another req's io_wq_work. For splitted bio case, this req maybe inserted to io_wqe->work_list repeatedly, once we insert it to tail of io_wqe->work_list for the second time, now io_wq_work->list->next will be invalid pointer, which then result in many strang error, panic, kernel soft-lockup, rcu stall, etc. In my vm, kernel doest not have commit cc29e1bf ("block: disable iopoll for split bio"), below fio job can reproduce this bug steadily: [global] name=iouring-sqpoll-iopoll-1 ioengine=io_uring iodepth=128 numjobs=1 thread rw=randread direct=1 registerfiles=1 hipri=1 bs=4m size=100M runtime=120 time_based group_reporting randrepeat=0 [device] directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point If we have commit cc29e1bf ("block: disable iopoll for split bio"), there will no splitted bio case for polled io, but I think we still to need to fix this list corruption, it also should maybe go to stable branchs. To fix this corruption, if a req is inserted into tail of io_wqe->work_list, initialize req->io_wq_work->list->next to bu NULL. Cc: stable@vger.kernel.org Signed-off-by:
Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Dec 09, 2020
-
-
Pavel Begunkov authored
Instead of iterating over each request and cancelling it individually in io_uring_cancel_files(), try to cancel all matching requests and use ->inflight_list only to check if there anything left. In many cases it should be faster, and we can reuse a lot of code from task cancellation. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Oct 21, 2020
-
-
Jens Axboe authored
This one was missed in the earlier conversion, should be included like any of the other IO identity flags. Make sure we restore to RLIM_INIFITY when dropping the personality again. Fixes: 98447d65 ("io_uring: move io identity items into separate struct") Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-