Skip to content
Snippets Groups Projects
  1. Mar 01, 2023
  2. Feb 26, 2023
    • Jens Axboe's avatar
      io_uring/poll: allow some retries for poll triggering spuriously · c16bda37
      Jens Axboe authored
      If we get woken spuriously when polling and fail the operation with
      -EAGAIN again, then we generally only allow polling again if data
      had been transferred at some point. This is indicated with
      REQ_F_PARTIAL_IO. However, if the spurious poll triggers when the socket
      was originally empty, then we haven't transferred data yet and we will
      fail the poll re-arm. This either punts the socket to io-wq if it's
      blocking, or it fails the request with -EAGAIN if not. Neither condition
      is desirable, as the former will slow things down, while the latter
      will make the application confused.
      
      We want to ensure that a repeated poll trigger doesn't lead to infinite
      work making no progress, that's what the REQ_F_PARTIAL_IO check was
      for. But it doesn't protect against a loop post the first receive, and
      it's unnecessarily strict if we started out with an empty socket.
      
      Add a somewhat random retry count, just to put an upper limit on the
      potential number of retries that will be done. This should be high enough
      that we won't really hit it in practice, unless something needs to be
      aborted anyway.
      
      Cc: stable@vger.kernel.org # v5.10+
      Link: https://github.com/axboe/liburing/issues/364
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c16bda37
  3. Jan 29, 2023
  4. Jan 20, 2023
  5. Jan 12, 2023
    • Jens Axboe's avatar
      io_uring/poll: attempt request issue after racy poll wakeup · 6e5aedb9
      Jens Axboe authored
      
      If we have multiple requests waiting on the same target poll waitqueue,
      then it's quite possible to get a request triggered and get disappointed
      in not being able to make any progress with it. If we race in doing so,
      we'll potentially leave the poll request on the internal tables, but
      removed from the waitqueue. That means that any subsequent trigger of
      the poll waitqueue will not kick that request into action, causing an
      application to potentially wait for completion of a request that will
      never happen.
      
      Fix this by adding a new poll return state, IOU_POLL_REISSUE. Rather
      than have complicated logic for how to re-arm a given type of request,
      just punt it for a reissue.
      
      While in there, move the 'ret' variable to the only section where it
      gets used. This avoids confusion the scope of it.
      
      Cc: stable@vger.kernel.org
      Fixes: eb0089d6 ("io_uring: single shot poll removal optimisation")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e5aedb9
  6. Jan 09, 2023
  7. Nov 30, 2022
  8. Nov 25, 2022
    • Lin Ma's avatar
      io_uring/poll: fix poll_refs race with cancelation · 12ad3d2d
      Lin Ma authored
      
      There is an interesting race condition of poll_refs which could result
      in a NULL pointer dereference. The crash trace is like:
      
      KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
      CPU: 0 PID: 30781 Comm: syz-executor.2 Not tainted 6.0.0-g493ffd6605b2 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:io_poll_remove_entry io_uring/poll.c:154 [inline]
      RIP: 0010:io_poll_remove_entries+0x171/0x5b4 io_uring/poll.c:190
      Code: ...
      RSP: 0018:ffff88810dfefba0 EFLAGS: 00010202
      RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000040000
      RDX: ffffc900030c4000 RSI: 000000000003ffff RDI: 0000000000040000
      RBP: 0000000000000008 R08: ffffffff9764d3dd R09: fffffbfff3836781
      R10: fffffbfff3836781 R11: 0000000000000000 R12: 1ffff11003422d60
      R13: ffff88801a116b04 R14: ffff88801a116ac0 R15: dffffc0000000000
      FS:  00007f9c07497700(0000) GS:ffff88811a600000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffb5c00ea98 CR3: 0000000105680005 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       io_apoll_task_func+0x3f/0xa0 io_uring/poll.c:299
       handle_tw_list io_uring/io_uring.c:1037 [inline]
       tctx_task_work+0x37e/0x4f0 io_uring/io_uring.c:1090
       task_work_run+0x13a/0x1b0 kernel/task_work.c:177
       get_signal+0x2402/0x25a0 kernel/signal.c:2635
       arch_do_signal_or_restart+0x3b/0x660 arch/x86/kernel/signal.c:869
       exit_to_user_mode_loop kernel/entry/common.c:166 [inline]
       exit_to_user_mode_prepare+0xc2/0x160 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline]
       syscall_exit_to_user_mode+0x58/0x160 kernel/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The root cause for this is a tiny overlooking in
      io_poll_check_events() when cocurrently run with poll cancel routine
      io_poll_cancel_req().
      
      The interleaving to trigger use-after-free:
      
      CPU0                                       |  CPU1
                                                 |
      io_apoll_task_func()                       |  io_poll_cancel_req()
       io_poll_check_events()                    |
        // do while first loop                   |
        v = atomic_read(...)                     |
        // v = poll_refs = 1                     |
        ...                                      |  io_poll_mark_cancelled()
                                                 |   atomic_or()
                                                 |   // poll_refs =
      IO_POLL_CANCEL_FLAG | 1
                                                 |
        atomic_sub_return(...)                   |
        // poll_refs = IO_POLL_CANCEL_FLAG       |
        // loop continue                         |
                                                 |
                                                 |  io_poll_execute()
                                                 |   io_poll_get_ownership()
                                                 |   // poll_refs =
      IO_POLL_CANCEL_FLAG | 1
                                                 |   // gets the ownership
        v = atomic_read(...)                     |
        // poll_refs not change                  |
                                                 |
        if (v & IO_POLL_CANCEL_FLAG)             |
         return -ECANCELED;                      |
        // io_poll_check_events return           |
        // will go into                          |
        // io_req_complete_failed() free req     |
                                                 |
                                                 |  io_apoll_task_func()
                                                 |  // also go into
      io_req_complete_failed()
      
      And the interleaving to trigger the kernel WARNING:
      
      CPU0                                       |  CPU1
                                                 |
      io_apoll_task_func()                       |  io_poll_cancel_req()
       io_poll_check_events()                    |
        // do while first loop                   |
        v = atomic_read(...)                     |
        // v = poll_refs = 1                     |
        ...                                      |  io_poll_mark_cancelled()
                                                 |   atomic_or()
                                                 |   // poll_refs =
      IO_POLL_CANCEL_FLAG | 1
                                                 |
        atomic_sub_return(...)                   |
        // poll_refs = IO_POLL_CANCEL_FLAG       |
        // loop continue                         |
                                                 |
        v = atomic_read(...)                     |
        // v = IO_POLL_CANCEL_FLAG               |
                                                 |  io_poll_execute()
                                                 |   io_poll_get_ownership()
                                                 |   // poll_refs =
      IO_POLL_CANCEL_FLAG | 1
                                                 |   // gets the ownership
                                                 |
        WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))   |
        // v & IO_POLL_REF_MASK = 0 WARN         |
                                                 |
                                                 |  io_apoll_task_func()
                                                 |  // also go into
      io_req_complete_failed()
      
      By looking up the source code and communicating with Pavel, the
      implementation of this atomic poll refs should continue the loop of
      io_poll_check_events() just to avoid somewhere else to grab the
      ownership. Therefore, this patch simply adds another AND operation to
      make sure the loop will stop if it finds the poll_refs is exactly equal
      to IO_POLL_CANCEL_FLAG. Since io_poll_cancel_req() grabs ownership and
      will finally make its way to io_req_complete_failed(), the req will
      be reclaimed as expected.
      
      Fixes: aa43477b ("io_uring: poll rework")
      Signed-off-by: default avatarLin Ma <linma@zju.edu.cn>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      [axboe: tweak description and code style]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      12ad3d2d
    • Pavel Begunkov's avatar
      io_uring: make poll refs more robust · a26a35e9
      Pavel Begunkov authored
      
      poll_refs carry two functions, the first is ownership over the request.
      The second is notifying the io_poll_check_events() that there was an
      event but wake up couldn't grab the ownership, so io_poll_check_events()
      should retry.
      
      We want to make poll_refs more robust against overflows. Instead of
      always incrementing it, which covers two purposes with one atomic, check
      if poll_refs is elevated enough and if so set a retry flag without
      attempts to grab ownership. The gap between the bias check and following
      atomics may seem racy, but we don't need it to be strict. Moreover there
      might only be maximum 4 parallel updates: by the first and the second
      poll entries, __io_arm_poll_handler() and cancellation. From those four,
      only poll wake ups may be executed multiple times, but they're protected
      by a spin.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarLin Ma <linma@zju.edu.cn>
      Fixes: aa43477b ("io_uring: poll rework")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/c762bc31f8683b3270f3587691348a7119ef9c9d.1668963050.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a26a35e9
    • Pavel Begunkov's avatar
      io_uring: cmpxchg for poll arm refs release · 2f389343
      Pavel Begunkov authored
      
      Replace atomically substracting the ownership reference at the end of
      arming a poll with a cmpxchg. We try to release ownership by setting 0
      assuming that poll_refs didn't change while we were arming. If it did
      change, we keep the ownership and use it to queue a tw, which is fully
      capable to process all events and (even tolerates spurious wake ups).
      
      It's a bit more elegant as we reduce races b/w setting the cancellation
      flag and getting refs with this release, and with that we don't have to
      worry about any kinds of underflows. It's not the fastest path for
      polling. The performance difference b/w cmpxchg and atomic dec is
      usually negligible and it's not the fastest path.
      
      Cc: stable@vger.kernel.org
      Fixes: aa43477b ("io_uring: poll rework")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0c95251624397ea6def568ff040cad2d7926fd51.1668963050.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2f389343
    • Dylan Yudaken's avatar
      io_uring: add io_aux_cqe which allows deferred completion · 9b8c5475
      Dylan Yudaken authored
      
      Use the just introduced deferred post cqe completion state when possible
      in io_aux_cqe. If not possible fallback to io_post_aux_cqe.
      
      This introduces a complication because of allow_overflow. For deferred
      completions we cannot know without locking the completion_lock if it will
      overflow (and even if we locked it, another post could sneak in and cause
      this cqe to be in overflow).
      However since overflow protection is mostly a best effort defence in depth
      to prevent infinite loops of CQEs for poll, just checking the overflow bit
      is going to be good enough and will result in at most 16 (array size of
      deferred cqes) overflows.
      
      Suggested-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarDylan Yudaken <dylany@meta.com>
      Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b8c5475
    • Dylan Yudaken's avatar
      io_uring: defer all io_req_complete_failed · 973fc83f
      Dylan Yudaken authored
      
      All failures happen under lock now, and can be deferred. To be consistent
      when the failure has happened after some multishot cqe has been
      deferred (and keep ordering), always defer failures.
      
      To make this obvious at the caller (and to help prevent a future bug)
      rename io_req_complete_failed to io_req_defer_failed.
      
      Signed-off-by: default avatarDylan Yudaken <dylany@meta.com>
      Link: https://lore.kernel.org/r/20221124093559.3780686-4-dylany@meta.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      973fc83f
    • Dylan Yudaken's avatar
      io_uring: always lock in io_apoll_task_func · c06c6c5d
      Dylan Yudaken authored
      
      This is required for the failure case (io_req_complete_failed) and is
      missing.
      
      The alternative would be to only lock in the failure path, however all of
      the non-error paths in io_poll_check_events that do not do not return
      IOU_POLL_NO_ACTION end up locking anyway. The only extraneous lock would
      be for the multishot poll overflowing the CQE ring, however multishot poll
      would probably benefit from being locked as it will allow completions to
      be batched.
      
      So it seems reasonable to lock always.
      
      Signed-off-by: default avatarDylan Yudaken <dylany@meta.com>
      Link: https://lore.kernel.org/r/20221124093559.3780686-3-dylany@meta.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c06c6c5d
  9. Nov 23, 2022
  10. Nov 22, 2022
  11. Nov 21, 2022
  12. Nov 18, 2022
  13. Nov 17, 2022
  14. Nov 11, 2022
  15. Sep 29, 2022
  16. Aug 13, 2022
  17. Jul 25, 2022
Loading