Skip to content
Snippets Groups Projects
  1. Feb 26, 2023
    • Fedor Pchelkin's avatar
      nfc: fix memory leak of se_io context in nfc_genl_se_io · 25ff6f8a
      Fedor Pchelkin authored
      
      The callback context for sending/receiving APDUs to/from the selected
      secure element is allocated inside nfc_genl_se_io and supposed to be
      eventually freed in se_io_cb callback function. However, there are several
      error paths where the bwi_timer is not charged to call se_io_cb later, and
      the cb_context is leaked.
      
      The patch proposes to free the cb_context explicitly on those error paths.
      
      At the moment we can't simply check 'dev->ops->se_io()' return value as it
      may be negative in both cases: when the timer was charged and was not.
      
      Fixes: 5ce3f32b ("NFC: netlink: SE API implementation")
      Reported-by: default avatar <syzbot+df64c0a2e8d68e78a4fa@syzkaller.appspotmail.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25ff6f8a
    • Nathan Chancellor's avatar
      net/sched: cls_api: Move call to tcf_exts_miss_cookie_base_destroy() · 37e1f3ac
      Nathan Chancellor authored
      
      When CONFIG_NET_CLS_ACT is disabled:
      
        ../net/sched/cls_api.c:141:13: warning: 'tcf_exts_miss_cookie_base_destroy' defined but not used [-Wunused-function]
          141 | static void tcf_exts_miss_cookie_base_destroy(struct tcf_exts *exts)
              |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Due to the way the code is structured, it is possible for a definition
      of tcf_exts_miss_cookie_base_destroy() to be present without actually
      being used. Its single callsite is in an '#ifdef CONFIG_NET_CLS_ACT'
      block but a definition will always be present in the file. The version
      of tcf_exts_miss_cookie_base_destroy() that actually does something
      depends on CONFIG_NET_TC_SKB_EXT, so the stub function is used in both
      CONFIG_NET_CLS_ACT=n and CONFIG_NET_CLS_ACT=y + CONFIG_NET_TC_SKB_EXT=n
      configurations.
      
      Move the call to tcf_exts_miss_cookie_base_destroy() in
      tcf_exts_destroy() out of the '#ifdef CONFIG_NET_CLS_ACT', so that it
      always appears used to the compiler, while not changing any behavior
      with any of the various configuration combinations.
      
      Fixes: 80cd22c3 ("net/sched: cls_api: Support hardware miss to tc action")
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37e1f3ac
  2. Feb 24, 2023
  3. Feb 23, 2023
    • Xin Long's avatar
      sctp: add a refcnt in sctp_stream_priorities to avoid a nested loop · 68ba4463
      Xin Long authored
      
      With this refcnt added in sctp_stream_priorities, we don't need to
      traverse all streams to check if the prio is used by other streams
      when freeing one stream's prio in sctp_sched_prio_free_sid(). This
      can avoid a nested loop (up to 65535 * 65535), which may cause a
      stuck as Ying reported:
      
          watchdog: BUG: soft lockup - CPU#23 stuck for 26s! [ksoftirqd/23:136]
          Call Trace:
           <TASK>
           sctp_sched_prio_free_sid+0xab/0x100 [sctp]
           sctp_stream_free_ext+0x64/0xa0 [sctp]
           sctp_stream_free+0x31/0x50 [sctp]
           sctp_association_free+0xa5/0x200 [sctp]
      
      Note that it doesn't need to use refcount_t type for this counter,
      as its accessing is always protected under the sock lock.
      
      v1->v2:
       - add a check in sctp_sched_prio_set to avoid the possible prio_head
         refcnt overflow.
      
      Fixes: 9ed7bfc7 ("sctp: fix memory leak in sctp_stream_outq_migrate()")
      Reported-by: default avatarYing Xu <yinxu@redhat.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://lore.kernel.org/r/825eb0c905cb864991eba335f4a2b780e543f06b.1677085641.git.lucien.xin@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68ba4463
    • Lu Wei's avatar
      ipv6: Add lwtunnel encap size of all siblings in nexthop calculation · 4cc59f38
      Lu Wei authored
      
      In function rt6_nlmsg_size(), the length of nexthop is calculated
      by multipling the nexthop length of fib6_info and the number of
      siblings. However if the fib6_info has no lwtunnel but the siblings
      have lwtunnels, the nexthop length is less than it should be, and
      it will trigger a warning in inet6_rt_notify() as follows:
      
      WARNING: CPU: 0 PID: 6082 at net/ipv6/route.c:6180 inet6_rt_notify+0x120/0x130
      ......
      Call Trace:
       <TASK>
       fib6_add_rt2node+0x685/0xa30
       fib6_add+0x96/0x1b0
       ip6_route_add+0x50/0xd0
       inet6_rtm_newroute+0x97/0xa0
       rtnetlink_rcv_msg+0x156/0x3d0
       netlink_rcv_skb+0x5a/0x110
       netlink_unicast+0x246/0x350
       netlink_sendmsg+0x250/0x4c0
       sock_sendmsg+0x66/0x70
       ___sys_sendmsg+0x7c/0xd0
       __sys_sendmsg+0x5d/0xb0
       do_syscall_64+0x3f/0x90
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      This bug can be reproduced by script:
      
      ip -6 addr add 2002::2/64 dev ens2
      ip -6 route add 100::/64 via 2002::1 dev ens2 metric 100
      
      for i in 10 20 30 40 50 60 70;
      do
      	ip link add link ens2 name ipv_$i type ipvlan
      	ip -6 addr add 2002::$i/64 dev ipv_$i
      	ifconfig ipv_$i up
      done
      
      for i in 10 20 30 40 50 60;
      do
      	ip -6 route append 100::/64 encap ip6 dst 2002::$i via 2002::1
      dev ipv_$i metric 100
      done
      
      ip -6 route append 100::/64 via 2002::1 dev ipv_70 metric 100
      
      This patch fixes it by adding nexthop_len of every siblings using
      rt6_nh_nlmsg_size().
      
      Fixes: beb1afac ("net: ipv6: Add support to dump multipath routes via RTA_MULTIPATH attribute")
      Signed-off-by: default avatarLu Wei <luwei32@huawei.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230222083629.335683-2-luwei32@huawei.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4cc59f38
  4. Feb 22, 2023
    • Pavel Tikhomirov's avatar
      netfilter: x_tables: fix percpu counter block leak on error path when creating new netns · 0af8c09c
      Pavel Tikhomirov authored
      
      Here is the stack where we allocate percpu counter block:
      
        +-< __alloc_percpu
          +-< xt_percpu_counter_alloc
            +-< find_check_entry # {arp,ip,ip6}_tables.c
              +-< translate_table
      
      And it can be leaked on this code path:
      
        +-> ip6t_register_table
          +-> translate_table # allocates percpu counter block
          +-> xt_register_table # fails
      
      there is no freeing of the counter block on xt_register_table fail.
      Note: xt_percpu_counter_free should be called to free it like we do in
      do_replace through cleanup_entry helper (or in __ip6t_unregister_table).
      
      Probability of hitting this error path is low AFAICS (xt_register_table
      can only return ENOMEM here, as it is not replacing anything, as we are
      creating new netns, and it is hard to imagine that all previous
      allocations succeeded and after that one in xt_register_table failed).
      But it's worth fixing even the rare leak.
      
      Fixes: 71ae0dff ("netfilter: xtables: use percpu rule counters")
      Signed-off-by: default avatarPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0af8c09c
    • Florian Westphal's avatar
      netfilter: ctnetlink: make event listener tracking global · fdf64911
      Florian Westphal authored
      
      pernet tracking doesn't work correctly because other netns might have
      set NETLINK_LISTEN_ALL_NSID on its event socket.
      
      In this case its expected that events originating in other net
      namespaces are also received.
      
      Making pernet-tracking work while also honoring NETLINK_LISTEN_ALL_NSID
      requires much more intrusive changes both in netlink and nfnetlink,
      f.e. adding a 'setsockopt' callback that lets nfnetlink know that the
      event socket entered (or left) ALL_NSID mode.
      
      Move to global tracking instead: if there is an event socket anywhere
      on the system, all net namespaces which have conntrack enabled and
      use autobind mode will allocate the ecache extension.
      
      netlink_has_listeners() returns false only if the given group has no
      subscribers in any net namespace, the 'net' argument passed to
      nfnetlink_has_listeners is only used to derive the protocol (nfnetlink),
      it has no other effect.
      
      For proper NETLINK_LISTEN_ALL_NSID-aware pernet tracking of event
      listeners a new netlink_has_net_listeners() is also needed.
      
      Fixes: 90d1daa4 ("netfilter: conntrack: add nf_conntrack_events autodetect mode")
      Reported-by: default avatarBryce Kahle <bryce.kahle@datadoghq.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      fdf64911
    • Xin Long's avatar
      netfilter: xt_length: use skb len to match in length_mt6 · 05c07c0c
      Xin Long authored
      
      For IPv6 Jumbo packets, the ipv6_hdr(skb)->payload_len is always 0,
      and its real payload_len ( > 65535) is saved in hbh exthdr. With 0
      length for the jumbo packets, it may mismatch.
      
      To fix this, we can just use skb->len instead of parsing exthdrs, as
      the hbh exthdr parsing has been done before coming to length_mt6 in
      ip6_rcv_core() and br_validate_ipv6() and also the packet has been
      trimmed according to the correct IPv6 (ext)hdr length there, and skb
      len is trustable in length_mt6().
      
      Note that this patch is especially needed after the IPv6 BIG TCP was
      supported in kernel, which is using IPv6 Jumbo packets. Besides, to
      match the packets greater than 65535 more properly, a v1 revision of
      xt_length may be needed to extend "min, max" to u32 in the future,
      and for now the IPv6 Jumbo packets can be matched by:
      
        # ip6tables -m length ! --length 0:65535
      
      Fixes: 7c4e983c ("net: allow gso_max_size to exceed 65536")
      Fixes: 0fe79f28 ("net: allow gro_max_size to exceed 65536")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      05c07c0c
    • Florian Westphal's avatar
      netfilter: ebtables: fix table blob use-after-free · e58a171d
      Florian Westphal authored
      
      We are not allowed to return an error at this point.
      Looking at the code it looks like ret is always 0 at this
      point, but its not.
      
      t = find_table_lock(net, repl->name, &ret, &ebt_mutex);
      
      ... this can return a valid table, with ret != 0.
      
      This bug causes update of table->private with the new
      blob, but then frees the blob right away in the caller.
      
      Syzbot report:
      
      BUG: KASAN: vmalloc-out-of-bounds in __ebt_unregister_table+0xc00/0xcd0 net/bridge/netfilter/ebtables.c:1168
      Read of size 4 at addr ffffc90005425000 by task kworker/u4:4/74
      Workqueue: netns cleanup_net
      Call Trace:
       kasan_report+0xbf/0x1f0 mm/kasan/report.c:517
       __ebt_unregister_table+0xc00/0xcd0 net/bridge/netfilter/ebtables.c:1168
       ebt_unregister_table+0x35/0x40 net/bridge/netfilter/ebtables.c:1372
       ops_exit_list+0xb0/0x170 net/core/net_namespace.c:169
       cleanup_net+0x4ee/0xb10 net/core/net_namespace.c:613
      ...
      
      ip(6)tables appears to be ok (ret should be 0 at this point) but make
      this more obvious.
      
      Fixes: c58dd2dd ("netfilter: Can't fail and free after table replacement")
      Reported-by: default avatar <syzbot+f61594de72d6705aea03@syzkaller.appspotmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e58a171d
    • Phil Sutter's avatar
      netfilter: ip6t_rpfilter: Fix regression with VRF interfaces · efb056e5
      Phil Sutter authored
      
      When calling ip6_route_lookup() for the packet arriving on the VRF
      interface, the result is always the real (slave) interface. Expect this
      when validating the result.
      
      Fixes: acc641ab ("netfilter: rpfilter/fib: Populate flowic_l3mdev field")
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      efb056e5
    • Florian Westphal's avatar
      netfilter: conntrack: fix rmmod double-free race · e6d57e9f
      Florian Westphal authored
      
      nf_conntrack_hash_check_insert() callers free the ct entry directly, via
      nf_conntrack_free.
      
      This isn't safe anymore because
      nf_conntrack_hash_check_insert() might place the entry into the conntrack
      table and then delteted the entry again because it found that a conntrack
      extension has been removed at the same time.
      
      In this case, the just-added entry is removed again and an error is
      returned to the caller.
      
      Problem is that another cpu might have picked up this entry and
      incremented its reference count.
      
      This results in a use-after-free/double-free, once by the other cpu and
      once by the caller of nf_conntrack_hash_check_insert().
      
      Fix this by making nf_conntrack_hash_check_insert() not fail anymore
      after the insertion, just like before the 'Fixes' commit.
      
      This is safe because a racing nf_ct_iterate() has to wait for us
      to release the conntrack hash spinlocks.
      
      While at it, make the function return -EAGAIN in the rmmod (genid
      changed) case, this makes nfnetlink replay the command (suggested
      by Pablo Neira).
      
      Fixes: c56716c6 ("netfilter: extensions: introduce extension genid count")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e6d57e9f
    • Hangyu Hua's avatar
      netfilter: ctnetlink: fix possible refcount leak in ctnetlink_create_conntrack() · ac489398
      Hangyu Hua authored
      
      nf_ct_put() needs to be called to put the refcount got by
      nf_conntrack_find_get() to avoid refcount leak when
      nf_conntrack_hash_check_insert() fails.
      
      Fixes: 7d367e06 ("netfilter: ctnetlink: fix soft lockup when netlink adds new entries (v2)")
      Signed-off-by: default avatarHangyu Hua <hbh25y@gmail.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      ac489398
  5. Feb 21, 2023
  6. Feb 20, 2023
    • Eric Dumazet's avatar
      scm: add user copy checks to put_cmsg() · 5f1eb1ff
      Eric Dumazet authored
      
      This is a followup of commit 2558b803 ("net: use a bounce
      buffer for copying skb->mark")
      
      x86 and powerpc define user_access_begin, meaning
      that they are not able to perform user copy checks
      when using user_write_access_begin() / unsafe_copy_to_user()
      and friends [1]
      
      Instead of waiting bugs to trigger on other arches,
      add a check_object_size() in put_cmsg() to make sure
      that new code tested on x86 with CONFIG_HARDENED_USERCOPY=y
      will perform more security checks.
      
      [1] We can not generically call check_object_size() from
      unsafe_copy_to_user() because UACCESS is enabled at this point.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f1eb1ff
    • Paolo Abeni's avatar
      devlink: drop leftover duplicate/unused code · fce10282
      Paolo Abeni authored
      
      The recent merge from net left-over some unused code in
      leftover.c - nomen omen.
      
      Just drop the unused bits.
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fce10282
    • Paolo Abeni's avatar
      net: make default_rps_mask a per netns attribute · 50bcfe8d
      Paolo Abeni authored
      
      That really was meant to be a per netns attribute from the beginning.
      
      The idea is that once proper isolation is in place in the main
      namespace, additional demux in the child namespaces will be redundant.
      Let's make child netns default rps mask empty by default.
      
      To avoid bloating the netns with a possibly large cpumask, allocate
      it on-demand during the first write operation.
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50bcfe8d
    • Shigeru Yoshida's avatar
      l2tp: Avoid possible recursive deadlock in l2tp_tunnel_register() · 9ca5e7ec
      Shigeru Yoshida authored
      
      When a file descriptor of pppol2tp socket is passed as file descriptor
      of UDP socket, a recursive deadlock occurs in l2tp_tunnel_register().
      This situation is reproduced by the following program:
      
      int main(void)
      {
      	int sock;
      	struct sockaddr_pppol2tp addr;
      
      	sock = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
      	if (sock < 0) {
      		perror("socket");
      		return 1;
      	}
      
      	addr.sa_family = AF_PPPOX;
      	addr.sa_protocol = PX_PROTO_OL2TP;
      	addr.pppol2tp.pid = 0;
      	addr.pppol2tp.fd = sock;
      	addr.pppol2tp.addr.sin_family = PF_INET;
      	addr.pppol2tp.addr.sin_port = htons(0);
      	addr.pppol2tp.addr.sin_addr.s_addr = inet_addr("192.168.0.1");
      	addr.pppol2tp.s_tunnel = 1;
      	addr.pppol2tp.s_session = 0;
      	addr.pppol2tp.d_tunnel = 0;
      	addr.pppol2tp.d_session = 0;
      
      	if (connect(sock, (const struct sockaddr *)&addr, sizeof(addr)) < 0) {
      		perror("connect");
      		return 1;
      	}
      
      	return 0;
      }
      
      This program causes the following lockdep warning:
      
       ============================================
       WARNING: possible recursive locking detected
       6.2.0-rc5-00205-gc96618275234 #56 Not tainted
       --------------------------------------------
       repro/8607 is trying to acquire lock:
       ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: l2tp_tunnel_register+0x2b7/0x11c0
      
       but task is already holding lock:
       ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(sk_lock-AF_PPPOX);
         lock(sk_lock-AF_PPPOX);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       1 lock held by repro/8607:
        #0: ffff8880213c8130 (sk_lock-AF_PPPOX){+.+.}-{0:0}, at: pppol2tp_connect+0xa82/0x1a30
      
       stack backtrace:
       CPU: 0 PID: 8607 Comm: repro Not tainted 6.2.0-rc5-00205-gc96618275234 #56
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
       Call Trace:
        <TASK>
        dump_stack_lvl+0x100/0x178
        __lock_acquire.cold+0x119/0x3b9
        ? lockdep_hardirqs_on_prepare+0x410/0x410
        lock_acquire+0x1e0/0x610
        ? l2tp_tunnel_register+0x2b7/0x11c0
        ? lock_downgrade+0x710/0x710
        ? __fget_files+0x283/0x3e0
        lock_sock_nested+0x3a/0xf0
        ? l2tp_tunnel_register+0x2b7/0x11c0
        l2tp_tunnel_register+0x2b7/0x11c0
        ? sprintf+0xc4/0x100
        ? l2tp_tunnel_del_work+0x6b0/0x6b0
        ? debug_object_deactivate+0x320/0x320
        ? lockdep_init_map_type+0x16d/0x7a0
        ? lockdep_init_map_type+0x16d/0x7a0
        ? l2tp_tunnel_create+0x2bf/0x4b0
        ? l2tp_tunnel_create+0x3c6/0x4b0
        pppol2tp_connect+0x14e1/0x1a30
        ? pppol2tp_put_sk+0xd0/0xd0
        ? aa_sk_perm+0x2b7/0xa80
        ? aa_af_perm+0x260/0x260
        ? bpf_lsm_socket_connect+0x9/0x10
        ? pppol2tp_put_sk+0xd0/0xd0
        __sys_connect_file+0x14f/0x190
        __sys_connect+0x133/0x160
        ? __sys_connect_file+0x190/0x190
        ? lockdep_hardirqs_on+0x7d/0x100
        ? ktime_get_coarse_real_ts64+0x1b7/0x200
        ? ktime_get_coarse_real_ts64+0x147/0x200
        ? __audit_syscall_entry+0x396/0x500
        __x64_sys_connect+0x72/0xb0
        do_syscall_64+0x38/0xb0
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This patch fixes the issue by getting/creating the tunnel before
      locking the pppol2tp socket.
      
      Fixes: 0b2c5972 ("l2tp: close all race conditions in l2tp_tunnel_register()")
      Cc: Cong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Reviewed-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ca5e7ec
    • Eric Dumazet's avatar
      ipv6: icmp6: add drop reason support to icmpv6_echo_reply() · ac03694b
      Eric Dumazet authored
      
      Change icmpv6_echo_reply() to return a drop reason.
      
      For the moment, return NOT_SPECIFIED or SKB_CONSUMED.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac03694b
    • Eric Dumazet's avatar
      ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_NS_OTHERHOST · c34b8bb1
      Eric Dumazet authored
      
      Hosts can often receive neighbour discovery messages
      that are not for them.
      
      Use a dedicated drop reason to make clear the packet is dropped
      for this normal case.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c34b8bb1
    • Eric Dumazet's avatar
      ipv6: icmp6: add SKB_DROP_REASON_IPV6_NDISC_BAD_OPTIONS · 784d4477
      Eric Dumazet authored
      
      This is a generic drop reason for any error detected
      in ndisc_parse_options().
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      784d4477
    • Eric Dumazet's avatar
      ipv6: icmp6: add drop reason support to ndisc_redirect_rcv() · ec993edf
      Eric Dumazet authored
      
      Change ndisc_redirect_rcv() to return a drop reason.
      
      For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
      and values from icmpv6_notify().
      
      More reasons are added later.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec993edf
    • Eric Dumazet's avatar
      ipv6: icmp6: add drop reason support to ndisc_router_discovery() · 2f326d9d
      Eric Dumazet authored
      
      Change ndisc_router_discovery() to return a drop reason.
      
      For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
      and SKB_CONSUMED.
      
      More reasons are added later.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f326d9d
    • Eric Dumazet's avatar
      ipv6: icmp6: add drop reason support to ndisc_recv_rs() · 243e37c6
      Eric Dumazet authored
      
      Change ndisc_recv_rs() to return a drop reason.
      
      For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
      or SKB_CONSUMED. More reasons are added later.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      243e37c6
    • Eric Dumazet's avatar
      ipv6: icmp6: add drop reason support to ndisc_recv_na() · 3009f9ae
      Eric Dumazet authored
      
      Change ndisc_recv_na() to return a drop reason.
      
      For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
      or SKB_CONSUMED. More reasons are added later.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3009f9ae
    • Eric Dumazet's avatar
      ipv6: icmp6: add drop reason support to ndisc_recv_ns() · 7c9c8913
      Eric Dumazet authored
      
      Change ndisc_recv_ns() to return a drop reason.
      
      For the moment, return PKT_TOO_SMALL, NOT_SPECIFIED
      or SKB_CONSUMED.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c9c8913
    • Eric Dumazet's avatar
      net: add location to trace_consume_skb() · dd1b5278
      Eric Dumazet authored
      
      kfree_skb() includes the location, it makes sense
      to add it to consume_skb() as well.
      
      After patch:
      
       taskd_EventMana  8602 [004]   420.406239: skb:consume_skb: skbaddr=0xffff893a4a6d0500 location=unix_stream_read_generic
               swapper     0 [011]   422.732607: skb:consume_skb: skbaddr=0xffff89597f68cee0 location=mlx4_en_free_tx_desc
            discipline  9141 [043]   423.065653: skb:consume_skb: skbaddr=0xffff893a487e9c00 location=skb_consume_udp
               swapper     0 [010]   423.073166: skb:consume_skb: skbaddr=0xffff8949ce9cdb00 location=icmpv6_rcv
               borglet  8672 [014]   425.628256: skb:consume_skb: skbaddr=0xffff8949c42e9400 location=netlink_dump
               swapper     0 [028]   426.263317: skb:consume_skb: skbaddr=0xffff893b1589dce0 location=net_rx_action
                  wget 14339 [009]   426.686380: skb:consume_skb: skbaddr=0xffff893a51b552e0 location=tcp_rcv_state_process
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd1b5278
    • Xuan Zhuo's avatar
      xsk: support use vaddr as ring · 9f78bf33
      Xuan Zhuo authored
      
      When we try to start AF_XDP on some machines with long running time, due
      to the machine's memory fragmentation problem, there is no sufficient
      contiguous physical memory that will cause the start failure.
      
      If the size of the queue is 8 * 1024, then the size of the desc[] is
      8 * 1024 * 8 = 16 * PAGE, but we also add struct xdp_ring size, so it is
      16page+. This is necessary to apply for a 4-order memory. If there are a
      lot of queues, it is difficult to these machine with long running time.
      
      Here, that we actually waste 15 pages. 4-Order memory is 32 pages, but
      we only use 17 pages.
      
      This patch replaces __get_free_pages() by vmalloc() to allocate memory
      to solve these problems.
      
      Signed-off-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Reviewed-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f78bf33
    • D. Wythe's avatar
      net/smc: fix application data exception · 475f9ff6
      D. Wythe authored
      There is a certain probability that following
      exceptions will occur in the wrk benchmark test:
      
      Running 10s test @ http://11.213.45.6:80
      
      
        8 threads and 64 connections
        Thread Stats   Avg      Stdev     Max   +/- Stdev
          Latency     3.72ms   13.94ms 245.33ms   94.17%
          Req/Sec     1.96k   713.67     5.41k    75.16%
        155262 requests in 10.10s, 23.10MB read
      Non-2xx or 3xx responses: 3
      
      We will find that the error is HTTP 400 error, which is a serious
      exception in our test, which means the application data was
      corrupted.
      
      Consider the following scenarios:
      
      CPU0                            CPU1
      
      buf_desc->used = 0;
                                      cmpxchg(buf_desc->used, 0, 1)
                                      deal_with(buf_desc)
      
      memset(buf_desc->cpu_addr,0);
      
      This will cause the data received by a victim connection to be cleared,
      thus triggering an HTTP 400 error in the server.
      
      This patch exchange the order between clear used and memset, add
      barrier to ensure memory consistency.
      
      Fixes: 1c552696 ("net/smc: Clear memory when release and reuse buffer")
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      475f9ff6
    • D. Wythe's avatar
      net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link() · e40b801b
      D. Wythe authored
      
      There is a certain chance to trigger the following panic:
      
      PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
       #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
       #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
       #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
       #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
       #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
       #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
       #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
          [exception RIP: ib_alloc_mr+19]
          RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
          RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
          RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
          R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
          R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
       #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
       #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]
      
      The reason here is that when the server tries to create a second link,
      smc_llc_srv_add_link() has no protection and may add a new link to
      link group. This breaks the security environment protected by
      llc_conf_mutex.
      
      Fixes: 2d2209f2 ("net/smc: first part of add link processing as SMC server")
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Reviewed-by: default avatarLarysa Zaremba <larysa.zaremba@intel.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e40b801b
    • Vladimir Oltean's avatar
      net/sched: taprio: dynamic max_sdu larger than the max_mtu is unlimited · 64cb6aad
      Vladimir Oltean authored
      
      It makes no sense to keep randomly large max_sdu values, especially if
      larger than the device's max_mtu. These are visible in "tc qdisc show".
      Such a max_sdu is practically unlimited and will cause no packets for
      that traffic class to be dropped on enqueue.
      
      Just set max_sdu_dynamic to U32_MAX, which in the logic below causes
      taprio to save a max_frm_len of U32_MAX and a max_sdu presented to user
      space of 0 (unlimited).
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      64cb6aad
    • Vladimir Oltean's avatar
      net/sched: taprio: don't allow dynamic max_sdu to go negative after stab adjustment · bdf366bd
      Vladimir Oltean authored
      
      The overhead specified in the size table comes from the user. With small
      time intervals (or gates always closed), the overhead can be larger than
      the max interval for that traffic class, and their difference is
      negative.
      
      What we want to happen is for max_sdu_dynamic to have the smallest
      non-zero value possible (1) which means that all packets on that traffic
      class are dropped on enqueue. However, since max_sdu_dynamic is u32, a
      negative is represented as a large value and oversized dropping never
      happens.
      
      Use max_t with int to force a truncation of max_frm_len to no smaller
      than dev->hard_header_len + 1, which in turn makes max_sdu_dynamic no
      smaller than 1.
      
      Fixes: fed87cc6 ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      bdf366bd
    • Vladimir Oltean's avatar
      net/sched: taprio: fix calculation of maximum gate durations · 09dbdf28
      Vladimir Oltean authored
      taprio_calculate_gate_durations() depends on netdev_get_num_tc() and
      this returns 0. So it calculates the maximum gate durations for no
      traffic class.
      
      I had tested the blamed commit only with another patch in my tree, one
      which in the end I decided isn't valuable enough to submit ("net/sched:
      taprio: mask off bits in gate mask that exceed number of TCs").
      
      The problem is that having this patch threw off my testing. By moving
      the netdev_set_num_tc() call earlier, we implicitly gave to
      taprio_calculate_gate_durations() the information it needed.
      
      Extract only the portion from the unsubmitted change which applies the
      mqprio configuration to the netdev earlier.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-15-vladimir.oltean@nxp.com/
      
      
      Fixes: a306a90c ("net/sched: taprio: calculate tc gate durations")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      09dbdf28
    • David Howells's avatar
      rxrpc: Fix overproduction of wakeups to recvmsg() · c0783818
      David Howells authored
      
      Fix three cases of overproduction of wakeups:
      
       (1) rxrpc_input_split_jumbo() conditionally notifies the app that there's
           data for recvmsg() to collect if it queues some data - and then its
           only caller, rxrpc_input_data(), goes and wakes up recvmsg() anyway.
      
           Fix the rxrpc_input_data() to only do the wakeup in failure cases.
      
       (2) If a DATA packet is received for a call by the I/O thread whilst
           recvmsg() is busy draining the call's rx queue in the app thread, the
           call will left on the recvmsg() queue for recvmsg() to pick up, even
           though there isn't any data on it.
      
           This can cause an unexpected recvmsg() with a 0 return and no MSG_EOR
           set after the reply has been posted to a service call.
      
           Fix this by discarding pending calls from the recvmsg() queue that
           don't need servicing yet.
      
       (3) Not-yet-completed calls get requeued after having data read from them,
           even if they have no data to read.
      
           Fix this by only requeuing them if they have data waiting on them; if
           they don't, the I/O thread will requeue them when data arrives or they
           fail.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/3386149.1676497685@warthog.procyon.org.uk
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c0783818
Loading