Skip to content
Snippets Groups Projects
  1. Feb 01, 2023
    • Sean Christopherson's avatar
      KVM: Destroy target device if coalesced MMIO unregistration fails · b1cb1fac
      Sean Christopherson authored
      
      Destroy and free the target coalesced MMIO device if unregistering said
      device fails.  As clearly noted in the code, kvm_io_bus_unregister_dev()
      does not destroy the target device.
      
        BUG: memory leak
        unreferenced object 0xffff888112a54880 (size 64):
          comm "syz-executor.2", pid 5258, jiffies 4297861402 (age 14.129s)
          hex dump (first 32 bytes):
            38 c7 67 15 00 c9 ff ff 38 c7 67 15 00 c9 ff ff  8.g.....8.g.....
            e0 c7 e1 83 ff ff ff ff 00 30 67 15 00 c9 ff ff  .........0g.....
          backtrace:
            [<0000000006995a8a>] kmalloc include/linux/slab.h:556 [inline]
            [<0000000006995a8a>] kzalloc include/linux/slab.h:690 [inline]
            [<0000000006995a8a>] kvm_vm_ioctl_register_coalesced_mmio+0x8e/0x3d0 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:150
            [<00000000022550c2>] kvm_vm_ioctl+0x47d/0x1600 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3323
            [<000000008a75102f>] vfs_ioctl fs/ioctl.c:46 [inline]
            [<000000008a75102f>] file_ioctl fs/ioctl.c:509 [inline]
            [<000000008a75102f>] do_vfs_ioctl+0xbab/0x1160 fs/ioctl.c:696
            [<0000000080e3f669>] ksys_ioctl+0x76/0xa0 fs/ioctl.c:713
            [<0000000059ef4888>] __do_sys_ioctl fs/ioctl.c:720 [inline]
            [<0000000059ef4888>] __se_sys_ioctl fs/ioctl.c:718 [inline]
            [<0000000059ef4888>] __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718
            [<000000006444fa05>] do_syscall_64+0x9f/0x4e0 arch/x86/entry/common.c:290
            [<000000009a4ed50b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        BUG: leak checking failed
      
      Fixes: 5d3c4c79 ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar柳菁峰 <liujingfeng@qianxin.com>
      Reported-by: default avatarMichal Luczaj <mhal@rbox.co>
      Link: https://lore.kernel.org/r/20221219171924.67989-1-seanjc@google.com
      Link: https://lore.kernel.org/all/20230118220003.1239032-1-mhal@rbox.co
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      b1cb1fac
  2. Jan 20, 2023
    • Yi Liu's avatar
      kvm/vfio: Fix potential deadlock on vfio group_lock · 51cdc8bc
      Yi Liu authored
      
      Currently it is possible that the final put of a KVM reference comes from
      vfio during its device close operation.  This occurs while the vfio group
      lock is held; however, if the vfio device is still in the kvm device list,
      then the following call chain could result in a deadlock:
      
      VFIO holds group->group_lock/group_rwsem
        -> kvm_put_kvm
         -> kvm_destroy_vm
          -> kvm_destroy_devices
           -> kvm_vfio_destroy
            -> kvm_vfio_file_set_kvm
             -> vfio_file_set_kvm
              -> try to hold group->group_lock/group_rwsem
      
      The key function is the kvm_destroy_devices() which triggers destroy cb
      of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
      if this path doesn't call back to vfio, this dead lock would be fixed.
      Actually, there is a way for it. KVM provides another point to free the
      kvm-vfio device which is the point when the device file descriptor is
      closed. This can be achieved by providing the release cb instead of the
      destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
      
      	/*
      	 * Destroy is responsible for freeing dev.
      	 *
      	 * Destroy may be called before or after destructors are called
      	 * on emulated I/O regions, depending on whether a reference is
      	 * held by a vcpu or other kvm component that gets destroyed
      	 * after the emulated I/O.
      	 */
      	void (*destroy)(struct kvm_device *dev);
      
      	/*
      	 * Release is an alternative method to free the device. It is
      	 * called when the device file descriptor is closed. Once
      	 * release is called, the destroy method will not be called
      	 * anymore as the device is removed from the device list of
      	 * the VM. kvm->lock is held.
      	 */
      	void (*release)(struct kvm_device *dev);
      
      Fixes: 421cfe65 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
      Reported-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Suggested-by: default avatarKevin Tian <kevin.tian@intel.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarYi Liu <yi.l.liu@intel.com>
      Reviewed-by: default avatarMatthew Rosato <mjrosato@linux.ibm.com>
      Link: https://lore.kernel.org/r/20230114000351.115444-1-mjrosato@linux.ibm.com
      Link: https://lore.kernel.org/r/20230120150528.471752-1-yi.l.liu@intel.com
      
      
      [aw: update comment as well, s/destroy/release/]
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      51cdc8bc
  3. Jan 11, 2023
  4. Dec 29, 2022
    • Sean Christopherson's avatar
      KVM: Clean up error labels in kvm_init() · 9f1a4c00
      Sean Christopherson authored
      
      Convert the last two "out" lables to "err" labels now that the dust has
      settled, i.e. now that there are no more planned changes to the order
      of things in kvm_init().
      
      Use "err" instead of "out" as it's easier to describe what failed than it
      is to describe what needs to be unwound, e.g. if allocating a per-CPU kick
      mask fails, KVM needs to free any masks that were allocated, and of course
      needs to unwind previous operations.
      
      Reported-by: default avatarChao Gao <chao.gao@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-51-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9f1a4c00
    • Sean Christopherson's avatar
      KVM: Opt out of generic hardware enabling on s390 and PPC · 441f7bfa
      Sean Christopherson authored
      
      Allow architectures to opt out of the generic hardware enabling logic,
      and opt out on both s390 and PPC, which don't need to manually enable
      virtualization as it's always on (when available).
      
      In addition to letting s390 and PPC drop a bit of dead code, this will
      hopefully also allow ARM to clean up its related code, e.g. ARM has its
      own per-CPU flag to track which CPUs have enable hardware due to the
      need to keep hardware enabled indefinitely when pKVM is enabled.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-50-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      441f7bfa
    • Sean Christopherson's avatar
      KVM: Register syscore (suspend/resume) ops early in kvm_init() · 35774a9f
      Sean Christopherson authored
      
      Register the suspend/resume notifier hooks at the same time KVM registers
      its reboot notifier so that all the code in kvm_init() that deals with
      enabling/disabling hardware is bundled together.  Opportunstically move
      KVM's implementations to reside near the reboot notifier code for the
      same reason.
      
      Bunching the code together will allow architectures to opt out of KVM's
      generic hardware enable/disable logic with minimal #ifdeffery.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-49-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35774a9f
    • Isaku Yamahata's avatar
      KVM: Make hardware_enable_failed a local variable in the "enable all" path · e6fb7d6e
      Isaku Yamahata authored
      
      Rework detecting hardware enabling errors to use a local variable in the
      "enable all" path to track whether or not enabling was successful across
      all CPUs.  Using a global variable complicates paths that enable hardware
      only on the current CPU, e.g. kvm_resume() and kvm_online_cpu().
      
      Opportunistically add a WARN if hardware enabling fails during
      kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware.
      The WARN is largely futile in the current code, as KVM BUG()s on spurious
      faults on VMX instructions, e.g. attempting to run a vCPU on CPU if
      hardware enabling fails will explode.
      
        ------------[ cut here ]------------
        kernel BUG at arch/x86/kvm/x86.c:508!
        invalid opcode: 0000 [#1] SMP
        CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_spurious_fault+0xa/0x10
        Call Trace:
         vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel]
         vmx_vcpu_load+0x16/0x60 [kvm_intel]
         kvm_arch_vcpu_load+0x32/0x1f0
         vcpu_load+0x2f/0x40
         kvm_arch_vcpu_ioctl_run+0x19/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      But, the WARN may provide a breadcrumb to understand what went awry, and
      someday KVM may fix one or both of those bugs, e.g. by finding a way to
      eat spurious faults no matter the context (easier said than done due to
      side effects of certain operations, e.g. Intel's VMCLEAR).
      
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      [sean: rebase, WARN on failure in kvm_resume()]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-48-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e6fb7d6e
    • Sean Christopherson's avatar
      KVM: Use a per-CPU variable to track which CPUs have enabled virtualization · 37d25881
      Sean Christopherson authored
      
      Use a per-CPU variable instead of a shared bitmap to track which CPUs
      have successfully enabled virtualization hardware.  Using a per-CPU bool
      avoids the need for an additional allocation, and arguably yields easier
      to read code.  Using a bitmap would be advantageous if KVM used it to
      avoid generating IPIs to CPUs that failed to enable hardware, but that's
      an extreme edge case and not worth optimizing, and the low level helpers
      would still want to keep their individual checks as attempting to enable
      virtualization hardware when it's already enabled can be problematic,
      e.g. Intel's VMXON will fault.
      
      Opportunistically change the order in hardware_enable_nolock() to set
      the flag if and only if hardware enabling is successful, instead of
      speculatively setting the flag and then clearing it on failure.
      
      Add a comment explaining that the check in hardware_disable_nolock()
      isn't simply paranoia.  Waaay back when, commit 1b6c0168 ("KVM: Keep
      track of which cpus have virtualization enabled"), added the logic as a
      guards against CPU hotplug racing with hardware enable/disable.  Now that
      KVM has eliminated the race by taking cpu_hotplug_lock for read (via
      cpus_read_lock()) when enabling or disabling hardware, at first glance it
      appears that the check is now superfluous, i.e. it's tempting to remove
      the per-CPU flag entirely...
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-47-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      37d25881
    • Isaku Yamahata's avatar
      KVM: Remove on_each_cpu(hardware_disable_nolock) in kvm_exit() · 667a83bf
      Isaku Yamahata authored
      
      Drop the superfluous invocation of hardware_disable_nolock() during
      kvm_exit(), as it's nothing more than a glorified nop.
      
      KVM automatically disables hardware on all CPUs when the last VM is
      destroyed, and kvm_exit() cannot be called until the last VM goes
      away as the calling module is pinned by an elevated refcount of the fops
      associated with /dev/kvm.  This holds true even on x86, where the caller
      of kvm_exit() is not kvm.ko, but is instead a dependent module, kvm_amd.ko
      or kvm_intel.ko, as kvm_chardev_ops.owner is set to the module that calls
      kvm_init(), not hardcoded to the base kvm.ko module.
      
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      [sean: rework changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-46-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      667a83bf
    • Isaku Yamahata's avatar
      KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock · 0bf50497
      Isaku Yamahata authored
      
      Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
      that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
      Previously, KVM hooked the STARTING phase, which is not allowed to sleep
      and thus could not take kvm_lock (a mutex).  This effectively allows the
      task that's initiating hardware enabling/disabling to preempted and/or
      migrated.
      
      Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
      is "raw" because hardware enabling/disabling needs to be atomic with
      respect to migration is wrong on multiple fronts.  First, while regular
      spinlocks can be preempted, the task holding the lock cannot be migrated.
      Second, preventing migration is not required.  on_each_cpu() disables
      preemption, which ensures that cpus_hardware_enabled correctly reflects
      hardware state.  The task may be preempted/migrated between bumping
      kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
      kvm_usage_count is still protected, e.g. other tasks that call
      hardware_enable_all() will be blocked until the preempted/migrated owner
      exits its critical section.
      
      KVM does have lockless accesses to kvm_usage_count in the suspend/resume
      flows, but those are safe because all tasks must be frozen prior to
      suspending CPUs, and a task cannot be frozen while it holds one or more
      locks (userspace tasks are frozen via a fake signal).
      
      Preemption doesn't need to be explicitly disabled in the hotplug path.
      The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
      only cares about having a stable CPU, i.e. to ensure hardware is enabled
      on the correct CPU.  Lockep, i.e. check_preemption_disabled(), plays nice
      with this state too, as is_percpu_thread() is true for the hotplug thread.
      
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-45-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0bf50497
    • Sean Christopherson's avatar
      KVM: Ensure CPU is stable during low level hardware enable/disable · 2c106f29
      Sean Christopherson authored
      
      Use the non-raw smp_processor_id() in the low hardware enable/disable
      helpers as KVM absolutely relies on the CPU being stable, e.g. KVM would
      end up with incorrect state if the task were migrated between accessing
      cpus_hardware_enabled and actually enabling/disabling hardware.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-44-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2c106f29
    • Chao Gao's avatar
      KVM: Disable CPU hotplug during hardware enabling/disabling · e4aa7f88
      Chao Gao authored
      
      Disable CPU hotplug when enabling/disabling hardware to prevent the
      corner case where if the following sequence occurs:
      
        1. A hotplugged CPU marks itself online in cpu_online_mask
        2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE
           callback
        3  hardware_{en,dis}able_all() is invoked on another CPU
      
      the hotplugged CPU will be included in on_each_cpu() and thus get sent
      through hardware_{en,dis}able_nolock() before kvm_online_cpu() is called.
      
              start_secondary { ...
                      set_cpu_online(smp_processor_id(), true); <- 1
                      ...
                      local_irq_enable();  <- 2
                      ...
                      cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3
              }
      
      KVM currently fudges around this race by keeping track of which CPUs have
      done hardware enabling (see commit 1b6c0168 "KVM: Keep track of which
      cpus have virtualization enabled"), but that's an inefficient, convoluted,
      and hacky solution.
      
      Signed-off-by: default avatarChao Gao <chao.gao@intel.com>
      [sean: split to separate patch, write changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-43-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e4aa7f88
    • Chao Gao's avatar
      KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section · aaf12a7b
      Chao Gao authored
      
      The CPU STARTING section doesn't allow callbacks to fail. Move KVM's
      hotplug callback to ONLINE section so that it can abort onlining a CPU in
      certain cases to avoid potentially breaking VMs running on existing CPUs.
      For example, when KVM fails to enable hardware virtualization on the
      hotplugged CPU.
      
      Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it ensures
      when offlining a CPU, all user tasks and non-pinned kernel tasks have left
      the CPU, i.e. there cannot be a vCPU task around. So, it is safe for KVM's
      CPU offline callback to disable hardware virtualization at that point.
      Likewise, KVM's online callback can enable hardware virtualization before
      any vCPU task gets a chance to run on hotplugged CPUs.
      
      Drop kvm_x86_check_processor_compatibility()'s WARN that IRQs are
      disabled, as the ONLINE section runs with IRQs disabled.  The WARN wasn't
      intended to be a requirement, e.g. disabling preemption is sufficient,
      the IRQ thing was purely an aggressive sanity check since the helper was
      only ever invoked via SMP function call.
      
      Rename KVM's CPU hotplug callbacks accordingly.
      
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarChao Gao <chao.gao@intel.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      [sean: drop WARN that IRQs are disabled]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-42-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aaf12a7b
    • Sean Christopherson's avatar
      KVM: Drop kvm_arch_check_processor_compat() hook · 81a1cf9f
      Sean Christopherson authored
      
      Drop kvm_arch_check_processor_compat() and its support code now that all
      architecture implementations are nops.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20221130230934.1014142-33-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81a1cf9f
    • Sean Christopherson's avatar
      KVM: Drop kvm_arch_{init,exit}() hooks · a578a0a9
      Sean Christopherson authored
      
      Drop kvm_arch_init() and kvm_arch_exit() now that all implementations
      are nops.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-30-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a578a0a9
    • Sean Christopherson's avatar
      KVM: Drop arch hardware (un)setup hooks · 63a1bd8a
      Sean Christopherson authored
      
      Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that
      all implementations are nops.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63a1bd8a
    • Sean Christopherson's avatar
      KVM: Teardown VFIO ops earlier in kvm_exit() · 73b8dc04
      Sean Christopherson authored
      
      Move the call to kvm_vfio_ops_exit() further up kvm_exit() to try and
      bring some amount of symmetry to the setup order in kvm_init(), and more
      importantly so that the arch hooks are invoked dead last by kvm_exit().
      This will allow arch code to move away from the arch hooks without any
      change in ordering between arch code and common code in kvm_exit().
      
      That kvm_vfio_ops_exit() is called last appears to be 100% arbitrary.  It
      was bolted on after the fact by commit 571ee1b6 ("kvm: vfio: fix
      unregister kvm_device_ops of vfio").  The nullified kvm_device_ops_table
      is also local to kvm_main.c and is used only when there are active VMs,
      so unless arch code is doing something truly bizarre, nullifying the
      table earlier in kvm_exit() is little more than a nop.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Message-Id: <20221130230934.1014142-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      73b8dc04
    • Sean Christopherson's avatar
      KVM: Allocate cpus_hardware_enabled after arch hardware setup · c9650228
      Sean Christopherson authored
      
      Allocate cpus_hardware_enabled after arch hardware setup so that arch
      "init" and "hardware setup" are called back-to-back and thus can be
      combined in a future patch.  cpus_hardware_enabled is never used before
      kvm_create_vm(), i.e. doesn't have a dependency with hardware setup and
      only needs to be allocated before /dev/kvm is exposed to userspace.
      
      Free the object before the arch hooks are invoked to maintain symmetry,
      and so that arch code can move away from the hooks without having to
      worry about ordering changes.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Message-Id: <20221130230934.1014142-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9650228
    • Sean Christopherson's avatar
      KVM: Initialize IRQ FD after arch hardware setup · 5910ccf0
      Sean Christopherson authored
      
      Move initialization of KVM's IRQ FD workqueue below arch hardware setup
      as a step towards consolidating arch "init" and "hardware setup", and
      eventually towards dropping the hooks entirely.  There is no dependency
      on the workqueue being created before hardware setup, the workqueue is
      used only when destroying VMs, i.e. only needs to be created before
      /dev/kvm is exposed to userspace.
      
      Move the destruction of the workqueue before the arch hooks to maintain
      symmetry, and so that arch code can move away from the hooks without
      having to worry about ordering changes.
      
      Reword the comment about kvm_irqfd_init() needing to come after
      kvm_arch_init() to call out that kvm_arch_init() must come before common
      KVM does _anything_, as x86 very subtly relies on that behavior to deal
      with multiple calls to kvm_init(), e.g. if userspace attempts to load
      kvm_amd.ko and kvm_intel.ko.  Tag the code with a FIXME, as x86's subtle
      requirement is gross, and invoking an arch callback as the very first
      action in a helper that is called only from arch code is silly.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5910ccf0
    • Sean Christopherson's avatar
      KVM: Register /dev/kvm as the _very_ last thing during initialization · 2b012812
      Sean Christopherson authored
      
      Register /dev/kvm, i.e. expose KVM to userspace, only after all other
      setup has completed.  Once /dev/kvm is exposed, userspace can start
      invoking KVM ioctls, creating VMs, etc...  If userspace creates a VM
      before KVM is done with its configuration, bad things may happen, e.g.
      KVM will fail to properly migrate vCPU state if a VM is created before
      KVM has registered preemption notifiers.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b012812
  5. Dec 27, 2022
  6. Dec 02, 2022
  7. Nov 30, 2022
  8. Nov 24, 2022
  9. Nov 18, 2022
  10. Nov 17, 2022
    • David Matlack's avatar
      KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL · 9eb8ca04
      David Matlack authored
      
      Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt,
      rather than just sampling the module parameter when the VM is first
      created. This restore the original behavior of kvm.halt_poll_ns for VMs
      that have not opted into KVM_CAP_HALT_POLL.
      
      Notably, this change restores the ability for admins to disable or
      change the maximum halt-polling time system wide for VMs not using
      KVM_CAP_HALT_POLL.
      
      Reported-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Fixes: acd05785 ("kvm: add capability for halt polling")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-4-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9eb8ca04
    • David Matlack's avatar
      KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-polling · 175d5dc7
      David Matlack authored
      
      Avoid re-reading kvm->max_halt_poll_ns multiple times during
      halt-polling except when it is explicitly useful, e.g. to check if the
      max time changed across a halt. kvm->max_halt_poll_ns can be changed at
      any time by userspace via KVM_CAP_HALT_POLL.
      
      This bug is unlikely to cause any serious side-effects. In the worst
      case one halt polls for shorter or longer than it should, and then is
      fixed up on the next halt. Furthmore, this is still possible since
      kvm->max_halt_poll_ns are not synchronized with halts.
      
      Fixes: acd05785 ("kvm: add capability for halt polling")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      175d5dc7
    • David Matlack's avatar
      KVM: Cap vcpu->halt_poll_ns before halting rather than after · 97b6847a
      David Matlack authored
      
      Cap vcpu->halt_poll_ns based on the max halt polling time just before
      halting, rather than after the last halt. This arguably provides better
      accuracy if an admin disables halt polling in between halts, although
      the improvement is nominal.
      
      A side-effect of this change is that grow_halt_poll_ns() no longer needs
      to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future
      commit where the max halt polling time can come from the module parameter
      halt_poll_ns instead.
      
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      97b6847a
  11. Nov 12, 2022
    • Gavin Shan's avatar
      KVM: Push dirty information unconditionally to backup bitmap · c57351a7
      Gavin Shan authored
      
      In mark_page_dirty_in_slot(), we bail out when no running vcpu exists
      and a running vcpu context is strictly required by architecture. It may
      cause backwards compatible issue. Currently, saving vgic/its tables is
      the only known case where no running vcpu context is expected. We may
      have other unknown cases where no running vcpu context exists and it's
      reported by the warning message and we bail out without pushing the
      dirty information to the backup bitmap. For this, the application is
      going to enable the backup bitmap for the unknown cases. However, the
      dirty information can't be pushed to the backup bitmap even though the
      backup bitmap is enabled for those unknown cases in the application,
      until the unknown cases are added to the allowed list of non-running
      vcpu context with extra code changes to the host kernel.
      
      In order to make the new application, where the backup bitmap has been
      enabled, to work with the unchanged host, we continue to push the dirty
      information to the backup bitmap instead of bailing out early. With the
      added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the
      kernel crash is avoided silently by the combined conditions: no running
      vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true',
      and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled
      yet.
      
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
      c57351a7
  12. Nov 10, 2022
Loading