- Feb 01, 2023
-
-
Sean Christopherson authored
Destroy and free the target coalesced MMIO device if unregistering said device fails. As clearly noted in the code, kvm_io_bus_unregister_dev() does not destroy the target device. BUG: memory leak unreferenced object 0xffff888112a54880 (size 64): comm "syz-executor.2", pid 5258, jiffies 4297861402 (age 14.129s) hex dump (first 32 bytes): 38 c7 67 15 00 c9 ff ff 38 c7 67 15 00 c9 ff ff 8.g.....8.g..... e0 c7 e1 83 ff ff ff ff 00 30 67 15 00 c9 ff ff .........0g..... backtrace: [<0000000006995a8a>] kmalloc include/linux/slab.h:556 [inline] [<0000000006995a8a>] kzalloc include/linux/slab.h:690 [inline] [<0000000006995a8a>] kvm_vm_ioctl_register_coalesced_mmio+0x8e/0x3d0 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:150 [<00000000022550c2>] kvm_vm_ioctl+0x47d/0x1600 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3323 [<000000008a75102f>] vfs_ioctl fs/ioctl.c:46 [inline] [<000000008a75102f>] file_ioctl fs/ioctl.c:509 [inline] [<000000008a75102f>] do_vfs_ioctl+0xbab/0x1160 fs/ioctl.c:696 [<0000000080e3f669>] ksys_ioctl+0x76/0xa0 fs/ioctl.c:713 [<0000000059ef4888>] __do_sys_ioctl fs/ioctl.c:720 [inline] [<0000000059ef4888>] __se_sys_ioctl fs/ioctl.c:718 [inline] [<0000000059ef4888>] __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718 [<000000006444fa05>] do_syscall_64+0x9f/0x4e0 arch/x86/entry/common.c:290 [<000000009a4ed50b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe BUG: leak checking failed Fixes: 5d3c4c79 ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed") Cc: stable@vger.kernel.org Reported-by:
柳菁峰 <liujingfeng@qianxin.com> Reported-by:
Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20221219171924.67989-1-seanjc@google.com Link: https://lore.kernel.org/all/20230118220003.1239032-1-mhal@rbox.co Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
- Jan 20, 2023
-
-
Yi Liu authored
Currently it is possible that the final put of a KVM reference comes from vfio during its device close operation. This occurs while the vfio group lock is held; however, if the vfio device is still in the kvm device list, then the following call chain could result in a deadlock: VFIO holds group->group_lock/group_rwsem -> kvm_put_kvm -> kvm_destroy_vm -> kvm_destroy_devices -> kvm_vfio_destroy -> kvm_vfio_file_set_kvm -> vfio_file_set_kvm -> try to hold group->group_lock/group_rwsem The key function is the kvm_destroy_devices() which triggers destroy cb of kvm_device_ops. It calls back to vfio and try to hold group_lock. So if this path doesn't call back to vfio, this dead lock would be fixed. Actually, there is a way for it. KVM provides another point to free the kvm-vfio device which is the point when the device file descriptor is closed. This can be achieved by providing the release cb instead of the destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). /* * Destroy is responsible for freeing dev. * * Destroy may be called before or after destructors are called * on emulated I/O regions, depending on whether a reference is * held by a vcpu or other kvm component that gets destroyed * after the emulated I/O. */ void (*destroy)(struct kvm_device *dev); /* * Release is an alternative method to free the device. It is * called when the device file descriptor is closed. Once * release is called, the destroy method will not be called * anymore as the device is removed from the device list of * the VM. kvm->lock is held. */ void (*release)(struct kvm_device *dev); Fixes: 421cfe65 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") Reported-by:
Alex Williamson <alex.williamson@redhat.com> Suggested-by:
Kevin Tian <kevin.tian@intel.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Signed-off-by:
Yi Liu <yi.l.liu@intel.com> Reviewed-by:
Matthew Rosato <mjrosato@linux.ibm.com> Link: https://lore.kernel.org/r/20230114000351.115444-1-mjrosato@linux.ibm.com Link: https://lore.kernel.org/r/20230120150528.471752-1-yi.l.liu@intel.com [aw: update comment as well, s/destroy/release/] Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
- Jan 11, 2023
-
-
David Woodhouse authored
Documentation/virt/kvm/locking.rst tells us that kvm->lock is taken outside vcpu->mutex. But that doesn't actually happen very often; it's only in some esoteric cases like migration with AMD SEV. This means that lockdep usually doesn't notice, and doesn't do its job of keeping us honest. Ensure that lockdep *always* knows about the ordering of these two locks, by briefly taking vcpu->mutex in kvm_vm_ioctl_create_vcpu() while kvm->lock is held. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 29, 2022
-
-
Sean Christopherson authored
Convert the last two "out" lables to "err" labels now that the dust has settled, i.e. now that there are no more planned changes to the order of things in kvm_init(). Use "err" instead of "out" as it's easier to describe what failed than it is to describe what needs to be unwound, e.g. if allocating a per-CPU kick mask fails, KVM needs to free any masks that were allocated, and of course needs to unwind previous operations. Reported-by:
Chao Gao <chao.gao@intel.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-51-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Allow architectures to opt out of the generic hardware enabling logic, and opt out on both s390 and PPC, which don't need to manually enable virtualization as it's always on (when available). In addition to letting s390 and PPC drop a bit of dead code, this will hopefully also allow ARM to clean up its related code, e.g. ARM has its own per-CPU flag to track which CPUs have enable hardware due to the need to keep hardware enabled indefinitely when pKVM is enabled. Signed-off-by:
Sean Christopherson <seanjc@google.com> Acked-by:
Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-50-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Register the suspend/resume notifier hooks at the same time KVM registers its reboot notifier so that all the code in kvm_init() that deals with enabling/disabling hardware is bundled together. Opportunstically move KVM's implementations to reside near the reboot notifier code for the same reason. Bunching the code together will allow architectures to opt out of KVM's generic hardware enable/disable logic with minimal #ifdeffery. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-49-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Isaku Yamahata authored
Rework detecting hardware enabling errors to use a local variable in the "enable all" path to track whether or not enabling was successful across all CPUs. Using a global variable complicates paths that enable hardware only on the current CPU, e.g. kvm_resume() and kvm_online_cpu(). Opportunistically add a WARN if hardware enabling fails during kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware. The WARN is largely futile in the current code, as KVM BUG()s on spurious faults on VMX instructions, e.g. attempting to run a vCPU on CPU if hardware enabling fails will explode. ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/x86.c:508! invalid opcode: 0000 [#1] SMP CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_spurious_fault+0xa/0x10 Call Trace: vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel] vmx_vcpu_load+0x16/0x60 [kvm_intel] kvm_arch_vcpu_load+0x32/0x1f0 vcpu_load+0x2f/0x40 kvm_arch_vcpu_ioctl_run+0x19/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 But, the WARN may provide a breadcrumb to understand what went awry, and someday KVM may fix one or both of those bugs, e.g. by finding a way to eat spurious faults no matter the context (easier said than done due to side effects of certain operations, e.g. Intel's VMCLEAR). Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> [sean: rebase, WARN on failure in kvm_resume()] Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-48-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use a per-CPU variable instead of a shared bitmap to track which CPUs have successfully enabled virtualization hardware. Using a per-CPU bool avoids the need for an additional allocation, and arguably yields easier to read code. Using a bitmap would be advantageous if KVM used it to avoid generating IPIs to CPUs that failed to enable hardware, but that's an extreme edge case and not worth optimizing, and the low level helpers would still want to keep their individual checks as attempting to enable virtualization hardware when it's already enabled can be problematic, e.g. Intel's VMXON will fault. Opportunistically change the order in hardware_enable_nolock() to set the flag if and only if hardware enabling is successful, instead of speculatively setting the flag and then clearing it on failure. Add a comment explaining that the check in hardware_disable_nolock() isn't simply paranoia. Waaay back when, commit 1b6c0168 ("KVM: Keep track of which cpus have virtualization enabled"), added the logic as a guards against CPU hotplug racing with hardware enable/disable. Now that KVM has eliminated the race by taking cpu_hotplug_lock for read (via cpus_read_lock()) when enabling or disabling hardware, at first glance it appears that the check is now superfluous, i.e. it's tempting to remove the per-CPU flag entirely... Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-47-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Isaku Yamahata authored
Drop the superfluous invocation of hardware_disable_nolock() during kvm_exit(), as it's nothing more than a glorified nop. KVM automatically disables hardware on all CPUs when the last VM is destroyed, and kvm_exit() cannot be called until the last VM goes away as the calling module is pinned by an elevated refcount of the fops associated with /dev/kvm. This holds true even on x86, where the caller of kvm_exit() is not kvm.ko, but is instead a dependent module, kvm_amd.ko or kvm_intel.ko, as kvm_chardev_ops.owner is set to the module that calls kvm_init(), not hardcoded to the base kvm.ko module. Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> [sean: rework changelog] Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-46-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Isaku Yamahata authored
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now that KVM hooks CPU hotplug during the ONLINE phase, which can sleep. Previously, KVM hooked the STARTING phase, which is not allowed to sleep and thus could not take kvm_lock (a mutex). This effectively allows the task that's initiating hardware enabling/disabling to preempted and/or migrated. Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock is "raw" because hardware enabling/disabling needs to be atomic with respect to migration is wrong on multiple fronts. First, while regular spinlocks can be preempted, the task holding the lock cannot be migrated. Second, preventing migration is not required. on_each_cpu() disables preemption, which ensures that cpus_hardware_enabled correctly reflects hardware state. The task may be preempted/migrated between bumping kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as kvm_usage_count is still protected, e.g. other tasks that call hardware_enable_all() will be blocked until the preempted/migrated owner exits its critical section. KVM does have lockless accesses to kvm_usage_count in the suspend/resume flows, but those are safe because all tasks must be frozen prior to suspending CPUs, and a task cannot be frozen while it holds one or more locks (userspace tasks are frozen via a fake signal). Preemption doesn't need to be explicitly disabled in the hotplug path. The hotplug thread is pinned to the CPU that's being hotplugged, and KVM only cares about having a stable CPU, i.e. to ensure hardware is enabled on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice with this state too, as is_percpu_thread() is true for the hotplug thread. Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-45-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use the non-raw smp_processor_id() in the low hardware enable/disable helpers as KVM absolutely relies on the CPU being stable, e.g. KVM would end up with incorrect state if the task were migrated between accessing cpus_hardware_enabled and actually enabling/disabling hardware. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-44-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Chao Gao authored
Disable CPU hotplug when enabling/disabling hardware to prevent the corner case where if the following sequence occurs: 1. A hotplugged CPU marks itself online in cpu_online_mask 2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE callback 3 hardware_{en,dis}able_all() is invoked on another CPU the hotplugged CPU will be included in on_each_cpu() and thus get sent through hardware_{en,dis}able_nolock() before kvm_online_cpu() is called. start_secondary { ... set_cpu_online(smp_processor_id(), true); <- 1 ... local_irq_enable(); <- 2 ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3 } KVM currently fudges around this race by keeping track of which CPUs have done hardware enabling (see commit 1b6c0168 "KVM: Keep track of which cpus have virtualization enabled"), but that's an inefficient, convoluted, and hacky solution. Signed-off-by:
Chao Gao <chao.gao@intel.com> [sean: split to separate patch, write changelog] Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-43-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Chao Gao authored
The CPU STARTING section doesn't allow callbacks to fail. Move KVM's hotplug callback to ONLINE section so that it can abort onlining a CPU in certain cases to avoid potentially breaking VMs running on existing CPUs. For example, when KVM fails to enable hardware virtualization on the hotplugged CPU. Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it ensures when offlining a CPU, all user tasks and non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU task around. So, it is safe for KVM's CPU offline callback to disable hardware virtualization at that point. Likewise, KVM's online callback can enable hardware virtualization before any vCPU task gets a chance to run on hotplugged CPUs. Drop kvm_x86_check_processor_compatibility()'s WARN that IRQs are disabled, as the ONLINE section runs with IRQs disabled. The WARN wasn't intended to be a requirement, e.g. disabling preemption is sufficient, the IRQ thing was purely an aggressive sanity check since the helper was only ever invoked via SMP function call. Rename KVM's CPU hotplug callbacks accordingly. Suggested-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Chao Gao <chao.gao@intel.com> Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by:
Yuan Yao <yuan.yao@intel.com> [sean: drop WARN that IRQs are disabled] Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-42-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop kvm_arch_check_processor_compat() and its support code now that all architecture implementations are nops. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Acked-by:
Anup Patel <anup@brainfault.org> Reviewed-by:
Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-33-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop kvm_arch_init() and kvm_arch_exit() now that all implementations are nops. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by:
Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-30-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that all implementations are nops. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Acked-by:
Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-10-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the call to kvm_vfio_ops_exit() further up kvm_exit() to try and bring some amount of symmetry to the setup order in kvm_init(), and more importantly so that the arch hooks are invoked dead last by kvm_exit(). This will allow arch code to move away from the arch hooks without any change in ordering between arch code and common code in kvm_exit(). That kvm_vfio_ops_exit() is called last appears to be 100% arbitrary. It was bolted on after the fact by commit 571ee1b6 ("kvm: vfio: fix unregister kvm_device_ops of vfio"). The nullified kvm_device_ops_table is also local to kvm_main.c and is used only when there are active VMs, so unless arch code is doing something truly bizarre, nullifying the table earlier in kvm_exit() is little more than a nop. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Cornelia Huck <cohuck@redhat.com> Reviewed-by:
Eric Farman <farman@linux.ibm.com> Message-Id: <20221130230934.1014142-5-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Allocate cpus_hardware_enabled after arch hardware setup so that arch "init" and "hardware setup" are called back-to-back and thus can be combined in a future patch. cpus_hardware_enabled is never used before kvm_create_vm(), i.e. doesn't have a dependency with hardware setup and only needs to be allocated before /dev/kvm is exposed to userspace. Free the object before the arch hooks are invoked to maintain symmetry, and so that arch code can move away from the hooks without having to worry about ordering changes. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Yuan Yao <yuan.yao@intel.com> Message-Id: <20221130230934.1014142-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move initialization of KVM's IRQ FD workqueue below arch hardware setup as a step towards consolidating arch "init" and "hardware setup", and eventually towards dropping the hooks entirely. There is no dependency on the workqueue being created before hardware setup, the workqueue is used only when destroying VMs, i.e. only needs to be created before /dev/kvm is exposed to userspace. Move the destruction of the workqueue before the arch hooks to maintain symmetry, and so that arch code can move away from the hooks without having to worry about ordering changes. Reword the comment about kvm_irqfd_init() needing to come after kvm_arch_init() to call out that kvm_arch_init() must come before common KVM does _anything_, as x86 very subtly relies on that behavior to deal with multiple calls to kvm_init(), e.g. if userspace attempts to load kvm_amd.ko and kvm_intel.ko. Tag the code with a FIXME, as x86's subtle requirement is gross, and invoking an arch callback as the very first action in a helper that is called only from arch code is silly. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Register /dev/kvm, i.e. expose KVM to userspace, only after all other setup has completed. Once /dev/kvm is exposed, userspace can start invoking KVM ioctls, creating VMs, etc... If userspace creates a VM before KVM is done with its configuration, bad things may happen, e.g. KVM will fail to properly migrate vCPU state if a VM is created before KVM has registered preemption notifiers. Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 27, 2022
-
-
Lai Jiangshan authored
No code is using KVM_MMU_READ_LOCK() or KVM_MMU_READ_UNLOCK(). They used to be in virt/kvm/pfncache.c: KVM_MMU_READ_LOCK(kvm); retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva); KVM_MMU_READ_UNLOCK(kvm); However, since 58cd407c ("KVM: Fix multiple races in gfn=>pfn cache refresh", 2022-05-25) the code is only relying on the MMU notifier's invalidation count and sequence number. Signed-off-by:
Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20221207120617.9409-1-jiangshanlai@gmail.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 02, 2022
-
-
Sean Christopherson authored
Remove a comment about KVM_REQ_UNHALT being set by kvm_vcpu_check_block() that was missed when KVM_REQ_UNHALT was dropped. Fixes: c59fb127 ("KVM: remove KVM_REQ_UNHALT") Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20221201220433.31366-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 30, 2022
-
-
Sean Christopherson authored
When refreshing a gfn=>pfn cache, skip straight to unlocking if the cache already valid instead of stuffing the "old" variables to turn the unmapping outro into a nop. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Sean Christopherson authored
Drop the @gpa param from the exported check()+refresh() helpers and limit changing the cache's GPA to the activate path. All external users just feed in gpc->gpa, i.e. this is a fancy nop. Allowing users to change the GPA at check()+refresh() is dangerous as those helpers explicitly allow concurrent calls, e.g. KVM could get into a livelock scenario. It's also unclear as to what the expected behavior should be if multiple tasks attempt to refresh with different GPAs. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Sean Christopherson authored
Don't partially reinitialize a gfn=>pfn cache when activating the cache, and instead assert that the cache is not valid during activation. Bug the VM if the assertion fails, as use-after-free and/or data corruption is all but guaranteed if KVM ends up with a valid-but-inactive cache. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Sean Christopherson authored
Drop kvm_gpc_unmap() as it has no users and unclear requirements. The API was added as part of the original gfn_to_pfn_cache support, but its sole usage[*] was never merged. Fold the guts of kvm_gpc_unmap() into the deactivate path and drop the API. Omit acquiring refresh_lock as as concurrent calls to kvm_gpc_deactivate() are not allowed (this is not enforced, e.g. via lockdep. due to it being called during vCPU destruction). If/when temporary unmapping makes a comeback, the desirable behavior is likely to restrict temporary unmapping to vCPU-exclusive mappings and require the vcpu->mutex be held to serialize unmap. Use of the refresh_lock to protect unmapping was somewhat specuatively added by commit 93984f19 ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") to guard against concurrent unmaps, but the primary use case of the temporary unmap, nested virtualization[*], doesn't actually need or want concurrent unmaps. [*] https://lore.kernel.org/all/20211210163625.2886-7-dwmw2@infradead.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache. No functional change intended. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> [sean: leave kvm_gpc_unmap() as-is] Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Make hva_to_pfn_retry() use kvm instance cached in gfn_to_pfn_cache. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Move the assignment of immutable properties @kvm, @vcpu, and @usage to the initializer. Make _activate() and _deactivate() use stored values. Note, @len is also effectively immutable for most cases, but not in the case of the Xen runstate cache, which may be split across two pages and the length of the first segment will depend on its address. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> [sean: handle @len in a separate patch] Signed-off-by:
Sean Christopherson <seanjc@google.com> [dwmw2: acknowledge that @len can actually change for some use cases] Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk>
-
Michal Luczaj authored
Remove the unused @kvm argument from gpc_unmap_khva(). Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Michal Luczaj authored
Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Michal Luczaj <mhal@rbox.co> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 24, 2022
-
-
David Woodhouse authored
In the case where a GPC is refreshed to a different location within the same page, we didn't bother to update it. Mostly we don't need to, but since the ->khva field also includes the offset within the page, that does have to be updated. Fixes: 3ba2c95e ("KVM: Do not incorporate page offset into gfn=>pfn cache user address") Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Reviewed-by:
Paul Durrant <paul@xen.org> Reviewed-by:
Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 18, 2022
-
-
Paolo Bonzini authored
Since gfn_to_memslot() is relatively expensive, it helps to skip it if it the memslot cannot possibly have dirty logging enabled. In order to do this, add to struct kvm a counter of the number of log-page memslots. While the correct value can only be read with slots_lock taken, the NX recovery thread is content with using an approximate value. Therefore, the counter is an atomic_t. Based on https://lore.kernel.org/kvm/20221027200316.2221027-2-dmatlack@google.com/ by David Matlack. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 17, 2022
-
-
David Matlack authored
Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt, rather than just sampling the module parameter when the VM is first created. This restore the original behavior of kvm.halt_poll_ns for VMs that have not opted into KVM_CAP_HALT_POLL. Notably, this change restores the ability for admins to disable or change the maximum halt-polling time system wide for VMs not using KVM_CAP_HALT_POLL. Reported-by:
Christian Borntraeger <borntraeger@de.ibm.com> Fixes: acd05785 ("kvm: add capability for halt polling") Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-4-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Avoid re-reading kvm->max_halt_poll_ns multiple times during halt-polling except when it is explicitly useful, e.g. to check if the max time changed across a halt. kvm->max_halt_poll_ns can be changed at any time by userspace via KVM_CAP_HALT_POLL. This bug is unlikely to cause any serious side-effects. In the worst case one halt polls for shorter or longer than it should, and then is fixed up on the next halt. Furthmore, this is still possible since kvm->max_halt_poll_ns are not synchronized with halts. Fixes: acd05785 ("kvm: add capability for halt polling") Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-3-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Cap vcpu->halt_poll_ns based on the max halt polling time just before halting, rather than after the last halt. This arguably provides better accuracy if an admin disables halt polling in between halts, although the improvement is nominal. A side-effect of this change is that grow_halt_poll_ns() no longer needs to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future commit where the max halt polling time can come from the module parameter halt_poll_ns instead. Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-2-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 12, 2022
-
-
Gavin Shan authored
In mark_page_dirty_in_slot(), we bail out when no running vcpu exists and a running vcpu context is strictly required by architecture. It may cause backwards compatible issue. Currently, saving vgic/its tables is the only known case where no running vcpu context is expected. We may have other unknown cases where no running vcpu context exists and it's reported by the warning message and we bail out without pushing the dirty information to the backup bitmap. For this, the application is going to enable the backup bitmap for the unknown cases. However, the dirty information can't be pushed to the backup bitmap even though the backup bitmap is enabled for those unknown cases in the application, until the unknown cases are added to the allowed list of non-running vcpu context with extra code changes to the host kernel. In order to make the new application, where the backup bitmap has been enabled, to work with the unchanged host, we continue to push the dirty information to the backup bitmap instead of bailing out early. With the added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the kernel crash is avoided silently by the combined conditions: no running vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true', and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled yet. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Gavin Shan <gshan@redhat.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
-
- Nov 10, 2022
-
-
Gavin Shan authored
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is enabled. It's conflicting with that ring-based dirty page tracking always requires a running VCPU context. Introduce a new flavor of dirty ring that requires the use of both VCPU dirty rings and a dirty bitmap. The expectation is that for non-VCPU sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to the dirty bitmap. Userspace should scan the dirty bitmap before migrating the VM to the target. Use an additional capability to advertise this behavior. The newly added capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Suggested-by:
Marc Zyngier <maz@kernel.org> Suggested-by:
Peter Xu <peterx@redhat.com> Co-developed-by:
Oliver Upton <oliver.upton@linux.dev> Signed-off-by:
Oliver Upton <oliver.upton@linux.dev> Signed-off-by:
Gavin Shan <gshan@redhat.com> Acked-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-4-gshan@redhat.com
-
Gavin Shan authored
The VCPU isn't expected to be runnable when the dirty ring becomes soft full, until the dirty pages are harvested and the dirty ring is reset from userspace. So there is a check in each guest's entrace to see if the dirty ring is soft full or not. The VCPU is stopped from running if its dirty ring has been soft full. The similar check will be needed when the feature is going to be supported on ARM64. As Marc Zyngier suggested, a new event will avoid pointless overhead to check the size of the dirty ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance. Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring becomes soft full in kvm_dirty_ring_push(). The event is only cleared in the check, done in the newly added helper kvm_dirty_ring_check_request(). Since the VCPU is not runnable when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from running until the dirty pages are harvested and the dirty ring is reset by userspace. kvm_dirty_ring_soft_full() becomes a private function with the newly added helper kvm_dirty_ring_check_request(). The alignment for the various event definitions in kvm_host.h is changed to tab character by the way. In order to avoid using 'container_of()', the argument @ring is replaced by @vcpu in kvm_dirty_ring_push(). Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org Suggested-by:
Marc Zyngier <maz@kernel.org> Signed-off-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-2-gshan@redhat.com
-