1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4KVM-specific MSRs
5=================
6
7:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
8
9KVM makes use of some custom MSRs to service some requests.
10
11Custom MSRs have a range reserved for them, that goes from
120x4b564d00 to 0x4b564dff. There are MSRs outside this area,
13but they are deprecated and their use is discouraged.
14
15Custom MSR list
16---------------
17
18The current supported Custom MSR list is:
19
20MSR_KVM_WALL_CLOCK_NEW:
21	0x4b564d00
22
23data:
24	4-byte alignment physical address of a memory area which must be
25	in guest RAM. This memory is expected to hold a copy of the following
26	structure::
27
28	 struct pvclock_wall_clock {
29		u32   version;
30		u32   sec;
31		u32   nsec;
32	  } __attribute__((__packed__));
33
34	whose data will be filled in by the hypervisor. The hypervisor is only
35	guaranteed to update this data at the moment of MSR write.
36	Users that want to reliably query this information more than once have
37	to write more than once to this MSR. Fields have the following meanings:
38
39	version:
40		guest has to check version before and after grabbing
41		time information and check that they are both equal and even.
42		An odd version indicates an in-progress update.
43
44	sec:
45		 number of seconds for wallclock at time of boot.
46
47	nsec:
48		 number of nanoseconds for wallclock at time of boot.
49
50	In order to get the current wallclock time, the system_time from
51	MSR_KVM_SYSTEM_TIME_NEW needs to be added.
52
53	Note that although MSRs are per-CPU entities, the effect of this
54	particular MSR is global.
55
56	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
57	leaf prior to usage.
58
59MSR_KVM_SYSTEM_TIME_NEW:
60	0x4b564d01
61
62data:
63	4-byte aligned physical address of a memory area which must be in
64	guest RAM, plus an enable bit in bit 0. This memory is expected to hold
65	a copy of the following structure::
66
67	  struct pvclock_vcpu_time_info {
68		u32   version;
69		u32   pad0;
70		u64   tsc_timestamp;
71		u64   system_time;
72		u32   tsc_to_system_mul;
73		s8    tsc_shift;
74		u8    flags;
75		u8    pad[2];
76	  } __attribute__((__packed__)); /* 32 bytes */
77
78	whose data will be filled in by the hypervisor periodically. Only one
79	write, or registration, is needed for each VCPU. The interval between
80	updates of this structure is arbitrary and implementation-dependent.
81	The hypervisor may update this structure at any time it sees fit until
82	anything with bit0 == 0 is written to it.
83
84	Fields have the following meanings:
85
86	version:
87		guest has to check version before and after grabbing
88		time information and check that they are both equal and even.
89		An odd version indicates an in-progress update.
90
91	tsc_timestamp:
92		the tsc value at the current VCPU at the time
93		of the update of this structure. Guests can subtract this value
94		from current tsc to derive a notion of elapsed time since the
95		structure update.
96
97	system_time:
98		a host notion of monotonic time, including sleep
99		time at the time this structure was last updated. Unit is
100		nanoseconds.
101
102	tsc_to_system_mul:
103		multiplier to be used when converting
104		tsc-related quantity to nanoseconds
105
106	tsc_shift:
107		shift to be used when converting tsc-related
108		quantity to nanoseconds. This shift will ensure that
109		multiplication with tsc_to_system_mul does not overflow.
110		A positive value denotes a left shift, a negative value
111		a right shift.
112
113		The conversion from tsc to nanoseconds involves an additional
114		right shift by 32 bits. With this information, guests can
115		derive per-CPU time by doing::
116
117			time = (current_tsc - tsc_timestamp)
118			if (tsc_shift >= 0)
119				time <<= tsc_shift;
120			else
121				time >>= -tsc_shift;
122			time = (time * tsc_to_system_mul) >> 32
123			time = time + system_time
124
125	flags:
126		bits in this field indicate extended capabilities
127		coordinated between the guest and the hypervisor. Availability
128		of specific flags has to be checked in 0x40000001 cpuid leaf.
129		Current flags are:
130
131
132		+-----------+--------------+----------------------------------+
133		| flag bit  | cpuid bit    | meaning			      |
134		+-----------+--------------+----------------------------------+
135		|	    |		   | time measures taken across       |
136		|    0      |	   24      | multiple cpus are guaranteed to  |
137		|	    |		   | be monotonic		      |
138		+-----------+--------------+----------------------------------+
139		|	    |		   | guest vcpu has been paused by    |
140		|    1	    |	  N/A	   | the host			      |
141		|	    |		   | See 4.70 in api.txt	      |
142		+-----------+--------------+----------------------------------+
143
144	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
145	leaf prior to usage.
146
147
148MSR_KVM_WALL_CLOCK:
149	0x11
150
151data and functioning:
152	same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
153
154	This MSR falls outside the reserved KVM range and may be removed in the
155	future. Its usage is deprecated.
156
157	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
158	leaf prior to usage.
159
160MSR_KVM_SYSTEM_TIME:
161	0x12
162
163data and functioning:
164	same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
165
166	This MSR falls outside the reserved KVM range and may be removed in the
167	future. Its usage is deprecated.
168
169	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
170	leaf prior to usage.
171
172	The suggested algorithm for detecting kvmclock presence is then::
173
174		if (!kvm_para_available())    /* refer to cpuid.txt */
175			return NON_PRESENT;
176
177		flags = cpuid_eax(0x40000001);
178		if (flags & 3) {
179			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
180			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
181			return PRESENT;
182		} else if (flags & 0) {
183			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
184			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
185			return PRESENT;
186		} else
187			return NON_PRESENT;
188
189MSR_KVM_ASYNC_PF_EN:
190	0x4b564d02
191
192data:
193	Asynchronous page fault (APF) control MSR.
194
195	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
196	which must be in guest RAM and must be zeroed. This memory is expected
197	to hold a copy of the following structure::
198
199	  struct kvm_vcpu_pv_apf_data {
200		/* Used for 'page not present' events delivered via #PF */
201		__u32 flags;
202
203		/* Used for 'page ready' events delivered via interrupt notification */
204		__u32 token;
205
206		__u8 pad[56];
207		__u32 enabled;
208	  };
209
210	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
211	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
212	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
213	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
214	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
215	present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
216	events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
217	CPUID.
218
219	'Page not present' events are currently always delivered as synthetic
220	#PF exception. During delivery of these events APF CR2 register contains
221	a token that will be used to notify the guest when missing page becomes
222	available. Also, to make it possible to distinguish between real #PF and
223	APF, first 4 bytes of 64 byte memory location ('flags') will be written
224	to by the hypervisor at the time of injection. Only first bit of 'flags'
225	is currently supported, when set, it indicates that the guest is dealing
226	with asynchronous 'page not present' event. If during a page fault APF
227	'flags' is '0' it means that this is regular page fault. Guest is
228	supposed to clear 'flags' when it is done handling #PF exception so the
229	next event can be delivered.
230
231	Note, since APF 'page not present' events use the same exception vector
232	as regular page fault, guest must reset 'flags' to '0' before it does
233	something that can generate normal page fault.
234
235	Bytes 5-7 of 64 byte memory location ('token') will be written to by the
236	hypervisor at the time of APF 'page ready' event injection. The content
237	of these bytes is a token which was previously delivered as 'page not
238	present' event. The event indicates the page in now available. Guest is
239	supposed to write '0' to 'token' when it is done handling 'page ready'
240	event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
241	writing to the MSR forces KVM to re-scan its queue and deliver the next
242	pending notification.
243
244	Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
245	ready' APF delivery needs to be written to before enabling APF mechanism
246	in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
247	available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
248
249	Note, previously, 'page ready' events were delivered via the same #PF
250	exception as 'page not present' events but this is now deprecated. If
251	bit 3 (interrupt based delivery) is not set APF events are not delivered.
252
253	If APF is disabled while there are outstanding APFs, they will
254	not be delivered.
255
256	Currently 'page ready' APF events will be always delivered on the
257	same vcpu as 'page not present' event was, but guest should not rely on
258	that.
259
260MSR_KVM_STEAL_TIME:
261	0x4b564d03
262
263data:
264	64-byte alignment physical address of a memory area which must be
265	in guest RAM, plus an enable bit in bit 0. This memory is expected to
266	hold a copy of the following structure::
267
268	  struct kvm_steal_time {
269		__u64 steal;
270		__u32 version;
271		__u32 flags;
272		__u8  preempted;
273		__u8  u8_pad[3];
274		__u32 pad[11];
275	  }
276
277	whose data will be filled in by the hypervisor periodically. Only one
278	write, or registration, is needed for each VCPU. The interval between
279	updates of this structure is arbitrary and implementation-dependent.
280	The hypervisor may update this structure at any time it sees fit until
281	anything with bit0 == 0 is written to it. Guest is required to make sure
282	this structure is initialized to zero.
283
284	Fields have the following meanings:
285
286	version:
287		a sequence counter. In other words, guest has to check
288		this field before and after grabbing time information and make
289		sure they are both equal and even. An odd version indicates an
290		in-progress update.
291
292	flags:
293		At this point, always zero. May be used to indicate
294		changes in this structure in the future.
295
296	steal:
297		the amount of time in which this vCPU did not run, in
298		nanoseconds. Time during which the vcpu is idle, will not be
299		reported as steal time.
300
301	preempted:
302		indicate the vCPU who owns this struct is running or
303		not. Non-zero values mean the vCPU has been preempted. Zero
304		means the vCPU is not preempted. NOTE, it is always zero if the
305		the hypervisor doesn't support this field.
306
307MSR_KVM_EOI_EN:
308	0x4b564d04
309
310data:
311	Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
312	when disabled.  Bit 1 is reserved and must be zero.  When PV end of
313	interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
314	physical address of a 4 byte memory area which must be in guest RAM and
315	must be zeroed.
316
317	The first, least significant bit of 4 byte memory location will be
318	written to by the hypervisor, typically at the time of interrupt
319	injection.  Value of 1 means that guest can skip writing EOI to the apic
320	(using MSR or MMIO write); instead, it is sufficient to signal
321	EOI by clearing the bit in guest memory - this location will
322	later be polled by the hypervisor.
323	Value of 0 means that the EOI write is required.
324
325	It is always safe for the guest to ignore the optimization and perform
326	the APIC EOI write anyway.
327
328	Hypervisor is guaranteed to only modify this least
329	significant bit while in the current VCPU context, this means that
330	guest does not need to use either lock prefix or memory ordering
331	primitives to synchronise with the hypervisor.
332
333	However, hypervisor can set and clear this memory bit at any time:
334	therefore to make sure hypervisor does not interrupt the
335	guest and clear the least significant bit in the memory area
336	in the window between guest testing it to detect
337	whether it can skip EOI apic write and between guest
338	clearing it to signal EOI to the hypervisor,
339	guest must both read the least significant bit in the memory area and
340	clear it using a single CPU instruction, such as test and clear, or
341	compare and exchange.
342
343MSR_KVM_POLL_CONTROL:
344	0x4b564d05
345
346	Control host-side polling.
347
348data:
349	Bit 0 enables (1) or disables (0) host-side HLT polling logic.
350
351	KVM guests can request the host not to poll on HLT, for example if
352	they are performing polling themselves.
353
354MSR_KVM_ASYNC_PF_INT:
355	0x4b564d06
356
357data:
358	Second asynchronous page fault (APF) control MSR.
359
360	Bits 0-7: APIC vector for delivery of 'page ready' APF events.
361	Bits 8-63: Reserved
362
363	Interrupt vector for asynchnonous 'page ready' notifications delivery.
364	The vector has to be set up before asynchronous page fault mechanism
365	is enabled in MSR_KVM_ASYNC_PF_EN.  The MSR is only available if
366	KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
367
368MSR_KVM_ASYNC_PF_ACK:
369	0x4b564d07
370
371data:
372	Asynchronous page fault (APF) acknowledgment.
373
374	When the guest is done processing 'page ready' APF event and 'token'
375	field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
376	write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
377	and check if there are more notifications pending. The MSR is available
378	if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
379
380MSR_KVM_MIGRATION_CONTROL:
381        0x4b564d08
382
383data:
384        This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
385        CPUID.  Bit 0 represents whether live migration of the guest is allowed.
386
387        When a guest is started, bit 0 will be 0 if the guest has encrypted
388        memory and 1 if the guest does not have encrypted memory.  If the
389        guest is communicating page encryption status to the host using the
390        ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
391        allow live migration of the guest.
392