1Authors: Feng Wu <feng.wu@intel.com>
2
3VT-d Posted-interrupt (PI) design for XEN
4
5Important Definitions
6==================
7VT-d posted-interrupts: posted-interrupts support in root-complex side
8CPU-side posted-interrupts: posted-interrupts support in CPU side
9IRTE: Interrupt Remapping Table Entry
10Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor
11Virtual Vector: the guest vector of the interrupt
12URG: indicates if the interrupt is urgent
13
14Posted-interrupt descriptor:
15The Posted Interrupt Descriptor hosts the following fields:
16Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit
17per vector, for up to 256 vectors).
18
19Outstanding Notification (ON): Indicate if there is a notification event outstanding (not
20processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0,
21hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving
22the notification event (processor or software) resets it as part of posted interrupt processing.
23
24Suppress Notification (SN): Indicate if a notification event is to be suppressed (not
25generated) for non-urgent interrupt requests (interrupts processed through an IRTE with
26URG=0).
27
28Notification Vector (NV): Specify the vector for notification event (interrupt).
29
30Notification Destination (NDST): Specify the physical APIC-ID of the destination logical
31processor for the notification event.
32
33Background
34==========
35With the development of virtualization, there are more and more device
36assignment requirements. However, today when a VM is running with
37assigned devices (such as, NIC), external interrupt handling for the assigned
38devices always needs VMM intervention.
39
40VT-d Posted-interrupt is a more enhanced method to handle interrupts
41in the virtualization environment. Interrupt posting is the process by
42which an interrupt request is recorded in a memory-resident
43posted-interrupt-descriptor structure by the root-complex or software,
44followed by an optional notification event issued to the CPU.
45
46With VT-d Posted-interrupt we can get the following advantages:
47- Direct delivery of external interrupts to running vCPUs without VMM
48intervention
49- Decrease the interrupt migration complexity. On vCPU migration, software
50can atomically co-migrate all interrupts targeting the migrating vCPU. For
51virtual machines with assigned devices, migrating a vCPU across pCPUs
52either incurs the overhead of forwarding interrupts in software (e.g. via VMM
53generated IPIs), or complexity to independently migrate each interrupt targeting
54the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU
55of an external interrupt from assigned devices is stored in the IRTE (i.e.
56Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
57we will set this new pCPU in the 'NDST' field of Posted-interrupt descriptor, this
58make the interrupt migration automatic.
59
60Here is what Xen currently does for external interrupts from assigned devices:
61
62When a VM is running and an external interrupt from an assigned device occurs
63for it. VM-EXIT happens, then:
64
65vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() -->
66raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
67
68softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
69
70dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() -->
71vmsi_inj_irq() --> vlapic_set_irq()
72
73vlapic_set_irq() does the following things:
741. If CPU-side posted-interrupt is supported, call vmx_deliver_posted_intr() to deliver
75the virtual interrupt via posted-interrupt infrastructure.
762. Else if CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC
77page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist()
78will help to inject the interrupt to guests.
79
80However, after VT-d PI is supported, when a guest is running in non-root and an
81external interrupt from an assigned device occurs for it. no VM-Exit is needed,
82the guest can handle this totally in non-root mode, thus avoiding all the above
83code flow.
84
85Posted-interrupt Introduction
86========================
87There are two components in the Posted-interrupt architecture:
88Processor Support and Root-Complex Support
89
90- Processor Support
91Posted-interrupt processing is a feature by which a processor processes
92the virtual interrupts by recording them as pending on the virtual-APIC
93page.
94
95Posted-interrupt processing is enabled by setting the process posted
96interrupts VM-execution control. The processing is performed in response
97to the arrival of an interrupt with the posted-interrupt notification vector.
98In response to such an interrupt, the processor processes virtual interrupts
99recorded in a data structure called a posted-interrupt descriptor.
100
101More information about APICv and CPU-side Posted-interrupt, please refer
102to Chapter "APIC VIRTUALIZATION AND VIRTUAL INTERRUPTS", and Section
103"POSTED-INTERRUPT PROCESSING" in the Intel SDM:
104http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
105
106- Root-Complex Support
107Interrupt posting is the process by which an interrupt request (from IOAPIC
108or MSI/MSIx capable sources) is recorded in a memory-resident
109posted-interrupt-descriptor structure by the root-complex, followed by
110an optional notification event issued to the CPU complex. The interrupt
111request arriving at the root-complex carry the identity of the interrupt
112request source and a 'remapping-index'. The remapping-index is used to
113look-up an entry from the memory-resident interrupt-remap-table. Unlike
114interrupt-remapping, the interrupt-remap-table-entry for a posted-interrupt,
115specifies a virtual-vector and a pointer to the posted-interrupt descriptor.
116The virtual-vector specifies the vector of the interrupt to be recorded in
117the posted-interrupt descriptor. The posted-interrupt descriptor hosts storage
118for the virtual-vectors and contains the attributes of the notification event
119(interrupt) to be issued to the CPU complex to inform CPU/software about pending
120interrupts recorded in the posted-interrupt descriptor.
121
122More information about VT-d PI, please refer to
123http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
124
125Design Overview
126==============
127In this design, we will cover the following items:
1281. Add a variable to control whether enable VT-d posted-interrupt or not.
1292. VT-d PI feature detection.
1303. Extend posted-interrupt descriptor structure to cover VT-d PI specific items.
1314. Extend IRTE structure to support VT-d PI.
1325. Introduce a new global vector which is used for waking up the blocked vCPU.
1336. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
1347. Update posted-interrupt descriptor during vCPU scheduling.
1358. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
1369. New boot command line for Xen, which controls VT-d PI feature by user.
13710. Multicast/broadcast and lowest priority interrupts consideration.
138
139
140Implementation details
141===================
142- New variable to control VT-d PI
143
144Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward
145to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
146only when interrupt remapping and VT-d posted-interrupt are both enabled.
147
148- VT-d PI feature detection.
149Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.
150
151- Extend posted-interrupt descriptor structure to cover VT-d PI specific items.
152Here is the new structure for posted-interrupt descriptor:
153
154struct pi_desc {
155    DECLARE_BITMAP(pir, NR_VECTORS);
156    union {
157        struct
158        {
159        u16 on     : 1,  /* bit 256 - Outstanding Notification */
160            sn     : 1,  /* bit 257 - Suppress Notification */
161            rsvd_1 : 14; /* bit 271:258 - Reserved */
162        u8  nv;          /* bit 279:272 - Notification Vector */
163        u8  rsvd_2;      /* bit 287:280 - Reserved */
164        u32 ndst;        /* bit 319:288 - Notification Destination */
165        };
166        u64 control;
167    };
168    u32 rsvd[6];
169} __attribute__ ((aligned (64)));
170
171- Extend IRTE structure to support VT-d PI.
172
173Here is the new structure for IRTE:
174/* interrupt remap entry */
175struct iremap_entry {
176  union {
177    struct { u64 lo, hi; };
178    struct {
179        u16 p       : 1,
180            fpd     : 1,
181            dm      : 1,
182            rh      : 1,
183            tm      : 1,
184            dlm     : 3,
185            avail   : 4,
186            res_1   : 4;
187        u8  vector;
188        u8  res_2;
189        u32 dst;
190        u16 sid;
191        u16 sq      : 2,
192            svt     : 2,
193            res_3   : 12;
194        u32 res_4   : 32;
195    } remap;
196    struct {
197        u16 p       : 1,
198            fpd     : 1,
199            res_1   : 6,
200            avail   : 4,
201            res_2   : 2,
202            urg     : 1,
203            im      : 1;
204        u8  vector;
205        u8  res_3;
206        u32 res_4   : 6,
207            pda_l   : 26;
208        u16 sid;
209        u16 sq      : 2,
210            svt     : 2,
211            res_5   : 12;
212        u32 pda_h;
213    } post;
214  };
215};
216
217- Introduce a new global vector which is used to wake up the blocked vCPU.
218
219Currently, there is a global vector 'posted_intr_vector', which is used as the
220global notification vector for all vCPUs in the system. This vector is stored in
221VMCS and CPU considers it as a _special_ vector, uses it to notify the related
222pCPU when an interrupt is recorded in the posted-interrupt descriptor.
223
224This existing global vector is a _special_ vector to CPU, CPU handle it in a
225_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
226http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
227for more information about how CPU handles it.
228
229After having VT-d PI, VT-d engine can issue notification event when the
230assigned devices issue interrupts. We need add a new global vector to
231wakeup the blocked vCPU, please refer to later section in this design for
232how to use this new global vector.
233
234- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
235After VT-d PI is introduced, the format of IRTE is changed as follows:
236	Descriptor Address: the address of the posted-interrupt descriptor
237	Virtual Vector: the guest vector of the interrupt
238	URG: indicates if the interrupt is urgent
239	Other fields continue to have the same meaning
240
241'Descriptor Address' tells the destination vCPU of this interrupt, since
242each vCPU has a dedicated posted-interrupt descriptor.
243
244'Virtual Vector' tells the guest vector of the interrupt.
245
246When guest changes the configuration of the interrupts, such as, the
247cpu affinity, or the vector, we need to update the associated IRTE accordingly.
248
249- Update posted-interrupt descriptor during vCPU scheduling
250
251The basic idea here is:
2521. When vCPU is running
253        - Set 'NV' to 'posted_intr_vector'.
254        - Clear 'SN' to accept posted-interrupts.
255        - Set 'NDST' to the pCPU on which the vCPU will be running.
2562. When vCPU is blocked
257        - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
258          related vCPU when posted-interrupt happens for it.
259          Please refer to the above section about the new global vector.
260        - Clear 'SN' to accept posted-interrupts
2613. When vCPU is preempted or sleeping
262        - Set 'SN' to suppress non-urgent interrupts
263          (Currently, we only support non-urgent interrupts)
264         When vCPU is preempted or sleep, it doesn't need to accept
265         posted-interrupt notification event since we don't change the behavior
266         of scheduler when the interrupt occurs, we still need wait for the next
267         scheduling of the vCPU. When external interrupts from assigned devices occur,
268         the interrupts are recorded in PIR, and will be synced to IRR before VM-Entry.
269        - Set 'NV' to 'posted_intr_vector'.
270
271- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
272
273Here is the scenario for the usage of the new global vector:
274
2751. vCPU0 is running on pCPU0
2762. vCPU0 is blocked and vCPU1 is currently running on pCPU0
2773. An external interrupt from an assigned device occurs for vCPU0, if we
278still use 'posted_intr_vector' as the notification vector for vCPU0, the
279notification event for vCPU0 (the event will go to pCPU1) will be consumed
280by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
281case is that vCPU0 will never be woken up again since the wakeup event
282for it is always consumed by other vCPUs incorrectly. So we need introduce
283another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU.
284
285After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
286event using this new vector. Since this new vector is not a SPECIAL one to CPU,
287it is just a normal vector. To CPU, it just receives an normal external interrupt,
288then we can get control in the handler of this new vector. In this case, hypervisor
289can do something in it, such as wakeup the blocked vCPU.
290
291Here are what we do for the blocked vCPU:
2921. Define a per-cpu list 'pi_blocked_vcpu', which stored the blocked
293vCPU on the pCPU.
2942. When the vCPU is going to block, insert the vCPU
295to the per-cpu list belonging to the pCPU it was running.
2963. When the vCPU is unblocked, remove the vCPU from the related pCPU list.
297
298In the handler of 'pi_wakeup_vector', we do:
2991. Get the physical CPU.
3002. Iterate the list 'pi_blocked_vcpu' of the current pCPU, if 'ON' is set,
301we unblock the associated vCPU.
302
303When the vCPU is blocked, we change the posted-interrupts descriptor and
304put it in the pCPU's blocking list, we don't change the status of posted-
305interrupts descriptor back when the vCPU is unblocked or the blocking
306operation directly returns since there are events to be delivered. Instead,
307we do it exactly before VM-Entry.
308
309- New boot command line for Xen, which controls VT-d PI feature by user.
310
311Like 'intremap' for interrupt remapping, we add a new boot command line
312'intpost' for posted-interrupts.
313
314- Multicast/broadcast and lowest priority interrupts consideration.
315
316With VT-d PI, the destination vCPU information of an external interrupt
317from assigned devices is stored in IRTE, this makes the following
318consideration of the design:
3191. Multicast/broadcast interrupts cannot be posted.
3202. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
321(starting from Nehalem) ignore TPR value, and instead supported two other
322ways (configurable by BIOS) on how the handle lowest priority interrupts:
323	A) Round robin: In this method, the chipset simply delivers lowest priority
324interrupts in a round-robin manner across all the available logical CPUs. While
325this provides good load balancing, this was not the best thing to do always as
326interrupts from the same device (like NIC) will start running on all the CPUs
327thrashing caches and taking locks. This led to the next scheme.
328	B) Vector hashing: In this method, hardware would apply a hash function
329on the vector value in the interrupt request, and use that hash to pick a logical
330CPU to route the lowest priority interrupt. This way, a given vector always goes
331to the same logical CPU, avoiding the thrashing problem above.
332
333So, gist of above is that, lowest priority interrupts has never been delivered as
334"lowest priority" in physical hardware.
335
336Vector hashing is used in this design.
337