1=encoding utf8
2
3=head1 NAME
4
5xl-numa-placement - Guest Automatic NUMA Placement in libxl and xl
6
7=head1 DESCRIPTION
8
9=head2 Rationale
10
11NUMA (which stands for Non-Uniform Memory Access) means that the memory
12accessing times of a program running on a CPU depends on the relative
13distance between that CPU and that memory. In fact, most of the NUMA
14systems are built in such a way that each processor has its local memory,
15on which it can operate very fast. On the other hand, getting and storing
16data from and on remote memory (that is, memory local to some other processor)
17is quite more complex and slow. On these machines, a NUMA node is usually
18defined as a set of processor cores (typically a physical CPU package) and
19the memory directly attached to the set of cores.
20
21NUMA awareness becomes very important as soon as many domains start
22running memory-intensive workloads on a shared host. In fact, the cost
23of accessing non node-local memory locations is very high, and the
24performance degradation is likely to be noticeable.
25
26For more information, have a look at the L<Xen NUMA Introduction|https://wiki.xenproject.org/wiki/Xen_on_NUMA_Machines>
27page on the Wiki.
28
29
30=head2 Xen and NUMA machines: the concept of I<node-affinity>
31
32The Xen hypervisor deals with NUMA machines throughout the concept of
33I<node-affinity>. The node-affinity of a domain is the set of NUMA nodes
34of the host where the memory for the domain is being allocated (mostly,
35at domain creation time). This is, at least in principle, different and
36unrelated with the vCPU (hard and soft, see below) scheduling affinity,
37which instead is the set of pCPUs where the vCPU is allowed (or prefers)
38to run.
39
40Of course, despite the fact that they belong to and affect different
41subsystems, the domain node-affinity and the vCPUs affinity are not
42completely independent.
43In fact, if the domain node-affinity is not explicitly specified by the
44user, via the proper libxl calls or xl config item, it will be computed
45basing on the vCPUs' scheduling affinity.
46
47Notice that, even if the node affinity of a domain may change on-line,
48it is very important to "place" the domain correctly when it is fist
49created, as the most of its memory is allocated at that time and can
50not (for now) be moved easily.
51
52
53=head2 Placing via pinning and cpupools
54
55The simplest way of placing a domain on a NUMA node is setting the hard
56scheduling affinity of the domain's vCPUs to the pCPUs of the node. This
57also goes under the name of vCPU pinning, and can be done through the
58"cpus=" option in the config file (more about this below). Another option
59is to pool together the pCPUs spanning the node and put the domain in
60such a I<cpupool> with the "pool=" config option (as documented in our
61L<Wiki|https://wiki.xenproject.org/wiki/Cpupools_Howto>).
62
63In both the above cases, the domain will not be able to execute outside
64the specified set of pCPUs for any reasons, even if all those pCPUs are
65busy doing something else while there are others, idle, pCPUs.
66
67So, when doing this, local memory accesses are 100% guaranteed, but that
68may come at he cost of some load imbalances.
69
70
71=head2 NUMA aware scheduling
72
73If using the credit1 scheduler, and starting from Xen 4.3, the scheduler
74itself always tries to run the domain's vCPUs on one of the nodes in
75its node-affinity. Only if that turns out to be impossible, it will just
76pick any free pCPU. Locality of access is less guaranteed than in the
77pinning case, but that comes along with better chances to exploit all
78the host resources (e.g., the pCPUs).
79
80Starting from Xen 4.5, credit1 supports two forms of affinity: hard and
81soft, both on a per-vCPU basis. This means each vCPU can have its own
82soft affinity, stating where such vCPU prefers to execute on. This is
83less strict than what it (also starting from 4.5) is called hard affinity,
84as the vCPU can potentially run everywhere, it just prefers some pCPUs
85rather than others.
86In Xen 4.5, therefore, NUMA-aware scheduling is achieved by matching the
87soft affinity of the vCPUs of a domain with its node-affinity.
88
89In fact, as it was for 4.3, if all the pCPUs in a vCPU's soft affinity
90are busy, it is possible for the domain to run outside from there. The
91idea is that slower execution (due to remote memory accesses) is still
92better than no execution at all (as it would happen with pinning). For
93this reason, NUMA aware scheduling has the potential of bringing
94substantial performances benefits, although this will depend on the
95workload.
96
97Notice that, for each vCPU, the following three scenarios are possbile:
98
99=over
100
101=item *
102
103a vCPU I<is pinned> to some pCPUs and I<does not have> any soft affinity
104In this case, the vCPU is always scheduled on one of the pCPUs to which
105it is pinned, without any specific peference among them.
106
107
108=item *
109
110a vCPU I<has> its own soft affinity and I<is not> pinned to any particular
111pCPU. In this case, the vCPU can run on every pCPU. Nevertheless, the
112scheduler will try to have it running on one of the pCPUs in its soft
113affinity;
114
115
116=item *
117
118a vCPU I<has> its own vCPU soft affinity and I<is also> pinned to some
119pCPUs. In this case, the vCPU is always scheduled on one of the pCPUs
120onto which it is pinned, with, among them, a preference for the ones
121that also forms its soft affinity. In case pinning and soft affinity
122form two disjoint sets of pCPUs, pinning "wins", and the soft affinity
123is just ignored.
124
125
126=back
127
128
129=head2 Guest placement in xl
130
131If using xl for creating and managing guests, it is very easy to ask for
132both manual or automatic placement of them across the host's NUMA nodes.
133
134Note that xm/xend does a very similar thing, the only differences being
135the details of the heuristics adopted for automatic placement (see below),
136and the lack of support (in both xm/xend and the Xen versions where that
137was the default toolstack) for NUMA aware scheduling.
138
139
140=head2 Placing the guest manually
141
142Thanks to the "cpus=" option, it is possible to specify where a domain
143should be created and scheduled on, directly in its config file. This
144affects NUMA placement and memory accesses as, in this case, the
145hypervisor constructs the node-affinity of a VM basing right on its
146vCPU pinning when it is created.
147
148This is very simple and effective, but requires the user/system
149administrator to explicitly specify the pinning for each and every domain,
150or Xen won't be able to guarantee the locality for their memory accesses.
151
152That, of course, also mean the vCPUs of the domain will only be able to
153execute on those same pCPUs.
154
155It is is also possible to have a "cpus_soft=" option in the xl config file,
156to specify the soft affinity for all the vCPUs of the domain. This affects
157the NUMA placement in the following way:
158
159=over
160
161=item *
162
163if only "cpus_soft=" is present, the VM's node-affinity will be equal
164to the nodes to which the pCPUs in the soft affinity mask belong;
165
166
167=item *
168
169if both "cpus_soft=" and "cpus=" are present, the VM's node-affinity
170will be equal to the nodes to which the pCPUs present both in hard and
171soft affinity belong.
172
173
174=back
175
176
177=head2 Placing the guest automatically
178
179If neither "cpus=" nor "cpus_soft=" are present in the config file, libxl
180tries to figure out on its own on which node(s) the domain could fit best.
181If it finds one (some), the domain's node affinity get set to there,
182and both memory allocations and NUMA aware scheduling (for the credit
183scheduler and starting from Xen 4.3) will comply with it. Starting from
184Xen 4.5, this also means that the mask resulting from this "fitting"
185procedure will become the soft affinity of all the vCPUs of the domain.
186
187It is worthwhile noting that optimally fitting a set of VMs on the NUMA
188nodes of an host is an incarnation of the Bin Packing Problem. In fact,
189the various VMs with different memory sizes are the items to be packed,
190and the host nodes are the bins. As such problem is known to be NP-hard,
191we will be using some heuristics.
192
193The first thing to do is find the nodes or the sets of nodes (from now
194on referred to as 'candidates') that have enough free memory and enough
195physical CPUs for accommodating the new domain. The idea is to find a
196spot for the domain with at least as much free memory as it has configured
197to have, and as much pCPUs as it has vCPUs.  After that, the actual
198decision on which candidate to pick happens accordingly to the following
199heuristics:
200
201=over
202
203=item *
204
205candidates involving fewer nodes are considered better. In case
206two (or more) candidates span the same number of nodes,
207
208
209=item *
210
211candidates with a smaller number of vCPUs runnable on them (due
212to previous placement and/or plain vCPU pinning) are considered
213better. In case the same number of vCPUs can run on two (or more)
214candidates,
215
216
217=item *
218
219the candidate with with the greatest amount of free memory is
220considered to be the best one.
221
222
223=back
224
225Giving preference to candidates with fewer nodes ensures better
226performance for the guest, as it avoid spreading its memory among
227different nodes. Favoring candidates with fewer vCPUs already runnable
228there ensures a good balance of the overall host load. Finally, if more
229candidates fulfil these criteria, prioritizing the nodes that have the
230largest amounts of free memory helps keeping the memory fragmentation
231small, and maximizes the probability of being able to put more domains
232there.
233
234
235=head2 Guest placement in libxl
236
237xl achieves automatic NUMA placement because that is what libxl does
238by default. No API is provided (yet) for modifying the behaviour of
239the placement algorithm. However, if your program is calling libxl,
240it is possible to set the C<numa_placement> build info key to C<false>
241(it is C<true> by default) with something like the below, to prevent
242any placement from happening:
243
244    libxl_defbool_set(&domain_build_info->numa_placement, false);
245
246Also, if C<numa_placement> is set to C<true>, the domain's vCPUs must
247not be pinned (i.e., C<<< domain_build_info->cpumap >>> must have all its
248bits set, as it is by default), or domain creation will fail with
249C<ERROR_INVAL>.
250
251Starting from Xen 4.3, in case automatic placement happens (and is
252successful), it will affect the domain's node-affinity and I<not> its
253vCPU pinning. Namely, the domain's vCPUs will not be pinned to any
254pCPU on the host, but the memory from the domain will come from the
255selected node(s) and the NUMA aware scheduling (if the credit scheduler
256is in use) will try to keep the domain's vCPUs there as much as possible.
257
258Besides than that, looking and/or tweaking the placement algorithm
259search "Automatic NUMA placement" in libxl_internal.h.
260
261Note this may change in future versions of Xen/libxl.
262
263
264=head2 Xen < 4.5
265
266The concept of vCPU soft affinity has been introduced for the first time
267in Xen 4.5. In 4.3, it is the domain's node-affinity that drives the
268NUMA-aware scheduler. The main difference is soft affinity is per-vCPU,
269and so each vCPU can have its own mask of pCPUs, while node-affinity is
270per-domain, that is the equivalent of having all the vCPUs with the same
271soft affinity.
272
273
274=head2 Xen < 4.3
275
276As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
277bit different for earlier version of Xen. If no "cpus=" option is specified
278and Xen 4.2 is in use, the automatic placement algorithm still runs, but
279the results is used to I<pin> the vCPUs of the domain to the output node(s).
280This is consistent with what was happening with xm/xend.
281
282On a version of Xen earlier than 4.2, there is not automatic placement at
283all in xl or libxl, and hence no node-affinity, vCPU affinity or pinning
284being introduced/modified.
285
286
287=head2 Limitations
288
289Analyzing various possible placement solutions is what makes the
290algorithm flexible and quite effective. However, that also means
291it won't scale well to systems with arbitrary number of nodes.
292For this reason, automatic placement is disabled (with a warning)
293if it is requested on a host with more than 16 NUMA nodes.
294