1========================= 2Capacity Aware Scheduling 3========================= 4 51. CPU Capacity 6=============== 7 81.1 Introduction 9---------------- 10 11Conventional, homogeneous SMP platforms are composed of purely identical 12CPUs. Heterogeneous platforms on the other hand are composed of CPUs with 13different performance characteristics - on such platforms, not all CPUs can be 14considered equal. 15 16CPU capacity is a measure of the performance a CPU can reach, normalized against 17the most performant CPU in the system. Heterogeneous systems are also called 18asymmetric CPU capacity systems, as they contain CPUs of different capacities. 19 20Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems 21from two factors: 22 23- not all CPUs may have the same microarchitecture (µarch). 24- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be 25 physically able to attain the higher Operating Performance Points (OPP). 26 27Arm big.LITTLE systems are an example of both. The big CPUs are more 28performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, 29smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones 30can. 31 32CPU performance is usually expressed in Millions of Instructions Per Second 33(MIPS), which can also be expressed as a given amount of instructions attainable 34per Hz, leading to:: 35 36 capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) 37 381.2 Scheduler terms 39------------------- 40 41Two different capacity values are used within the scheduler. A CPU's 42``capacity_orig`` is its maximum attainable capacity, i.e. its maximum 43attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to 44which some loss of available performance (e.g. time spent handling IRQs) is 45subtracted. 46 47Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, 48while ``capacity_orig`` is class-agnostic. The rest of this document will use 49the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of 50brevity. 51 521.3 Platform examples 53--------------------- 54 551.3.1 Identical OPPs 56~~~~~~~~~~~~~~~~~~~~ 57 58Consider an hypothetical dual-core asymmetric CPU capacity system where 59 60- work_per_hz(CPU0) = W 61- work_per_hz(CPU1) = W/2 62- all CPUs are running at the same fixed frequency 63 64By the above definition of capacity: 65 66- capacity(CPU0) = C 67- capacity(CPU1) = C/2 68 69To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would 70be a LITTLE. 71 72With a workload that periodically does a fixed amount of work, you will get an 73execution trace like so:: 74 75 CPU0 work ^ 76 | ____ ____ ____ 77 | | | | | | | 78 +----+----+----+----+----+----+----+----+----+----+-> time 79 80 CPU1 work ^ 81 | _________ _________ ____ 82 | | | | | | 83 +----+----+----+----+----+----+----+----+----+----+-> time 84 85CPU0 has the highest capacity in the system (C), and completes a fixed amount of 86work W in T units of time. On the other hand, CPU1 has half the capacity of 87CPU0, and thus only completes W/2 in T. 88 891.3.2 Different max OPPs 90~~~~~~~~~~~~~~~~~~~~~~~~ 91 92Usually, CPUs of different capacity values also have different maximum 93OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: 94 95- max_freq(CPU0) = F 96- max_freq(CPU1) = 2/3 * F 97 98This yields: 99 100- capacity(CPU0) = C 101- capacity(CPU1) = C/3 102 103Executing the same workload as described in 1.3.1, which each CPU running at its 104maximum frequency results in:: 105 106 CPU0 work ^ 107 | ____ ____ ____ 108 | | | | | | | 109 +----+----+----+----+----+----+----+----+----+----+-> time 110 111 workload on CPU1 112 CPU1 work ^ 113 | ______________ ______________ ____ 114 | | | | | | 115 +----+----+----+----+----+----+----+----+----+----+-> time 116 1171.4 Representation caveat 118------------------------- 119 120It should be noted that having a *single* value to represent differences in CPU 121performance is somewhat of a contentious point. The relative performance 122difference between two different µarchs could be X% on integer operations, Y% on 123floating point operations, Z% on branches, and so on. Still, results using this 124simple approach have been satisfactory for now. 125 1262. Task utilization 127=================== 128 1292.1 Introduction 130---------------- 131 132Capacity aware scheduling requires an expression of a task's requirements with 133regards to CPU capacity. Each scheduler class can express this differently, and 134while task utilization is specific to CFS, it is convenient to describe it here 135in order to introduce more generic concepts. 136 137Task utilization is a percentage meant to represent the throughput requirements 138of a task. A simple approximation of it is the task's duty cycle, i.e.:: 139 140 task_util(p) = duty_cycle(p) 141 142On an SMP system with fixed frequencies, 100% utilization suggests the task is a 143busy loop. Conversely, 10% utilization hints it is a small periodic task that 144spends more time sleeping than executing. Variable CPU frequencies and 145asymmetric CPU capacities complexify this somewhat; the following sections will 146expand on these. 147 1482.2 Frequency invariance 149------------------------ 150 151One issue that needs to be taken into account is that a workload's duty cycle is 152directly impacted by the current OPP the CPU is running at. Consider running a 153periodic workload at a given frequency F:: 154 155 CPU work ^ 156 | ____ ____ ____ 157 | | | | | | | 158 +----+----+----+----+----+----+----+----+----+----+-> time 159 160This yields duty_cycle(p) == 25%. 161 162Now, consider running the *same* workload at frequency F/2:: 163 164 CPU work ^ 165 | _________ _________ ____ 166 | | | | | | 167 +----+----+----+----+----+----+----+----+----+----+-> time 168 169This yields duty_cycle(p) == 50%, despite the task having the exact same 170behaviour (i.e. executing the same amount of work) in both executions. 171 172The task utilization signal can be made frequency invariant using the following 173formula:: 174 175 task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) 176 177Applying this formula to the two examples above yields a frequency invariant 178task utilization of 25%. 179 1802.3 CPU invariance 181------------------ 182 183CPU capacity has a similar effect on task utilization in that running an 184identical workload on CPUs of different capacity values will yield different 185duty cycles. 186 187Consider the system described in 1.3.2., i.e.:: 188 189- capacity(CPU0) = C 190- capacity(CPU1) = C/3 191 192Executing a given periodic workload on each CPU at their maximum frequency would 193result in:: 194 195 CPU0 work ^ 196 | ____ ____ ____ 197 | | | | | | | 198 +----+----+----+----+----+----+----+----+----+----+-> time 199 200 CPU1 work ^ 201 | ______________ ______________ ____ 202 | | | | | | 203 +----+----+----+----+----+----+----+----+----+----+-> time 204 205IOW, 206 207- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency 208- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency 209 210The task utilization signal can be made CPU invariant using the following 211formula:: 212 213 task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) 214 215with ``max_capacity`` being the highest CPU capacity value in the 216system. Applying this formula to the above example above yields a CPU 217invariant task utilization of 25%. 218 2192.4 Invariant task utilization 220------------------------------ 221 222Both frequency and CPU invariance need to be applied to task utilization in 223order to obtain a truly invariant signal. The pseudo-formula for a task 224utilization that is both CPU and frequency invariant is thus, for a given 225task p:: 226 227 curr_frequency(cpu) capacity(cpu) 228 task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- 229 max_frequency(cpu) max_capacity 230 231In other words, invariant task utilization describes the behaviour of a task as 232if it were running on the highest-capacity CPU in the system, running at its 233maximum frequency. 234 235Any mention of task utilization in the following sections will imply its 236invariant form. 237 2382.5 Utilization estimation 239-------------------------- 240 241Without a crystal ball, task behaviour (and thus task utilization) cannot 242accurately be predicted the moment a task first becomes runnable. The CFS class 243maintains a handful of CPU and task signals based on the Per-Entity Load 244Tracking (PELT) mechanism, one of those yielding an *average* utilization (as 245opposed to instantaneous). 246 247This means that while the capacity aware scheduling criteria will be written 248considering a "true" task utilization (using a crystal ball), the implementation 249will only ever be able to use an estimator thereof. 250 2513. Capacity aware scheduling requirements 252========================================= 253 2543.1 CPU capacity 255---------------- 256 257Linux cannot currently figure out CPU capacity on its own, this information thus 258needs to be handed to it. Architectures must define arch_scale_cpu_capacity() 259for that purpose. 260 261The arm and arm64 architectures directly map this to the arch_topology driver 262CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see 263Documentation/devicetree/bindings/arm/cpu-capacity.txt. 264 2653.2 Frequency invariance 266------------------------ 267 268As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task 269utilization. Architectures must define arch_scale_freq_capacity(cpu) for that 270purpose. 271 272Implementing this function requires figuring out at which frequency each CPU 273have been running at. One way to implement this is to leverage hardware counters 274whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, 275AMU on arm64). Another is to directly hook into cpufreq frequency transitions, 276when the kernel is aware of the switched-to frequency (also employed by 277arm/arm64). 278 2794. Scheduler topology 280===================== 281 282During the construction of the sched domains, the scheduler will figure out 283whether the system exhibits asymmetric CPU capacities. Should that be the 284case: 285 286- The sched_asym_cpucapacity static key will be enabled. 287- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain 288 level that spans all unique CPU capacity values. 289- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans 290 CPUs with any range of asymmetry. 291 292The sched_asym_cpucapacity static key is intended to guard sections of code that 293cater to asymmetric CPU capacity systems. Do note however that said key is 294*system-wide*. Imagine the following setup using cpusets:: 295 296 capacity C/2 C 297 ________ ________ 298 / \ / \ 299 CPUs 0 1 2 3 4 5 6 7 300 \__/ \______________/ 301 cpusets cs0 cs1 302 303Which could be created via: 304 305.. code-block:: sh 306 307 mkdir /sys/fs/cgroup/cpuset/cs0 308 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus 309 echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems 310 311 mkdir /sys/fs/cgroup/cpuset/cs1 312 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus 313 echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems 314 315 echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 316 317Since there *is* CPU capacity asymmetry in the system, the 318sched_asym_cpucapacity static key will be enabled. However, the sched_domain 319hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't 320set in that hierarchy, it describes an SMP island and should be treated as such. 321 322Therefore, the 'canonical' pattern for protecting codepaths that cater to 323asymmetric CPU capacities is to: 324 325- Check the sched_asym_cpucapacity static key 326- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in 327 the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific 328 CPU or group thereof) 329 3305. Capacity aware scheduling implementation 331=========================================== 332 3335.1 CFS 334------- 335 3365.1.1 Capacity fitness 337~~~~~~~~~~~~~~~~~~~~~~ 338 339The main capacity scheduling criterion of CFS is:: 340 341 task_util(p) < capacity(task_cpu(p)) 342 343This is commonly called the capacity fitness criterion, i.e. CFS must ensure a 344task "fits" on its CPU. If it is violated, the task will need to achieve more 345work than what its CPU can provide: it will be CPU-bound. 346 347Furthermore, uclamp lets userspace specify a minimum and a maximum utilization 348value for a task, either via sched_setattr() or via the cgroup interface (see 349Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to 350clamp task_util() in the previous criterion. 351 3525.1.2 Wakeup CPU selection 353~~~~~~~~~~~~~~~~~~~~~~~~~~ 354 355CFS task wakeup CPU selection follows the capacity fitness criterion described 356above. On top of that, uclamp is used to clamp the task utilization values, 357which lets userspace have more leverage over the CPU selection of CFS 358tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: 359 360 clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) 361 362By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run 363on any CPU by giving it a low uclamp.max value. Conversely, it can force a small 364periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by 365giving it a high uclamp.min value. 366 367.. note:: 368 369 Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling 370 (EAS), which is described in Documentation/scheduler/sched-energy.rst. 371 3725.1.3 Load balancing 373~~~~~~~~~~~~~~~~~~~~ 374 375A pathological case in the wakeup CPU selection occurs when a task rarely 376sleeps, if at all - it thus rarely wakes up, if at all. Consider:: 377 378 w == wakeup event 379 380 capacity(CPU0) = C 381 capacity(CPU1) = C / 3 382 383 workload on CPU0 384 CPU work ^ 385 | _________ _________ ____ 386 | | | | | | 387 +----+----+----+----+----+----+----+----+----+----+-> time 388 w w w 389 390 workload on CPU1 391 CPU work ^ 392 | ____________________________________________ 393 | | 394 +----+----+----+----+----+----+----+----+----+----+-> 395 w 396 397This workload should run on CPU0, but if the task either: 398 399- was improperly scheduled from the start (inaccurate initial 400 utilization estimation) 401- was properly scheduled from the start, but suddenly needs more 402 processing power 403 404then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; 405the CPU capacity scheduling criterion is violated, and there may not be any more 406wakeup event to fix this up via wakeup CPU selection. 407 408Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism 409put in place to handle this shares the same name. Misfit task migration 410leverages the CFS load balancer, more specifically the active load balance part 411(which caters to migrating currently running tasks). When load balance happens, 412a misfit active load balance will be triggered if a misfit task can be migrated 413to a CPU with more capacity than its current one. 414 4155.2 RT 416------ 417 4185.2.1 Wakeup CPU selection 419~~~~~~~~~~~~~~~~~~~~~~~~~~ 420 421RT task wakeup CPU selection searches for a CPU that satisfies:: 422 423 task_uclamp_min(p) <= capacity(task_cpu(cpu)) 424 425while still following the usual priority constraints. If none of the candidate 426CPUs can satisfy this capacity criterion, then strict priority based scheduling 427is followed and CPU capacities are ignored. 428 4295.3 DL 430------ 431 4325.3.1 Wakeup CPU selection 433~~~~~~~~~~~~~~~~~~~~~~~~~~ 434 435DL task wakeup CPU selection searches for a CPU that satisfies:: 436 437 task_bandwidth(p) < capacity(task_cpu(p)) 438 439while still respecting the usual bandwidth and deadline constraints. If 440none of the candidate CPUs can satisfy this capacity criterion, then the 441task will remain on its current CPU. 442