1 2 3NOTE; all this assumes a linear relation between frequency and work capacity, 4we know this is flawed, but it is the best workable approximation. 5 6 7PELT (Per Entity Load Tracking) 8------------------------------- 9 10With PELT we track some metrics across the various scheduler entities, from 11individual tasks to task-group slices to CPU runqueues. As the basis for this 12we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) 13is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute 14half, while the rest of history contribute the other half. 15 16Specifically: 17 18 ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... 19 20 ewma(u) = ewma_sum(u) / ewma_sum(1) 21 22Since this is essentially a progression of an infinite geometric series, the 23results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property 24is key, since it gives the ability to recompose the averages when tasks move 25around. 26 27Note that blocked tasks still contribute to the aggregates (task-group slices 28and CPU runqueues), which reflects their expected contribution when they 29resume running. 30 31Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' 32reflects the time an entity spends on the CPU, while 'runnable' reflects the 33time an entity spends on the runqueue. When there is only a single task these 34two metrics are the same, but once there is contention for the CPU 'running' 35will decrease to reflect the fraction of time each task spends on the CPU 36while 'runnable' will increase to reflect the amount of contention. 37 38For more detail see: kernel/sched/pelt.c 39 40 41Frequency- / CPU Invariance 42--------------------------- 43 44Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU 45for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on 46a big CPU, we allow architectures to scale the time delta with two ratios, one 47Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. 48 49For simple DVFS architectures (where software is in full control) we trivially 50compute the ratio as: 51 52 f_cur 53 r_dvfs := ----- 54 f_max 55 56For more dynamic systems where the hardware is in control of DVFS we use 57hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. 58For Intel specifically, we use: 59 60 APERF 61 f_cur := ----- * P0 62 MPERF 63 64 4C-turbo; if available and turbo enabled 65 f_max := { 1C-turbo; if turbo enabled 66 P0; otherwise 67 68 f_cur 69 r_dvfs := min( 1, ----- ) 70 f_max 71 72We pick 4C turbo over 1C turbo to make it slightly more sustainable. 73 74r_cpu is determined as the ratio of highest performance level of the current 75CPU vs the highest performance level of any other CPU in the system. 76 77 r_tot = r_dvfs * r_cpu 78 79The result is that the above 'running' and 'runnable' metrics become invariant 80of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. 81 82For more detail see: 83 84 - kernel/sched/pelt.h:update_rq_clock_pelt() 85 - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." 86 - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" 87 88 89UTIL_EST / UTIL_EST_FASTUP 90-------------------------- 91 92Because periodic tasks have their averages decayed while they sleep, even 93though when running their expected utilization will be the same, they suffer a 94(DVFS) ramp-up after they are running again. 95 96To alleviate this (a default enabled option) UTIL_EST drives an Infinite 97Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is 98highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR 99filter to instantly increase and only decay on decrease. 100 101A further runqueue wide sum (of runnable tasks) is maintained of: 102 103 util_est := \Sum_t max( t_running, t_util_est_ewma ) 104 105For more detail see: kernel/sched/fair.c:util_est_dequeue() 106 107 108UCLAMP 109------ 110 111It is possible to set effective u_min and u_max clamps on each CFS or RT task; 112the runqueue keeps an max aggregate of these clamps for all running tasks. 113 114For more detail see: include/uapi/linux/sched/types.h 115 116 117Schedutil / DVFS 118---------------- 119 120Every time the scheduler load tracking is updated (task wakeup, task 121migration, time progression) we call out to schedutil to update the hardware 122DVFS state. 123 124The basis is the CPU runqueue's 'running' metric, which per the above it is 125the frequency invariant utilization estimate of the CPU. From this we compute 126a desired frequency like: 127 128 max( running, util_est ); if UTIL_EST 129 u_cfs := { running; otherwise 130 131 clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK 132 u_clamp := { u_cfs + u_rt; otherwise 133 134 u := u_clamp + u_irq + u_dl; [approx. see source for more detail] 135 136 f_des := min( f_max, 1.25 u * f_max ) 137 138XXX IO-wait; when the update is due to a task wakeup from IO-completion we 139boost 'u' above. 140 141This frequency is then used to select a P-state/OPP or directly munged into a 142CPPC style request to the hardware. 143 144XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min 145required to satisfy the workload. 146 147Because these callbacks are directly from the scheduler, the DVFS hardware 148interaction should be 'fast' and non-blocking. Schedutil supports 149rate-limiting DVFS requests for when hardware interaction is slow and 150expensive, this reduces effectiveness. 151 152For more information see: kernel/sched/cpufreq_schedutil.c 153 154 155NOTES 156----- 157 158 - On low-load scenarios, where DVFS is most relevant, the 'running' numbers 159 will closely reflect utilization. 160 161 - In saturated scenarios task movement will cause some transient dips, 162 suppose we have a CPU saturated with 4 tasks, then when we migrate a task 163 to an idle CPU, the old CPU will have a 'running' value of 0.75 while the 164 new CPU will gain 0.25. This is inevitable and time progression will 165 correct this. XXX do we still guarantee f_max due to no idle-time? 166 167 - Much of the above is about avoiding DVFS dips, and independent DVFS domains 168 having to re-learn / ramp-up when load shifts. 169 170