先看一个例子:
If period is 250ms and quota is also 250ms, the group will get
1 CPU worth of runtime every 250ms.
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */
我有一个疑问:上面的设置,如果换成下面的,会有什么不同?
If period is 1000ms and quota is also 1000ms, the group will get
1 CPU worth of runtime every 1000ms.
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 1000000 > cpu.cfs_period_us /* period = 1000ms */
二者都可以让这个资源组获得一个 CPU 的资源,period 设置成 250ms 和 1000ms 的区别在哪里呢?
要理解这里面的区别,必须去了解 CFS Bandwidth Control 的实现思路,这篇文档里有一个 burst 的概念。
CPU 时间根据 cpu.cfs_period_us 划分成一个个 period,当一个资源组里出现突发流量时,允许它在当前 period 内超出 quota 限制,然后在后面的 period 内把超出的找补回来。从长期看,实现了总体的限时。
那么,后面要用多少个 period 才能找补回来呢?如果后面的 period 里也一直超出 quota 限制呢?在比较新的内核里,cpu.cfs_burst_us
会控制 burst 上限
*** Burst feature ***
This feature borrows time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded.
Traditional (UP-EDF) bandwidth control is something like:
(U = Sum u_i) <= 1
This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we’d have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail.
The burst feature observes that a workload doesn’t always executes the full quota; this enables one to describe u_i as a statistical distribution.
For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional Worst Case Execution Time , WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average.
That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there’s a threshold where one exceeds and the other doesn’t underrun enough to compensate; this depends on the specific CDFs.
At the same time, we can say that the worst case deadline miss, will be Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET).
The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
回到上面的一个疑问, cfs_period_us 的值有什么讲究?period 越大,结算周期越长。从应用侧感受到的区别就是:period 小,则应用运行更平滑,period 大,则运行毛刺更多。从操作系统侧来看,period 小,则调度开销更大,period 大,则调度开销会小一些。
补充资料
burst 这个功能还在进化中,可以对比两个文档看到:
https://www.kernel.org/doc/html/v5.13/scheduler/sched-bwc.html
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html
在阿里的常规内核(3.10.0-327.ali2010.rc7.alios7.x86_64) 里,还没有 cpu.cfs_burst_us
这个东西。
它的 burst 能力也比较有限,只能借。