1.介绍
1.1vpa组件
vpa主要由三个组件组成,分别为recommend、updater、admission-controllerrecommend:根据metric-server api获取到的容器指标,计算推荐指标。
updater:根据pod的request中设置的指标和recommend计算的推荐指标,在一定条件下驱逐pod。
admission-controller:这是一个webhook组件,监听pod的创建,把recommend计算出的指标设置给pod的request和limit
1.2自定义资源
涉及到的自定义资源主要有两个:VerticalPodAutoscaler 、VerticalPodAutoscalerCheckpoint
VerticalPodAutoscaler:
该资源由用户创建,用于设置纵向扩容的目标对象和存储recommend组件计算出的推荐指标。
apiVersion autoscaling.k8s.io/v1beta2
kind VerticalPodAutoscaler
metadata
name nginx-vpa
namespace default
spec
targetRef
apiVersion"apps/v1"
kind Deployment
name nginx
updatePolicy
updateMode"Off"
resourcePolicy
containerPolicies
containerName"nginx"
minAllowed
cpu"90m"
memory"10Mi"
maxAllowed
cpu"1000m"
memory"1Gi"
status
conditions
lastTransitionTime"2021-11-24T08:54:41Z"
status"True"
type RecommendationProvided
recommendation
containerRecommendations
containerName nginx
lowerBound
cpu 90m
memory 131072k
target
cpu 90m
memory 131072k
uncappedTarget
cpu 12m
memory 131072k
upperBound
cpu"1"
memory 131072k
containerName yellow-pod-container
lowerBound
cpu 12m
memory 131072k
target
cpu 12m
memory 131072k
uncappedTarget
cpu 12m
memory 131072k
upperBound
cpu 407m
memory 425500k
VerticalPodAutoscalerCheckpoint:
该资源由recommend组件创建和维护,用于存储指标相关信息。
一个vpa对应的多个容器,每个容器创建一个该资源。
apiVersion autoscaling.k8s.io/v1
kind VerticalPodAutoscalerCheckpoint
metadata
creationTimestamp"2021-11-25T08:00:20Z"
name nginx-vpa-nginx
namespace default
spec
containerName nginx
vpaObjectName nginx-vpa
status
cpuHistogram
bucketWeights
"0"10000
"9"4
"11"2
"12"16
"13"2
"15"1
"16"11
"17"5
"31"40
"34"2
"44"2
"60"13
"75"744
referenceTimestamp"2021-11-28T00:00:00Z"
totalWeight560.4882004282056
firstSampleStart null
lastSampleStart"2021-11-28T14:13:59Z"
lastUpdateTime null
memoryHistogram
bucketWeights
"0"10000
"1"67
"34"141
referenceTimestamp"2021-11-29T00:00:00Z"
totalWeight23.976566526407623
totalSamplesCount9371
version v3
2.recommend组件
2.1yaml介绍
2.1.1版本细节
0.7版本
0.7版本
apiVersion autoscaling.k8s.io/v1beta2
kind VerticalPodAutoscaler
metadata
name nginx-vpa
namespace default
spec
targetRef ----设置目标资源,也可以是scaledClient能获取信息的自定义资源,如cloneSet这种能作为hpa target的自定义资源
apiVersion"apps/v1"
kind Deployment
name nginx
updatePolicy ---- 可设置Auto、Recreate或Off,设置Off,只会计算推荐值,updater组件不会驱逐pod。设置Auto:updater组建会驱逐pod
updateMode"Auto"
resourcePolicy
containerPolicies ---- 指定设置推荐值的容器
containerName"nginx"
minAllowed ---- 设置了minAllowed和maxAllowed,那么recommend计算推荐值的最后,会把推荐值控制在这个范围内。
cpu"90m"
memory"10Mi"
maxAllowed
cpu"1000m"
memory"1Gi"
0.8版本
多了一个controlledResources
可以只设置cpu、或memory
该属性在recommend组件中有使用
该属性可设置的值如下:enum: ["cpu", "memory"]
// Specifies the type of recommendations that will be computed
// (and possibly applied) by VPA.
// If not specified, the default of [ResourceCPU, ResourceMemory]will be used.
ControlledResources *[]v1.ResourceName `json:"controlledResources,omitempty" patchStrategy:"merge" protobuf:"bytes,5,rep,name=controlledResources"`
0.9版本
又多了一个controlledValues
该属性在recommend组件中没有使用,在admission组件中使用了
该属性可设置的值如下:enum: ["RequestsAndLimits", "RequestsOnly"]
// Specifies which resource values should be controlled.
// The default is "RequestsAndLimits".
// +optional
ControlledValues *ContainerControlledValues `json:"controlledValues,omitempty" protobuf:"bytes,6,rep,name=controlledValues"`
2.1.2targetRef
设置目标资源
可以是如下资源daemonSet、deployment、replicaSet、statefulSet、replicationController、job、cronJob
也可以是scaledClient能获取信息的自定义资源,如cloneSet这种能作为hpa target的自定义资源
注意:只有是deployment、replicaSet、replicationController、statefulSet、job时,updater驱逐pod的时候,updater设置的可驱逐pod比例(evictionToleranceFraction)才生效。当是其他资源时,会把所有pod全部驱逐,因为计算的tolerance=0。
targetRef
apiVersion"apps/v1"
kind Deployment
name nginx
源码查看
//根据vpa获取target 资源的selector。。例如:deployment的selector
func (f *vpaTargetSelectorFetcher) Fetch(vpa *vpa_types.VerticalPodAutoscaler) (labels.Selector, error) {
if vpa.Spec.TargetRef == nil {
return nil, fmt.Errorf("targetRef not defined. If this is a v1beta1 object switch to v1beta2.")
}
kind := wellKnownController(vpa.Spec.TargetRef.Kind)
//informersMap := map[wellKnownController]cache.SharedIndexInformer{
// daemonSet: factory.Apps().V1().DaemonSets().Informer(),
// deployment: factory.Apps().V1().Deployments().Informer(),
// replicaSet: factory.Apps().V1().ReplicaSets().Informer(),
// statefulSet: factory.Apps().V1().StatefulSets().Informer(),
// replicationController: factory.Core().V1().ReplicationControllers().Informer(),
// job: factory.Batch().V1().Jobs().Informer(),
// cronJob: factory.Batch().V1beta1().CronJobs().Informer(),
// }
informer, exists := f.informersMap[kind]
if exists {
return getLabelSelector(informer, vpa.Spec.TargetRef.Kind, vpa.Namespace, vpa.Spec.TargetRef.Name)
}
// not on a list of known controllers, use scale sub-resource
// TODO: cache response
groupVersion, err := schema.ParseGroupVersion(vpa.Spec.TargetRef.APIVersion)
if err != nil {
return nil, err
}
groupKind := schema.GroupKind{
Group: groupVersion.Group,
Kind: vpa.Spec.TargetRef.Kind,
}
//通过scaleClient获取资源的selector
selector, err := f.getLabelSelectorFromResource(groupKind, vpa.Namespace, vpa.Spec.TargetRef.Name)
if err != nil {
return nil, fmt.Errorf("Unhandled targetRef %s / %s / %s, last error %v",
vpa.Spec.TargetRef.APIVersion, vpa.Spec.TargetRef.Kind, vpa.Spec.TargetRef.Name, err)
}
return selector, nil
}
2.1.3updatePolicy
可设置Auto、Recreate、Off、Initial
设置Off、Initial:只会计算推荐值,updater组件不会驱逐pod。从代码层面没有发现这两个的区别。
设置Auto、Recreate:updater组建会驱逐pod。从代码层面没有发现这两个的区别。
updatePolicy
updateMode"Auto"
2.1.3resourcePolicy
设置了minAllowed和maxAllowed,那么recommend计算推荐值的最后,会把推荐值控制在这个范围内。
resourcePolicy
containerPolicies ---- 指定设置推荐值的容器
containerName"nginx"
minAllowed ---- 设置了minAllowed和maxAllowed,那么recommend计算推荐值的最后,会把推荐值控制在这个范围内。
cpu"90m"
memory"10Mi"
maxAllowed
cpu"1000m"
memory"1Gi"
2.2细节介绍
2.2.1参数介绍
1.指标存储
默认每个容器的指标存储在VerticalPodAutoscalerCheckpoint中
可以--storage 来指定将指标存储到prometheus
这种方式很好呢,默认有自己的存储,也可指定外部存储。。。我们的系统是不是也能这样呢。可以通过参数指定是否使用sso。
storage = flag.String("storage", "", `Specifies storage mode. Supported values: prometheus, checkpoint (default)`)
2.指标获取周期:
默认1分钟获取一次指标,并计算推荐值
可通过 --recommender-interval 自定义
metricsFetcherInterval = flag.Duration("recommender-interval", 1*time.Minute, `How often metrics should be fetched`)
3.推荐值默认最小值-defaultMinRecommend:
cpu推荐最小值:25m*(1/pod中的container数)
memory推荐最小值:250M*(1/vpa中的container数)
注意:
minRecommend = max(minAllowed,defaultMinRecommend)
推荐值的最小值是minAllowed和defaultMinRecommend 取两者之间最大
可分别通过--pod-recommendation-min-cpu-millicores 和--pod-recommendation-min-memory-mb自定义
podMinCPUMillicores = flag.Float64("pod-recommendation-min-cpu-millicores", 25, `Minimum CPU recommendation for a pod`)
podMinMemoryMb = flag.Float64("pod-recommendation-min-memory-mb", 250, `Minimum memory recommendation for a pod`)
源码查看:
//containerNameToAggregateStateMap----pod中的container数
fraction := 1.0 / float64(len(containerNameToAggregateStateMap))
//cpu最小25m*fraction
//内存最小250Mi*fraction
//minResources:map[cpu:25 memory:262144000]
minResources := model.Resources{
model.ResourceCPU: model.ScaleResource(model.CPUAmountFromCores(*podMinCPUMillicores*0.001), fraction),
model.ResourceMemory: model.ScaleResource(model.MemoryAmountFromBytes(*podMinMemoryMb*1024*1024), fraction),
}
4.最终推荐值:
最终推荐值 = bucketStart*(1+recommendation-margin-fraction)
safetyMarginFraction = flag.Float64("recommendation-margin-fraction", 0.15, `Fraction of usage added as the safety margin to the recommended request`)
5.prometheus指标
recommend指标路径
http://127.0.0.1:8942/metrics
2.2.2逻辑介绍
1.InitFromCheckpoints
把checkPoint中存储的放大后的索引和权重值,转为放大前的数组,权重没有的,数组中数值为0。
从checkPoint中初始化的bucketWeight放在feeder.clusterState.Vpas[vpaID].ContainersInitialAggregateState[checkpoint.Spec.ContainerName] 中。
checkpoint中数值:
checkpoint.BucketWeights map[0:10000 9:2 11:1 12:8 13:1 16:6 17:3 31:21 34:1 44:1 60:7 75:365]
TotalWeight :549.3782628171234
sumWeight:10416=10000+2+1+8+1+6+3+21+1+1+7+365
ratio =0.052743688826528745 = totalWeight/sumWeight = 549.3782628171234/10416
复原后的bucketWeight = [527.4368882652875 0 0 0 0 0 0 0 0 0.10548737765305749 0 0.052743688826528745 0.42194951061222996 0.052743688826528745 0 0 0.31646213295917247 0.15823106647958624 0 0 0 0 0 0 0 0 0 0 0 0 0 1.1076174653571036 0 0 0.052743688826528745 0 0 0 0 0 0 0 0 0 0.052743688826528745 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.3692058217857012 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19.251446421682992 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
TotalWeight :549.3782628171234
eg:527.4368882652875 = 10000*ratio(0.052743688826528745)
2.LoadVPAs
把所有vpa放到这个字段中 feeder.clusterState.Vpas
3.LoadPods
把所有pod放到feeder.clusterState.pods
注意:
aggregateContainerState = &AggregateContainerState{
AggregateCPUUsage: util.NewDecayingHistogram(CPUHistogramOptions, CPUHistogramDecayHalfLife),
AggregateMemoryPeaks: util.NewDecayingHistogram(MemoryHistogramOptions, MemoryHistogramDecayHalfLife),
CreationTime: time.Now(),
}
1.pod.Containers[containerID.ContainerName] 使用同一个aggregateContainerState
2.cluster.aggregateStateMap[aggregateStateKey] = aggregateContainerState 使用同一个aggregateContainerState
3.vpa.aggregateContainerStates clusterState的vpa中也使用同一个aggregateContainerState
正是因为上面三处使用的是同一个aggregateContainerState,所以下面LoadRealTimeMetrics 是使用的pod[].containers中的aggregate,而UpdateVPAs中使用的是vpa中的aggregate
关键源码
// findOrCreateAggregateContainerState returns (possibly newly created) AggregateContainerState
// that should be used to aggregate usage samples from container with a given ID.
// The pod with the corresponding PodID must already be present in the ClusterState.
func (cluster *ClusterState) findOrCreateAggregateContainerState(containerID ContainerID) *AggregateContainerState {
aggregateStateKey := cluster.aggregateStateKeyForContainerID(containerID)
aggregateContainerState, aggregateStateExists := cluster.aggregateStateMap[aggregateStateKey]
if !aggregateStateExists {
aggregateContainerState = NewAggregateContainerState()
cluster.aggregateStateMap[aggregateStateKey] = aggregateContainerState
// Link the new aggregation to the existing VPAs.
for _, vpa := range cluster.Vpas {
//cluster.aggregateStateMap[aggregateStateKey] = aggregateContainerState 使用同一个aggregateContainerState
//vpa.aggregateContainerStates clusterState的vpa中也使用同一个aggregateContainerState
if vpa.UseAggregationIfMatching(aggregateStateKey, aggregateContainerState) {
cluster.VpasWithMatchingPods[vpa.ID] = true
}
}
}
return aggregateContainerState
}
4.LoadRealTimeMetrics
根据从metric-server中获取到的当前内存使用率,计算出bucket是哪个,weight增加下 cpu权重+0.1 (实际还不是0.1,是0.14*******) memory权重+1 (实际还不是1,是1.4******) 注意: 由于是以container为维度,一个deployment中有5个副本的话,那么container1的CPU指标就会+0.1*5 targetPercentile=0.9 lowerBuundPercentile=0.5 uperBuundPercentile=0.95 上面逻辑找出bucket,再根据bucket 算出bucket Start,由于第一个bucket 小于minResource,就使用minReource 结合上图查看 把bucketWeight存储到checkPoint中,详细逻辑可查看 InitFromCheckpoints 处的图 每小时清理一下内存中无效数据,例如上面的feeder.clusterState.pods,feeder.clusterState.vpas 注意: 不会删除无效的VerticalPodAutoscalerCheckpoint5.UpdateVPAs
//根据vpa中的bucketWeight算出recommend,并设置到vpa的status中
partialSum := 0.0
threshold := percentile * h.totalWeight
bucket := h.minBucket
for ; bucket < h.maxBucket; bucket++ {
partialSum += h.bucketWeight[bucket]
if partialSum >= threshold {
break
}
}6.MaintainCheckpoints
7.GarbageCollect
VerticalPodAutoscalerCheckpoint的名字是 vpaName-containerName
当容器不存在时,该资源不会被删除
dz0400819@MacBook-Pro ~ kubectl get VerticalPodAutoscalerCheckpoint -n vpa
NAME AGE
nginx-vpa-nginx 23d
nginx-vpa-nginxll 21h
nginx-vpa-yellow-pod-container 21h
3.updater组件
3.1pod驱逐条件
3.1.1.当前pod数<minReplicas(默认是2)的,不会驱逐pod
可通过--min-replicas自定义
minReplicas = flag.Int("min-replicas", 2,`Minimum number of replicas to perform update`)
3.1.2.resourceDiff<0.1的不会被驱逐
一个pod下两个容器C1和C2
resourceDiff = (C1request+C2request)-(C1recommand+C2recommand)/(C1request+C2request)
源码:
if updatePriority.resourceDiff < calc.config.MinChangePriority {
klog.Info(fmt.Sprintf("not updating pod %v, resource diff too low: %v", pod.Name, updatePriority))
return
}
3.1.3.如果pod中任何一个容器request值不在vpa设置的范围(lowerBound -uperBound),pod会被驱逐
注意:
这里是vpa status中的lowerBound 和uperBound,不是minAllowed和maxAllowed。
lowerBound 和uperBound 也在minAllowed和maxAllowed范围中。。
3.1.4.如果request值在vpa范围内,并且不是被OOM,pod运行12小时后才会根据resourceDiff判断是否驱逐
源码:
if now.Before(pod.Status.StartTime.Add(podLifetimeUpdateThreshold)) {
klog.Info(fmt.Sprintf("not updating a short-lived pod %v, request within recommended range", pod.Name))
return
}
3.1.5.驱逐pod比例
eviction-tolerance ====》updater工作周期内,一次可驱逐的pod百分比
//updater默认每分钟执行一次,nginx一共两个pod,那么updater会每分钟驱逐一个,直到所有pod都为target推荐值
evictionToleranceFraction = flag.Float64("eviction-tolerance", 0.5,
`Fraction of replica count that can be evicted for update, if more than one pod can be evicted.`)
只有Target是deployment、replicaSet、replicationController、statefulSet、job时,updater驱逐pod的时候,updater设置的可驱逐pod比例(eviction-tolerance )才生效。当是其他资源时,updater一个周期内,会把所有pod全部驱逐,因为计算的tolerance=0。
源码查看:
// NewPodsEvictionRestriction creates PodsEvictionRestriction for a given set of pods.
func (f *podsEvictionRestrictionFactoryImpl) NewPodsEvictionRestriction(pods []*apiv1.Pod) PodsEvictionRestriction {
// We can evict pod only if it is a part of replica set
// For each replica set we can evict only a fraction of pods.
// Evictions may be later limited by pod disruption budget if configured.
//-----pod的owner对应的pod的列表----
livePods := make(map[podReplicaCreator][]*apiv1.Pod)
for _, pod := range pods {
creator, err := getPodReplicaCreator(pod)
if err != nil {
klog.Errorf("failed to obtain replication info for pod %s: %v", pod.Name, err)
continue
}
if creator == nil {
klog.Warningf("pod %s not replicated", pod.Name)
continue
}
livePods[*creator] = append(livePods[*creator], pod)
}
//-----pod的owner对应的pod的列表----
//获取每个pod对应的creator
podToReplicaCreatorMap := make(map[string]podReplicaCreator)
//获取每个creator 的pod runnin给数量、pending数量、replica数量
creatorToSingleGroupStatsMap := make(map[podReplicaCreator]singleGroupStats)
for creator, replicas := range livePods {
actual := len(replicas)
if actual < f.minReplicas {
klog.V(2).Infof("too few replicas for %v %v/%v. Found %v live pods",
creator.Kind, creator.Namespace, creator.Name, actual)
continue
}
//这里只处理了job,statefulSet,replicationController,replicaSet 四种资源的副本数
var configured int
if creator.Kind == job {
// Job has no replicas configuration, so we will use actual number of live pods as replicas count.
configured = actual
} else {
var err error
//只有是replicaSet、replicationController、statefulSet,返回值为replicas字段,其他资源返回值为0 ,那么singleGroup.evictionTolerance=0
configured, err = f.getReplicaCount(creator)
if err != nil {
klog.Errorf("failed to obtain replication info for %v %v/%v. %v",
creator.Kind, creator.Namespace, creator.Name, err)
continue
}
}
singleGroup := singleGroupStats{}
singleGroup.configured = configured
singleGroup.evictionTolerance = int(float64(configured) * f.evictionToleranceFraction)
for _, pod := range replicas {
podToReplicaCreatorMap[getPodID(pod)] = creator
if pod.Status.Phase == apiv1.PodPending {
singleGroup.pending = singleGroup.pending + 1
}
}
singleGroup.running = len(replicas) - singleGroup.pending
creatorToSingleGroupStatsMap[creator] = singleGroup
}
podToReplicaCreatorMapStr,_ := json.Marshal(podToReplicaCreatorMap)
klog.Info(fmt.Sprintf("NewPodsEvictionRestriction----podToReplicaCreatorMapStr %s", podToReplicaCreatorMapStr))
creatorToSingleGroupStatsMapStr,_ := json.Marshal(creatorToSingleGroupStatsMap)
klog.Info(fmt.Sprintf("NewPodsEvictionRestriction----creatorToSingleGroupStatsMapStr %s", creatorToSingleGroupStatsMapStr))
return &podsEvictionRestrictionImpl{
client: f.client,
podToReplicaCreatorMap: podToReplicaCreatorMap,
creatorToSingleGroupStatsMap: creatorToSingleGroupStatsMap}
}
func (e *podsEvictionRestrictionImpl) CanEvict(pod *apiv1.Pod) bool {
cr, present := e.podToReplicaCreatorMap[getPodID(pod)]
podToReplicaCreatorMapStr,err := json.Marshal(e.podToReplicaCreatorMap)
if err != nil {
klog.Warningf("CanEvict e.podToReplicaCreatorMap --marshal err: %v", err)
}
klog.Info(fmt.Sprintf("CanEvict podToReplicaCreatorMapStr %s", podToReplicaCreatorMapStr))
if present {
klog.Info("CanEvict pod present")
singleGroupStats, present := e.creatorToSingleGroupStatsMap[cr]
if pod.Status.Phase == apiv1.PodPending {
return true
}
if present {
klog.Info("CanEvict pod also present")
//shouldBeAlive = 0-0,所以才会全部驱逐
shouldBeAlive := singleGroupStats.configured - singleGroupStats.evictionTolerance
if singleGroupStats.running-singleGroupStats.evicted > shouldBeAlive {
klog.Info("singleGroupStats.running-singleGroupStats.evicted > shouldBeAlive")
return true
}
// If all pods are running and eviction tollerance is small evict 1 pod.
if singleGroupStats.running == singleGroupStats.configured &&
singleGroupStats.evictionTolerance == 0 &&
singleGroupStats.evicted == 0 {
klog.Info("singleGroupStats.evictionTolerance == 0")
return true
}
}
}
return false
}
3.1.6.pod驱逐顺序
扩容的pod优先处理
resourceDiff大的优先处理
1. If any container wants to grow, the pod takes precedence.扩容的pod优先处理
2. A pod with larger value of resourceDiff takes precedence. resourceDiff大的优先处理
func (list byPriority) Swap(i, j int) {
list[i], list[j] = list[j], list[i]
}
// Less implements reverse ordering by priority (highest priority first).
func (list byPriority) Less(i, j int) bool {
// 1. If any container wants to grow, the pod takes precedence.
// TODO: A better policy would be to prioritize scaling down when
// (a) the pod is pending
// (b) there is general resource shortage
// and prioritize scaling up otherwise.
if list[i].scaleUp != list[j].scaleUp {
return list[i].scaleUp
}
// 2. A pod with larger value of resourceDiff takes precedence.
return list[i].resourceDiff > list[j].resourceDiff
}
4.admission组件
容器的newRequest就是recommendation.Target
容器的newLimit =oldLimit*recommend/oldRequest 即:保留了原limit与request的比例
参考文献
参考文档:
https://cloud.tencent.com/developer/news/841543
https://github.com/kubernetes/design-proposals-archive/blob/main/autoscaling/vertical-pod-autoscaler.md