Skip to main content

Running out of Private IP in EKS

 


In particular, there are cases where the pending state persists despite not running out of
CPU or memory.

There is a case that it is due to lack of private ip.If the node group is shown as "Degraded"
in the EKS cluster configuration and you can see the following error in Health issues.

"Amazon Autoscaling was unable to launch instances because there are not enough free addresses
in the subnet associated with your AutoScaling group(s)."

And you can see that the number of "Available IP4 addresses" in the AWS VPC subnet used in the
node group is 0.

By designating the IP that the node group occupies, you can get some IPs back.


kubectl set env -n kube-system daemonset/aws-node MINIMUM_IP_TARGET=10 WARM_IP_TARGET=2
kubectl get daemonset -n kube-system aws-node -o json | jq -r '.spec.template.spec.containers[] |select ( .name == "aws-node" ).env'


You can see that the number of "Available IP4 addresses" in the AWS VPC subnet is increased.

Nevertheless, if IPs are not enough, consider two approaches.

1. Check the HA status and adjust appropriately to avoid the case where too many pods are
created due to the cpu and memory allocated to the application being too small.

kubectl get hpa -n test
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
test-nginx Deployment/test-nginx 10%/80%, 9%/80% 7 200 7 14d

---

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: test-nginx
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: test-nginx
minReplicas: 7
maxReplicas: 200
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80 =>
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 =>


---

apiVersion: apps/v1
kind: Deployment
metadata:
name: test-nginx
spec:
selector:
matchLabels:
app: test-nginx
replicas: 7
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 50%
maxUnavailable: 50%
template:
spec:
containers:
- name: nginx
imagePullPolicy: Always
resources:
requests:
memory: "200Mi" =>
cpu: "100m" =>
limits:
memory: "1Gi"
cpu: "500m"
nodeSelector:
team: test
environment: prod

2. If IPs are not enough even after tuning, add a node group to which another subnet is
assigned. (This below is a terraform snippet.)

test = {
desired_capacity = 5
max_capacity = 15
min_capacity = 4
subnets = [element(module.vpc.private_subnets, 0)]
disk_size = 30
k8s_labels = {
team = "test"
environment = "prod"
}
},

test2 = { =>
desired_capacity = 5
max_capacity = 15
min_capacity = 4
subnets = [element(module.vpc.private_subnets, 4)]
disk_size = 30
k8s_labels = {
team = "test"
environment = "prod"
}
},


Comments

Popular posts from this blog

Install CoreOs on linode without VM

Install CoreOs on linode without VM 1. Add a Linode 2. Create a new Disk   CoreOS 3. Rescue > Reboot into Rescue Mode 4. Remote Access   Launch Lish Console 5. make an install script cat <<'EOF1' > install.sh # add needed package sudo apt-get update sudo apt-get install -y curl wget whois sudo apt-get install -y ca-certificates #sudo apt-get install gawk -y # get discovery url discoveryUrl=`curl https://discovery.etcd.io/new` # write cloud-config.yml cat <<EOF2 > cloud-config.yml #cloud-config users:   - name: core     groups:       - sudo       - docker coreos:   etcd:     name: node01     discovery: $discoveryUrl hostname: node01 EOF2 # get the coreos installation script #wget https://raw.github.com/coreos/init/master/bin/coreos-install wget https://raw.githubusercontent.com/coreos/init/master/bin/coreos-install # run installation chmod 75...

Amazon RDS Blue/Green Deployments

In order to avoid some errors I experienced when proceeding as described in the official documentation, I describe what I did in order. 1) Modify parameters of source_database * error: Blue Green Deployments requires cluster parameter group has binlog enabled. RDS Parameter groups: source-params-group binlog_format => MIXED mysql> show global variables like 'binlog_format'; 2) Insert a row after rebooting the source database, to avoid this error. * error: Correct the replication errors and then switch over. Read Replica Replication Error - IOError: 1236, reason: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file' => To Fix: You need to change the data in the source database. INSERT INTO dummy_table ( `favorite_id` , `favorite_order` , `user_id` , `board_id` ) VALUES ('100001', '1', '11111', '11111'); 3) Modify the param...

실리콘밸리 구직 체험기2_201505 - 3. 새로운 위기

실리콘밸리 구직 체험기2_201505 - 3. 새로운 위기 우리는 지난 1년간 lockscreen app과 messenger app을 만들었습니다. 나는 backend restful API를 만들고 cms를 node.js, angular.js로 만들었다. 또한 AWS 인스턴스를 관리했습니다. 마지막 몇달 동안 기존 lockscreen 앱과 CMS 관리 대신에 메신저 앱에 집중했습니다. 한국에서는 매출과 이익을 내지 않으면 투자를 받을 수 없는 환경이지만 매출 없이 유저수 증가에만 집중하는 모습이 참 생소했습니다. 우리는 사용자의 행태를 확인하기 위해서 BI툴을 이용해서 사용자 수와 retention rate 등을 추적하고 새로운 기능에 대해 사용자의 반응을 점검하면서 기획을 수시로 바꿨습니다. 많은 기능을 새로 만들고 ab test를 통해서 또 많이 폐기 했다. 실제 만든 기능 중에 절반 이상은 버려 졌습니다. locket앱은 2014년 google store의 베스트 앱으로 선정되기도 했다. 그러나 사실 이 시점에 이미 우리는 메신저 앱에 올인 하고 있었습니다. https://www.facebook.com/photo.php?fbid=10204452594066393&set=a.2225490048257.118046.1577949323&type=1 우리가 만든 메신저 앱은 초반의 반응이 워낙 좋아서 정체되었던 lockscreen 유저수를 단번에 따라 잡았습니다. 특히 Retention Rate 이 좋아서 사용자들의 제대로 사용하고 있다고 생각했습니다. 잠깐이지만 google play 에 featured 되고, 인도에 가입 지원을 했을 때 사용자 수가 급증하기도 했습니다. https://www.facebook.com/photo.php?fbid=10204901853057587&set=a.2225490048257.118046.1577949323&type=1 그 시...