Karpenter on AWS

AWS-specific Karpenter configuration, optimization, and troubleshooting

Overview

Karpenter on AWS (via EKS) differs significantly from OVH in:

Quota model: Spot/On-Demand quota (not vCPU quota)
Instance types: Hundreds of EC2 instance options
Cost optimization: Spot instances for 70% savings
Interruption handling: Spot termination requires graceful shutdown
Scaling speed: Typically 30-60s for new nodes

AWS NodePool Configuration

Basic Setup

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: workers
spec:
  template:
    spec:
      requirements:
        # Architecture
        - key: kubernetes.io/arch
          operator: In
          values: ["arm64", "amd64"]
        
        # Instance families (compute-optimized for workflows)
        - key: node.kubernetes.io/instance-family
          operator: In
          values: ["c6g", "c6i", "m6g", "m6i"]
        
        # Capacity type: Spot (cost) vs On-Demand (reliability)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        
        # No GPU nodes
        - key: karpenter.sh/gpu-count
          operator: In
          values: ["0"]
      
      nodeClassRef:
        name: default
  
  # Hard limits
  limits:
    cpu: 1000
    memory: 4000Gi
  
  # Consolidation: scale down idle nodes
  consolidateAfter: 30s
  consolidationPolicy: WhenEmpty
  
  # Node lifetime: recycle old nodes every 7 days
  expireAfter: 604800s

NodeClass Configuration

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  role: "KarpenterNodeRole"
  
  # EBS configuration
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 250  # MB/s
        deleteOnTermination: true
        encrypted: true
  
  # Security group (auto-created by CloudFormation)
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: TES
  
  # Subnets (auto-created by CloudFormation)
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: TES
  
  # Tenancy
  tenancy: default
  
  # Tags for cost allocation
  tags:
    Name: "karpenter-worker"
    Cluster: "TES"
    ManagedBy: "Karpenter"

AWS vs OVH Differences

Aspect	AWS	OVH
Quota Mechanism	Spot/On-Demand quota per account	vCPU/RAM quota per project
Quota Enforcement	HTTP 400 on new SpotRequest	HTTP 412 on new instance
Quota Recovery	Automatic (quota applies globally)	Manual request increase
Spot Interruption	AWS terminates nodes (2-min warning)	No equivalent
Instance Selection	300+ types, dynamic pricing	~20 types, fixed pricing
Network	VPC subnets (managed)	Neutron vRack (manual)
Storage	EBS (per-node), EFS (shared)	Cinder (per-node), Manila (shared)
Bin-Packing	Prefers spot (cheaper) + smaller nodes	Largest fitting node

Key Implications

AWS Karpenter strategy:

Prefer Spot instances (70% cheaper) but tolerate interruption
Select smaller instance types for faster provisioning
Use On-Demand for system services (reliable)
Handle Spot termination gracefully

OVH Karpenter strategy:

Filter instance types to avoid oversized flavors
Respect vCPU/RAM quota explicitly
Use only On-Demand (no Spot equivalent)
Rely on consolidation to manage costs

Instance Type Selection

Recommended Families for Genomics

# Compute-optimized (CPU-heavy bioinformatics)
c6g/c6i:
  - c6g.large    # 2 vCPU, 4 GB (Graviton, ARM)
  - c6g.xlarge   # 4 vCPU, 8 GB
  - c6i.2xlarge  # 8 vCPU, 16 GB (Intel)

# General purpose (balanced)
m6g/m6i:
  - m6g.large    # 2 vCPU, 8 GB (Graviton)
  - m6i.xlarge   # 4 vCPU, 16 GB (Intel)
  - m6i.2xlarge  # 8 vCPU, 32 GB

# Memory-optimized (for large reference files in RAM)
r6g/r6i:
  - r6g.xlarge   # 4 vCPU, 32 GB (Graviton)
  - r6i.2xlarge  # 8 vCPU, 64 GB (Intel)

Avoid

# Too small (not cost-effective)
t3.micro, t3.small:
  - High per-vCPU cost
  - Burstable (unpredictable)

# Too large (over-provision)
c6g.4xlarge, m6i.4xlarge:
  - Overkill for most genomics tasks
  - Longer startup time
  - Less flexible bin-packing

# Expensive specialty
p4d, trn1:
  - GPU/TPU (not needed for most workflows)
  - 10x–100x normal cost

# Outdated generations
m5, t2:
  - More expensive than m6i/c6i
  - Slower provisioning

Configuration Example

requirements:
  - key: node.kubernetes.io/instance-family
    operator: In
    values: ["c6g", "c6i", "m6g", "m6i"]  # Whitelist good families
  
  - key: node.kubernetes.io/instance-cpu
    operator: Lt
    values: ["16"]  # Don't select >16 vCPU instances
  
  - key: node.kubernetes.io/instance-memory
    operator: Lt
    values: ["65536"]  # Don't select >64 GB instances

Spot vs On-Demand Strategy

Spot Instances (Cost Optimization)

Pros:

70% cheaper than On-Demand
Ideal for fault-tolerant workloads
Automatic bin-packing across AZs

Cons:

2-minute termination warning
Not suitable for long-running tasks
Availability varies by region/type

Configuration:

requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]  # Prefer spot

Pod configuration to tolerate Spot:

apiVersion: v1
kind: Pod
metadata:
  name: task
spec:
  terminationGracePeriodSeconds: 120  # 2 min shutdown
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["on-demand"]  # Prefer on-demand for critical tasks

On-Demand Instances (Reliability)

Use for:

System services (Karpenter, Funnel, Cromwell)
Long-running tasks (>1 hour)
Guaranteed availability

Configuration:

# System NodePool (always on-demand)
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
  limits:
    cpu: 50      # Small, stable system
    memory: 200Gi

Cost Optimization

1. Consolidation (Scale Down Idle Nodes)

consolidateAfter: 30s       # Wait 30s before consolidating
consolidationPolicy: WhenEmpty  # Only consolidate when empty

Effect: Nodes automatically removed 30s after all pods move.

2. Spot Instance Preference

requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]  # Spot first, fallback to On-Demand

Effect: 70% cost savings on compute nodes.

3. Right-Sized Instances

Don’t use oversized instances. Example cost per hour:

c6g.large   (2 vCPU, 4GB)   — $0.084/hr (Spot) ~ $0.28/hr (On-Demand)
c6g.xlarge  (4 vCPU, 8GB)   — $0.168/hr (Spot) ~ $0.56/hr (On-Demand)
c6i.2xlarge (8 vCPU, 16GB)  — $0.34/hr (Spot) ~ $1.13/hr (On-Demand)

Running 100 tasks:
- c6g.large ×100:    ~$8.40 (spot)
- c6i.2xlarge ×25:   ~$8.50 (spot, over-provisioned)

→ Use smallest fitting instance type

4. Lifecycle: Expire Old Nodes

expireAfter: 604800s  # 7 days

Recycles nodes to catch security patches, kernel updates.

5. EBS Performance Settings

blockDeviceMappings:
  - ebs:
      volumeType: gp3
      iops: 3000          # Baseline (free tier)
      throughput: 250     # 250 MB/s (included in 3000 IOPS)
      # Higher IOPS/throughput = more cost

Pricing:

gp3: 3000 IOPS + 125 MB/s = $0.10/volume/month
Each 100 additional IOPS: +$0.006/month
Each 1 MB/s above 125: +$0.04/month

AWS Spot Interruption Handling

How Spot Interruption Works

AWS sends 2-minute warning via EC2 Instance Metadata Service
Karpenter drains node (sends SIGTERM to pods)
Workload manager (Cromwell/job controller) catches signal
Task checkpoints or fails gracefully
Node terminated after 2 minutes

Enabling Graceful Shutdown

Funnel task pods:

terminationGracePeriodSeconds: 120  # 2 minutes
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]  # Allow graceful shutdown

Cromwell jobs:

runtime {
  docker: "my-image:latest"
  cpu: 4
  memory: "8 GB"
  # Cromwell will receive SIGTERM and fail task gracefully
}

Monitoring Interruptions

# Check node drain events
kubectl get events -n karpenter --sort-by='.lastTimestamp' | grep -i drain

# Check CloudWatch for SpotInstanceInterruption warnings
aws ec2 describe-spot-instance-requests --query \
  'SpotInstanceRequests[?Status.Code==`capacity-oversubscribed`]'

Troubleshooting

Issue: Nodes Not Scaling Up

Diagnosis:

kubectl describe nodepool workers
kubectl logs -n karpenter deployment/karpenter | tail -50

Common causes:

Cause	Check	Fix
Spot quota exhausted	`aws service-quotas list-service-quotas --service-code ec2`	Request quota increase or use on-demand
On-Demand quota exhausted	Same as above	Request quota increase
Pod security policy	`kubectl get psp`	Adjust PSP or use Pod Security Standards
Subnet capacity	`aws ec2 describe-subnets`	Add more subnets or expand CIDR
Instance type unavailable	`aws ec2 describe-spot-price-history`	Switch to different instance family

Issue: Spot Interruption Too Frequent

Check interruption rate:

aws ec2 describe-spot-instance-request-history --region us-east-1 \
  --query 'SpotInstanceRequestHistory[?EventCode==`instance-terminated-capacity-oversubscribed`]' | wc -l

Solutions:

Switch to less-popular instance types (lower interruption rate)
Use On-Demand instead of Spot for critical workloads
Enable Capacity Rebalancing (cost extra but reduces interruption)

Issue: Slow Node Provisioning

Target: <1 minute from pod creation to running

Check:

kubectl get events | grep -i provision
kubectl logs -n karpenter deployment/karpenter | grep -i latency

Improvements:

Use smaller instances (faster to provision)
Warm node pool (pre-provision standby nodes)
Switch to Spot (typically faster than On-Demand)

Installation Guide — AWS EKS setup
Cost & Capacity — Budget planning
Troubleshooting — Detailed issue resolution
Karpenter Docs — Official Karpenter documentation
AWS EC2 Pricing — Pricing calculator

Last Updated: March 13, 2026
Version: 1.0

Cromwell + TES on Kubernetes

Step-by-step guides to deploy Cromwell and Funnel TES on managed Kubernetes — AWS EKS and OVHcloud MKS — with Karpenter autoscaling, EFS/NFS shared storage, and S3 object storage.

Karpenter on AWS

Overview

AWS NodePool Configuration

Basic Setup

NodeClass Configuration

AWS vs OVH Differences

Key Implications

Instance Type Selection

Recommended Families for Genomics

Avoid

Configuration Example

Spot vs On-Demand Strategy

Spot Instances (Cost Optimization)

On-Demand Instances (Reliability)

Cost Optimization

1. Consolidation (Scale Down Idle Nodes)

2. Spot Instance Preference

3. Right-Sized Instances

4. Lifecycle: Expire Old Nodes

5. EBS Performance Settings

AWS Spot Interruption Handling

How Spot Interruption Works

Enabling Graceful Shutdown

Monitoring Interruptions

Troubleshooting

Issue: Nodes Not Scaling Up

Issue: Spot Interruption Too Frequent

Issue: Slow Node Provisioning

Karpenter on AWS

Overview

AWS NodePool Configuration

Basic Setup

NodeClass Configuration

AWS vs OVH Differences

Key Implications

Instance Type Selection

Recommended Families for Genomics

Avoid

Configuration Example

Spot vs On-Demand Strategy

Spot Instances (Cost Optimization)

On-Demand Instances (Reliability)

Cost Optimization

1. Consolidation (Scale Down Idle Nodes)

2. Spot Instance Preference

3. Right-Sized Instances

4. Lifecycle: Expire Old Nodes

5. EBS Performance Settings

AWS Spot Interruption Handling

How Spot Interruption Works

Enabling Graceful Shutdown

Monitoring Interruptions

Troubleshooting

Issue: Nodes Not Scaling Up

Issue: Spot Interruption Too Frequent

Issue: Slow Node Provisioning

Related Documentation