AWS EKS Installation Guide
Deploying Cromwell + Funnel TES on AWS EKS with Karpenter
NOTE: This is a stub page — contents will be revised after production validation on eu-north-1. The architecture and phase outline below reflect the intended deployment plan.
Architecture Overview
AWS (eu-north-1 / Stockholm, 3 AZ)
┌────────────────────────────────────────────────────┐
│ VPC (10.0.0.0/16) │
│ ├─ Public subnets (ALB, NAT Gateway) │
│ └─ Private subnets (EKS nodes, EFS mount targets) │
└────────────────────────────────────────────────────┘
│
├─ EKS Cluster (Kubernetes 1.34)
│ ├─ System Node (t4g.medium, always-on)
│ │ ├─ Karpenter Controller
│ │ └─ Funnel Server
│ └─ Worker Nodes (Karpenter-managed, Spot)
│ └─ Task Pods (on demand)
│
├─ EFS (Elastic File System)
│ └─ Multi-AZ shared storage, NFS-mounted
│
├─ EBS volumes (per task)
│ └─ gp3, auto-provisioned by Karpenter
│
├─ ECR (Elastic Container Registry)
│ └─ Private container image registry
│
└─ S3
└─ Task I/O, workflow inputs/outputs, cold archive
Key Technologies
| Technology | Role | Details |
|---|---|---|
| EKS | Kubernetes cluster | Managed service, $0.10/hr control plane |
| Karpenter | Auto-scaling | Scales workers on demand, Spot support |
| EFS | Shared storage | Multi-AZ NFS, CSI driver |
| EBS | Task local storage | gp3 volumes, auto-provisioned |
| S3 | Object storage | Workflow I/O, cold archive |
| ECR | Container registry | Private registry for task images |
| Funnel | TES orchestrator | Kubernetes-native TES API server |
| Cromwell | Workflow manager | WDL submission, runs on system node |
Deployment Phases (Planned)
| Phase | Description | Estimated Time |
|---|---|---|
| Phase 0 | Prerequisites: AWS account, IAM, quotas, tools | 15 min |
| Phase 1 | VPC + EKS cluster creation (eksctl / CloudFormation) | 25–35 min |
| Phase 2 | Karpenter autoscaler installation | 10 min |
| Phase 3 | EFS shared storage + CSI driver | 10 min |
| Phase 4 | S3 configuration + IAM policies | 10 min |
| Phase 5 | ECR registry setup + image push | 10 min |
| Phase 6 | Funnel TES deployment | 10 min |
| Phase 7 | Cromwell integration + verification | 15 min |
Estimated total: ~90–120 minutes
(P)rerequisites
P.1 Micromamba / Conda
Install Micromamba or Conda with an aws environment. This is recommended to keep all tooling and config settings localised and isolated from the system Python.
# Create environment (one-time)
micromamba create -n aws
micromamba activate aws
micromamba install \
-c conda-forge \
-c defaults \
python=3
⭐ Keep this env active during the install procedure!
P.2 AWS CLI v2
The AWS CLI is the primary tool for interacting with all AWS services (EC2, EKS, EFS, S3, IAM, Service Quotas, SSM). Install it as a standalone binary into the conda env:
BIN_DIR=$(dirname $(which python3))
mkdir -p awscli_release && cd awscli_release
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install --bin-dir "$BIN_DIR" --install-dir "$BIN_DIR/../lib/awscli"
cd ..
Verify:
aws --version
# aws-cli/2.x.x Python/3.x.x Linux/...
P.3 eksctl
eksctl is the official CLI for creating and managing EKS clusters and node groups. Install as a standalone binary:
BIN_DIR=$(dirname $(which python3))
mkdir -p eksctl_release && cd eksctl_release
# pick version: https://github.com/eksctl-io/eksctl/releases
EKSCTL_VERSION="0.224.0"
curl -sLO "https://github.com/eksctl-io/eksctl/releases/download/v${EKSCTL_VERSION}/eksctl_Linux_amd64.tar.gz"
tar -zxvf eksctl_Linux_amd64.tar.gz
mv eksctl "$BIN_DIR"
cd ..
Verify:
eksctl version
# 0.224.0
Minimum version: 0.224.0 — the Karpenter subnet and security-group discovery tags (
karpenter.sh/discovery) were auto-applied but silently broken in earlier releases; the fix shipped in v0.224.0 (#8684). Older versions can cause Karpenter to fail to provision nodes.
P.4 kubectl
kubectl is the Kubernetes CLI. Install the latest stable release into the conda env:
BIN_DIR=$(dirname $(which python3))
mkdir -p kubectl_release && cd kubectl_release
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
mv kubectl "$BIN_DIR"
cd ..
Verify:
kubectl version --client
# Client Version: v1.3x.x
P.5 helm
Helm is the Kubernetes package manager, used to deploy Karpenter and the EFS CSI driver. Install the latest v4 release:
BIN_DIR=$(dirname $(which python3))
mkdir -p helm_release && cd helm_release
# pick version: https://github.com/helm/helm/releases
wget https://get.helm.sh/helm-v4.1.3-linux-amd64.tar.gz
tar -zxvf helm-v4.1.3-linux-amd64.tar.gz
mv linux-amd64/helm "$BIN_DIR"
cd ..
Verify:
helm version
# version.BuildInfo{Version:"v4.1.3", ...}
Helm 4 vs Helm 3 — what changed for this installer:
helm upgrade --installnow defaults to server-side apply (SSA) for fresh installs. For Karpenter and the ALB controller this is transparent and improves conflict handling. Re-runs (upgrade path) latch to the previous apply method automatically.--atomicis renamed--rollback-on-failure;--forceis renamed--force-replace. The old flags still work but emit deprecation warnings. Neither flag is used in this installer, so no action needed.helm registry login/logoutmust use the bare domain name only (no path). The installer already callshelm registry logout public.ecr.aws— correct as-is.- Existing Helm v2
apiVersion: v1charts continue to install unchanged.helm versionoutput format is identical to v3 (version.BuildInfo{...}).
P.6 gettext (envsubst)
The installer uses envsubst (from the gettext package) to render YAML and policy templates before applying them. Install via conda:
micromamba install -c conda-forge gettext
Verify:
envsubst --version
# envsubst (GNU gettext-runtime) ...
P.7 python3
The instance-type filtering script (update-nodepool-types.sh) is a Python 3 script embedded in the installer. It only uses the standard library — no extra pip packages are needed. python3 is already present from step P.1.
python3 --version
# Python 3.x.x
P.8 AWS Account & IAM Credentials
Account requirements
- AWS account with billing enabled in the target region (
eu-north-1by default) - IAM user or role with sufficient permissions (see table below)
- Programmatic access via Access Key + Secret Key, EC2 instance profile, or AWS SSO
Required IAM permissions
The deploying identity needs the following AWS-managed policies (or an equivalent custom inline policy):
| Policy | Needed for |
|---|---|
AmazonEKSClusterPolicy + AmazonEKSServicePolicy |
EKS cluster creation |
AmazonEC2FullAccess |
EC2 instances, VPC, security groups, AMI lookup |
IAMFullAccess |
Create Karpenter node role + IRSA roles/policies |
AWSCloudFormationFullAccess |
Karpenter prerequisite CloudFormation stack |
AmazonS3FullAccess |
Task I/O bucket creation and access |
AmazonElasticFileSystemFullAccess |
EFS filesystem + mount target creation |
AmazonSSMReadOnlyAccess |
AMI ID lookup via SSM parameter store |
ServiceQuotasReadOnlyAccess |
Spot vCPU quota auto-detection |
AmazonEC2ContainerRegistryFullAccess |
Image push to ECR during build phase |
⭐ A minimal scoped inline policy covering only the resources created by this installer is available at policies/iam-installer-policy.json.
Deploying identity vs runtime identities — these are separate. The permissions above are only needed by the person (or CI/CD pipeline) running the installer — a one-time operation. They are not embedded in the cluster.
Once the cluster is running, every component operates under its own purpose-built, minimally-scoped IAM role created by the installer:
Runtime role Used by Scope KarpenterNodeRole-${CLUSTER_NAME}Worker EC2 nodes (instance profile) ECR pull, EKS node join, SSM, EBS/EFS access ${CLUSTER_NAME}-karpenterKarpenter controller pod (Pod Identity / IRSA) EC2 run/terminate, SQS interruption queue — cluster-scoped ${CLUSTER_NAME}-iam-roleFunnel/TES pods (IRSA) S3 read/write on the task bucket only ALB controller role ALB controller pod (IRSA) ELB/EC2 management, cluster-scoped Your admin credentials are not stored in the cluster and are not used after the install completes.
Configure credentials
aws configure
# AWS Access Key ID: <your-access-key>
# AWS Secret Access Key: <your-secret-key>
# Default region name: eu-north-1
# Default output format: json
Verify:
aws sts get-caller-identity
# { "Account": "123456789012", "UserId": "AIDA...", "Arn": "arn:aws:iam::..." }
P.9 Quotas & Capacity
Before starting, ensure your AWS account has sufficient quota in the target region. Check via AWS Console → Service Quotas → EC2, or with the CLI:
# Standard Spot Instance vCPU quota
aws service-quotas list-service-quotas \
--service-code ec2 \
--region eu-north-1 \
--query "Quotas[?QuotaName=='All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests'].{Name:QuotaName,Value:Value}" \
--output table
# On-Demand vCPU quota (for system node)
aws service-quotas list-service-quotas \
--service-code ec2 \
--region eu-north-1 \
--query "Quotas[?QuotaName=='Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances'].{Name:QuotaName,Value:Value}" \
--output table
| Resource | Minimal | Notes |
|---|---|---|
| Spot vCPU (Standard family) | 100+ | Karpenter worker pool; auto-detected from SPOT_QUOTA in env.variables |
| On-Demand vCPU (Standard family) | 4+ | System node (t4g.medium = 2 vCPU; keep headroom) |
| EBS gp3 storage (GB) | 500+ | Auto-provisioned per-task local volumes |
| EFS filesystems | 1 | Shared /mnt/efs across all worker nodes |
| VPCs | 1 | Created by CloudFormation prerequisite stack |
| Elastic IPs | 1 | NAT Gateway for private subnet outbound access |
| NAT Gateways | 1 | Outbound internet from private subnets |
| S3 buckets | 1 | Task I/O and workflow logs |
| ECR repositories | 1 | Funnel worker container image |
Request increases via AWS Console → Service Quotas → Request increase. Spot vCPU quota increases are typically approved within minutes; EBS and EIP increases within 1–2 hours.
(E)nvironment Setup
E.1: Download Installer
mkdir -p aws_installer
cd aws_installer
wget https://geertvandeweyer.github.io/aws/files/aws_installer.tar.gz
tar -xzf aws_installer.tar.gz
cd installer
E.2: Configure Environment Variables
Edit env.variables. Review and complete all settings — no shell logic belongs here, just plain values. Some variables requiring attention are highlighted below:
| Variable Name | Notes |
|---|---|
CLUSTER_NAME |
Name for the EKS cluster (used as prefix for all created resources) |
AWS_DEFAULT_REGION |
Target AWS region (e.g. eu-north-1) |
K8S_VERSION |
Tested with 1.34 |
SYSTEM_NODE_TYPE |
Always-on bootstrap node; t4g.medium (ARM64) keeps the permanent cost minimal |
WORKER_INSTANCE_FAMILIES |
Comma-separated EC2 category letters: c=compute, m=general, r=memory, i=storage |
WORKER_MIN_GENERATION |
Minimum instance generation (e.g. 3 → c3+, m3+, r3+) |
WORKER_EXCLUDE_TYPES |
Comma-separated substrings to disqualify types (e.g. metal,nano,micro,small,flex) |
WORKER_MAX_VCPU |
Per-instance vCPU cap (0 = no cap) |
WORKER_MAX_RAM_GIB |
Per-instance RAM cap in GiB (0 = no cap) |
WORKER_ARCH |
amd64 / arm64 / graviton / both |
WORKER_CPU_VENDOR |
intel / amd / both (only relevant when WORKER_ARCH=amd64) |
SPOT_QUOTA |
Spot vCPU limit; auto-detected from Service Quotas if blank; pre-fill to cap below your raw quota |
ALIAS_VERSION |
AL2023 AMI alias version tag (e.g. v20260223); auto-detected from SSM if blank; pre-fill to pin |
USE_EFS |
true to provision and mount EFS shared storage on all worker nodes |
EFS_ID |
Leave blank on first run; the installer creates the filesystem and writes back the ID |
ECR_IMAGE_REGION |
AWS region where the Funnel ECR image is stored (may differ from cluster region) |
TES_VERSION |
Funnel image tag |
EXTERNAL_IP |
IP of the on-prem Cromwell server; only this IP gets inbound access to the TES endpoint |
READ_BUCKETS |
Additional S3 buckets worker tasks may read from (wildcards * allowed) |
WRITE_BUCKETS |
Additional S3 buckets worker tasks may write to |
EBS_IOPS |
gp3 IOPS for worker data disk (3,000–80,000; 3,000 = free baseline) |
EBS_THROUGHPUT |
gp3 throughput in MB/s (125–1,000; 250 = above baseline, small extra cost) |
Auto-derived values — the following variables can be left blank and will be resolved by the installer at runtime. Pre-fill them to skip the live lookup or to override the derived value:
Variable Resolved from AWS_ACCOUNT_IDaws sts get-caller-identityFUNNEL_IMAGE${AWS_ACCOUNT_ID}.dkr.ecr.${ECR_IMAGE_REGION}.amazonaws.com/funnel:${TES_VERSION}TES_S3_BUCKETtes-tasks-${AWS_ACCOUNT_ID}-${AWS_DEFAULT_REGION}
E.3: Verify Prerequisites
# Check all tools are available
which aws eksctl kubectl helm envsubst python3
# Check AWS credentials
aws sts get-caller-identity
# { "Account": "123456789012", "UserId": "AIDA...", "Arn": "arn:aws:iam::..." }
# Check target region is accessible
aws ec2 describe-availability-zones --region "${AWS_DEFAULT_REGION}" \
--query 'AvailabilityZones[].ZoneName' --output text
# Check Spot vCPU quota
aws service-quotas list-service-quotas --service-code ec2 \
--query "Quotas[?QuotaName=='All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests'].Value" \
--output text
(D)eploy Cluster
What Happens
The installer orchestrates all setup in ordered phases (0–7) to provision AWS infrastructure, deploy Karpenter, configure storage, and deploy Funnel TES.
- Phase 0: Load
env.variables, derive blank variables (AWS_ACCOUNT_ID,ALIAS_VERSION,SPOT_QUOTA), validate tools and credentials - Phase 1: Deploy Karpenter prerequisite CloudFormation stack (IAM roles, SQS interruption queue)
- Phase 2: Create EKS cluster via
eksctlfrom CloudFormation template, wait forACTIVE, label bootstrap node - Phase 3: Attach IAM policies to the Karpenter node role (EBS CSI + optional EFS CSI)
- Phase 4: Install Karpenter via Helm (OCI chart from public ECR), verify IRSA/Pod Identity
- Phase 4.1: Apply
EC2NodeClass(rendered from template + injected userdata), then generate and applyNodePoolviaupdate-nodepool-types.sh - Phase 5: Create EFS filesystem, deploy EFS CSI addon, configure security groups and mount targets; write back
EFS_IDtoenv.variables - Phase 6: Deploy AWS Load Balancer Controller via Helm
- Phase 7: Create Funnel IAM role (IRSA), create S3 task bucket, deploy all Funnel resources from YAML templates
Execution
cd installer/
./install-aws-eks.sh
The script prints coloured status lines (✅ / ⚠ / 💥) for each phase step and stops on the first unrecoverable error. All behaviour is driven by env.variables — re-runs are safe and idempotent.
NodePool only — to regenerate the Karpenter
NodePoolafter changing instance family or quota settings without re-running the full installer:./update-nodepool-types.sh
Phase 1: CloudFormation Prerequisites
Goal
Deploy the Karpenter prerequisite IAM and eventing infrastructure as a CloudFormation stack before the EKS cluster is created.
What Gets Created
- IAM role
KarpenterNodeRole-${CLUSTER_NAME}— instance profile for all Karpenter-provisioned worker nodes - Five
KarpenterController*IAM policies scoped to the cluster — attached to the Karpenter controller role in Phase 4 - SQS interruption queue named
${CLUSTER_NAME}— receives Spot interruption and rebalance notices - EventBridge rules that forward EC2 Spot interruption, rebalance, health, and state-change events into the queue
Expected Output
============================================
Phase 1: CloudFormation prerequisites
============================================
Deploying CloudFormation stack for Karpenter prerequisites...
1/40 : Stack status: CREATE_IN_PROGRESS — waiting 15s...
2/40 : Stack status: CREATE_IN_PROGRESS — waiting 15s...
...
✅ CloudFormation stack is CREATE_COMPLETE
Common Issues
ROLLBACK_COMPLETE— usually a missing permission on the deploying IAM identity (IAMFullAccessor equivalent required)CAPABILITY_NAMED_IAMnot passed — the stack creates named IAM roles; the CLI flag is required- Stack already in
ROLLBACK_IN_PROGRESSfrom a previous failed run — delete the stack manually before retrying:aws cloudformation delete-stack --stack-name "EKS-${CLUSTER_NAME}" --region "${AWS_DEFAULT_REGION}"
Manual Verification
# Check stack status
aws cloudformation describe-stacks \
--stack-name "EKS-${CLUSTER_NAME}" --region "${AWS_DEFAULT_REGION}" \
--query "Stacks[0].StackStatus" --output text
# Expected: CREATE_COMPLETE
# Confirm KarpenterNodeRole exists
aws iam get-role --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
--query "Role.Arn" --output text
# Confirm SQS queue exists
aws sqs get-queue-url --queue-name "${CLUSTER_NAME}" \
--region "${AWS_DEFAULT_REGION}" --query "QueueUrl" --output text
✅ Phase 1 Checklist
- CloudFormation stack
EKS-${CLUSTER_NAME}isCREATE_COMPLETE - IAM role
KarpenterNodeRole-${CLUSTER_NAME}exists - SQS queue
${CLUSTER_NAME}exists in the target region
Phase 2: EKS Cluster
Goal
Create the EKS control plane and a single always-on ARM64 baseline node via eksctl. This node hosts Karpenter and the Funnel server; all workflow work runs on Karpenter-provisioned Spot nodes.
What Gets Created
- EKS cluster
${CLUSTER_NAME}(Kubernetes${K8S_VERSION}) with OIDC provider - Managed nodegroup
${CLUSTER_NAME}-baseline-arm— 1×${SYSTEM_NODE_TYPE}(ARM64, always-on) - Pod Identity Association for the Karpenter controller SA →
${CLUSTER_NAME}-karpenterIAM role - IAM identity mapping so
KarpenterNodeRole-${CLUSTER_NAME}nodes can join the cluster - Addons:
eks-pod-identity-agent,vpc-cni - Node labels:
${BOOTSTRAP_LABEL_KEY}=true,workload-type=system - VPC subnets tagged
karpenter.sh/discovery=${CLUSTER_NAME}andkubernetes.io/cluster/${CLUSTER_NAME}=owned - kubeconfig updated at
~/.kube/config
Expected Output
============================================
Phase 2: EKS cluster
============================================
Rendering cluster configuration...
Creating EKS cluster 'TES' in eu-north-1 (K8s 1.34)...
2026-03-29 ... creating EKS cluster "TES" in "eu-north-1" region ...
2026-03-29 ... creating managed nodegroup "TES-baseline-arm" ...
2026-03-29 ... EKS cluster "TES" in "eu-north-1" region is ready
Waiting for EKS cluster to reach status 'ACTIVE' (timeout 1800s)...
[0s/1800s] EKS cluster status: CREATING — waiting...
...
✅ EKS cluster is ACTIVE
✅ Cluster endpoint: https://ABCD1234.gr7.eu-north-1.eks.amazonaws.com
✅ VPC ID: vpc-0abc1234
Labeling bootstrap nodes: karpenter.io/bootstrap=true, workload-type=system
✅ Subnet tags applied
Duration: ~15–20 minutes.
eksctlpolls internally; thewait_for_statuscall after is a belt-and-suspenders check.
Common Issues
eksctl create clusterhangs or fails — check CloudFormation events for theeksctl-managed stack:aws cloudformation describe-stack-events --stack-name "eksctl-${CLUSTER_NAME}-cluster" \ --region "${AWS_DEFAULT_REGION}" --output table- VPC quota exhausted — default limit is 5 VPCs per region; request an increase or delete unused VPCs
- Nodegroup stuck
CREATE_FAILED— typically insufficient EC2 quota for${SYSTEM_NODE_TYPE}on-demand instances cluster.yamlrendered with blank variables — meansenv.variableswas not sourced before callingenvsubst
Manual Verification
# Check cluster status
aws eks describe-cluster --name "${CLUSTER_NAME}" --region "${AWS_DEFAULT_REGION}" \
--query "cluster.status" --output text
# Expected: ACTIVE
# Check node is Ready and labelled
kubectl get nodes --show-labels
# ip-10-0-x-x.eu-north-1.compute.internal Ready <none> ... karpenter.io/bootstrap=true,...
# Check addons
aws eks list-addons --cluster-name "${CLUSTER_NAME}" --region "${AWS_DEFAULT_REGION}" \
--output table
# Check kubeconfig
kubectl cluster-info
✅ Phase 2 Checklist
- EKS cluster status is
ACTIVE(aws eks describe-cluster ...) - Bootstrap node is
Ready(kubectl get nodes) - Bootstrap node has label
karpenter.io/bootstrap=true kubectl cluster-infoconnects successfully
Phase 3: Node IAM Policies
Goal
Attach the additional inline IAM policies that worker nodes need for EBS autoscale script downloads from S3 and (optionally) EFS mount access. The base KarpenterNodeRole created in Phase 1 covers EC2/ECR/EKS join; these policies add the storage-specific permissions.
What Gets Created
- Inline policy
EBSAutoscaleAndArtifactsPolicyonKarpenterNodeRole-${CLUSTER_NAME}— allows nodes to download autoscale scripts from${ARTIFACTS_S3_BUCKET}and call EC2 autoscale APIs - Inline policy
EFSClientPolicyonKarpenterNodeRole-${CLUSTER_NAME}— allowselasticfilesystem:ClientMount/ClientWrite/DescribeMountTargets(only whenUSE_EFS=true)
Expected Output
============================================
Phase 3: Node IAM policies
============================================
Rendering EBS autoscale policy...
✅ EBSAutoscaleAndArtifactsPolicy attached
✅ EFSClientPolicy attached
Common Issues
- Policy document render fails — check that
ARTIFACTS_S3_BUCKETis set inenv.variables NoSuchEntityExceptiononput-role-policy— the CloudFormation stack (Phase 1) did not complete; the role does not exist yet
Manual Verification
# List inline policies on the node role
aws iam list-role-policies \
--role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
--output text
# Expected: EBSAutoscaleAndArtifactsPolicy (+ EFSClientPolicy if USE_EFS=true)
✅ Phase 3 Checklist
EBSAutoscaleAndArtifactsPolicyis attached toKarpenterNodeRole-${CLUSTER_NAME}EFSClientPolicyis attached whenUSE_EFS=true
Phase 4: Karpenter Controller
Goal
Install the Karpenter controller (Helm, OCI chart from public ECR) onto the bootstrap node, wire its IAM role via IRSA/Pod Identity, and verify it is ready to provision worker nodes.
What Gets Created
- Karpenter Helm release in namespace
${KARPENTER_NAMESPACE}—${KARPENTER_REPLICAS}replica(s) - Controller pinned to the bootstrap node via
nodeSelector: ${BOOTSTRAP_LABEL_KEY}=true - IRSA annotation on the
karpenterServiceAccount →arn:aws:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter - Five
KarpenterController*policies verified as attached to the controller role - Funnel namespace
${TES_NAMESPACE}(created early, needed by later phases)
Expected Output
============================================
Phase 4: Karpenter controller
============================================
Tagging subnets for Karpenter and ALB discovery...
✅ Subnet tags applied
Creating namespace funnel...
Release "karpenter" does not exist. Installing it now.
NAME: karpenter
LAST DEPLOYED: Sun Mar 29 12:00:00 2026
NAMESPACE: kube-system
STATUS: deployed
✅ Karpenter controller installed
✅ Karpenter SA annotation: arn:aws:iam::123456789012:role/TES-karpenter
✓ KarpenterControllerInterruptionPolicy-TES attached
✓ KarpenterControllerNodeLifecyclePolicy-TES attached
✓ KarpenterControllerIAMIntegrationPolicy-TES attached
✓ KarpenterControllerEKSIntegrationPolicy-TES attached
✓ KarpenterControllerResourceDiscoveryPolicy-TES attached
✅ Karpenter controller verified
Common Issues
- Karpenter pod
CrashLoopBackOff— IRSA annotation missing or wrong; checkkubectl -n kube-system describe sa karpenter no subnets foundin Karpenter logs — subnet tags did not propagate; wait 1–2 min and restart the controller:kubectl -n kube-system rollout restart deployment karpenter- Helm pull fails with
401 Unauthorized— stale public ECR credentials; the installer callshelm registry logout public.ecr.awsfirst, but retry if needed - Pod stays
Pending— bootstrap node is not yet Ready or taint/nodeSelector mismatch
Manual Verification
# Check Karpenter pod is Running
kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter
# NAME READY STATUS RESTARTS AGE
# karpenter-xxxx-yyyy 1/1 Running 0 2m
# Check IRSA annotation
kubectl -n kube-system get sa karpenter \
-o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'
# arn:aws:iam::123456789012:role/TES-karpenter
# Check logs for errors
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter --since=5m | grep -i error
✅ Phase 4 Checklist
- Karpenter pod is
Runningin${KARPENTER_NAMESPACE} karpenterServiceAccount has correct IRSA annotation- All five controller policies are attached to the role
- No
errorlines in Karpenter logs
Phase 4.1: Karpenter EC2NodeClass & NodePool
Goal
Define how worker nodes are launched (EC2NodeClass) and what workloads they accept + how many (NodePool). The NodePool’s instance-type list is computed at install time from the live EC2 Spot catalog filtered by env.variables settings.
What Happens
This phase:
- Renders
userdata/workload-node.template.sh(EBS autoscale setup script for worker nodes) - Injects the rendered userdata into
yamls/karpenter-nodeclass.template.yamland appliesEC2NodeClass workload - Calls
update-nodepool-types.shwhich:- Queries
aws ec2 describe-instance-types --filters "Name=supported-usage-class,Values=spot" - Filters by
WORKER_INSTANCE_FAMILIES,WORKER_MIN_GENERATION,WORKER_EXCLUDE_TYPES, vCPU/RAM caps - Generates a
NodePool workloadYAML with an explicitnode.kubernetes.io/instance-typeIn-values list - Sets
limits.cpu = ${SPOT_QUOTA} - 2(reserves 2 vCPU for the bootstrap node) - Applies with
kubectl apply --server-side --force-conflicts
- Queries
What Gets Created
EC2NodeClass workload— AL2023 AMI alias${ALIAS_VERSION}, two EBS volumes (20 GiB root + 100 GiB data at${EBS_IOPS}/${EBS_THROUGHPUT}), subnet and SG discovery viakarpenter.sh/discovery=${CLUSTER_NAME}tagNodePool workload— Spot-only, explicit instance-type list,limits.cpu = SPOT_QUOTA - 2,consolidateAfter: 5m,expireAfter: 168h
Expected Output
============================================
Phase 4.1: Karpenter EC2NodeClass + NodePool
============================================
Rendering workload node userdata...
Rendering EC2NodeClass...
ec2nodeclass.karpenter.k8s.aws/workload configured
✅ EC2NodeClass 'workload' applied
Generating Karpenter NodePool 'workload' with eligible instance types...
Querying Spot instance types in eu-north-1...
Filtering: families=[c,m,r] min_gen=3 exclude=[metal] vcpu_cap=0 ram_cap=0 min_mem=4096
Eligible instance types (42): c3.large, c3.xlarge, c5.large, ...
NodePool limits.cpu = 98
nodepool.karpenter.sh/workload configured
✅ Karpenter NodePool 'workload' applied
Configuration Details
Instance selection is driven by env.variables:
| Variable | Effect |
|---|---|
WORKER_INSTANCE_FAMILIES |
First letter(s) of instance type names to include (c,m,r) |
WORKER_MIN_GENERATION |
Minimum generation integer (3 → c5, m6i, r7g are in; c2/m2 are out) |
WORKER_EXCLUDE_TYPES |
Comma-separated substrings to disqualify (e.g. metal,nano,micro,small,flex) |
WORKER_MAX_VCPU |
Per-instance vCPU cap; 0 = no cap |
WORKER_MAX_RAM_GIB |
Per-instance RAM cap in GiB; 0 = no cap |
WORKER_MIN_MEMORY_MIB |
Minimum RAM per instance in MiB (e.g. 4096 removes tiny types) |
WORKER_ARCH |
amd64 / arm64 / graviton / both |
WORKER_CPU_VENDOR |
intel / amd / both (only applies when WORKER_ARCH=amd64) |
To regenerate the NodePool after changing quota or instance-family settings without re-running the full installer:
./update-nodepool-types.sh ./env.variables
Manual Verification
# Check EC2NodeClass
kubectl get ec2nodeclass workload
# Inspect NodePool instance list
kubectl get nodepool workload \
-o jsonpath='{.spec.template.spec.requirements[?(@.key=="node.kubernetes.io/instance-type")].values}' \
| python3 -m json.tool
# Check NodePool limits
kubectl get nodepool workload -o jsonpath='{.spec.limits}' | python3 -m json.tool
# { "cpu": "98" }
# Check Karpenter logs for NodePool reconciliation
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter --since=2m
✅ Phase 4.1 Checklist
EC2NodeClass workloadexists and showsReadyNodePool workloadexists with a non-empty instance-type listlimits.cpumatchesSPOT_QUOTA - 2- Karpenter logs show no provisioning errors
Phase 5: EFS Shared Storage
Goal
Create an encrypted EFS filesystem and mount it on every worker node at /mnt/efs so that shared reference data (e.g. genome indices) is available across all tasks without S3 round-trips.
This phase is optional — set USE_EFS=false in env.variables to skip it entirely.
What Gets Created
- EFS filesystem
${CLUSTER_NAME}-efs(encrypted, Standard tier) — taggedkarpenter.sh/discovery=${CLUSTER_NAME} EFS_IDwritten back toenv.variables(for re-runs and the destroy script)- EKS addon
aws-efs-csi-driverscaled to 1 replica on the bootstrap node - Kubernetes
StorageClass efs-sc,PersistentVolume efs-pv,PersistentVolumeClaim efs-pvcin${TES_NAMESPACE} - DaemonSet
efs-node-mount— mounts EFS on each worker node’s host filesystem at/mnt/efs - Security group
efs-mount-sg-${CLUSTER_NAME}— allows TCP 2049 from the Karpenter node SG and the EKS cluster SG - EFS mount targets in each private subnet of the VPC
Expected Output
============================================
Phase 5: EFS shared storage (optional)
============================================
Creating new EFS filesystem in VPC vpc-0abc1234...
✅ EFS filesystem created: fs-0bd10f52a04211916
Installing EFS CSI driver add-on...
attempt 1/36: addon status=CREATING — waiting 10s...
...
✅ EFS CSI Driver add-on ACTIVE
✅ EFS StorageClass, PV, PVC created
✅ efs-node-mount DaemonSet applied
Configuring EFS security groups and mount targets...
✅ EFS mount targets created
Common Issues
- Addon stays
DEGRADED— restart theefs-csi-controllerdeployment:kubectl -n kube-system rollout restart deployment efs-csi-controller - Mount target creation fails with
subnet already has a mount target— non-fatal; the installer uses|| true; mount target already exists in that AZ - Worker pods can’t reach EFS — the EFS security group did not allow inbound TCP 2049 from the node SG; verify with
aws ec2 describe-security-group-rules - EFS PVC stuck
Pending— EFS CSI driver not running orStorageClass efs-scnot created
Manual Verification
# Check EFS filesystem
aws efs describe-file-systems --file-system-id "${EFS_ID}" \
--region "${AWS_DEFAULT_REGION}" \
--query "FileSystems[0].LifeCycleState" --output text
# Expected: available
# Check mount targets
aws efs describe-mount-targets --file-system-id "${EFS_ID}" \
--region "${AWS_DEFAULT_REGION}" \
--query "MountTargets[].{Subnet:SubnetId,State:LifeCycleState}" --output table
# Check PVC
kubectl get pvc efs-pvc -n "${TES_NAMESPACE}"
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# efs-pvc Bound efs-pv 150Gi RWX efs-sc 2m
# Check efs-node-mount DaemonSet (only shows desired=0 until workers exist)
kubectl get daemonset efs-node-mount -n "${TES_NAMESPACE}"
# Verify /mnt/efs is mounted on an existing node
kubectl debug node/$(kubectl get nodes -o name | head -1 | cut -d/ -f2) \
-it --image=busybox -- ls /mnt/efs
✅ Phase 5 Checklist
- EFS filesystem
EFS_IDis set inenv.variables - EFS lifecycle state is
available - Mount targets exist in each private subnet
efs-pvcisBoundin namespace${TES_NAMESPACE}efs-node-mountDaemonSet is deployed
Phase 6: AWS Load Balancer Controller
Goal
Install the AWS Load Balancer Controller so that the Ingress tes-ingress created in Phase 7 provisions an Application Load Balancer (ALB) for the Funnel TES endpoint.
What Gets Created
- IAM policy
AWSLoadBalancerControllerIAMPolicy(or reuses existing) - IRSA ServiceAccount
aws-load-balancer-controllerinkube-systemwith the policy attached - Helm release
aws-load-balancer-controller(charteks/aws-load-balancer-controllerv3.0.0) - LB controller CRDs applied from
eks-chartsGitHub
Expected Output
============================================
Phase 6: AWS Load Balancer Controller
============================================
✅ AWSLoadBalancerControllerIAMPolicy created: arn:aws:iam::123456789012:policy/...
Release "aws-load-balancer-controller" does not exist. Installing it now.
NAME: aws-load-balancer-controller
LAST DEPLOYED: Sun Mar 29 12:15:00 2026
NAMESPACE: kube-system
STATUS: deployed
✅ AWS Load Balancer Controller installed
Common Issues
- Controller pod
CrashLoopBackOff— IRSA service account missing or wrong policy ARN - ALB not provisioned after Funnel ingress is applied — check controller logs:
kubectl -n kube-system logs -l app.kubernetes.io/name=aws-load-balancer-controller Failed to resolve security group— subnets not tagged withkubernetes.io/cluster/${CLUSTER_NAME}=owned(applied in Phase 4)invalid VPC ID—VPC_IDwas not exported from Phase 2; re-run from Phase 2
Manual Verification
# Check controller is Running
kubectl get deployment -n kube-system aws-load-balancer-controller
# NAME READY UP-TO-DATE AVAILABLE AGE
# aws-load-balancer-controller 1/1 1 1 3m
# Check IRSA annotation
kubectl -n kube-system get sa aws-load-balancer-controller \
-o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'
✅ Phase 6 Checklist
aws-load-balancer-controllerdeployment isAvailable- IRSA ServiceAccount annotation points to the correct IAM role
- No errors in controller logs
Phase 7: Funnel TES Deployment
Goal
Create the Funnel IAM role (for S3 access via IRSA), provision the TES S3 bucket, and deploy all Funnel Kubernetes resources. After this phase the TES API is reachable via an ALB endpoint.
What Happens
- Resolves OIDC provider ID from the EKS cluster (
cut -d'/' -f5of the OIDC issuer URL) - Creates IAM role
${CLUSTER_NAME}-iam-rolewith an OIDC-scoped trust policy — allows thefunnelServiceAccount in${TES_NAMESPACE}to assume it - Builds and attaches an inline S3 policy: full access to
${TES_S3_BUCKET}, read-only to${READ_BUCKETS}, read-write to${WRITE_BUCKETS} - Creates S3 bucket
${TES_S3_BUCKET}with all public access blocked - Renders and applies all Funnel YAML templates:
funnel-namespace,funnel-serviceaccount,funnel-rbac,funnel-crds,funnel-deployment,funnel-tes-service,tes-ingress-alb,funnel-configmap,ecr-auth-refresh - Waits for all Funnel pods to be Ready (10 min timeout)
- If
EXTERNAL_IPis set, callssetup_external_access.shto restrict the ALB security group to that IP
What Gets Created
- IAM role
${CLUSTER_NAME}-iam-rolewith OIDC trust policy - Inline S3 policy
${CLUSTER_NAME}-tes-policyon the role - S3 bucket
${TES_S3_BUCKET}(private, public access blocked) - Kubernetes resources in namespace
${TES_NAMESPACE}:ServiceAccount funnelannotated with the IAM role ARN (IRSA)ClusterRole+ClusterRoleBindingfor pod management- Funnel CRDs
Deployment funnel(Funnel server, pinned to bootstrap node)Service tes-service(ClusterIP, port${FUNNEL_PORT})Ingress tes-ingress(ALB, port 80 →${FUNNEL_PORT})ConfigMap funnel-config(nerdctl executor config, EFS mounts if enabled)CronJob ecr-auth-refresh(refreshes ECR credentials hourly)
Expected Output
============================================
Phase 7: TES / Funnel deployment
============================================
OIDC provider ID: ABCD1234567890EFGH
✅ IAM role created: TES-iam-role
✅ S3 permissions attached
✅ S3 bucket created: tes-tasks-123456789012-eu-north-1
Applying funnel-namespace...
Applying funnel-serviceaccount...
Applying funnel-rbac...
Applying funnel-crds...
Applying funnel-deployment...
Applying funnel-tes-service...
Applying tes-ingress-alb...
Applying funnel-configmap...
Applying ecr-auth-refresh...
Waiting for Funnel pods to be Ready (10 min)...
✅ Funnel deployment complete
✅ Installation complete!
Next steps:
1. Retrieve the TES endpoint:
kubectl -n funnel get ingress tes-ingress
2. Configure Cromwell tes.conf to point at the TES endpoint
3. Submit a test task: funnel task run hello.json
Common Issues
- Funnel pod
Pending— Karpenter has not provisioned a worker node yet; wait 30–60 s then checkkubectl get nodeclaims ImagePullBackOff— ECR image not found in${ECR_IMAGE_REGION}; verifyFUNNEL_IMAGEinenv.variables- ALB not created — LB controller not running (Phase 6 incomplete); check ingress events:
kubectl describe ingress tes-ingress -n ${TES_NAMESPACE} - IRSA not working (tasks can’t write to S3) — OIDC fingerprint mismatch; run
aws iam list-open-id-connect-providersand verify the OIDC provider ARN exists ConfigMap funnel-configrenders with blank EFS mounts —USE_EFS=truebut Phase 5 was skipped; re-run Phase 5 first
Manual Verification
# Check Funnel pod is Running (on bootstrap node)
kubectl get pods -n "${TES_NAMESPACE}"
# NAME READY STATUS RESTARTS AGE
# funnel-xxxxxxxxxx-yyyy 1/1 Running 0 3m
# Get the TES endpoint
kubectl get ingress tes-ingress -n "${TES_NAMESPACE}"
# NAME CLASS HOSTS ADDRESS PORTS AGE
# tes-ingress alb * k8s-funnel-xxx.eu-north-1.elb.amazonaws.com 80 5m
# Test TES API (replace with actual ALB DNS)
curl http://k8s-funnel-xxx.eu-north-1.elb.amazonaws.com/v1/service-info
# {"id":"funnel","name":"Funnel","type":{"artifact":"tes","type":"tes","version":"1.0.0"},...}
# Check IRSA annotation on Funnel ServiceAccount
kubectl get sa funnel -n "${TES_NAMESPACE}" \
-o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'
# arn:aws:iam::123456789012:role/TES-iam-role
# Verify S3 bucket
aws s3 ls "s3://${TES_S3_BUCKET}"
# (empty — no output means accessible and empty)
✅ Phase 7 Checklist
- Funnel pod is
Runningin namespace${TES_NAMESPACE} Ingress tes-ingresshas an ALB addresscurl http://<ALB_DNS>/v1/service-inforeturns Funnel service info JSONfunnelServiceAccount has IRSA annotation pointing to${CLUSTER_NAME}-iam-role- S3 bucket
${TES_S3_BUCKET}exists and is accessible
💰 Cost Estimation
Monthly Costs (Example)
| Component | Cost | Notes |
|---|---|---|
| EKS Cluster | ~$72 | Fixed per cluster |
| System Node (t4g.medium, on-demand) | ~$29 | Always-on ARM64 |
| Worker Nodes (Karpenter, Spot) | ~$100–800 | Highly variable — scales to zero when idle |
| EFS Storage | ~$0.30/GB-month | Scales with usage; 0 cost when empty |
| EBS Volumes | ~$50–200 | Per-task gp3 volumes; auto-deleted |
| S3 Storage & Transfer | ~$50–300 | Depends on workflow data volume |
| ALB | ~$20 | Fixed per load balancer |
| NAT Gateway | ~$35 | Fixed per AZ for private subnet egress |
| Total | ~$360–1500/month | Idle cluster (no tasks): ~$170/month |
Cost optimisation tips:
- Workers scale to zero between workflow runs — the dominant cost is proportional to active compute time
- Use
WORKER_MIN_GENERATION=5or higher to target newer-generation instances with better price/performance - Set
WORKER_MAX_VCPUto avoid accidental selection of very large (expensive) instance types - Pre-fill
SPOT_QUOTAbelow your actual quota to leave headroom for other workloads
Troubleshooting
EKS Cluster Creation Fails
# Check eksctl CloudFormation stack events
aws cloudformation describe-stack-events \
--stack-name "eksctl-${CLUSTER_NAME}-cluster" \
--region "${AWS_DEFAULT_REGION}" --output table | head -60
# Check Karpenter prerequisites stack
aws cloudformation describe-stacks \
--stack-name "EKS-${CLUSTER_NAME}" \
--query "Stacks[0].StackStatus" --output text
Common causes: insufficient IAM permissions on deploying identity; VPC quota exhausted; CAPABILITY_NAMED_IAM not passed.
Karpenter Not Provisioning Nodes
# Check NodePool and NodeClaims
kubectl get nodepool workload -o yaml
kubectl get nodeclaims
# Check Karpenter logs for provisioning decisions
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter --since=5m | grep -E "launched|failed|error"
Common causes: Spot quota exhausted; NodePool limits.cpu already reached; no eligible instance types after filtering; subnet tags missing.
Funnel Task Fails
# Check task pod logs
kubectl get pods -n "${TES_NAMESPACE}" --sort-by=.metadata.creationTimestamp
kubectl logs -n "${TES_NAMESPACE}" <task-pod-name>
# Check worker node has EFS mounted (if USE_EFS=true)
kubectl exec -n "${TES_NAMESPACE}" <task-pod-name> -- ls /mnt/efs
Common causes: EFS not mounted; S3 IRSA not working (check ServiceAccount annotation); container image not found in ECR.
📚 Related Documentation
- AWS CLI Guide — All CLI commands used by the installer with exact syntax and expected output
- AWS Cost & Capacity — Quota planning and cost breakdown
- AWS Troubleshooting — Detailed issue resolution
- Cromwell Documentation — Workflow orchestration
- Funnel Documentation — Task Execution Service
Last Updated: March 29, 2026 Status: Draft — pending production validation on eu-north-1
cd ./AWS_installer/installer
./install-eks-karpenter.sh
# Runs CloudFormation, eksctl, Helm, kubectl automatically
Expected output:
✅ Deploying CloudFormation stack for Karpenter prerequisites...
✅ CloudFormation stack created successfully.
✅ EKS cluster created: TES
✅ Node group (system-ng) is ACTIVE
✅ kubeconfig updated
Duration: ~20–30 minutes
Verification
# Check cluster access
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# ip-10-0-1-xx.ec2.internal Ready <none> 5m v1.34.x
# Check system pods
kubectl get pods -n kube-system
# Should see: coredns, kube-proxy, aws-node, ebs-csi-driver, efs-csi-driver
# Check EKS cluster
aws eks describe-cluster --name TES --region us-east-1 --query Cluster.status
# Should return: ACTIVE
Troubleshooting
Issue: eksctl create cluster hangs or fails
# Check CloudFormation events
aws cloudformation describe-stack-events --stack-name EKS-TES --region us-east-1
# Check EC2 instances
aws ec2 describe-instances --region us-east-1 --query 'Reservations[*].Instances[*].[InstanceId, State.Name, InstanceType]' --output table
# Delete and retry
eksctl delete cluster --name TES --region us-east-1
Phase 2: Karpenter Auto-scaling
Installation
The installer deploys Karpenter via Helm using the OCI chart from public ECR:
# Automatic (part of install script)
./install-eks-karpenter.sh
# Or manual (see installer for full flag set):
helm upgrade --install karpenter \
oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace kube-system \
--create-namespace \
--set settings.clusterName="${CLUSTER_NAME}" \
--set settings.interruptionQueue="${CLUSTER_NAME}" \
--set replicas=1 \
--wait --timeout 10m
The chart is hosted on public ECR (
public.ecr.aws/karpenter/karpenter) as an OCI artefact — not oncharts.karpenter.sh(legacy, pre-v1 only). Nohelm repo addstep is needed for OCI charts. IAM is wired via Pod Identity Association (set incluster.template.yamland created byeksctl), not via the old IRSA service-account annotation.
Karpenter NodePool Configuration
AWS Karpenter differs from OVH: no per-node vCPU quota but Spot quota limits.
NodePool YAML (auto-generated from env.variables):
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: workers
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["arm64", "amd64"]
- key: node.kubernetes.io/instance-family
operator: In
values: ["c6g", "c6i", "m6g", "m6i"] # Graviton + Intel
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"] # Prefer spot for cost
- key: karpenter.sh/gpu-count
operator: In
values: ["0"] # No GPU nodes
nodeClassRef:
name: default
limits:
cpu: 1000 # Max 1000 vCPU (adjust per AWS quota)
memory: 4000Gi # Max 4000 GB memory
consolidateAfter: 30s
expireAfter: 604800s # 7 days
AWS-Specific Settings
Unlike OVH, AWS Karpenter controls:
| Setting | Purpose | AWS Specific |
|---|---|---|
| Instance family | Compute type (c6g = Graviton ARM) | Filter by cost/performance |
| Capacity type | On-demand vs Spot | Spot saves ~70% but can be interrupted |
| AMI | OS image (Amazon Linux 2023) | EBS-optimized, AWS-native |
| VPC subnets | Network placement | Auto-selected by CloudFormation |
| IAM instance role | Worker permissions | Created by installer |
| Security group | Firewall rules | Auto-created, allows internal traffic |
Verification
# Check Karpenter controller
kubectl get deployment -n karpenter
kubectl logs -n karpenter deployment/karpenter -f | head -20
# Check NodePool
kubectl get nodepools
# NAME NODEPOOL CAPACITY NODES READY
# workers default 1000 0 True
# Scale test: create 2 pods
kubectl create deployment test --image=nginx --replicas=2
kubectl get nodes -w
# Karpenter should provision nodes within 30s
Phase 3: EFS Shared Storage
Automatic Setup
The installer creates and mounts EFS:
# Automatic (part of install script)
./install-eks-karpenter.sh
# Or manual:
./mount-efs.sh # Runs EFS CSI driver + mounts
Expected output:
✅ EFS created: fs-0bd10f52a04211916
✅ EFS mounted at /mnt/efs (all nodes)
✅ EFS PV and PVC created
Verification
# Check EFS CSI driver
kubectl get daemonset -n kube-system efs-csi-node
# Check EFS mount on node
kubectl debug node/$(kubectl get nodes -o name | head -1) -it --image=ubuntu
> df -h | grep efs
# 127.0.0.1:/ 150G 1G 149G 1% /mnt/efs
# Check EFS usage
kubectl exec -n kube-system -it $(kubectl get pods -n kube-system -l app=efs-csi-node -o name | head -1) -- \
df -h /mnt/efs
EFS Performance
| Tier | Throughput | Latency | Cost | Use Case |
|---|---|---|---|---|
| Standard | Bursting | <5ms | Low | Most workflows |
| Max IO | Provisioned | <5ms | Higher | High-throughput genomics |
For genomics workflows, Standard is typically sufficient.
Phase 4: S3 Object Storage
Bucket Setup
The installer creates S3 buckets:
export TES_S3_BUCKET="tes-tasks-123456789012-us-east-1"
export READ_BUCKETS="*" # Allow all buckets
# Automatic (part of install script)
./install-eks-karpenter.sh
# Or manual:
aws s3api create-bucket --bucket "$TES_S3_BUCKET" --region us-east-1
aws s3api put-bucket-versioning --bucket "$TES_S3_BUCKET" --versioning-configuration Status=Enabled
IAM Permissions
Worker pods get S3 access via IRSA (IAM Roles for Service Accounts):
# Installer creates IAM role: TES-iam-role
# Funnel service account linked to role
kubectl get serviceaccount funnel-worker -n funnel -o yaml | grep iam\.amazonaws\.com/role-arn
# arn:aws:iam::123456789012:role/TES-iam-role
S3 Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:*"],
"Resource": ["arn:aws:s3:::tes-tasks-*"]
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": ["arn:aws:s3:::cromwell-aws-*/*"]
}
]
}
Phase 5: Deploy Funnel TES
Prerequisites
- Funnel image pushed to ECR
- EKS cluster running (Phase 1–4 complete)
Automatic Deployment
./install-eks-karpenter.sh
# Runs all phases including Funnel deployment
Manual Deployment
# Apply Funnel ConfigMap with EFS mounts
envsubst < yamls/funnel-configmap.template.yaml | kubectl apply -f -
# Deploy Funnel server
kubectl apply -f yamls/funnel-namespace.yaml
kubectl apply -f yamls/funnel-deployment.yaml
kubectl apply -f yamls/funnel-tes-service.yaml
# Check status
kubectl get pods -n funnel
kubectl logs -n funnel deployment/funnel -f
Verification
# Get service endpoint
kubectl get svc -n funnel tes-service
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# tes-service LoadBalancer 10.100.x.x a1234567-1234567890.elb.us-east-1.amazonaws.com:8000
# Test TES API
export TES_SERVER=<EXTERNAL-IP>:8000
curl -X GET http://${TES_SERVER}/ga4gh/tes/v1/tasks
# Expected: { "tasks": [] }
Phase 6: Configure Cromwell
Local Installation
# Download Cromwell JAR
wget https://github.com/broadinstitute/cromwell/releases/download/86/cromwell-86.jar
# Create Cromwell config
cat > cromwell.conf << 'EOF'
include required(classpath("application"))
backend {
default = TES
providers {
TES {
actor-factory = "cromwell.backend.impl.tes.TesBackendFactory"
config {
root = "s3://tes-tasks-123456789012-us-east-1/cromwell"
tes-server = "http://<TES_ALB_DNS>:8000"
# EBS volume configuration
disks = "/mnt/cromwell 100 SSD"
concurrent-job-limit = 1000
}
}
}
}
EOF
# Run Cromwell
java -Dconfig.file=cromwell.conf -jar cromwell-86.jar server
Submit Workflow
# Create WDL workflow
cat > hello.wdl << 'EOF'
workflow HelloWorld {
call hello
output {
String greeting = hello.message
}
}
task hello {
command {
echo "Hello, World!"
}
output {
String message = read_string(stdout())
}
runtime {
docker: "ubuntu:22.04"
cpu: 1
memory: "512 MB"
}
}
EOF
# Submit to Cromwell
curl -X POST http://localhost:7900/api/workflows/v1 \
-H "Content-Type: application/json" \
-d @- << 'EOF'
{
"workflowSource": "$(cat hello.wdl)",
"inputsJson": "{}"
}
EOF
Phase 7: Verification & Testing
Cluster Health
# Check nodes
kubectl get nodes -o wide
# Check Karpenter
kubectl get nodepools
kubectl top nodes
# Check storage
kubectl get pvc -n funnel
kubectl get storageclasses
Workflow Test
# Monitor task execution
kubectl get pods -n funnel -w
# Check task logs
kubectl logs -n funnel <pod-name>
# Query TES API
curl http://${TES_SERVER}:8000/ga4gh/tes/v1/tasks/v1/$(TASK_ID)
💰 Cost Estimation
Monthly Costs (Example)
| Component | Cost | Notes |
|---|---|---|
| EKS Cluster | ~$72 | Fixed per cluster |
| System Node (t4g.medium, on-demand) | ~$36 | Always-on |
| Worker Nodes (Karpenter, spot) | ~$500–1000 | Depends on workflow |
| EFS Storage (150 GB) | ~$45 | $0.30/GB-month |
| EBS Volumes | ~$50–200 | Task-local storage |
| S3 Storage & Transfer | ~$100–500 | Depends on data volume |
| **Total | ~$800–2000/month | Highly variable |
Cost optimization:
- Use Spot instances (70% cheaper but interruptible)
- Right-size instance types for your workload
- Delete unused EFS/S3 data
- Use On-Demand Savings Plans for system node
Troubleshooting
Issue: EKS Cluster Creation Fails
Check CloudFormation stack:
aws cloudformation describe-stack-events --stack-name EKS-TES
aws cloudformation describe-stacks --stack-name EKS-TES --query 'Stacks[0].StackStatus'
Common causes:
- Insufficient EC2/vCPU quota
- IAM permissions missing
- Region not available
Issue: Karpenter Not Scaling
Check NodePool:
kubectl describe nodepool workers
kubectl logs -n karpenter deployment/karpenter | tail -50
Common causes:
- Spot quota exhausted (fallback to on-demand)
- Pod security policy blocking nodes
- Insufficient EBS quota
Issue: Funnel Task Fails
Check worker pod logs:
kubectl logs -n funnel <task-pod-name>
kubectl describe pod -n funnel <task-pod-name>
Common causes:
- EFS not mounted (check
/mnt/efs) - S3 credentials expired
- Container image not in ECR
📚 Related Documentation
- Karpenter AWS Provider — Karpenter-specific AWS setup
- AWS Cost Optimization — Budget planning
- AWS Troubleshooting — Detailed issue resolution
- Cromwell Documentation — Workflow orchestration
- Funnel Documentation — Task execution
Last Updated: March 13, 2026
Version: 1.0
Status: Production-ready