Upstream Pull Requests
This page documents pending pull requests to upstream repositories with functionality developed in this deployment.
1. karpenter-provider-ovhcloud
Status: PR submitted : PR #1
Branch: main (4 commits)
Target: antonin-a/karpenter-provider-ovhcloud:main
Summary
Introduces Karpenter label patching controller and fixes for drift detection, pool creation, API parsing, and node tracking on OVH MKS.
Key Features & Fixes
New: Node Labels Controller
- Problem: Karpenter requires standard Kubernetes labels on nodes (capacity, architecture, zone) but OVH nodes don’t have them initially
- Solution: Dedicated controller watches for node registration and applies required labels from matching NodeClaim
- Benefit: Closes race window where Karpenter’s drift controller fires within seconds of node joining
Fix: Drift Detection in Create()
- Ensure labels are added to nodes during
Create()method, not deferred - Convert flavor UUIDs to human-readable names for Karpenter compatibility
Fix: Single-Zone Cluster Pool Creation
- Retry node pool creation without
availabilityZonesparameter on single-zone clusters - Prevents API 400 errors when zone parameter is not applicable
Fix: RAM API Response Parsing
- OVH API returns RAM in GiB; convert to MiB for Karpenter compatibility
Fix: Node Provider-ID Tracking
- Track individual nodes to prevent duplicate
provider-idassignment
Fix: CRD Template Serialization
- Remove
omitemptyfrom all template fields to ensure required fields are present in JSON - Add mandatory
finalizersfield to node pool templates
2. funnel
Status: PR submitted. PR #1357
Branch: feat/k8s-ovh-improvements (branched from develop)
Target: ohsu-comp-bio/funnel:develop
Summary
Infrastructure, database, server, worker, and Kubernetes backend enhancements.
Key Features & Fixes
Infrastructure (Docker img)
- Bumped Go base image:
1.23-alpine→1.26-alpine - Added
nerdctlbinary for containerd usage - Exposed containerd socket and namespace environment variables in final image
Database
- Problem: Task insertion and queueing used separate BoltDB write operations, causing lock contention at scale
- Solution: Combine task store and queue writes in single atomic
db.Updatetransaction
Server
- Added gRPC keepalive policies for server and gateway clients
- Retry service config for transient errors (UNAVAILABLE, RESOURCE_EXHAUSTED)
Worker
- New:
Resourcesstruct support with memory limit calculation helper - New: Volume consolidation algorithm to merge input mounts to common ancestor directory
- Prevents EBUSY when tasks manipulate input files
- Reduces mount label overhead
Kubernetes Backend: Optional GenericS3
- Problem: GenericS3 (AWS S3 CSI configuration) was mandatory for all PV/PVC creation, blocking non-AWS deployments (OVH, on-premise)
- Solution:
- Add nil guards around
config.GenericS3accesses inCreatePVandCreatePVC - Deployments using hostPath or other non-S3 storage can now omit GenericS3
- Prevents index-out-of-bounds panic when GenericS3 is empty
- Add nil guards around
Kubernetes Backend: Configurable ConfigMaps
- Problem: Per-task ConfigMaps created unconditionally, causing:
- Duplicated full config (including credentials) N times in etcd
- etcd write pressure and API-server churn at scale (1000s of tasks)
- Leak risk if reconciler/worker crashes before cleanup
- Solution:
- New
ConfigMapTemplatefield (like existingServiceAccountTemplate,RoleTemplate) - Default is
""(disabled) — fully backward compatible - Renders template with
,, `` - Only create when template is explicitly configured
- New
- Reference:
config/kubernetes/worker-configmap.yaml
Kubernetes Backend: Template Rendering Fix
- Problem: All seven Kubernetes resource files used
html/template, which HTML-escapes interpolated values ("→",&→&), corrupting YAML- Example: `` becomes
"in output, breaking YAML parsing
- Example: `` becomes
- Solution: Replace all
html/templatewithtext/template(required by Go docs for non-HTML output)
3. cromwell
Status: PR submitted. PR #7858
Branch: ovh-tes-improvements (branched from working commit)
Target: broadinstitute/cromwell:develop (or feature branch)
Summary
S3/AWS endpoint flexibility, TES backend enhancements (memory retry, local-filesystem support, backoff limits), and logging improvements enabling Cromwell to work efficiently with OVH infrastructure and S3-compatible services.
Key Features & Fixes
Standard Logging
- Downgraded command-line logging from
infotodebugin StandardAsyncExecutionActor - Reduces log verbosity in production
S3/AWS Endpoint Support
- Problem: Cromwell hardcoded AWS endpoints; S3-compatible services and custom endpoints not supported
- Solution:
- New
aws.endpoint-urlconfig parameter - Propagated through AwsConfiguration, AwsAuthMode, S3 client builder
- Skip STS validation for non-AWS endpoints
- Force path-style access for S3-compatible services
- Custom URI handling in S3PathBuilder/Factory
- New
- Robustness Fixes:
- Ignore errors creating “directory marker” objects
- Handle empty key (bucket root) specially in
exists()andS3Utils - Tolerate 400/403 responses from S3-compatible services when checking existence
- Miscellaneous path handling tweaks (permissions NPE,
createDirectories()no-op)
- Documentation: Extensive comments explaining non-AWS behavior
TES Backend: Memory Retry
- New Runtime Attribute:
memory_retry_multiplier - Scan stderr and logs for OOM indicators
- Extended
handleExecutionResult()with memory-specific error handling - Automatic task retry with increased memory allocation
TES Backend: Shared Filesystem Support
- New Config:
filesystems.local.local-root(with legacyefsfallback) - Inputs under configured local root are not localized (reduce data movement)
- Custom hashing actor respects mountpoint boundaries
- Sibling-md5 file support for faster hashing
- Constructors and expression functions adjusted for local-filesystem paths
TES Backend: Backoff Limit Support
- New Runtime Attribute:
backoff_limit - Propagated to TES backend_parameters with debug logging
- Prevents excessive retries on permanent failures
TES Backend: File Hashing
- New:
TesBackendFileHashingActor.scalawith sibling-md5 file support - Respects local filesystem mountpoints
TES Backend: JSON Formatting
- Recursive null-stripping for spray-json
- Corrected
size_bytestype toLongfor large files
Deployment Context
These pull requests enable running Cromwell and Funnel on OVH MKS infrastructure with:
- Node Provisioning: Karpenter automatically scales workers up/down
- Storage: S3-compatible object storage (OVH S3) + local shared filesystems
- Workload Execution: TES backend for Cromwell, containerized Funnel workers
- Cost Optimization: Consolidation of idle nodes, configurable resource limits
See Karpenter Deployment on OVH MKS for deployment details.