We are seeking a highly skilled Kubernetes Orchestration Engineer to lead the deployment and management of GPU-optimized Kubernetes environments that power AI/ML and hypercomputing workloads. This role is critical to ensuring scalable, reliable, and high-performance infrastructure across on-premises and hybrid cloud environments. As a core member of our infrastructure engineering team, you will work at the intersection of container orchestration, GPU resource management, and AI application scaling, enabling large scale distributed training and inference across GPU clusters
Must Have
- Strong experience with Kubernetes (K8s) and container orchestration in production environments.
- Expertise in managing GPU workloads in Kubernetes using NVIDIA GPU Operator, vGPU, and device plugin configurations.
- Proficiency with container runtimes such as Docker and CRI-O, and orchestration tools like Helm and Kubernetes Operators.
- Solid understanding of networking within Kubernetes and service mesh integration (e.g., Istio, Linkerd).
- Familiarity with hybrid/multi-cloud Kubernetes platforms (e.g., GKE, EKS, AKS).
- Strong scripting and automation skills (e.g., YAML, Helm templating, Bash, Python).
- AI Infrastructure Design & Deployment with multi-GPU clusters using NVIDIA or AMD platforms.
- Configure GPU environments using CUDA, DGX Systems, and NVIDIA Kubernetes Device Plugin.
- Deploy and manage containerized environments with Docker, Kubernetes, and Slurm.
- AI Model Support & Optimization for training, fine-tuning, and inference pipelines for LLMs and deep learning models.
- Enable distributed training using DDP, FSDP, and ZeRO, with support for mixed precision.
- Tune infrastructure to optimize model performance, throughput, and GPU utilization.
- Design and operate high-bandwidth, low-latency networks using InfiniBand and RoCE v2.
- Integrate GPUDirect Storage and optimize data flow across Lustre, BeeGFS, and Ceph/S3.
- Support fast data ingestion, ETL pipelines, and large-scale data staging.
- Leverage NVIDIA’s AI stack including cuDNN, NCCL, TensorRT, and Triton Inference Server.
- Conduct performance benchmarking with MLPerf and custom test suites
- Certified Kubernetes Administrator (CKA) –Must
- Certified Kubernetes Application Developer (CKAD)
- NVIDIA Certified Kubernetes Specialist
Educational Qualifications
- Batchlors in Computer Science/Applications/BTech Computer
- Science/MCA