Introduction
We are seeking a highly capable Infrastructure as Code (IaC) Engineer to lead the design, implementation, and management of automated infrastructure provisioning for high performance AI data centers. This role is central to orchestrating compute, network, storage, and virtualization layers using modern IaC tools across on-premise and hybrid cloud environments.
The ideal candidate will play a strategic role in enabling scalable and repeatable deployment pipelines that support GPU clusters, AI model training environments, and containerized platforms such as Kubernetes and OpenShift.
Job Description
Must have
- 5+ years of experience in infrastructure automation or SRE roles with hands-on IaC deployment.
- Proficiency in Terraform, Ansible, and scripting languages such as Python, Bash, and YAML.
- Experience automating infrastructure in GPU-intensive environments supporting AI/ML workloads.
- Strong understanding of networking (VXLAN, EVPN, BGP, RoCE) and virtualization platforms (OpenShift, VMware, KVM).
- Familiarity with Kubernetes, Helm, Operators, and container orchestration frameworks.
- Exposure to storage automation for AI data lakes (e.g., Ceph, BeeGFS, Lustre, or S3-compatible storage).
- Experience with CI/CD tools (GitLab CI/CD, Jenkins, ArgoCD, Flux) in IaC pipelines
Responsibilities include:
- Design and implement IaC frameworks to automate the provisioning and configuration of data center infrastructure for AI workloads.
- Orchestrate and manage multi-layer automation across compute (GPU/CPU), networking (VXLAN, EVPN, BGP), storage (NVMe, object, parallel file systems), and virtualization platforms (KVM, VMware, OpenShift).
- Develop reusable Terraform modules, Ansible playbooks, and YAML templates to define infrastructure in version-controlled environments.
- Automate deployment of Kubernetes clusters and integrate with GPU operators for training and inference pipelines.
- Build and maintain CI/CD pipelines to deploy, test, and manage infrastructure changes using tools like GitLab CI/CD, Jenkins, or ArgoCD.
- Integrate with monitoring and observability stacks (Prometheus, Grafana, DCGM) for automated infrastructure validation and health monitoring.
- Work closely with AI/ML platform teams to align infrastructure deployment with model training, data pipelines, and security policies.
- Ensure compliance with security and operational standards through policy-as-code and drift detection mechanisms.
Certifications :
- Certified Kubernetes Administrator (CKA) –Must
- Certified Kubernetes Application Developer (CKAD)
- NVIDIA Certified Kubernetes Specialist
Educational Qualifications
- Batchlors in Computer Science/Applications/BTech Computer Science/MCA