We are looking for a highly capable AI Infrastructure Engineer to design, implement, and optimize GPU-accelerated compute environments that power advanced AI and machine learning workloads. This role is critical in building and supporting scalable, high-performance infrastructure across data centers and hybrid cloud platforms, enabling training, fine-tuning, and inference of modern AI models.
Must have
- 3–6 years of experience in AI/ML infrastructure engineering or high-performance computing (HPC).
- Solid experience with GPU-based systems, container orchestration, and AI/ML frameworks.
- Familiarity with distributed systems, performance tuning, and large-scale deployments.
- Expertise in modern GPU architectures (e.g., NVIDIA A100/H100, AMD MI300), multi-GPU configurations (NVLink, PCIe, HBM), and accelerator scheduling for AI training and inference workloads.
- Good understanding of modern AI model architectures, including LLMs (e.g., GPT, LLaMA), diffusion models, and multimodal encoder-decoder frameworks, with awareness of their compute and scaling requirements.
- Knowledge of leading AI/ML frameworks (e.g., TensorFlow, PyTorch), NVIDIA’s AI stack (CUDA, cuDNN, TensorRT), and open-source tools like Hugging Face, ONNX, and MLPerf for model development and benchmarking.
- Familiarity with AI pipelines for supervised/unsupervised training, fine-tuning (PEFT/LoRA/QLoRA), and batch or real-time inference, with expertise in distributed training, checkpointing, gradient strategies, and mixed precision optimization
- AI Infrastructure Design & Deployment with multi-GPU clusters using NVIDIA or AMD platforms.
- Configure GPU environments using CUDA, DGX Systems, and NVIDIA Kubernetes Device Plugin.
- Deploy and manage containerized environments with Docker, Kubernetes, and Slurm.
- AI Model Support & Optimization for training, fine-tuning, and inference pipelines for LLMs and deep learning models.
- Enable distributed training using DDP, FSDP, and ZeRO, with support for mixed precision.
- Tune infrastructure to optimize model performance, throughput, and GPU utilization.
- Design and operate high-bandwidth, low-latency networks using InfiniBand and RoCE v2.
- Integrate GPUDirect Storage and optimize data flow across Lustre, BeeGFS, and Ceph/S3.
- Support fast data ingestion, ETL pipelines, and large-scale data staging.
- Leverage NVIDIA’s AI stack including cuDNN, NCCL, TensorRT, and Triton Inference Server.
- Conduct performance benchmarking with MLPerf and custom test suites
- NVIDIA Certified Professional – Data Center AI
- Kubernetes Administrator (CKA)
- CCNP or CCIE Data Center
- Cloud Certification (AWS, Azure, or GCP
Educational Qualifications
- Batchlors in Computer Science/Applications/BTech Computer Science/MCA