Introduction
We are looking for 8+years experienced candidates for this role.
Responsibilities include:
- Lead deployment and configuration of on-prem Linux-based platforms for AI workloads.
- Install, configure, and harden approved Linux OS on GPU and non-GPU servers.
- Design and configure KVM virtualization
- Provision and manage VMs for application, database, observability, and platform components.
- Architect and implement GPU-enabled environments for production LLM inference.
- Deploy and operate containerized LLM serving stacks in production.
- Configure NVIDIA GPU drivers, CUDA runtime, GPU monitoring, and GPU health validation.
- Design and validate GPU utilization, isolation, workload placement patterns and health monitoring.
- Configure server networking, VLANs, Linux bridges, bonding, storage pools, and access controls.
- Apply security hardening, RBAC, encryption, and audit-ready configurations.
- Design HA and DR strategies and prepare systems for future scale-out.
- Coordinate with infrastructure, network, security, and hardware vendor teams during implementation.
- Lead troubleshooting, performance tuning, and stabilization.
- Produce architecture documentation, runbooks, and handover materials.

