Zafin is seeking a Cloud Site Reliability Engineer II (CSRE II) to lead strategic initiatives in ensuring the reliability, scalability, and performance of our cloud infrastructure and applications. This advanced role requires mastery in cloud technologies, strategic planning, and incident management to drive innovative solutions and operational excellence.
As a CSRE II, you will influence the direction of cloud reliability strategies, mentor junior engineers, and lead significant projects that have a broad organisational impact. This position reports directly to the VP of Cloud Services and requires a proactive, collaborative mindset to achieve operational and strategic objectives.
Key Responsibilities
- Lead and manage the resolution of complex technical issues involving Zafin’s products and the Azure cloud environment.
- Design and implement strategic operational enhancements to improve resiliency and system reliability.
- Conduct in-depth Root Cause Analysis (RCA) for high-severity incidents and drive initiatives to reduce error recurrence.
- Represent the organisation in external client escalation calls, providing expert guidance and solutions.
- Architect and optimise cloud infrastructure for high performance, scalability, and cost-effectiveness.
- Provide thought leadership in managing and scaling container orchestration platforms such as AKS and OpenShift.
- Oversee the implementation of advanced monitoring solutions and integrate predictive analytics for proactive issue resolution.
- Develop and execute automation strategies to streamline operational workflows and enhance incident response capabilities.
- Create and maintain comprehensive documentation of cloud architectures, processes, and incident management strategies.
- Mentor and coach junior engineers, fostering a culture of continuous learning and innovation.
- Drive strategic initiatives, collaborating with cross-functional teams to achieve organisational objectives.