Manchester

Site Reliability Engineer (SRE)

Applies software engineering principles to operations, improving system reliability, scalability, and resilience through automation.

About the Company

This organisation operates complex, technology-driven systems where uptime, performance, and scalability are critical. Reliability is treated as a shared responsibility between engineering and operations, with SREs playing a key role in bridging the two.

The culture values automation, learning from failure, and data-driven decision-making. Engineers are encouraged to reduce manual work and build systems that can operate reliably at scale.

Typical benefits include flexible working, strong technical autonomy, and opportunities to work on high-impact infrastructure challenges.

The Role

As a Site Reliability Engineer, you will focus on ensuring systems are reliable, observable, and scalable. You will use software engineering approaches to automate operational tasks, improve monitoring, and reduce the risk of outages.

The role combines coding, systems thinking, and operational responsibility, requiring both technical depth and a strong understanding of how systems behave in production.

Key Responsibilities

Design and implement reliability-focused tooling and automation
Monitor system performance and availability
Respond to and analyse production incidents
Improve observability through logging, metrics, and alerts
Work with engineering teams to design resilient systems
Reduce toil through automation and process improvement
Conduct post-incident reviews and implement improvements
Define and maintain reliability standards and practices

What We’re Looking For

Strong engineering background with an operational mindset
Experience with production systems and incident response
Ability to write code to automate operational tasks
Understanding of distributed systems and failure modes
Calm, analytical approach under pressure
Strong documentation and communication skills

Tools & Environment

You are likely to work with:

Programming or scripting languages
Monitoring, logging, and alerting platforms
Cloud infrastructure and container platforms
CI/CD pipelines and automation tools

How Success Is Measured

System reliability and uptime
Reduction in manual operational work
Quality of monitoring and alerting
Effectiveness of incident response
Improvements driven by post-incident learning

Benefits & Progression

SREs often progress into senior engineering, platform architecture, or reliability leadership roles. Benefits typically include flexible working, training budgets, and the opportunity to work on complex, large-scale systems.