Life Science People are excited to be working with a cutting edge Health Technology company using AI & Machine Learning to revolutionise drug discovery. Their work focuses on developing life changing medicines for a wide variety of diseases and reducing the time it takes to develop them.
As the Lead Site Reliability Engineer, you will build a team around you and have line management responsibilities whilst remaining hands-on and steering the direction of their cutting-edge infrastructure. You will lead the team of up to seven engineers building and maintaining cloud and Kubernetes-based platforms that form the foundation of their drug discovery pipeline. You must be a strong communicator who can lead by example and guide your team to deliver robust, secure and reliable infrastructure solutions.
Your team will work alongside other infrastructure squads to promote industry best practices and ensure the software is resilient enough for their scientists to rely upon. You will also be adding your input into diverse areas such as cloud services, container technologies, authentication, network topology, sharded databases, scalable web services, interfaces to external data sources and APIs.
- Co-ownership of the overall cloud architecture.
- Ownership of the company's site reliability goals, formulation of objectives in alignment with high-level organisation strategy.
- Approving the defined targets for SLOs and SLIs. Participation in the negotiations to define SLAs.
- Driving large-scale infrastructure projects to delivery through coordination with engineering and security teams in order to achieve a common goal.
- Incident response management. Ownership of incident response and disaster recovery policies.
- Influencing the direction of infrastructure technology advancements. Designing around challenges associated with large-scale distributed systems and driving the harmonisation of technology support layer to promote reuse across the organisation.
- Conceiving and driving infrastructure solutions to achieve business continuity goals
- Constantly refining processes and working practices to remove obstacles and empower engineering teams to supply our users with ample infrastructure solutions.
- Designing infrastructure solutions and maintaining specification.
We are looking for someone with
- Evidence of creative thinking and problem solving, confidently applying novel strategies to move projects to important decision points quickly and efficiently.
- Excellent oral and written communication skills e.g. can tailor the complexity of communications as and when required, whilst maintaining clarity of communication.
- Ability to work under pressure, manage different projects and deliver to defined timelines.
- Experience successfully leading a Site Reliability, DevOps or engineering team with excellent communication skills and the ability to forge productive relationships and collaborations both internally and externally.
- Excellent understanding of AWS and Kubernetes. Knowledge of scalability challenges associated with containers, distributed systems and large-scale web applications.
- Experience with programming languages(any, bonus points for Python/Java/Go/C++).
- Comfortable with availability out of working hours in the event of a high severity incident.
- Experience with monitoring and alerting solutions(for example Grafana/Prometheus).
- Extensive knowledge of cloud networking architecture, cloud operations, automation and orchestration.
- Good knowledge of network protocols and components such as BGP, TCP, HTTP/S and Load Balancing.