In this episode, we talk about SRE (Site Reliability Engineering) with Vlad Ukis, who’s written a book about his experience at Siemens Healthnieers’ digital health platform. His book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations, walks you through the steps to go from developing a cloud based service, to operating that service 24/7 with a method (SRE) published, and promoted by Google.
We’ve also covered this method in another episode with János Csorvási and Jeff Campbell, which is a great complement to this interview with Vlad Ukis.
The many aspects of Site Reliability Engineering
In this segment, we discuss the many different aspects that we must take into account when operating live cloud-based services. Vlad describes aspects like the change leadership (who to involve and how to involve them), to the technical decisions and processes (on call service, etc.)
In this respect, it is critical to develop and deploy the appropriate methodology, including the HR related aspects (e.g. on call service) that enable the use of SRE in your organisation. Implementing SRE is one example of the end-to-end processes that Scrum Masters must be familiar with, and in some cases, help deploy in their organizations.
Product Manager (Product Owner) – Scrum team collaboration is key, also when implementing Site Reliability Engineering
As the team started implementing SRE, it was quickly clear that the Product Manager (Product Owner) role would have a critical impact on the success of the teams. This led Vlad to develop an approach to align the organization, from development to operations, and established SRE as an organizational initiative.
In this segment, we also discuss what happens when operations are kept apart from the product development cycle. A problem that SRE tries to eliminate. The topic of organizational alignment is also extensively discussed in the episode with János Csorvási and Jeff Campbell.
When to start with Site Reliability Engineering?
Vlad shares with us that many organizations try to adopt SRE after a major incident. At that time, everyone is in a hurry and the results will be heavily influenced by that urgency. It’s important for organizations operating live systems to be able to prepare for the transition. In the book, Vlad discusses how to prepare your organization for the adoption of SRE before any major problem happens. In this segment, we explore some of the key change leadership topics that determine the level of readiness for adopting SRE.
If you want to explore more, Vlad participated in an interview with InfoQ, and you can check out his interview here:
- Data-Driven Decision Making – Product Operations with Site Reliability Engineering
- Data-Driven Decision Making – Optimizing the Product Delivery Organization
You can purchase the book on Amazon: Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations by Vlad Ukis
About Vlad Ukis
Vlad is a leader of R&D and reliability lead at Siemens Healthineers. In this capacity, he drives Continuous Delivery, SRE, and DevRel transformation, helping this large distributed development organization evolve architecture, deployment, testing, operations, and culture to implement these new processes at scale.
You can link with Vlad Ukis on LinkedIn.