Siemens Healthineers engaged Philipp and Vlad due to growing challenges with their platform. As more users started using the platform, the availability requirements increased, and they wanted to reduce downtime. However, their current operations capabilities did not allow them to achieve the uptime and availability needed, and there was also a problem with the time it took to recover from failures. To address these challenges, Siemens Healthineers decided to adopt SRE as a solution to improve their operations and increase reliability. The adoption of SRE was added to the list of big initiatives, and Vlad and Philipp worked through the organization to get buy-in and support for the change.
Overcoming Challenges in Adopting Agile, Scrum, SRE: A Journey in Change Management and Leadership
Philipp and Vlad faced several challenges in adopting SRE at Siemens Healthineers. One of the biggest challenges was getting support from those who were skeptical about the change. They also struggled to find the business metrics to justify the change and to update the code in operation. The transition from code-and-forget to code-and-operate was also a significant challenge. This change required a business transition as well, from on-premise to software subscription/SaaS. The company did not previously have the responsibility to operate an online service, but now they had to take on this responsibility. The transition also affected the way the business worked and required the customers to get ready to consume an online service. The team also realized that they needed to operate the services when they delivered them quickly, which had a sales impact from having continuous delivery.
However, Philipp and Vlad were successful in encouraging the teams to find out how good/available their systems already were, and played the educator by explaining the solution. They encouraged the teams to measure their services and helped them feel the pain so they would be motivated to improve their services. However, they also advised not to throw the teams into cold water and leave them there, but instead to provide guidance and support along the way. They linked the teams to common service quality indicators and encouraged them to measure their services, so they could understand the impact of SRE adoption. The idea was to provide support but also to allow the teams to work independently, to encourage creativity and innovation.
Successful Adoption of SRE: The Importance of Metrics, DevOps, and Change Management Leadership
One of the biggest successes in the adoption of SRE at Siemens Healthineers was the Central Tools Team. This team built the necessary tools for the adoption of SRE, and enabled knowledge to be transferred to other teams through the adoption of these tools.
Philipp and Vlad also worked with the teams to come up with meaningful targets that reflected customer behavior, these helped the teams define their SLA and SLO. SLA, or Service Level Agreement typically outlines key performance indicators (KPIs), such as uptime, response time, and availability, and defines the consequences if these KPIs are not met. While SLO stands for Service Level Objective. It is a target performance level that a service provider aims to achieve, usually expressed in terms of key performance indicators (KPIs) such as availability, latency, or error rate.
The Central Tools Tem also provided a standard way to collect alerts and visibility, with the work being done once and scaled to all the teams through dashboards that were also used by Product Owners, further aiding in the adoption of SRE.
The success of the Central Tools Team was enabled through technology, process, and coaching. The team had the knowledge and expertise needed to build the necessary tools, and the coaching sessions helped transfer that knowledge to the rest of the teams. The centralized solution provided by the Central Tools Team for collecting alerts and visibility, made it easier for the teams to adopt SRE. This shows the importance of technology, process, and coaching in the successful adoption of SRE.
Resources for SRE adoption in your organization
In this episode, we refer to many different resources that can help you adopt SRE in your organization. Here’s the list of resources:
- Google has many different YouTube videos on SRE, which you can consult in their YouTube CloudTech channel.
- Google also published about their work in defining SRE. You can find many of their books here.
- Vlad Ukis’s book on adopting SRE in an Enterprise is available here, and his previous episode on Establishing SRE / Site reliability engineering Foundations: A Step-by-Step Guide is here.
- We also refer to some of the tools that help the adoption of SRE. For example tools for Logging and Monitoring
- Cloud Independent:
- New Relic https://newrelic.com
- Data Dog https://www.datadoghq.com
- Prometheus https://prometheus.io
- Dynatrace https://www.dynatrace.de
- Cloud Independent:
- Vlad also has some articles on INFOQ:
About Vlad Ukis and Philipp Gündisch
Vlad is a leader of R&D and reliability lead at Siemens Healthineers. In this capacity, he drives Continuous Delivery, SRE, and DevRel transformation, helping this large distributed development organization evolve architecture, deployment, testing, operations, and culture to implement these new processes at scale.
You can link with Vlad Ukis on LinkedIn.
Philipp studied Computer Science in Erlangen and worked in the Operations Team at “teamplay” in Siemens Healthineers. He was responsible for building up / implementing the SRE Infrastructure based on tools from Microsoft Azure. Together with Vlad they drove the mindset transformation by working with the development teams. Recently, Philipp went back to University where he is in a Doctorate degree program.
You can link with Philipp Gündisch on LinkedIn.