As enterprises place more emphasis on the resiliency of distributed computing systems, SRE practices are playing a crucial role, prompting vendors like New Relic to expand observability tools accordingly.
Service Level Management (SLM), a new feature in the New Relic One platform, became generally available this week at no cost to existing customers. It provides a framework for Site Reliability Engineering (SRE) teams to configure Service Level Indicators (SLIs) and Service Level Objectives (SLOs), automatically set baselines, and track the reliability of microservices based on these performance indicators.
A company spokesperson also said New Relic plans to announce a security offering this year, but did not provide further details. FutureStack, the company’s annual user conference where it typically performs major product updates, is scheduled for May.
New Relic SLM beta testers said the vendor’s update this week reflects how the move to a microservices architecture has expanded the role observability tools and SREs play in their businesses, and has welcomed the possible addition of security monitoring to this mix. SREs have also begun to play an expanded role in DevSecOps environments.
Ultimately, bringing multiple types of monitoring and measurement to capture user experience, rather than tracking the raw performance of individual infrastructure components, is what makes microservices different from monoliths, observability different from monitoring and SREs different from traditional system administrators, said an early adopter New Relic SLM.
“Monitoring is very useful when failure modes are well understood, such as depletion of finite system resources like memory or threads,” said Andrew Myers, senior SRE manager at Zip.co, an Australian online payments company. line. “Observability helps us understand the state of a distributed system by looking at all the data it generates, not just individual data. [resources].”
Observability tools are entering a fierce consolidation phase
At least some companies have begun consolidating observability tools with New Relic, adding logs and distributed traces to traditional New Relic APM tools as they evolve, along with metrics and data aggregation. data from third-party tools such as Prometheus, and phasing out competing tools. like Splunk and Grafana accordingly.
However, some companies are making consolidation choices that also favor other vendors, and New Relic is catching up by catering to SREs – two of its main competitors, Dynatrace and Datadog, have had SLI and SLO monitoring capabilities since 2020. and 2019., respectively.
These competitors also cover a whole category of IT security monitoring and DevSecOps that New Relic has yet to address. The observability market is ripe for further attrition and consolidation as users continue to reduce the number of IT management tools they use, including for security, and New Relic must keep pace with its competitors, including in security monitoring, to achieve long-term success.
“[Adding application security tools] would make sense as they continue to focus more on the software delivery lifecycle and developers,” said IDC analyst Stephen Elliot. “Code analysis is an interesting area, as are vulnerability assessments for developers.
New Relic is also still emerging from a major shake-up in May 2021, when it named a new CEO and revamped its product portfolio to create New Relic One, a unified observability platform. According to the company latest earnings reportits revenue has grown steadily since then, with 14,600 customers in the third fiscal quarter, which ended in January.
However, as it navigates the innovator’s dilemma, which is also creating turbulence for enterprise IT vendors Splunk and ServiceNow, New Relic has yet to return to profitability, forecasts relatively flat revenue over the course of its fourth fiscal quarter and does not expect profitability before the end of the fiscal year. 2023.
SREs, observability create harmony out of chaos
SREs played the role of enabler as microservices matured in a company that adopted SLM, creating a centralized observability stack with New Relic and using it to orchestrate communication between developers, engineers platform and product teams.
“In one monolithic environment, reliability belonged only to the SRE team – we were the only ones who cared about production issues,” said Stefan Kolesnikowicz, SRE director at Achievers, a maker of employee recognition software based in Toronto.
As the deployments of Achievers’ culture and microservices on Google Cloud Platform grew, “everyone became responsible for reliability,” he said. The distributed nature of microservices, by definition, forces collaboration between the teams developing and managing them, and their complexity cannot be handled by a single team alone.
The Achievers SRE team created a self-service portal for developers called Slaughterhouse, in a nod to the oft-quoted “livestock versus pets” analogy that arose with the highly automated ephemeral infrastructure that underpins rapidly evolving microservices environments.
New Relic SLM will slot into Abattoir to allow software engineers and product teams to configure and track SLIs and SLOs for the services they manage, in part through a new integration with Terraform that automatically creates objects in the New Relic Observability Database in the background.
“We have a checkbox for that – basically engineers just say ‘Yes, I do,'” Kolesnikowicz said. “All of this is then translated from YAML, where the engineers write it, and pushed through Terraform, [which] talks to the New Relic API, which creates all of these objects in New Relic.”
This all reflects how system reliability has also risen to the top of the priority list at Achievers, Kolesnikowicz said, as it does for many companies where microservices are becoming mainstream.
“We’re trying to be more stringent, so if your error budget runs out, that’s your top priority, to increase your reliability before we can release new features and introduce more risk to our platform,” Kolesnikowicz said. “[New Relic SLM] is going to give us better insight into how well a system is performing and how it impacts the rest of the platform, and integrations with the product will let them see, “Hey, you’re slipping on your error budget.”
SLI/SLO wishlists: burn rate alerts, edge metrics
Early adopters of SLM would like to see built-in alerts around error budgets added to the tool in a future release. They can use the New Relic query language to configure custom alerts as error budget burn rate reach certain thresholds, but it would be easier if this alert was pre-packaged with SLM.
“It would also be great to have insights to help teams decide on realistic goals for service levels based on the historical data we have as a baseline,” said Zip.co’s Myers. “It’s something we needed to coach our teams internally.”
According to Kolesnikowicz, another potential refinement for SLM in the future would be expanded support for Prometheus metrics that Achievers monitors in its individual Kubernetes clusters through the Istio service mesh. New Relic One already aggregates Prometheus metrics for other uses, but it has not yet been integrated with SLM.
“If you know the SRE book, [it says] you can bring the measurement closer to the user to improve its quality,” he said, referring to the important Google Site Reliability Engineering Manual. “Today we measure [SLIs] server-side — we want to measure it on the load balancer, which would be in our Istio instance.”
Both error budget burn rates and support for Prometheus metrics are on the vendor’s near-term roadmap for SLM, a New Relic spokesperson said.
Beth Pariseau, Senior Writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.