Share this

IKEA Observability

Observability

IKEA INGKA: Excelling customer satisfaction by eliminating prolonged service outages, with Observability as Code implementation

IKEA INGKA partnered with AIOPSGROUP to future-proof the service excellence leadership position. IKEA innovates its IT infrastructure with unparalleled observability implementation, automation for incident response, and escalations. This enables IKEA organization to work across the entire data landscape vs. silos in the data center or cloud-native environments.

About IKEA INGKA

IKEA INGKA is a Swedish company that was established in 1943. The company designs and sells ready-to-assemble furniture, kitchen appliances, and home products, among other goods and home services. Across the globe, IKEA INGKA has 378 stores, welcomes 706 million visitors to these stores, and has more than 3.6 billion visits to IKEA.com.

In this article, we look at how IKEA INGKA (IKEA) partnered with AIOPSGROUP. We built a solution for Delivery & Services technology to improve the buying experience for IKEA customers. The goal was to eliminate any prolonged service outage caused by service anomalies left undetected, and protracted resolution times. IKEA Delivery & Service team selected AIOPSGROUP to implement Observability as Code. This is a new methodology of technical development where AIOPSGROUP is developing a code that is completed (DONE) only when we have built-in enough analytics proactive alarms and escalation process to support it. This project contributes to realizing the vision of having a holistic view of the customer over their relationship with IKEA.

Meet the Team

As the Head of Product Engineering, Tom Haveman leads the engineering teams for Delivery & Services @ IKEA.

Observability as Code Implementation Leadership Team

Tom Haveman
Head of Product Engineering at IKEA

Giordano Tamburelli
Engineering Manager at IKEA

Rui Tiago Costa
Engineering Manager at IKEA

Fabio Fonseca
Engineering Manager at IKEA

Milad (yahia) Reyhani
Software Engineering Manager at IKEA

Petar Gadjev
Software Architect at AIOPSGROUP

IKEA’s way of retail is dynamic. They are constantly innovating to meet customer expectations. The vision for technology is not to only provide the technology-challenged answers but to have them implemented.

Delivery & Services Team at IKEA sees digital transformation as a key to improving the “human to human connection” – equipping the services teams with customer stories and data points that will help them serve customers today and for years to come.

The motivation and drive to build observability into IKEA technical product(s) started when new IKEA leadership was introduced to the existing Ikea monitoring and support tool, BMC Service Request Management. BMC Remedy Tool is an archaic, reactive, and complex central repository designed to receive all service alarms and requests within IKEA. Also, being central meant that “someone else” was managing other’s department alarms and issues. Delivery & Services Team at IKEA decided to take a lead and re-create and re-innovate this process that was lacking not only IKEA’s “constantly innovating” philosophy but also the basic monitoring and observability rules of Proactiveness and Accountability. The vision was to make IKEA technical products observable, where one can determine the state and behavior of the entire system from the system outputs/alerts.
Now, with Observability as Code implementation, IKEA Delivery & Services teams will be able to:

  • Build / code alerts into the code of IKEA technical products
  • Proactively receive the alerts in real-time
  • Easily sift through vast amounts of events by filtering, tagging, and sorting
  • Avoid alarm fatigue and desensitization caused by the chaos of alerts and incidents
  • Receive alerts with context to events to make them informative and actionable
  • Reduce resolution time of incidents
  • Increase Customer Satisfaction
  • Increase IKEA sales

AIOPSGROUP team helped Ikea Delivery & Services division implement Observability as Code on the first project of this kind. The project was also used as a Proof of Concept for Observability culture adoption for Ikea engineering teams in the future. It is being delivered via www.backstage.io as a development portal vehicle for any new team, team member, and project.

“Deliver the IKEA experience in the homes of the many people”

What is Observability?

Teams requiring operational visibility have expanded beyond sysadmins and IT Ops analysts — even developers are taking greater ownership of knowing what is going on for a better customer experience. To effectively do this, all roles need visibility inside their entire architecture — from third-party apps and services to their own — to fix and eventually prevent problems. When that capability is built-in — the premise of observability — it not only makes visibility easier, enables greater insight, and leaves more time for more strategic initiatives, but it is also critical to the overall success of Site Reliability Engineering (SRE). This provides a bridge between developers releasing code and operators maintaining infrastructure impacted by code. On top of this, it shifts some of the monitoring workloads onto development.

Simply defined, observability is instrumenting systems and applications to collect metrics and logs. It is building apps with the idea that someone is going to watch them. Think of it as a property of a system—another attribute, like functionality, performance, or testability.

Observability vs. Monitoring

Monitoring Observability
Tells you weather the system works Lets you ask why it is not working
A collection of metrics and logs about the system The dissemination of information from that system
Failure-centric Understands system behavior regardless of an outage
Is "the how"/ something you do Is "the goal"/ something you have
I "monitor" you You "make yourself" observable

The volume, velocity, and variety of the data that is being collected are fundamentally unmanageable by humans. Observability allows the questions to be asked and the systems to manage themselves, using artificial intelligence (AI) and machine learning (ML) for sophisticated analytics.

The following is an example list of observable metrics and events that we have found to be critical for achieving full observability:

  • Distributed tracing
  • Telemetry
  • Logging

“OBSERVABILITY AS CODE” - IKEA Monitoring and Observability Approach

IKEA Engineering Leadership Team in Amsterdam worked with AIOPSGROUP to enable monitoring and observability solutions. It also serves as the new foundation block for continuing IKEA’s service excellence for many years to come.

The approach was piloted with the engineering team working on the IKEA Locker project completed in April 2021.

In addition, as with all DevOps capabilities, installing a tool is not enough to achieve the objectives, but tools can help or hinder the effort. Monitoring systems should not be confined to a single individual or team within an organization. Empowering all developers to be proficient with monitoring helps develop a culture of data-driven decision-making and improves overall system debuggability, reducing outages. It builds the culture, “You build it, you run it, you own it”.

Don’t let us forget that measuring a distributed system means having observability in many places and being able to view them all together. This might mean both a frontend and its database, or it might mean a mobile application running on a customer’s device, a cloud load balancer, and a set of microservices. Being able to connect data from all these sources in one place is a fundamental requirement in modern observability tools that IKEA was able to achieve.

And the work does not stop there. Part of operating a system is learning from outages and mistakes. The process of writing retrospectives or postmortems with corrective actions is well documented. One outcome of this process is the development of improved monitoring. It is critical to a fast-moving organization like IKEA to allow its monitoring systems to be updated quickly and efficiently by anyone within the organization.

Impact and Benefits

While collecting more detailed impact metrics will require at least 6+ months to complete, the following are early indicators of the benefits and impact of this solution:

  • Report on the overall health of systems (Are my systems functioning? Do my systems have sufficient resources available?)
  • A holistic view of the health of key defined KPI’s
  • Discovery of incident root causes in minutes instead of hours
  • Improved leading indicators of an outage or service degradation
  • Detection of outages, service degradations, bugs, and unauthorized activity
  • Identification of long-term trends
  • Exposed unexpected side effects of changes or added functionality
  • The ability for continuous improvements of the tool
  • The ability for continuous KPI’s calibration
  • Improved IT practices – Adopted infrastructure as code for our monitoring and alerting infrastructure. Selected observability products that support configuration through version-controlled code and execution of APIs or commands via infrastructure CD pipelines.

Solution Architecture

The architecture is based on the three observability pillars: distributed tracing, logging, and telemetry. Built on open-source standards like OpenTelemetry, the architecture is using platform-agnostic tools for telemetry and employs infrastructure-as-code principles.

Distributed tracing

The microservices are auto instrumented with Splunk’s OpenTelemetry Java Instrumentation Agent. The setup is sending trace and span data in OTLP format to an OpenTelemetry Collector which acts as a gateway. With this setup, one could use any observability system. In our case, that was Splunk Observability (former SignalFx).

The instrumentation agent supports W3C tracecontext, so that one would be able to trace the call back to the originating side. Tracing data is also being logged on each log message giving you the possibility to correlate the distributed messages for a single trace.

Telemetry

Two different types of telemetry are used in the architecture: infrastructure telemetry and custom telemetry. Infrastructure telemetry is focused on measuring the performance of Kubernetes workloads, while custom telemetry is focused on measuring product KPIs.

For infrastructure telemetry, the team used the Splunk OpenTelemetry Connector for Kubernetes because IKEA microservice deployments target Kubernetes.

For custom telemetry, IKEA chose to use Micrometer.io as part of each microservice dependency list. It is easy to use and configure and is the default metrics provider for the most used Java frameworks like Spring Boot and Quarkus. The Micrometer is platform agnostic and is mostly oriented on providing the concepts of metrics and feature-rich set of metric implementations while at the same time has built-in support for most of the time-series databases (IKEA used widely adopted Prometheus).

The source of telemetry and distributed tracing data is the OpenTelemetry Collector. It is the central gateway running on Kubernetes and exporting data to Splunk Observability.

Logging

Logs play a crucial part in investigating. The more capabilities your organization has in analyzing logs the faster it would get to the root cause of the problem. This is the reason why Splunk is used for aggregating the logs from Kubernetes workloads. Apart from its feature-rich set of data analysis functions it also provides a UI to build your own set of dashboards – grouping log messages by trace id, filtering, anomaly detection, etc.

Technical stack

A summary of the technologies used in the solution is as follows:

  • Micrometer.io: Micrometer is used to collect and publish custom metrics to SignalFx from the microservices context.
  • Splunk OpenTelemetry Java Instrumentation Agent: The JVM Agent is used to instrument the microservices and send trace data in OTLP format to an OpenTelemetry Collector.
  • Splunk OpenTelemetry Connector for Kubernetes: This is a Kubernetes DaemonSet along with other Kubernetes objects in a Kubernetes cluster that provides a unified way to receive, process, and export metric, trace, and log data.
  • SignalFlow: This is the language used to define issue detectors based on distributed tracing data and telemetry data.
  • Terraform: Terraform configurations are used to deploy all the required infrastructure: Splunk OpenTelemetry Connector for Kubernetes and alert detectors in SignalFx.
  • Splunk: Splunk is used to aggregate the logs produced by the microservices running on Kubernetes. Custom-built dashboards facilitate issue investigations by providing search and correlation capabilities over the log messages.
  • Splunk Observability: The platform for infrastructure and application performance monitoring. It provides integrations and built-in dashboards for the major cloud providers and their services (AWS, GCP, Azure; VMs, Serverless, Container Orchestration, Storage)

Harmonizing the Architecture

Each new microservice is created with the following:

  • pom.xml with Micrometer.io dependencies
  • Dockerfile with OpenTelemetry JVM Agent
  • Helm chart for the microservice with configurations enabling the JVM Agent to send traces and spans to an OpenTelemetry Collector running in a K8S cluster
  • Terraform module for the deployment of a Splunk OpenTelemetry Connector for Kubernetes configured to receive infrastructure telemetry, custom telemetry, tracing data, and forward everything to Splunk Observability
  • Terraform modules providing dashboards that visualize microservice JVM metrics, error rate, and latency
  • Terraform modules providing generic alert detectors for issues with an error rate and latency of microservices running on K8S

All the mentioned items are built into the Development Portal Provisioner so that every new microservice will get observability as code basics out of the box.

Looking Ahead

The IKEA team in partnership with AIOPSGROUP also continues to add new capabilities to the existing Observability solution. The new capabilities include:

  • Programmatically define SLOs and alerts based on the SLOs in a platform-agnostic way
  • Route telemetry to Splunk Enterprise and correlate it with aggregated logs and other data
  • Route incident management events to Splunk Enterprise and correlate them with telemetry, logs, and SLOs to identify trends, detect anomalies and create a holistic view of the system

Working Together – IKEA & AIOPSGROUP

IKEA, led by Head of Product Engineering, Tom Haveman, worked through goals, expectations, and projected outcomes from the initiative. The team documented this as their “Strategic Intent” and conducted success planning workshops to create common expectations, assess the complexity of the project, and understand the transformation readiness of the organization. Based on this, an action plan was created to ensure the expected impact and benefits from the new engineering approach/methodology, and the behavior change was maximized.

The AIOPSGROUP team developed Observability as code solution, applying AIOPS Monitoring for eCommerce wealth of experience. AIOPSGROUP Architect Petar Gadjev provided deep technical architecture and design guidance, including best practices for performance and scalability. This has resulted in high performance and low maintenance, with centralized data, increased mobility, and data integrity. The team worked in close collaboration with each other as well as IKEA product owners to achieve the set goals.

AIOPSGROUP