App reliability with Azure

Application reliability in Azure for smarter monitoring | Lume

App reliability with Azure: smarter monitoring without expensive tools

Applications need to work. Always. That sounds obvious, but in practice, application reliability is a constant challenge. Downtime costs money, sluggish performance frustrates users, and unexpected errors create stress for IT teams. Many organizations turn to high-feature, high-cost monitoring platforms like Dynatrace. At Lume, we prefer relying on Azure-native tools to establish app reliability, which is often smarter, more flexible, and more cost-effective.

Improve application reliability in three phases
Azure-native tools are cost-effective and flexible
Learn how health modeling improves reliability

Maarten Ghijsens - Cloud Architect

Our approach to app reliability

At Lume, we use a three-phase methodology. This structure helps us to work together with the customer step by step towards a more reliable application environment.

Phase 1

It all starts with a clear understanding of the current situation and goals. In this first phase, we dive deep into discussions with the client.

Intake and expectations

What are the specific concerns? Does the client want a general overview or are there particular issues like slowness or downtime? We need to clarify expectations. That is crucial.

Architecture and infrastructure analysis

We examine the technical setup of the application. What does the infrastructure look like? What architectural decisions have been made? This matters because certain design choices (or lack thereof) can impact performance.

Identifying workloads

We map out the different workloads: the specific processes or tasks the application handles. A common example is an app that generates payslips monthly and tax statements annually. That yearly task is a heavy workload that only runs once a year. If it shares infrastructure with the daily tasks, it can slow down the entire app during that time. We look for those kinds of ‘hidden’ intensive processes.

Current monitoring setup

What is the client using today? Which tools? What data is being collected?

Action plan

Based on all that information, we create a concrete action plan for the next phase.

At this stage, we’re not yet installing tools or collecting data. It’s still about understanding, analysis, and planning.

Phase 2

In the second phase, we roll up our sleeves and start implementing and improving monitoring and reliability. This involves several steps:

Standardizing and centralizing

We often see a mix of tools and approaches used by internal teams and external partners. Everyone has their own way of doing things, leading to fragmented monitoring that isn’t very effective. Our first step is therefore usually to standardize and centralize. We’re bringing everything together in one place, with one clear set of guidelines.

The power of Azure-native tools

Here’s where we differ from platforms like Dynatrace. We believe strongly in the flexibility and strength of Azure’s native ecosystem. We use a core set of tools:

Azure Monitor: The central hub for monitoring data in Azure.
Log Analytics: For storing and analysing logs.
Application Insights: Focused on app performance and diagnostics.
OpenTelemetry: An open standard for collecting telemetry (logs, metrics, traces) from apps.
Grafana: For powerful, flexible dashboards and visualizations (also available as a managed service in Azure).

With those five tools, we can cover almost all monitoring and alerting needs for most scenarios. Yes, expensive tools such as Dynatrace may have highly specific, complex features that are hard to replicate. The key question remains, though: do you really need those?

Quite often, you’ll pay for a costly suite while only using a fraction. Our approach is more cost-effective and focuses on what truly adds value. For instance, we add business context to monitoring so the impact of technical issues on the business becomes much clearer.

Starter packs for a quick launch

We’ve developed ready-to-go starter packs: sets of alerts and dashboards that provide immediate insights into basic infrastructure metrics (CPU, RAM, slowest DB queries, etc.). This gives you a solid start. Later, we fine-tune them per application and workload. We also offer an IaC Starter Pack (Infrastructure as Code), so new alerts and dashboards can be deployed consistently and automatically.

Health analysis: How healthy is the app?

Once we have the data, we can analyze the application’s health. We assess four key areas:

Observability: Do we see everything we need to see?
Availability: Are the app and its components available?
Scalability: Can the app handle variable loads? Do we spot workloads that might cause issues during peak times?
Fault tolerance: How well does the app handle failures? Can it recover on its own?

Health modelling: Intelligent monitoring

This is a crucial step: without health modelling, monitoring is just a good intention. We define what ‘healthy’, ‘degraded’ (reduced performance), and ‘unhealthy’ means per workload. 90% CPU usage might be fine for one workload, but critical for another. By implementing this model:

We make reliability measurable,
We avoid alert fatigue: too many irrelevant alerts will ultimately just be ignored. We only want alerts for real issues or risks,
We fine-tune the alerts from the starter packs and add new, specific ones.

Phase 3 (optional)

If you want complete peace of mind, we also offer managed services.

Alert and incident management

We receive and analyze alerts 24/7, create incidents in our system, and handle them. Azure platform issues? We resolve them immediately. Application issues? We escalate them to the appropriate development teams or partners. The client doesn’t have to worry about a thing.

Problem management

We identify recurring problems and proactively suggest solutions to prevent them from happening again.

Quarterly review

Optionally, we offer a quarterly review to discuss:

Triggered alerts and handled incidents
Trends in performance and reliability
Logging and monitoring costs
New workloads or applications
The need to adjust the health model
Application errors and their trends

Want more reliable apps without the overhead?

Want to learn more about how Lume can improve your app reliability using Azure-native tools? Struggling with slow applications, unexpected downtime, or fragmented monitoring? We’d be happy to show how our approach can help you gain better insight, control, and reliability without the costs of monolithic tools.

Want more? Read on!

blog

What is Real User Monitoring (RUM)? And why does it matter? | Lume

Fri 3 Jan 25

Real User Monitoring (RUM): what is it? And why does it matter?

To optimize your apps, it’s important to focus on the actual user experience. That's where Real User Monitoring (RUM) shines. Let's explore what it is, why you should consider using it, and how we implement it for our clients.

blog

Build reliable CI/CD pipelines with our tips | Lume

Tue 2 Jan 24

Pipeline reliability: how to build solid CI/CD workflows?

A poorly designed pipeline can lead to all sorts of nasty surprises: deployment errors, security vulnerabilities, and a lot of lost sleep. We'll share our best practices to help you build bulletproof CI/CD pipelines. Ready to protect your deployments from errors and vulnerabilities?

blog

Web app scalability in Azure: comparing the options | Lume

Sun 13 Oct 24

How can you scale web applications on Microsoft Azure?

As the load on your web app grows, the need for scaling will grow with it. But Microsoft Azure offers multiple scaling options that are tailored for different workloads. Each option has its own benefits and downsides. Ready to find the one that's perfect for you?