We asked IT Admins & Entrepreneurs about today’s most urgent challenges in dealing with outages.
David Gildeh is the CEO / Founder of dataloop.io, a company working on new and easy to use monitoring services. (Image source)
David’s view in short to tackle outages:
- Find the root cause of the outage
- Keep monitoring dashboards simple and allow for individual alerts
- Keep everyone in mind (not just IT teams) when making monitoring systems available
David, what do you consider the biggest challenge in dealing with outages?
Definitely finding the root cause. In today’s complex, constantly changing cloud environments, being able to identify the root cause of an outage quickly is really important.
Today’s monitoring systems collect a lot of data and can sometimes overwhelm users making it hard to pinpoint exactly what went wrong.
To make things worse, a lot of operations teams leave a „sea of red“ checks on their dashboard or spammy alerts, masking real errors. This is mainly due to the amount of friction involved in updating monitoring checks as the system changes between releases.
How do you go about addressing these problems?
We address the issue of friction in keeping monitoring checks up to date and try to reduce it.
This ensures that the operations teams significantly reduce the number of spammy alerts coming from their monitoring systems, so they’re not getting false alerts just because things have changed between releases, and ensures its a breeze to add coverage so that there are no gaps that could create a blind spot ahead of an outage later on.
We do this in our product in several ways:
-
Synchronized Configuration
You can update your monitoring configuration per section to fix failed checks or add new ones. As you make changes in our UI or via our command line/API tools, they are synchronized in seconds to all your servers. This way you can immediately add new checks, collect new metrics, or modify existing checks. It keeps your configuration and checks up to date. This typically reduces the process of adding new checks or modifying existing ones from 15+ minutes to less than 30 seconds. As a result users don’t fall behind ensuring they have 100% coverage of their environment.
-
Simple User Interface
We’ve reduced what usually takes 10-20 steps with most monitoring systems to less than 2-3 steps. Again we’ve lowered the friction of updating your monitoring configuration so you can easily get rid of spammy alerts and get 100% coverage.
Today’s User Interfaces must be geared not only toward operational people but ideally to the whole team.
-
If This Then That Alerting
We have a powerful IFTTT alert engine that allows users to quickly setup routines that check issues from multiple sources before taking an action (i.e. sending an alert). This significantly reduces the number of spammy alerts and also allows users to automate some actions (like clearing disk space) so that they’re resolved quickly without human intervention again avoiding outages.
In a long term view we will be looking into new visualization and Machine Learning algorithms that help users quickly understand the data they’re seeing and have the computer aid in the search for root causes during outages. But we see that as step 2. The first step is getting all the correct checks and metrics into the monitoring solution, which still requires humans to setup with their context of the system architecture and expected behaviour. Hence today we focus on making the collection part incredibly easy (reduce friction).
What has to change in order to minimize these problems?
To sum it up, monitoring tools need to get better at handling the increasing complexity and rate of change happening in today’s modern cloud services.
They need to make collecting information incredibly easy and frictionless so that all teams, not just operations, can add their own checks and metrics, and then present that information in a way that helps users understand what’s really going on.
To date most monitoring tools have fallen behind and are incredibly clunky and complex to keep up to date and use.
Three additional questions about dataloop.io
What is Dataloop and what makes it special?
Dataloop.IO is a new monitoring tool designed for DevOps & Operations teams running online services. It’s been designed to work out of the box with Cloud-, DevOps- and Micro-Services, making it effortless to set up monitoring across multiple teams, in minutes. What makes it special is our powerful real-time sync-engine that allows users to quickly write new Nagios checks in any language, remotely test them and then deploy them out to hundredss of servers in real time, instantly adding coverage and collecting new metrics. This makes Dataloop.IO easy for other teams to adopt and to setup for their own metrics and checks, It also makes Dataloop.IO incredibly agile in order to keep up with the rate of change as the service changes between releases.
How does it work?
Find out more about Dataloop’s functionality from this concept video:
For an even more detailed explanation check out Davids presentation at Monitorama 2014.
Who is your target group?
Dataloop.IO is designed specifically for Operations & DevOps teams running online services. These teams will typically piece together tools like Nagios, Graphite and StatsD, to get the equivalent of Dataloop.IO over several months. We can provide the same capability in minutes, with a solution that is designed for the world of Cloud, DevOps and Micro-Services.
Thank you for this interview, David.




