When your Website Slows Down, Don’t Frown

Turn that frown upside down

It can be a real nail biter to figure out why your site took a turn for the worst. While it may not have intentionally slipped into its sluggish habits, it did. This condition is not forever. You just have to get your site back on track. What trainer will be able to bust your site’s buns back in shape, you ask? The one and only, Deterministic Root Cause Algorithm.

The lean, mean, mathematical fighting machine tracks down the reason for high response time. In an N-tier architecture, the problem can be with the web server, the database server, or other server components. To effectively pinpoint the problem, we determine which server is the source of the high response time. The algorithm traces backwards, checking for a large increase in response time between each server. As soon as the large increase dissipates, we can narrow in on the server, the root cause of the high response time.

Here is the skinny: The slowness initially affects a server. Then, it propagates back toward the web server. We determine which resources had increased usage, and which specific processes were using these resources.

The Technical Details:

Alerts are automatically set up on each server for high response time. We set this to a lightning speed of 1 second for 5 consecutive minutes. As a user, you can then configure the threshold and duration desired.

When the average response time (ART) is calculated for a server, we consider the processes which we have intercepted and summarized. We want to get an accurate picture, so we only consider connections where that machine is the server, and not ones where it is a client. Once we see the impact on the server, we can fine tune the alert so the issue does not affect the connecting clients.

The Algorithm:

Let’s break it down:

1. If the threshold is reached for the length of the set duration, an alert will be triggered. We compare the ART for each minute of the duration to the ART for the previous hour. This allows us to calculate the ART over all connections to the server. If the ART for each minute is 50% higher than it was in the preceding hour, the root cause algorithm is in play. However, if this happens to not be the case, we trigger a normal resource alert, similar to alerts on high memory and CPU.

2. From the alerts sent on the initial server, we trace back to the connections. This machine serves as a client to other servers. For example, if the initial server is A, and it connects to a server B, then we get the ART for server B when server A is connected to it. If the value for each minute of the duration is at least 25% more than in the previous hour, we trace back to server B. But, if it is not the case, (you got it), it is the root cause.

3. If there are multiple servers to trace, which all have a 25% increase in response time, we follow every path. This can result in multiple root cause servers and/or multiple paths to a single root cause server. You want the report on all of them? You got it! If there are more than three paths to a server at the end of running the algorithm and we are not confident of the cause, we will keep you in the loop with a regular resource alert email until we can identify the cause.

4. Once we determine the root cause(s), we examine resource usage on the root cause server. We determine which resources had a large percent increase in usage. Then for each of these resources, we find which processes had the largest usage increase. Knowing this, we can report, for each root cause server, which resources had a large usage increase and which specific processes are the root cause.

Limitations:

As with everything, there is always a limitation. The ART is calculated based on the collected data from intercepted and summarized processes. But, if server processes are not intercepted, things just do not add up. Therefore, too many short-lived processes cannot be summarized. You can count on us to crunch the numbers and isolate the problem if we can.

This kind of limitation exists with the Postgres database. Because so much work is done in short lived processes, we often cannot record any ART value for Postgres. If the problem is with this database, there would be no way to trace it.

While we cannot guarantee that the resources and processes mentioned above are the root cause, we have a hunch that it is the case. And if not, we will not stop until we find it.

The Formula:

The greater increase in response time and resource usage, the more likely it will lead us to the accurate root cause.

Why We Chose HBase

HBase

A long awaited follow-up post to A RAM-Based Data Architecture.

It is about time we tell you why we chose HBase as our preferred data model: Simplicity.

When you are faced with making a decision between A and B, you have to remember there is more than meets the eye. Initially, both Cassandra and HBase were in the running for AppFirst’s next data model. But as time progressed and we had the chance to explore both, we were convinced that HBase offered not only efficiency, but also the necessary consistency to present thorough data to our customers on a regular basis.

The hype and positive reviews surrounding HBase was certainly a contributing factor in our decision making. With active and ongoing open source development, we were reassured that HBase would be here to stay. Its performance had a significant impact on our decision making as well. It sustains an enormous number of writes and the read cycle times were much better than we had anticipated. Further, it gives us the option to interact with the Hadoop Ecosystem, including HDFS, Mapreduce, and Zookeeper frameworks. Our enthusiasm for HBase skyrocketed when we discovered how to create map-reduce apps to do a number of management tasks. While Cassandra also has these capabilities, its data model was fundamentally more complex.

As we all know, good performance is dependent on maintenance and management. With Cloudera’s competent and resourceful support, installing, upgrading, and deploying on HBase is really as easy as 1,2,3. Cloudera has taken simplification to a whole new level. The only thing for you to do is install your RPMs and DEBs. The management of the cluster is first-rate. Not only is it easy to add a server to the cluster to increase the capacity of data, rebalancing the cluster is equally effortless. Cloudera provides templates from Puppet to deploy a consistent cluster. What more could we ask for?

Well, we could ask for just one thing. The current version only allows for us to recover data manually. We are confident, however, that the next version will present an alternative.

What can the cloud do for you?

As a SaaS-based solution, we already know about the advantages of operating in the cloud. But how do you, the reader/customer/stumbleupon visitor, feel about the cloud? How do you know if it can benefit you and your company? How do you know that the cloud can keep your data safe and your application stable?

We recently sat down with Eli Almog, a thought leader in the application performance management space. Eli formerly served in the office of the CTO at BMC. Over the course of our discussion, he goes over how to prepare to migrate to the cloud, and what to do once you’re there. Click the link below to view as a PDF.

We have also embedded a beautiful infographic on how cloud computing has changed business. It was designed from a CSC survey of 3,645 IT decision makers in eight countries.

>>>What can the cloud do for me?

Ahead in the Cloud - The CSC Cloud Usage Index

Regarding the issues that occurred on December 8th

Hello to all our valued customers,

AppFirst is going through very exciting times right now. We’ve successfully rolled out our log file data offering (hundreds of logs have already been uploaded in just a short time) and we’re constantly working to provide improved services to all of our customers.

As much as we’d like to share all of those successes today, instead we’re unhappily writing about a production problem we encountered yesterday that you may have noticed on your dashboards.

What’s Been Happening

Recently, there have been problems stemming from our Postgres database, which is where we currently store all of our process data. As a result of monitoring over 100 million processes, and given our current system architecture, we’ve come to a place where Postgres can’t handle this amount of data. As a result, the problems of slow page load times and data loading errors emerged.

How We’re Resolving These Issues For A Bright Future

Now that we have identified the problem, we’re improving our architecture to keep all the process data in HBase rather than Postgres. We’re confident this will resolve the current scaling issues and architectural problems we’re encountering. We’re taking a three-step course of action to achieve our goals:

Step 1: Earlier this morning, we stopped reporting process data to our free Developer users. We’re not happy that we had to do this, but already, our product is operating 5X faster than it previously was.

>> If you’re a Developer user and still need/want that process data, please contact us and we will re-enable it for you.

Step 2: Next week, we will begin moving our process data to HBase. During this time, process data will be stored in both HBase and Postgres. This will help keep load times down and everyone happy.

Step 3: After the New Year, we’re going to be moving all of our process data to HBase. We chose HBase because it’s a much more scalable solution. Postgres will continue to be used to store account info and things of that nature.

In the meantime, AppFirst continues to be grateful for your support and we offer our apologies for any inconveniences this may have caused. We’re working extremely hard to ensure that we remain a reliable solution for you.

We’re losing sleep so you don’t have to.

David Roth
Co-Founder and CEO