Foundations of Data Systems

Part I of Series: Data-Intensive Application

Every complex system has a series of foundations that makes them what they are of course. We build upon these foundations to make sure that our applications can handle future growth accordingly and solve our problems without much setback.

The better we understand and apply relevant principles and build good foundations, the easier it will be to scale and the less problems we will have down the line.

Most complex applications are formed out of simpler building blocks. In particular, data intensive applications usually have some, or all, of the following building blocks:

  • Data stores: We need to store data/state somewhere, usually a database.
  • Caches: In order to increase performance/speed we use caches to store most accessed data, or the result of expensive operations.
  • Search indexes: Ways to allow users search and index data in a fast manner.
  • Stream processing: Async message passing across services.
  • Batch processing: Regularly operate on the stored data in order to get insights, or to transform it into relevant subsets.

Within each of these, there’s another whole universe of choice, and the general rule of thumb is to pick the right tool for the job. One that adapts to your particular needs and nothing else. If storing relational data for example, chances are a relational database will work better than a graph database.

We can see all of these components in action in an example from Kleppmann’s book

Generic data system architecture – Martin Kleppmann

Here you can identify all the different simpler components that could be in play to form a generic complex system.

When picking the relevant building blocks, there are fundamental targets we want to achieve.

According to Kleppmann, successful data systems need to be reliable, scalable and maintainable, so we should choose our components carefully in order to meet these criteria.

Reliability

“The system should continue to work correctly, even in the face of adversity”.

Every system should aim to have good reliability. Unfortunately, we can’t protect against every single scenario, hence it’s likely that our systems will experience reliability issues at some point in time, in the form of faults.

Conversely, we can’t protect ourselves against every fault, but we can make our systems as fault tolerant as possible. Even better if we aim at auto-recovery or self-healing.

There are faults we can recover from and others that paint a much harder picture, for example, a security breach that could’ve been prevented with better engineering is a good example of the latter.

Some of the common ones we could recover from include hardware and software faults, along with human error.

Hardware faults typically relate to loss of machines. Most modern applications will create tolerance to these by adding redundancies. Multi A-Z and multi-region deployments are a common example. If an entire availability zone is down, we can safely access a different one, perhaps with only a bit of latency, but continue to operate.

Software faults are a bit harder to plan for, given that by definition they’re introduced via bugs in the system and if we were aware of those, they could’ve been eliminated altogether. These issues could even be dormant until certain conditions happen, or could be affecting particular critical services that were fine until a certain threshold was reached. The best way to prevent these is through extensive testing.

Human error is the third factor that could affect reliability, and boy can we screw up things badly! If you don’t believe me, read the major network outage that happened at Facebook recently. Human errors are, funny enough, easier to deal with. Ideally you don’t put people in situations where they can screw up, have monitoring in place to detect issues quickly, have mechanisms to recover, and overall try to implement good engineering practises to minimise as many dangerous situations as possible.

Reliability is a very important criteria to optimise. In some instances, like air traffic control, it could put people’s lives in danger, but even when that’s not the case, unreliable systems can cause loss of revenue, which puts companies in jeopardy, even with the potential of driving them to the ground.

Scalability

“As the system grows, there should be reasonable ways of dealing with that growth”.

In growing systems, today’s performance is no guarantee of tomorrow’s. With more traffic, more data, more regions, more users, etc the assumption is that all those new interactions will expose critical bottlenecks where simply adding more metal (processing power) won’t be enough.

Achieving good scalability is a hard problem, when balancing “thinking ahead” meets “avoid overspending too much in advance”.

Nowadays, there are cloud solutions that scale infinitely to serve your needs, and allocate resources for you as you grow. However, there are parts of a system that cannot simply be scaled that way.

To figure out how scalable your system is, you need to stress test it, and always know how far you’re able to push it.

The first thing to do is to figure out the current load of your system, then start running tests with increased load and see what happens. You need to define KPIs in advance, in order to have a reliable metric of success. For example: if measuring resources; CPU utilisation, memory usage etc. If measuring latency; median response times, throughput, etc. Watch for outliers using 95th, 99th percentiles.

Once you’ve identified breaking points, you can start formulating a strategy to deal with the potential failures or plan for growth accordingly.

Maintainability

“Over time, many different people will work on the system, and they should all be able to work on it productively”

The majority of software cost goes into its constant maintenance (fixing bugs, addressing technical debt, adding new features etc), which makes building maintainable systems a priority.

When designing systems, we should do it in a forward thinking way, attempting to minimise future pain.

I once read somewhere that you should code as if the person maintaining your code is a psychopath that knows where you live. Unfortunately I cannot recall where I read it, but I find it both amusing and true.

Kleppmann suggests we focus on 3 design principles:

  • Operability, make it easy for your ops to run the software.
  • Simplicity, make it easy for future engineers to understand it by removing complexity.
  • Evolvability, make it easy for engineers to make changes and add features.

Guided by these three criteria (reliability, scalability and maintainability), we can increase our chances of building better systems, and achieve a good result when dealing with data-intensive applications.

As you can imagine, this is not enough to build complex data-intensive systems, but it’s a start. Lots of design and engineering decisions come into play. What is more likely to happen is that your system will evolve over time, and that you won’t have to make all of these decisions at once, but you’ll have time to consider your options.

Keep these principles in mind when choosing and you’ll have a good step ahead.

Next in the series we will explore Replication and Partitioning in Distributed Systems.