Big data architecture patterns for performance

This blog post will give some overview of how to build a high performance architecture to process your data. Some real benchmarks conducted by Incentergy will be explained and how we leverage our knowledge when using our platform.

Turbo snail for big data

Lets start with a list of everything that makes your program slow:

  • believing in marketing promises
  • quadratic O(n2) or exponential algorithms O(2n) like bubble sort or generating the power set
  • disc writes and reads
  • network access
  • inter-process-communication
  • big data
  • locks
  • user mode

Now a list with everything that makes your program fast:

  • smart programmers
  • pushing customers
  • using efficiently main memory
  • parallelism
  • quantitative modeling
  • smart benchmarks
  • kernel mode

In my opinion the most important part are good programmers and smart benchmarks.

Now lets look into some parts in more detail.

Computer architecture

Latency numbers every programmer should know. This is the absolute basic that you should know to develop a high performance architecture. Speed of software is always influenced by behavior of hardware. Hardware is based on different components that have quite different characteristics using them wisely is very important.

Operating system architecture

Inter process communication is expensive if you want to have a high performance architecture you should have a thread pool that is working on the tasks in parallel using all the CPUs in your machine. Further if you have a GPU use it for the heavy math lifting. There is currently also lots of research going on using the GPU.

If your application runs in user mode it will always be slower than running in kernel mode. There are some specially tuned systems like TUX which use kernel capabilities to run very fast but are horrible to debug.

ConcurrentHashMap vs. Memcached vs. MySQL InnoDB

The following image shows a comparison between a ConcurrentHashMap in Java vs. memcached vs. MySQL when parallel accessing them (smaller is better). We are comparing here apples with oranges because the architecture of these three systems is quite different but it gives you a feeling where your cpu cycles go.

HashMap vs Memcached vs MySQL

Results for run with ‘100.000’ objects:

Type Insert Lookup Remove
ConcurrentHashMap 264ms 93ms 82ms
Memcached 6549ms 5976ms 4900ms
MySQL InnoDB 55754ms 26002ms 57899ms

As you can see the map which is in the same process is about 25 – 90 times faster than memcached and memcached is about 5 – 12 times faster than MySQL InnoDB.

So if you want to be fast use a HashMap.

If you want to have the benchmark source contact us.

Why Hadoop is slow

We are running a lot of hadoop jobs to be more specific mahout jobs and we are seeing that the system has a hard time keeping the CPU usage on the cluster higher than 30%. The problem here is that Hadoop is heavily disc and network bound and this is slow. So if you can process your data in memory you should.

Our system runs online (in-memory) and offline (hadoop) machine learning algorithms to have best of both worlds. This is sometimes also called the lambda architecture.

If you want to give it a try for doing real time online marketing optimization contact us.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.