This blog post will give some overview of how to build a high performance architecture to process your data. Some real benchmarks conducted by Incentergy will be explained and how we leverage our knowledge when using our platform.
Lets start with a list of everything that makes your program slow:
- believing in marketing promises
- quadratic O(n2) or exponential algorithms O(2n) like bubble sort or generating the power set
- disc writes and reads
- network access
- big data
- user mode
Now a list with everything that makes your program fast:
- smart programmers
- pushing customers
- using efficiently main memory
- quantitative modeling
- smart benchmarks
- kernel mode
In my opinion the most important part are good programmers and smart benchmarks.
Now lets look into some parts in more detail.
Latency numbers every programmer should know. This is the absolute basic that you should know to develop a high performance architecture. Speed of software is always influenced by behavior of hardware. Hardware is based on different components that have quite different characteristics using them wisely is very important.
Operating system architecture
Inter process communication is expensive if you want to have a high performance architecture you should have a thread pool that is working on the tasks in parallel using all the CPUs in your machine. Further if you have a GPU use it for the heavy math lifting. There is currently also lots of research going on using the GPU.
If your application runs in user mode it will always be slower than running in kernel mode. There are some specially tuned systems like TUX which use kernel capabilities to run very fast but are horrible to debug.
ConcurrentHashMap vs. Memcached vs. MySQL InnoDB
The following image shows a comparison between a ConcurrentHashMap in Java vs. memcached vs. MySQL when parallel accessing them (smaller is better). We are comparing here apples with oranges because the architecture of these three systems is quite different but it gives you a feeling where your cpu cycles go.
Results for run with ‘100.000’ objects:
As you can see the map which is in the same process is about 25 – 90 times faster than memcached and memcached is about 5 – 12 times faster than MySQL InnoDB.
So if you want to be fast use a HashMap.
If you want to have the benchmark source contact us.
Why Hadoop is slow
We are running a lot of hadoop jobs to be more specific mahout jobs and we are seeing that the system has a hard time keeping the CPU usage on the cluster higher than 30%. The problem here is that Hadoop is heavily disc and network bound and this is slow. So if you can process your data in memory you should.
Our system runs online (in-memory) and offline (hadoop) machine learning algorithms to have best of both worlds. This is sometimes also called the lambda architecture.