Eerybody thinks that big data is a magical pill that you gulp, and voila, everything falls into place. One of the most common traps is to think of cost only at the level of online or front end systems. A common discussion would read something like this - “A typical $500/month machine easily handles two thousand Queries Per Second (QPS) and so if I need 10 thousand QPS more, I would need 2500$/month as an infrastructure cost”.
What this argument misses out is a lot of downstream aspects which take up bulk of the cost. We can broadly classify the entire infrastructure into five components and then try and assess them for costs.
Entry-point frontend systems
Note that these are not ‘front-end’ as in UI/UX systems, but the main entry points of your platform for the external world. Typical examples would be ad servers, beacon listeners, RTB bidders, B2C websites etc. These would invariably be core business systems that need to run 24x7. Any downtime there not only affects your revenue, but importantly, has a negative brand effect.
Naturally you plan as much buffer and redundancy here as you can afford. Surprisingly, the cost-calculations here are not hard unless your traffic fluctuates unpredictably. Its simple math like - one machine handles X QPS, let us provision for Y QPS factoring in enough buffer. Also, with techniques like auto-scaling on the cloud, you can be as close to real traffic as you want, without over-provisioning in a big way.
Storage accessed by online systems
This could typically be in-memory caching systems like or persistent stores like RDBMS or NoSQL. Tmemcache, redis his depends a lot on how you architect your systems and how much caching makes sense. A typical low-latency system like an ad server or bid server would need a lot of cache since it cannot afford to hit disks at runtime. But memory is generally super-expensive and so the architecture has to be judicious in its usage.
From a cost perspective, this component costing can be quite tricky. The thumb rule is that RAM is 10X more expensive than SSD which is 10X more expensive than HDD and performance-wise RAM is 1000x of SSD which is around 100X of HDD. With such a wide variety at your disposal, you can architect the systems creatively to have the right latency and cost. Another typical mistake done is when the frontend system scales, this caching/persistent store may not scale and so it is important to have upper limits for this benchmarked and provisioned for proper redundancy. Auto-scaling this storage system is not as easy as the stateless frontend systems.
Analytics – the meat in the sandwich
Big data in raw form is just loud noise. If you have terabytes of raw data, it’s pretty useless for the business unless a user can make sense of it. Typically, analytics involves two aspects. One is the well-defined OLAP-like reports which can be easily and quickly consumed by clients and business-users alike. Second is the adhoc queries that analysts can fire when they are trying to
make business decisions. This does however, complicate the infrastructure since adhoc queries can range from a simple “count(*)” on a small table to joining multiple terabytes tables. Having a good analytics platform is a long journey. While latest technologies like hadoop, hive, spark are enabling this, none of them will be drop-in solutions which will work out of box. You will need to invest in them, tune them to your needs and strive for a balance between costs and ease-of-data-access. There is
never an easy answer there!
Geographic distribution of infrastructure
It‘s not uncommon for infrastructure to be spread out around the globe. While most dedicated hosting providers make it easy, you need to be clear on the maintenance and costs involved. The first part is the bandwidth cost which is an often-understated component. Similar to RAM, one needs to be judicious with bandwidth; since it is not only expensive, but the capacities can get inconsistent. So the more data you start propagating around the world, the more uncertainty you are adding to the systems, since the bandwidth speeds are not the most reliable.
Again, when data starts scaling, a good engineering team should be able to architect a solution that will not be super-expensive and be as predictable as possible. Another angle one needs to pay heed to is how the geographical distance impacts the storage accessed by online systems. Not all systems around the world need access to data, so application and domain knowledge, backed by good architecture, will help optimize this spread.
Storage of raw data
This is generally different from storage directly accessed by online systems. Typically, this would be network storage, HDFS, S3 or the like. It’s a lot of unstructured data in raw form. While these costs are reducing by the day (S3 is hardly 0.03$/GB-month and they keep reducing it every now and then), note that the data collected is also exploding every day. Therefore, one needs to strike a balance between costs and utility of data. The next time you consider the cost of data, please factor in the entire ecosystem – one that can bring different disparate nodes into a unified umbrella, and yet justify the investments.
By Vikram, CTO and co-founder of Vizury