How do we harvest, process and analyse millions of data points every day?
As I look through the past year, there has never been an uneventful or boring day at Draup. Each new day brings in some challenging yet interesting problems to solve. Our entire stack is divided into four important components and each of them is a complete product by itself. We will go through an overview of each without diving into much detail.
Harvesters – We have data from 1000s of different sources flowing into our database. Each data point has its own refresh cycle ranging from real-time to quarterly refresh. Volumes have sometimes gone up to 300 million records a day (200 GB) based on seasonality. Data processing time should not be affected by data volumes, irrespective of its size as decision making is affected with even minor delays.
The data might not have a well-defined schema, this is where MongoDb helps us. Its simple yet dynamic and extremely scalable document store model fits correctly with all our varied needs. We have an internal python-based tool that helps us onboard new sources quickly.
ETL – Getting the data into the system might be an easy task but processing them, which includes joins that go into 100 mil * 100 mil, is a very memory-intensive as well as compute-intensive task. We use the Databricks platform to ETL, decomplexify and merge vast volumes of related data. Databricks is a cloud-based managed spark ecosystem which allows us to concentrate on solving business challenges without worrying too much about the infrastructure and scaling. We have developed some proprietary algorithms which help us deduplicate similar data points across various sources, translate all data into English and make it ready to be consumed by Machine Learning models.
Gateway is the brain of Draup, all important business logics/decisions are made here. This is also a place where data from different machine learning algorithms, rule-based engines, psychological models and manual human intelligence converge. It follows a micro-services architecture with multiple apps, each solving a different problem and communicating through APIs. The apps generally have well defined data schemas to work with. Given the relational nature of the data, MySQL is the natural choice as our database engine. We use Django as it helps us create smaller development cycles, is easy to learn, and helps us control quality and maintain a well-defined MVC architecture. We also use celery coupled with Redis as the broker for a lot of our long running async processes. All different data points along with model results are available for our internal subject matter experts and analyst teams to review and correct through the Gateway web interface. All corrections made by them are fed back to the Machine Learning algorithms as learning data set.
Finally, we have the application which is our front facing SaaS product. It has DRF (Django Rest Framework) at the backend with all the data being consumed by a ReactJs application. To make the user experience better, by allowing our users to scoop through our proprietary heuristics and data much faster, elastic search is used.
We are primarily a Big Data and Machine Learning based startup, so scalability and efficiency is the key for us. All the components are hosted on managed cloud services. The provisioning of the architecture is automated through Terraform, we can bring multiple machines up in seconds based on the load. The deployments are managed through ansible scripts run on Jenkins.All the applications have their own Dev, Qa and Production environments that helps us test our features thoroughly and make sure we deliver bug free software.
I hope this was useful. If you are part of a budding startup and need guidance on solving similar problems, feel free to reach out to us.