Contact Us
Back

Collecting Trillions of Data Points at Scale

Engineering March 24, 2023




Collecting Trillions of Data Points at Scale
Akash Doifode
Akash Doifode

Share

Draup’s Harvester team is responsible for gathering a vast amount of data from various publicly available sources. This team collects trillions of data points to be used by downstream processes and Data Science teams. 

These downstream teams use the collected data to develop new insights and run Machine learning models.The data we collect comes from over 8000 sources and includes job descriptions, demographic data, publicly available resumes, university profiles, courses, news, and much more.  

Scaling a system that can handle such huge volumes while still maintaining 99.99% uptime is of great engineering interest. 

So, how do we do it? 

Distributing Workload in a Producer-Consumer Framework  

At the core of our Harvester system is the Producer-Consumer Framework. 

In the producer-consumer framework, producers are responsible for generating tasks, and Consumers process those tasks. Because producers and consumers operate independently and at different throughputs, the workload can be distributed among multiple producers and consumers, improving performance and scalability.

The Architecture of the Producer System 

We store the input metadata (URLs in our case) in MongoDB because it offers high performance and flexibility compared to traditional relational databases. The producer reads the input data from MongoDB, transforms it, and pushes the job further into the queue along with relevant information.  

The Moving Parts of the Producer System 

  • Airflow – Airflow is used to schedule jobs at a frequency based on the source’s needs. This allows us to define complex scheduling rules and dependencies and to monitor and manage our data pipelines easily. The scalability and reliability of Airflow make it suitable for our system to process enormous volumes of data. 
  • Broker – Redis on AWS (Amazon Web Services) Elastic Cache – Redis is used as a broker and acts as a link between different system components. We use Redis on Amazon Elastic cache for deploying a fast and reliable data storage and retrieval system in the cloud. Redis gives us fast read-and-write performance, thus enabling producers and consumers to scale without worrying about database throughput and bottlenecks.

The Config Layer – Programming the Harvester 

The Config Layer acts as the starting point for the harvester system and serves as the interface between developers and the system. In this layer, developers specify parameters such as rate limits, batch limits, harvesting lag, input metadata, and output locations for a particular source. It also contains source-wise attributes and their respective data types, providing crucial meta-information to all other layers in the system. 

Efficient Task Queuing with Celery 

We use Celery for our task queueing requirements. It allows us to check the status of tasks and retry failed jobs. With its robust and flexible design, Celery is used in a wide range of applications, from small one-off scripts to large-scale distributed systems. 

But task queuing with Celery is not enough for the volumes of data that we process. 

Building a Custom Distribution System 

Earlier, Celery was used to distribute tasks among workers, and each source used to have its own queue. However,  Celery does not support global rate limiting and only supports queue-level rate limiting. 

This means we have to develop a custom distribution system that will help us track the tasks better. In this custom distribution system, we have divided the queue into two parts 1. Priority queue 2. Default queue. 

  • Priority Queue – Priority queue is used for business-critical sources. We have kept dedicated machines on the worker side to execute tasks on the priority queue. This helps us to decouple the queue from other non-important sources. 
  • Default Queue – The default queue is used for non priority source we push.  

The queue size can be scaled up for storing tasks at scale 

In addition to the above, the Distribution system is equipped with a Rate Limit Store. This is used to store all active tasks and track them until completion. Once an output is received, that task is removed from the Rate Limit Store. 

How the Consumer System Works 

The Consumers are responsible for executing tasks from the queue. The input will be read from different queues, and the scraper framework (developed by Draup team) will use this information to scrape data from various sources. The output will then be returned in a specific data format, and an appropriate data handler will be called to write the data to the database. The data handler is responsible for writing the data to the database in a format that is optimized for fast retrieval and easy access, making it easy for other parts of the system to access and use the data. 

Priority worker – Priority worker listens to the priority queue 

Default worker – These workers will listen to the default queue 

Optimized Scraping with Puppeteer 

We are using puppeteer for harvesting dynamic websites. Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It is used for harvesting data from websites and performing automated tasks that would otherwise require manual labor or complex configurations. We have optimized the puppeteer system by adding custom resource type filtering while rendering the web page. This allows us to harvest the specific resources from the web page which are important to us. For example, In most of the cases we only need HTML but not images and ads. Due to this, we improved the response time of queries on puppeteer, cutting it by 60-70 percent.  

Duplicate URL checker 

One of the major challenges in data collection is handling duplicates. To avoid collecting duplicated data, we have implemented a duplicate URL checker. If the URL has already been processed, it will not be processed again, saving time and resources. This system is critical for ensuring the data’s accuracy and maintaining our system’s integrity. 

URL Store – We use Elastic Search for our URL store and for checking duplicates, as it is faster in searching URLs and indexing the data.  

Logging and Monitoring Mechanism 

The logging mechanism of the system involves using Kibana (ELK) and Grafana (with Prometheus) dashboards for monitoring. These tools provide an in-depth view of the system’s performance and behavior. Customized dashboards have been created to monitor various aspects of the system, including individual tasks, Redis servers, databases and worker nodes. By leveraging the power of these dashboards, we are identifying potential issues proactively in real time and ensure smooth operation of the system. 

Understanding the Harvester Task Flow 

Now that you understand all the blocks let us look at the task flow of the harvester system.  

When it comes to harvesting websites, there are typically two types of processes –creating a list of links from the website and obtaining details from each link. This information may vary depending on the website and its purpose, but the overall flow remains unchanged. 

  • Listing Process: The listing process begins with a producer which creates the initial link. This link is then added to the listing queue, where the distribution server retrieves it based on the configurations set in the Config Layer. The server determines the importance of the source and places the link in either the high queue or the default queue. Workers then retrieve the links from the queue and extract all the URLs from the page. The Duplicate URL Checker is used to verify that the links are unique, and the unique links are then sent to the Details Process. 
  • Details Process: The Details Process takes over once the unique URLs are received. The distribution server separates the tasks into the high queue and default queue. Workers retrieve the data from each queue and extract the details from the links. The resulting data is stored in MongoDB.  

What Next? 

We are currently working on scale the system 10X and improve overall rate adding more sources and making system more resilient. We are currently working to find patterns in different websites and create a Machine Learning model that can automatically create configs for different types of websites, alert us when there are major changes in the website format, correctly map the schema for the websites and help us scale this system to Millions of websites globally. 

We also plan to open source this project soon. Watch this space for updates ?