Optimizing Data Processing at Scale

Post on June 10, 2020 by Shubham Jain

Shubham Jain Senior Software QA Engineer

How PubMatic processes 100+ billion ad impressions daily without losing a single record. 

As one of the largest independent ad tech companies, PubMatic processes over 100 billion ads and bid requests each day. Our data processing has grown exponentially in the last few years, with 50% more daily ad impressions processed this year compared with 2019This growth has been accelerated by the increased web and app traffic as a result of the COVID-19 pandemic. With the ever-increasing volume, velocity, variety, variability, and complexity of informationwe have had to solve for new challenges to ensure processing occurs without any glitch for our clients. 

Key Challenges in Data Processing:  

    1. Failures: There are countless ways in which a system can fail – Including network errors, latency issues, memory leaks, power outages, hardware failures, malicious attacks and so on. To make our platform fault-tolerant, we need to identify potential failures and develop proactive measures to circumvent the problems. The likelihood of a problem and its potential impact determines the extent to which the system needs to respond. Our system needs to detect and handle a possible failure in infrastructure involving thousands of software and hardware components.
    2. Volume and variety of data: We process ad impressions with 12 different varieties of semi-structured data generated through multiple products and solutions. At the same time, our data is generated from nine data centres geographically distributed across the US, Europe, and Asia. This introduces a lot of variability in the data that must be handled at huge volumes
    3. Quality: Having the right kind of data can help a business make sound decisions. Inaccurate data is not only useless, but it can also be dangerously misleading if applied to attempt to solve business problems. To safeguard and grow business, it is paramount to ensure the quality of the data so discrepancies can be caught in time.
    4. Deployment velocity: The fast-paced ad tech industry demands speedy development to continuously add value to customers and stay ahead of the competition. The market leaders are the ones that have aligned their organisations to deliver on innovation with agility and speed. This means shipping code fast with good deployment velocity.

How PubMatic is Solving These Challenges 

Zero tolerance for failure 

Our highly skilled engineering team has developed many inhouse tools and frameworks which helps to take preventive actions before any incident happens, including:  

  • Recycler tool: We use Kafka to ingest data into our system. If Kafka is unreachable for any reason, our APIs dump records into local storage. This recycler tool automatically processes those records within few minutes and pushes them into Kafka once it is up and running. 
  • Dummy pipeline:  Whave implemented a dummy pipeline parallel to the main pipeline, which sends one dummy record at specific interval of time on the Load Balancer in each data center. We then confirm the source record count  and target record count in the Hadoop file system outputensure our system is running smoothly. 
  • Alerting and Monitoring:  We have developed an in-house framework for 24/7 alerting and monitoring. We discover health issues and any discrepancies in the data that can impact publisher revenue through open-source and home-grown monitoring toolsIf these metrics fall outside of the expected range, the system can send notifications through multiple channels and can then assist in surfacing information to help identify the possible root causes. 
  • High Availability (HA):  Zero downtime is very important for any business. At PubMatic, we recently migrated our entire data center without any downtime. We did this by introducing a project called “Analytics 365, which was developed to provide the most frequently used reports to our customers with zero downtime in a highly available/redundant environment that can be relied upon in the event of any unforeseen issues. 

Dealing with volume and variety of data:

PubMatic has more than 50 Kafka clusters, across 9 data centers, with 1,000-plus servers to handle 150 terabytes of compressed data daily. To deal with such a huge volume, our engineering team conducts optimizations in different areas, including separating out bidless impressions from our regular pipeline to reduce the overall processing time of the regular pipeline by 40% and reduce storage capacity by 30%. These optimizations allow our platform to serve more requests with existing hardware.

Quality:

Quality is paramount for us, and automation has become important to enable stable, highfrequency releases. At PubMatic, we had started an initiative called “Automation first” in 2018, which mandated that all testing was done through coding. We have achieved more than 95% of automation through PySpark across all the components with more than 30,000 test cases.  

Frequent releases velocity:

PubMatic is set to improve release velocity in 2020 with an added mission to run end-to-end automation in one hour to help us to release faster. To achieve this goal, we have taken necessary actions to improve our functional and performance automation in each area

When empowered with correct data, smarter decisions can be made by the business — driving positive customer outcomes. Through the systems and processes we’ve built, PubMatic ensures that data is accurate, qualitative, precise, and healthy, helping our publishers make informed decisions and increase their monetization.