PubMatic receives over 12 trillion advertiser bids each month, with over 55 billion daily ad impressions served to millions of devices across the globe. We also generate more than 700 terabytes of data on a daily basis. All of this data is processed in a complex infrastructure with thousands of servers.
You can imagine the challenges we face with validating all of this heterogeneous and complex data on such a large scale and volume. In fact, there are 12 different varieties of data, many being semi-structured, generated through multiple products and solutions. Further, our data is generated from geographically distributed locations such as the US, Europe, Japan, and Singapore, to name a few. This introduces a lot of variability in the data.
Along with the above challenges, we also need to consider the various types of errors we might encounter with all this data, including communication incidents, network errors and so on. At PubMatic, the most important part of testing big data applications is the creation of test data and the subsequent data verification with the business logic validation. As big data testing engineers, we consider these three “Vs” important to improve data quality: volume, variety and variability.
In light of the complex and challenging nature of big data, we created an automation framework to help us test this huge dataset.
Automation for Testing Big Data Applications
We developed a smart automation framework which validates the business logic and allows us to introduce volume, variety and variability in testing. Our team developed a highly intelligent and configurable structure that also allows various plugins to be written and subsequently integrated with the framework.
Two key plug-ins are used:
Data Refresh Plugin
The data refresh facilitates receiving gigabytes of live data in seconds, which helps get a higher volume of data. The plugin also sanitizes the data and readies it for application in the framework.
NLP-Based Data Generation Plugin
This Natural Language Processing (NLP) helps generate the variation in data, based on test cases. It takes various business rules in text format, as input, and produces data based on these rules. This data generator plugin also helps us induce all functional and negative test data into the application.
Finding the Correct Dataset for Testing
Various operations need to be completed on the business rules to get the correct dataset for tests including:
This is the first step which involves chopping up the test cases into pieces.
- Dropping Stop Words:
In this step we drop the extremely common words like the, is, an, etc. which add little value in identifying the correct data set required. This is done with the help of Python’s NLTK library.
We canonicalize the tokens so that matches occur despite superficial differences. For example, we convert P.M.P. to PMP (Private Marketplace) or OpenExchange /oRTB impression to RTB impression.
My team also implemented several strategies to classify the base record required for the business validation.
- Stemming and Lemmatization:
During this stage we reduce inflectional and derivational forms to the common base; for example, winning or won is converted to win. We have achieved this using the Porter Stemmer algorithm.
- Machine Learning Algorithm based on Naïve Bayes:
We use a machine learning (ML) algorithm to classify the use case. It uses the bag of words model to identify the correct data set. With the implementation of Naïve Bayes, we have seen accuracy improve dramatically.
- Rule-Based Classification:
Alternatively, rather than always resulting in an ML algorithm, we use rule-based classification. This gives us very high accuracy. However, in the rule-based classification the test cases/business logic need to be written in an intelligent manner.
The Core Framework
The big data testing framework provides an easy interface to validate business logic based on simple SQL queries. The queries are hit on the input and output of the big data application. Query-based validations reduce the complexity and provide a simpler test case automation.
The framework is extensible to use with any format of data including the Parquet, Avro, Json or even CSV format. Since the core of the framework is SparkTM, it leverages Spark libraries to support various formats.
As test cases are transformed into writing simple queries, it also helps us automate the test cases while development is in progress, translating to a faster release.
The framework and automation approach provides our internal teams and our partners a better quality product and shorter time-to-market. Some of the noticeable benefits that we saw were:
- Significant time reduction in test data creation—from hours to minutes
- Automation frameworks allow us to release software faster so we are able to release at 2X our traditional release frequency
- Since our framework uses Spark, our automation can scale with future increases in data size
- The framework provides the flexibility to write future plug-ins
- We can automate with feature development, which reduces the automation backlog and reduces time to market, helping us reduce the QA functional test cycle dramatically
- The framework helps us to incorporate the volume, variety and variability in the test data, improving the quality of our big data jobs
Want to Learn More?
To learn more about product updates and learn about what our engineering teams are doing, check out our recent content. If you are interested in innovating technology with PubMatic, check out our available positions and join the team.