PubMatic handles 516 billion daily impressions, and maintaining a large infrastructure can pose challenges. The ad tech industry is rapidly evolving, and we are committed to remaining ahead of the trends impacting our industry. We are continuously innovating, which requires effective automation testing of our owned and operated cloud infrastructure to ensure stable software performance and that project goals are met on time.
Traditional testing approaches caused challenges, due to the complexity and scale of PubMatic’s infrastructure. Our goal was to create a solution that would drastically reduce test suite execution time with minimal changes, known as Project Flash.
Problems With the Traditional Test Execution Approach
Non-atomic test cases with complex dependencies and multiple steps can be challenging to manage, as issues or changes in one step can have a ripple effect throughout the test case, resulting in inefficient and time-consuming test execution processes that may delay the overall testing lifecycle.
Scalability can also be a challenge for traditional test execution, especially for large or complex applications, which can limit the ability to run a high number of tests in parallel, leading to longer testing times.
The Software Development Engineer in a testing team manages over 120,000 extensive test scenarios, which could take days to execute sequentially, as even small changes developed quickly cause significant delays in bug detection and release qualification, hindering time-to-market, requiring a streamlined testing lifecycle.
Project Flash: Linux Containers (LXC) Based Automation Design:
We decided to implement LXC-based virtual parallel execution, as it can effortlessly be created on bare metal machines. LXC is a virtualized container system which we used to separate execution programs like different type of server components with multiple processes on the same machine to run parallel. Any number of LXC containers can be quickly spawned and managed using the LXD daemon. Leveraging the capabilities of LXD and LXC, we have achieved initial phase of parallelism.
Ansible plays a pivotal role in this process by managing each container for parallel execution. Each LXC container has a separate, unique IP address (hosts). We can use each host to give specific instructions on what to execute and with runtime parameters, using the ansible-playbook.
Moreover, using automation containers, we can dynamically execute test suites in parallel. These containers operate independently, communicating exclusively within their specified LXC instances to ensure the execution of programs remains free from dependencies.
To overcome the issue of data dependency during parallel execution, we developed a Python program that creates a replica of the existing test data set and maps it to individual suites.
Pain points that were solved during the transition from traditional to parallel architecture:
Dataset deadlock (solved conflict)
- When using the sequential approach, automation utilized the same dataset. However, running tests in parallel caused conflicts as each suite modified the shared dataset in the database. To address this, allocating a separate dataset for each suite is crucial. This can be achieved through data replication, creating an exact replica of the existing dataset. By implementing this solution, each suite has its own data, ensuring independent handling and improved efficiency during parallel execution.
Database performance stability issues
- Transitioning to a multi-threaded system posed stability challenges, including deadlocks, slow query execution, and connection drops. To address these concerns, we embarked on a solution-oriented approach by thoroughly examining the automation code. We undertook the task of restructuring lengthy queries and streamlining repetitive data-operations. Additionally, we reviewed and enhanced the automation of database transactions, focusing on targeting specific datasets rather than broader ones. These measures aimed to improve system performance and stability during the transition.
Simplification of debugging in parallel execution
- Pinpointing failures becomes challenging when multiple automated processes run at the same time. If all instances use a generic dataset, any process can encounter issues with the same data. In the initial phase, we established a dedicated directory in the initial phase to store timed automation logs for failed executions. These logs were analysed alongside other ongoing automation suites, allowing us to adopt an incremental strategy for identifying and resolving each failure.
Test case idempotency
- With a transition to a parallel test execution architecture, it becomes important to ensure that same test case can be executed multiple times without affecting the system under test. This is known as test case idempotency. When multiple instances of the same test case are executed concurrently, it can result in race conditions and inconsistent test results.
To ensure test case idempotency, it may be necessary to modify the test cases to make them more modular and independent. Also, the test environment must be configured to support parallel execution of same test cases without interfering with each other. Test cases should be modularized, breaking them down into smaller, independent units. Additionally, synchronization mechanisms should be implemented to control access to shared resources.
The outcome has been remarkable, reducing automation execution time from 48 hours to one hour, resulting in accelerated release signoffs. The implementation of LXC containerization-based automation architecture, known as Project Flash, has empowered PubMatic to keep pace with the rapidly evolving ad tech industry.