ads.txt crawler

PubMatic’s Approach to an Auto-Scaling ads.txt Crawler System

Professional headshot of Abbas Suterwala
By Abbas Suterwala, Manager, Engineering
June 28, 2018

ads.txt is an IAB initiative, implemented in May 2017, that allows publishers to publicly list the companies authorized to sell their digital inventory. In practice, ads.txt is a simple file hosted on the webserver that helps combat domain spoofing. The buyers, by reading this file, can verify if the company is an authorized seller of the digital inventory. As of April 2018, ads.txt had about 60 percent adoption with the top 1,000 US publishers.

As a supporter of ads.txt and in order to comply with the initiative, PubMatic needed a system to read the file on domains, in an automated fashion, and then filter out unauthorized inventory. Sounds simple right?

But like always, things started getting interesting when we had to scale it for millions of domains, filtering out billions of requests, all in real-time.

Main Components of the ads.txt Crawler and Filtering System

The following are the main components of the ads.txt crawler and filtering system we developed:

Domain Ingestion

The first challenge we had to solve for was how build the list of domains for the crawler. We prepared two pipelines as input for the crawler system:

  • A scheduled job to aggregate allowlisted domains in our platform.
  • A Spark job to parse the AdServer logs for non-allowlisted domains.

Domain Crawler

The crawler has two major components:

  • ads.txt Fetcher –

The fetcher picks up a batch of domains from the input domain list. It then locks those rows so that no other ads.txt fetcher (running on a different instance) will pick up the same domains in the current run. The ads.txt fetcher will then gather the ads.txt files from the various domains, in parallel, and submit them to the next step—the ads.txt Parser.

  • ads.txt Parser –

The parser takes the downloaded ads.txt file and parses for authorized publisher information. Only incremental changes are sent to the data store.

Domain Filtering

Each incoming bid request contains a domain name and publisher ID attributes. These are matched against parsed out ads.txt data, in real-time, by the AdServer. Given the low latency and small memory foot requirements of our AdServer, we cache the data for each AdServer.

Ads.txt Reporting

Our team also created a reporting component which sends out crawler coverage stats after each run to the stakeholders. The report includes total scanned domains, declared domains, unauthorized domains and various kinds of errors to understand publisher ads.txt adoption.

Technical challenges

During our solution development, we also had to solve for major technical challenges while building the ads.txt crawler system including:

  • Ingestion: We had to create a crawler that would ingest millions of domains across multiple sources efficiently and at scale.
  • Scanning at scale: Given the number of domains, our crawler needed to auto-scale easily to run on multiple instances as the number of domains continually increases.
  • Filtering requests in real-time: Data checks need to be done in real time thus we had to build a really efficient filtering system with appropriate data structures.
  • Scaling the database: Our databases needed to be updated so they could effectively handle the write-and-read loads.

In the initial stages, and throughout the updates, we relied on Java, Spring Boot and MySQL as the data sources to get the entire system up and running. 

What Does This Mean for Partners?

PubMatic supports a fraud-free ecosystem and is an early supporter of ads.txt. By creating an auto-scaling ads.txt crawler, we are able to better root out fraud in programmatic advertising and protect our partners. To learn more about our quality initiatives, check out our latest content or contact us.