Puppeteer based Simple Data Scraper: Advanced Options

This is the fourth part of our Puppeteer series where we show how Puppeteer can easily be leveraged to scrape data from web pages. We started off by showing what Puppeteer is capable of and how it can be used to extract data using css selectors.

On the second article we showed how to build a rule based data scraper with Puppeteer, then proceeded to show how Puppeteer native functions can also be used to build a rule based scraper. In this article, we show how Puppeteer's advanced capabilities can be used to make our scraper better equipped for handling real world use cases. Namely, we will explore the following:

  • Controlling page load behavior
  • HTTP Authentication
  • Adding extra headers
  • Changing user agent
  • Adding cookies
  • Intercepting requests and rejecting resources

The simple scraper function

Let's start with a little reminder of what our scraper function looks like:

// simple-scraper.js

const debug = require('debug')('simple-scraper');

// import the function we created before
const {startBrowser} = require("./utils/browser-utils");

// import waitForTask and scrapeTask we defined before
const {waitForTask, scrapeTask} = require("./utils/page-utils");



async function scraper(url, rules, options = {}) {
    let browser;

    const {
        headless = true,
        gotoOptions = {
            timeout: null,
            waitUntil: 'domcontentloaded'
        },
        waitFor, // the selector to wait for
        waitForTimeout // waitFor timeout
    } = options;

    try {
        debug('Starting');
        browser = await startBrowser({
            headless,
        });

        const page = await browser.newPage();

        debug(`Navigate to URL ${url}`);
        await page.goto(url, {...gotoOptions});

        // only if waitFor is specified
        if (waitFor) {
            debug(`Wait for ${waitFor}`)
            await waitForTask(page, waitFor, waitForTimeout);
        }

        debug(`Start scraping`);

        // Scrape data using the DOM API
        return await scrapeTask(page, rules);


    } catch (e) {
        // handle error
        debug(`Error ${e.message}`);
        return null;
    } finally {

        // close browser
        if (browser) {
            await browser.close();
        }
    }
}

module.exports = {
    scraper,
}

HTTP Authentication

Puppeteer provides the .authenticate() method on the page object to perform HTTP authentication. The method takes an object with the username and password properties.

Let's add this as an option in our scraper function

// simple-scraper.js

// ... code

async function scraper(url, rules, options = {}) {
    let browser;

    const {
        headless = true,
        authenticate = null, // auth object with username and password properties
        gotoOptions = {
            timeout: null,
            waitUntil: 'domcontentloaded'
        },
        waitFor, // the selector to wait for
        waitForTimeout // waitFor timeout
      
    } = options;
    
    
    // ...code
    
    const page = await browser.newPage();

    // apply http authenticate if provided
    if (authenticate) {
        await page.authenticate(authenticate);
    }
    
    // ...code
    
}

module.exports = {
    scraper,
}

Adding Extra Headers

To add extra headers Puppeteer provides the .setExtraHTTPHeaders method on the page object. The method takes and object of type Record<string, string>(object of key:string, value:string)

// simple-scraper.js

// ... code

async function scraper(url, rules, options = {}) {
    let browser;

    const {
        headless = true,
        authenticate = null, // auth object with username and password properties
        headers = null,
        gotoOptions = {
            timeout: null,
            waitUntil: 'domcontentloaded'
        },
        waitFor, // the selector to wait for
        waitForTimeout // waitFor timeout
      
    } = options;
    
    
    // ...code
    
    const page = await browser.newPage();

    // apply http authenticate if provided
    if (authenticate) {
        await page.authenticate(authenticate);
    }
    
    // add extra headers
    if (headers) {
        await page.setExtraHTTPHeaders(headers);
    }
    
    // ...code
    
}

module.exports = {
    scraper,
}

Changing User Agent

To change the user-agent sent within your request, you can use Puppeteer's .setUserAgent() method on the page object. This method takes a string argument that represents the desired user-agent.

// simple-scraper.js

// ... code

async function scraper(url, rules, options = {}) {
    let browser;

    const {
        headless = true,
        authenticate = null, // auth object with username and password properties
        headers = null,
        userAgent = null,
        gotoOptions = {
            timeout: null,
            waitUntil: 'domcontentloaded'
        },
        waitFor, // the selector to wait for
        waitForTimeout // waitFor timeout
      
    } = options;
    
    //...code 
    
    // add extra headers
    if (headers) {
        await page.setExtraHTTPHeaders(headers);
    }
    
    if (userAgent) {
        await page.setUserAgent(userAgent);
    }
    
    // ...code
    
}

module.exports = {
    scraper,
}

Adding Cookies

To add cookies to a page, Puppeteer provides the .setCookie() method on the page object. This method takes an object representing the cookie properties, such as name, value, url, domain, path, etc.

An example of a cookie:

const cookie = {
    name: 'sessid',
    value: '233444422',
    url: 'https://ujeebu.com',
    domain: 'ujeebu.com',
    path: '/',
    expires: Date.now() + 3600 * 24 * 1000 // Cookie expiration date
  });
// simple-scraper.js

// ... code

async function scraper(url, rules, options = {}) {
    let browser;

    const {
        headless = true,
        authenticate = null, // auth object with username and password properties
        headers = null,
        userAgent = null,
        cookies = [],
        gotoOptions = {
            timeout: null,
            waitUntil: 'domcontentloaded'
        },
        waitFor, // the selector to wait for
        waitForTimeout // waitFor timeout
      
    } = options;
    
    
    // ...code
    
    if (userAgent) {
        await page.setUserAgent(userAgent);
    }
    
    // add  cookies
    if (cookies.length) {
        await page.setCookie(...cookies);
    }
    
    // ...code
    
}

module.exports = {
    scraper,
}
simple-scraper.js

Blocking Resource Requests

In Puppeteer we can intercept every request made by the page using the setRequestInterception method of the page object. With request interception activated we can:

  • abort requests for resources (images, CSS, fonts or other media) to speed up the scraper and to save bandwidth,
  • abort requests based on a specific pattern,
  • provide custom responses to the browser.

A simple code example of blocking CSS files and images:


// activate request interception
await page.setRequestInterception(true);

// listen to request
page.on('request', (req) => {
    const resourceType = req.resourceType();
    
    // abort the request if the type of requests if css or image
    if (resourceType === 'stylesheet' || resourceType === 'image') {
      return req.abort();
    }
    
   
    req.continue();
  
});

We can also block requests based on filename patterns:


// pattern for fonts files ending with one these extensions: (eot, otf, ttf, woff, woff2) 
const pattern = "\.(eot|otf|ttf|woff|woff2)$"

// activate request interception
await page.setRequestInterception(true);

page.on('request', (req) => {
    
    	// abort request if request url match our fonts pattern
        if (req.url().match(pattern)) {
            return req.abort();
        }
      
        return req.continue();
    });

Provide custom response to simulate an error response:

// activate request interception
await page.setRequestInterception(true);

page.on('request', (req) => {
    
    if (req.url().endsWith('data.json')) {
      // respond with custom http status code and custom JSON data
      req.respond({
        status: 400,
        contentType: 'application/json',
        body: JSON.stringify({ error: 'Invalid data' })
      });
    } 
      
        return req.continue();
});

Let's add a function to our scraper to handle all 3 cases and make them configurable:

vim ./utils/page-utils.js 
// ./utils/page-utils.js 

const debug = require('debug')('simple-scraper:page');
let scrapeFunctions = require("./scrape-utils");

// waitForTask function code ...
// scrapeTask function code ...


/**
 *
 * @param {import("puppeteer").Page} page
 * @param {{patterns: string[],interceptors: {pattern: string; response: {code: number; contentType: string; body: string};}[], resourceTypes: string[] }} rejections - timeout of waitFor if function or selector is provided
 */
async function interceptTask(page, rejections = {}) {
    const {
        patterns = [],
        resourceTypes = [],
        interceptors = []
    } = rejections;

    debug(`Activate request Interception`);
    await page.setRequestInterception(true);
    page.on('request', (req) => {

        const reqURL = req.url();
        const reqResourceType = req.resourceType();

        if (
             // abort if one of the given patterns match out request's URL
            !!patterns.find((pattern) => reqURL.match(pattern)) ||
            
                        // or if one of the given resource matches the request resourceType
            resourceTypes.includes(reqResourceType)
        ) {
            return req.abort();
        }
        
        // respond with custom response if one of the given interceptors matches our request's URL
        const interceptor = interceptors.find((reqInter) => reqURL.match(reqInter.pattern));
        if (interceptor) {
            return req.respond(interceptor.response);
        }
        return req.continue();
    });
}




module.exports = {
    waitForTask,
    scrapeTask,
    interceptTask
}

Now let's integrate the function in our main scraper function:

// simple-scraper.js

const debug = require('debug')('simple-scraper');

// import the function we created before
const {startBrowser} = require("./utils/browser-utils");

// import waitForTask, interceptTask and scrapeTask we defined before
const {waitForTask, scrapeTask, interceptTask} = require("./utils/page-utils");

// ... code

async function scraper(url, rules, options = {}) {
    let browser;

    const {
        headless = true,
        authenticate = null, // auth object with username and password properties
        headers = null,
        userAgent = null,
        cookies = [],
        gotoOptions = {
            timeout: null,
            waitUntil: 'domcontentloaded'
        },
        rejections = {
            patterns: [], // list of requests patterns to a abort
            resourceTypes: [], // list of resourceTypes to abort
            interceptors: [], // list of request interceptors to apply ({pattern: string, response: {status: number, contentType: string, body: string}}
        },
        waitFor, // the selector to wait for
        waitForTimeout // waitFor timeout
      
    } = options;
    
    
    // ...code
    
       const page = await browser.newPage();

        // apply http authenticate if provided
        if (authenticate) {
            await page.authenticate(authenticate);
        }

        // add extra headers
        if (headers) {
            await page.setExtraHTTPHeaders(headers);
        }

        // reject resources or patterns
        // or intercept and change responses
        if (
            (rejections.patterns || []).length ||
            (rejections.resourceTypes || []).length ||
            (rejections.interceptors || []).length
        ) {
            await interceptTask(page, {...rejections});
        }
    
    // ...code
    
}

module.exports = {
    scraper,
}

Conclusion

In this article we further explored Puppeteer's capabilities by making use of functionality that is often necessary when building scrapers, then integrated it within our simple scraper that we defined in an earlier post.

While Puppeteer is an excellent tool that helps developers handle web page rendering with all its intricacies, using it at scale for data extraction purposes comes with its set of challenges: memory management, containerization, proxy rotation among other things.

At Ujeebu we devised a Scraping API that can scale to thousands of requests per second, while handling blocks and anti-scraping mechanisms. Give it a try here. The first 5000 credits (approx. 1000 requests) are on us, and no credit card is required.