Simple Scraper with Puppeteer

Introduction

Web scraping is the process of extracting data from websites. One popular library for web scraping is Puppeteer. Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.

In this article, we will show you how to scrape data from any website using Puppeteer.

Prerequisites

You need Node.js installed on your machine. This tutorial was tested on Node.js version v17.9.1 (npm v8.11.0).

Setting up the project

Create a folder for the project as follows:

mkdir puppeteer-simple-scraper
cd puppeteer-simple-scraper

Initialize npm for your project:

npm init

This command will create a package.json file in your project directory. This file contains information about your project, such as the project's name, version, and dependencies. Fill out the prompts that appear in the terminal, such as the project name, version, and description. You can leave some fields blank or press Enter to accept the default values.

Once the package.json file has been created, you can use npm to install dependencies for your project.

Let's install the Puppeteer library

npm install --save puppeteer debug

This command installs both Puppeteer and a version of Chromium that will work with the installed Puppeteer API. We will also install the debug library that we will use to print out debug info.

Once you have installed Puppeteer, you can start writing your scraping script.

Start the browser

Let's start with a function that will launch a headless browser

vim ./utils/browser-utils.js

// file ./utils/browser-utils.js

const puppeteer = require("puppeteer");
const debug = require('debug')('simple-scraper:browser');

const defaultOptions = {
    headless: true,
    ignoreHTTPSErrors: true
};

async function startBrowser(options){
    options = {
        ...defaultOptions,
        ...options
    };
    let browser;
    try {
        debug(`Launching the browser`);
        browser = await puppeteer.launch({
            ...options
        });
    } catch (err) {
        debug(`Error launching browser: ${e.message}`);
        throw err;
    }
    return browser;
}

module.exports = {
    startBrowser
};

The launchBrowser function above launches a new instance of the Chrome browser using Puppeteer. It takes parameter 'options' which is a set of configurable options to set on the browser. Here we have used just two:

headless: when set to true the browser will run in headless mode this is useful in our case (i.e. when you want to deploy your scraper in a machine that does not include a user interface). If this is set to false the browser will run with a user interface and you can watch your script execute.
ignoreHTTPSErrors - true allows to ignore any HTTPS-related errors.

There are many other configurable options that you can find in the official Puppeteer documentation.

Navigating to URL

vim simple-scraper.js

// file ./simple-scraper.js

const debug = require('debug')('simple-scraper');

// import the function we created before
const {startBrowser} = require("./utils/browser-utils");


async function scraper(url, options) {
    let browser;

    const { 
        headless,
        gotoOptions = {
        	timeout: null,
        	waitUntil: 'domcontentloaded'
    	}
    } = options;

    try {
        debug('Starting the browser');
        browser = await startBrowser({
            headless,
        });

        const page = await browser.newPage();

        debug(`Navigate to URL ${url}`);
        await page.goto(url, {...gotoOptions});

        // our page is ready start scraping...
    } catch (e) {
        // handle error
        debug(`Error ${e.message}`);

    } finally {

        // close browser    
        if (browser) {
            await browser.close();
        }
    }
}

Scraping Data from a Single Page

Let's edit our scraper function to scrape all products from the url https://ujeebu.com/docs/scrape-me/load-more/

Wait for content

The products in our page are not loaded right away. We will use Puppeteer's function page.waitForSelector to wait for the products list to be ready

// file ./simple-scraper.js

// ... code befor page.goto
	
    debug(`Navigate to URL ${url}`);
    await page.goto(url, {...gotoOptions});

    debug(`Wait for selector: .products-list`)
    await page.waitForSelector('.products-list', {timeout: 0})
// ...  code after page.goto

Scrape content

In vanilla JavaScript, we have two functions .querySelector and .querySelectorAll in the document object to query the DOM for elements that match a given CSS selector. The first function returns the first matching element and the second one returns an array of elements that match the selector. In Puppeteer we have functions .$ and .$$ which are respectively similar to .querySelector and .querySelectorAll.

Let's use them to scrape content from the page:

// file ./simple-scraper.js

//...code before waitForSelector call
	debug(`Wait for selector: .products-list`)
	await page.waitForSelector('.products-list', {timeout: 0})
	
	// .$ selects only the first matching element so we used .$$ to select all matching elements
    debug(`Getting product elements: .product-card`);
    const products = await page.$$('.product-card');

    const data = [];
    debug(`Scraping data from ${products.length} elements`);
    for (const product of products) {
        const productData = await product.evaluate(node => {
            return {
                'title': node.querySelector('.title').innerText,
                'price': node.querySelector('.price').innerText,
                'description': node.querySelector('.description').innerText,
                'image': node.querySelector('.card__image>img').getAttribute('src'),
            }
        });
        data.push(productData);
    }

//...code after waitForSelector call

In the code above we used .$$ to select all product elements from page, then we looped over each one of them and used the function .evaluate to interact with the element to get the information we need.

Alternatively we can merge the two functions .$$ and .evaluate in one call using .$$eval

const data = await page.$$eval('.product-card', products => {
            // Extract data from page
            return  products.map(product => ({
                                'title': product.querySelector('.title').innerText,
                                'price': product.querySelector('.price').innerText,
                                'description': product.querySelector('.description').innerText,
                                'image': product.querySelector('.card__image>img').getAttribute('src'),

            }))
        });

Our final function looks like this:

// file ./simple-scraper.js

const debug = require('debug')('simple-scraper');

// import the function we created before
const {startBrowser} = require("./utils/browser-utils");


async function scraper(url, options = {}) {
    let browser;

    const {
        headless = true,
        gotoOptions = {
            timeout: null,
            waitUntil: 'domcontentloaded'
        },
    } = options;

    try {
        debug('Starting');
        browser = await startBrowser({
            headless,
        });

        const page = await browser.newPage();

        debug(`Navigate to URL ${url}`);
        await page.goto(url, {...gotoOptions});

        debug(`Wait for selector: .products-list`)
        await page.waitForSelector('.products-list', {timeout: 0})


        // we used .$$ to select all products
        debug(`Getting product elements: .product-card`);
        const products = await page.$$('.product-card');
        const data = [];

        debug(`Scraping product information of ${products.length} products`);
        for (const product of products) {
            const productData = await product.evaluate(node => {
                return {
                    'title': node.querySelector('.title').innerText,
                    'price': node.querySelector('.price').innerText,
                    'description': node.querySelector('.description').innerText,
                    'image': node.querySelector('.card__image>img').getAttribute('src'),
                }
            });
            data.push(productData);
        }

        return data;


    } catch (e) {
        // handle errors
        debug(`Error ${e.message}`);
        return null;
    } finally {

        // close browser
        if (browser) {
            await browser.close();
        }
    }
}

module.exports = {
    scraper,
}

Finally let's call our scraper function

// file index.js

const scraper = require('simple-scraper')

(async () => {
    
    // scrape the url
    const data = await scraper('https://ujeebu.com/docs/scrape-me/load-more/', {
        headless: false
    });

    // log scraping data
    console.log(JSON.stringify(data, null, 2));
})();

Conclusion

In this post, we showed how Puppeteer can be used to programmatically load a URL in Google Chrome, and extract data from it using css selectors. In an upcoming post we will be showcasing more advanced scraping use cases.

Why Puppeteer? At Ujeebu we use a plethora of tools for scraping and are constantly experimenting with new as well as proven technologies to help our customers achieve their data extraction goals in a cost effective way.

If you don't have time to deal with headless browsers with a library such as Puppeteer and would like to automate your scraping efforts as much as possible, we have an API just for you. Try us out today. The first 5000 credits (approx. 1000 requests) are on us, and no credit card is required.

A Simple Scraper using Puppeteer