Introduction

With the increasing adoption of client-side frameworks, being able to scrape sites to extract a max of information often requires JavaScript execution. In our experience at Ujeebu, close to 5% of blogs and news sites that we extract articles from require executing JavaScript. For other types of pages, this percentage might be significantly higher.

There is a variety of packages out there which can be leveraged for this: headless browsers like Chrome using Puppeteer, Splash(python) and HtmlUnit(Java) among others.

In this article, we will focus on Puppeteer and present a quick tutorial on how to use it for scraping, page scrolling and Ad blocking.

Installation and integration

  1. What is Puppeteer

Puppeteer is a Node.js library developed by the Google Chrome team which allows you to control an instance of the headless Chrome browser (Chrome without UI). Some of the things it allows include:

  • Scraping website content
  • Generating screenshots and PDFs
  • Automating the filling of forms
  • Performing UI tests
  • Creating automated tests
  • Diagnosing performance issues with the Dev Tools API

2. Installation

To start, let's create a new project directory and use the npm command to install the Puppeteer library

mkdir scrape-example
cd scrape-example
npm init
npm install puppeteer

3. Scraping

The following will open Chrome and go to the URL we would like to scrape:

const puppeteer = require("puppeteer")

const run = async () => {
  
  const browser = await puppeteer.launch({ headless: false })
  const page = await browser.newPage()

  await page.goto("https://www.thriftbooks.com/")

  browser.close()

  return null
}

run().then(value => {
  console.log(value)
})

Next, we perform a search by filling out the search form and clicking the submit button:


  await page.waitFor('input.Search-input');

  await page.type('input.Search-input', 'AI');
  // await page.$eval('input.Search-input', el => el.value = 'AI');

  await page.click('.Search-submit>button');
  // await page.keyboard.press('Enter');
  
  await page.waitForSelector('.SearchContentResults-tilesContainer');

After the search results page is loaded, let's do some scraping. Here we extract the title, author and price of the books listed on the results page:

 let data = await page.evaluate(() => {
    let results =[];
    let books = document.querySelectorAll('.AllEditionsItem-tile');
    books.forEach((book) => {
      let title = book.querySelector('.AllEditionsItem-tileTitle').innerText
      let author = book.querySelector('a[itemprop=author]').innerText
      let price = book.querySelector('.SearchResultListItem-dollarAmount').innerText
      results.push({
        title,author,price 
      });
    });
    return results;
  })

The full code:

const puppeteer = require("puppeteer")

const run = async () => {
  
  const browser = await puppeteer.launch({ headless: false })
  const page = await browser.newPage()

  await page.goto("https://www.thriftbooks.com/")


  await page.waitFor('input.Search-input');

  await page.type('input.Search-input', 'AI');
  // await page.$eval('input.Search-input', el => el.value = 'AI');

  await page.click('.Search-submit>button');
  // await page.keyboard.press('Enter');
  
  await page.waitForSelector('.SearchContentResults-tilesContainer');

  let data = await page.evaluate(() => {
    let results =[];
    let books = document.querySelectorAll('.AllEditionsItem-tile');
    books.forEach((book) => {
      let title = book.querySelector('.AllEditionsItem-tileTitle').innerText
      let author = book.querySelector('a[itemprop=author]').innerText
      let price = book.querySelector('.SearchResultListItem-dollarAmount').innerText
      results.push({
        title,author, price 
      });
    });
    return results;
  })

  browser.close()

  return data
}

run().then(value => {
  console.log(value)
})

4. Scrolling a web page

We sometimes need to scroll down in a web page to reveal its full content:

  await page.evaluate(async () => {
    await new Promise((resolve, reject) => {
      try {
        let bodyElement = document.querySelector('body');
        
        let counter = 50;
        let scrollStep = 0;
        let scrollingElement = (document.scrollingElement || bodyElement);
        if (scrollingElement) {
          scrollStep = parseInt(scrollingElement.scrollHeight / counter);
        }

        const interval = setInterval(() => {
          if (scrollStep > 0) {
            window.scrollBy(0, scrollStep);
          }
          counter--;

          if (counter <= 0) {
            clearInterval(interval);
            resolve();
          }
        }, scrollWait);

      } catch (err) {
        reject(err.toString());
      }
    });

  });

5. Blocking Ads

In what follows we block ads based on hosts listed in the file:
http://winhelp2002.mvps.org/hosts.txt

We first parse the file:

const hostFile = fs.readFileSync('hosts.txt', 'utf8').split('\n');
let hosts = {}; // object with ads domains as keys
for (let i = 0; i < hostFile.length; i++) {
  const frags = hostFile[i].split(' ');
  if (frags.length > 1 && frags[0] === '0.0.0.0') {
    hosts[frags[1].trim()] = true;
  }
}

Then we filter out unwanted ad requests when loading a page using Puppeteer:

 await page.setRequestInterception(true);
  page.on('request', request => {
    var domain = null;
    
    var frags = request.url().split('/');
    if (frags.length > 2) {
      domain = frags[2];
    }
 
    if (hosts[domain] === true) {
      request.abort();
    }
    else {
      request.continue();
    }
  });

Conclusion

Rendering JS-heavy pages using Puppeteer is a breeze. Puppeteer hides the intricacies of running your own instances of headless browsers. That being said, Puppeteer and JS execution in general is memory and CPU intensive, therefore running it at scale requires some careful provisioning and monitoring. In an upcoming post we will delve into the scaling issues related to running multiple instances of Puppeteer, and show how we solved them at Ujeebu.