Scraping a Wikipedia Page with Node.js Using Puppeteer

Scrap A Web Page With Node.js and Export The Scrapped Data to a CSV File

Parag Mahale
JavaScript in Plain English

--

Web scraping a wiki page with nodejs using puppetter

Web scraping is a method where we extract data from a website. Web scraping is generally used for price monitoring, market research, news monitoring, etc.

In this article, we’ll be scraping a table within a Wikipedia page and generating a CSV file containing that data.

I like mangoes and I want to know and keep a record of all the different kinds of mangoes there are in a CSV file. We’ll be using puppeteer for scraping the wiki page.

Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol.

Scraping a Wikipedia Page with Node.js:

Make sure you have node and npm installed

Create a folder and initialize npm.

mkdir wiki-scrapernpm init -y

Install puppeteer

cd wiki-scrapernpm i puppeteer

Create index.js

Touch index.js

Open the folder in a code editor of your choice.

Getting started with Web Scraping with Node.js using Puppeteer:

In index.js ,

setting up puppeteer
index.js

This is a basic setup for the puppeteer. Here, we are doing the following things in order:

  • Create a browser instance
  • Create/Open a new tab
  • Navigate to the URL
  • Capture a screenshot of the page
  • Close the browser.

It saves a screenshot of the page we provided.

captured screenshot with puppeteer
captured screenshot with puppeteer

Scraping the mango data off of the Wikipedia page

For scraping the table of mangoes, we’ll find the CSS selectors for that specific table.

For that, open the Inspect developer tool and look for the table.

Inspecting wiki page of mangoes
Inspecting the wiki page of mangoes

Then in index.js ,

scraping the selected table
index.js

Here, we are:

  • Waiting for the selector table.wikitable to be loaded.
  • Creating a 2D array consisting of all the rows and columns.
scraping result
scraping result

Now we have the table data, let’s convert the 2D array into a CSV file.

Generating CSV file with the scraped data

There are many conventions for a CSV file, but we’ll be using the following:

  1. A comma (,) represents a new column
  2. A new line (\n) represents a new row
  3. And double quotes ( “ ”) represent a cell

With this, in index.js

creating a csv file of the scrapped content
creating a csv file of the scrapped content

Here we are:

  • Creating the first row of the CSV file.
  • Creating a file and writing the first row into the file.
  • Iterating over the content and generating a string according to the rules mentioned above.
  • After iteration, append the CSV file with the generated string.

And thus, we have the types of mangoes in a CSV file!

The code in this article is available on Github.

--

--