Scraping a Wikipedia Page with Node.js Using Puppeteer

Scrap A Web Page With Node.js and Export The Scrapped Data to a CSV File

Published in

JavaScript in Plain English

4 min readMar 6, 2022

--

Web scraping a wiki page with nodejs using puppetter

Web scraping is a method where we extract data from a website. Web scraping is generally used for price monitoring, market research, news monitoring, etc.

In this article, we’ll be scraping a table within a Wikipedia page and generating a CSV file containing that data.

I like mangoes and I want to know and keep a record of all the different kinds of mangoes there are in a CSV file. We’ll be using puppeteer for scraping the wiki page.

Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol.

Scraping a Wikipedia Page with Node.js:

Make sure you have node and npm installed

Create a folder and initialize npm.

mkdir wiki-scrapernpm init -y

Install puppeteer

cd wiki-scrapernpm i puppeteer

Create index.js

Touch index.js

Open the folder in a code editor of your choice.

Getting started with Web Scraping with Node.js using Puppeteer:

In index.js ,

setting up puppeteer — index.js

This is a basic setup for the puppeteer. Here, we are doing the following things in order:

Create a browser instance
Create/Open a new tab
Navigate to the URL
Capture a screenshot of the page
Close the browser.

It saves a screenshot of the page we provided.

captured screenshot with puppeteer

Scraping the mango data off of the Wikipedia page

For scraping the table of mangoes, we’ll find the CSS selectors for that specific table.

For that, open the Inspect developer tool and look for the table.

Inspecting wiki page of mangoes — Inspecting the wiki page of mangoes

Then in index.js ,

scraping the selected table — index.js

Here, we are:

Waiting for the selector table.wikitable to be loaded.
Creating a 2D array consisting of all the rows and columns.

scraping result

Now we have the table data, let’s convert the 2D array into a CSV file.

Generating CSV file with the scraped data

There are many conventions for a CSV file, but we’ll be using the following:

A comma (,) represents a new column
A new line (\n) represents a new row
And double quotes ( “ ”) represent a cell

With this, in index.js

creating a csv file of the scrapped content

Here we are:

Creating the first row of the CSV file.
Creating a file and writing the first row into the file.
Iterating over the content and generating a string according to the rules mentioned above.
After iteration, append the CSV file with the generated string.

And thus, we have the types of mangoes in a CSV file!

The code in this article is available on Github.

Link to my other articles:

RESTful API with Express and NodeJS

Build a REST API using Express and NodeJS. Nodejs is a javascript runtime that runs on chrome’s V8 engine and executes…

paragwebdev.medium.com

What is Fetch API in JavaScript and How It Works

JavaScript’s Fetch API is a powerful and easy-to-use alternative to the old ajax and jQuery. Let’s see how to use it…

paragwebdev.medium.com

Further Reading

Automate Web Scraping with an Easy-to-Use Browser Extension

How to easily automate the web scraping process with Listly, a beginner-friendly no-code tool.

javascript.plainenglish.io

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Parag Mahale

Written by Parag Mahale

Writer for

JavaScript in Plain English

Web developer, freelancer, blogger. My portfolio — s0npaRi11.github.io

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams