Scraping a Wikipedia Page with Node.js Using Puppeteer
Scrap A Web Page With Node.js and Export The Scrapped Data to a CSV File
Web scraping is a method where we extract data from a website. Web scraping is generally used for price monitoring, market research, news monitoring, etc.
In this article, we’ll be scraping a table within a Wikipedia page and generating a CSV file containing that data.
I like mangoes and I want to know and keep a record of all the different kinds of mangoes there are in a CSV file. We’ll be using puppeteer
for scraping the wiki page.
Puppeteer
is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol.
Scraping a Wikipedia Page with Node.js:
Make sure you have node and npm installed
Create a folder and initialize npm
.
mkdir wiki-scrapernpm init -y
Install puppeteer
cd wiki-scrapernpm i puppeteer
Create index.js
Touch index.js
Open the folder in a code editor of your choice.
Getting started with Web Scraping with Node.js using Puppeteer:
In index.js
,
This is a basic setup for the puppeteer
. Here, we are doing the following things in order:
- Create a browser instance
- Create/Open a new tab
- Navigate to the URL
- Capture a screenshot of the page
- Close the browser.
It saves a screenshot of the page we provided.
Scraping the mango data off of the Wikipedia page
For scraping the table of mangoes, we’ll find the CSS selectors for that specific table.
For that, open the Inspect developer tool and look for the table.
Then in index.js
,
Here, we are:
- Waiting for the selector
table.wikitable
to be loaded. - Creating a 2D array consisting of all the rows and columns.
Now we have the table data, let’s convert the 2D array into a CSV file.
Generating CSV file with the scraped data
There are many conventions for a CSV file, but we’ll be using the following:
- A comma
(,)
represents a new column - A new line
(\n)
represents a new row - And double quotes
( “ ”)
represent a cell
With this, in index.js
Here we are:
- Creating the first row of the CSV file.
- Creating a file and writing the first row into the file.
- Iterating over the content and generating a string according to the rules mentioned above.
- After iteration, append the CSV file with the generated string.
And thus, we have the types of mangoes in a CSV file!
The code in this article is available on Github.
Link to my other articles:
Further Reading
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.