Web scraping is a powerful technique for programmatically gathering data from websites. It’s particularly useful for extracting structured data from web pages, such as tables, lists, and paragraphs. In this article, we’ll explore how to scrape a table from Wikipedia using Node.js, focusing on the “Wikipedia:About” page as our example. We’ll use two popular Node.js libraries: axios for making HTTP requests and cheerio for parsing HTML and traversing the DOM.

Prerequisites

Before we start, ensure you have Node.js installed on your machine. You’ll also need npm (Node Package Manager), which comes with Node.js, to install the required libraries.

Step 1: Setting Up Your Project

Create a new directory for your project and initialize a new Node.js project by running npm init in your terminal. You can accept the default configurations for simplicity. Next, install axios and cheerio by executing:

npm install axios cheerio

Step 2: Writing the Scraping Script

Let’s break down the script into several key operations:

  • Fetching the Web Page: We use axios to make a GET request to the URL of the Wikipedia page we want to scrape. axios returns a promise with the HTML content of the page.
  • Loading the HTML: With cheerio, we load the HTML content fetched by axios, which allows us to use a syntax similar to jQuery for traversing and manipulating the DOM.
  • Extracting the Table Data: We identify the table we want to scrape using a selector (e.g: .nowraplinks). We then iterate over the table headers (th) to create an array of column names and loop through each row (tr) to extract the cell (td) values, organizing them into a structured format.
  • Logging the Data: Finally, we log the structured data to the console, allowing us to see the scraped information.

Here’s the complete script:

const axios = require("axios");
const cheerio = require("cheerio");
const url = "https://en.wikipedia.org/wiki/Wikipedia:About";

axios.get(url)
.then(response => {
const html = response.data;
const $ = cheerio.load(html);

// Find the table with the specified class
const targetTable = $(".nowraplinks");

// If the target table is found, parse its content
if (targetTable.length > 0) {
const data = [];
const trElements = targetTable.find("tr");
trElements.each((index, tr) => {
const th = $(tr).find("th").text().trim();
const td = $(tr).find("td").text().trim();

if (th && td) { // Only add row if both th and td are not empty
data.push({ [th]: td });
}
});

// Split the data array based on the key "Overview (outline)"
const splitIndex = data.findIndex(obj => Object.keys(obj)[0] === "Overview(outline)");
const firstTableData = data.slice(0, splitIndex);
const secondTableData = data.slice(splitIndex);

console.log("First Table Data:", firstTableData);
console.log("Second Table Data:", secondTableData);
} else {
console.log("No table with the specified class found.");
}
})
.catch(error => {
console.error("Error fetching data:", error);
});

Step 3: Running Your Script

Save your script as scrape.js (or another name of your choice) in your project directory. Run the script with Node.js by executing:

node scrape.js

The terminal will output the structured data from the table, showing you the results of your scraping endeavor.

Response:

First Table Data: [{
‘About Wikipedia’: "Readers’ index to Wikipedia\n" +
‘Statistics\n’ +
‘Administration\n’ +
‘FAQs\n’ +
‘Purpose\n’ +
‘Who writes Wikipedia?\n’ +
‘Organization\n’ +
‘Censorship\n’ +
‘In brief\n’ +
‘General disclaimer’
},
{
"Readers’ FAQ": ‘Student help\n’ +
‘Navigation\n’ +
‘Searching\n’ +
‘Viewing media\n’ +
‘Help\n’ +
‘Mobile access\n’ +
‘Parental advice\n’ +
‘Other languages\n’ +
‘Researching with Wikipedia\n’ +
‘Citing Wikipedia\n’ +
‘Copyright’
},
{
‘Introductionsto contributing’: ‘Main introduction\n’ +
‘List of tutorials and introductions\n’ +
‘The answer\n’ +
"Dos and don’ts\n" +
‘Learning the ropes\n’ +
‘Common mistakes\n’ +
‘Newcomer primer\n’ +
‘Simplified ruleset\n’ +
‘The "Missing Manual"\n’ +
‘Your first article\n’ +
‘Wizard\n’ +
‘Young Wikipedians\n’ +
‘The Wikipedia Adventure\n’ +
‘Accounts\n’ +
‘Why create an account?\n’ +
‘Logging in\n’ +
‘Email confirmation\n’ +
‘Editing\n’ +
‘Toolbar\n’ +
‘Conflict\n’ +
‘VisualEditor\n’ +
‘User guide’
},
{
‘Pillars, policies and guidelines’: ‘Five pillars\n’ +
‘Manual of Style\n’ +
‘Simplified\n’ +
‘Etiquette\n’ +
‘Expectations\n’ +
‘Oversight\n’ +
‘Principles\n’ +
‘Ignore all rules\n’ +
‘The rules are principles\n’ +
‘Core content policies\n’ +
‘Policies and guidelines\n’ +
‘Vandalism\n’ +
‘Appealing blocks\n’ +
‘What Wikipedia is not’
},

——-
]

Second Table Data: [{
‘Overview(outline)’: ‘Censorship\n’ +
‘Conflict-of-interest editing\n’ +
‘political editing incidents\n’ +
‘Criticism\n’ +
‘Biases\n’ +
‘gender\n’ +
‘geographical\n’ +
‘ideological\n’ +
‘racial\n’ +
‘Deletion of articles\n’ +
‘deletionism and inclusionism\n’ +
‘notability\n’ +
‘"Ignore all rules"\n’ +
‘MediaWiki\n’ +
‘Plagiarism\n’ +
"Predictions of the project’s end\n" +
‘Reliability\n’ +
‘Fact-checking\n’ +
‘Citation needed\n’ +
‘Vandalism’
},
{
‘Community(Wikipedians)EventsWiki LovesPeople(list)’: ‘Administrators\n’ +
‘AfroCrowd\n’ +
‘Arbitration Committee\n’ +
‘Art+Feminism\n’ +
‘Bots\n’ +
‘Lsjbot\n’ +
‘Edit count\n’ +
‘List of Wikipedias\n’ +
‘The Signpost\n’ +
‘Wikimedian of the Year\n’ +
‘Wikipedian in residence\n’ +
‘WikiProject\n’ +
‘Women in Red\n’ +
‘Events\n’ +
‘Edit-a-thon\n’ +
‘WikiConference India\n’ +
‘Wiki Indaba\n’ +
‘WikiConference North America\n’ +
‘Wikimania\n’ +
‘Wiki Loves\n’ +
‘Earth\n’ +
‘Folklore\n’ +
‘Monuments\n’ +
‘Pride\n’ +
‘Science\n’ +
‘People(list)\n’ +
"Esra’a Al Shafei\n" +
‘Florence Devouard\n’ +
‘Sue Gardner\n’ +
‘James Heilman\n’ +
‘Maryana Iskander\n’ +
‘Dariusz Jemielniak\n’ +
‘Rebecca MacKinnon\n’ +
‘Katherine Maher\n’ +
‘Magnus Manske\n’ +
‘Ira Brad Matetsky\n’ +
‘Erik Möller\n’ +
‘Jason Moore\n’ +
‘Raju Narisetti\n’ +
‘Steven Pruitt\n’ +
‘Annie Rauwerda\n’ +
‘Larry Sanger\n’ +
‘María Sefidari\n’ +
‘Lisa Seitz-Gruwell\n’ +
‘Rosie Stephenson-Goodknight\n’ +
‘Lila Tretikov\n’ +
‘Jimmy Wales\n’ +
‘\n’ +
‘\n’ +
‘Edit-a-thon\n’ +
‘WikiConference India\n’ +
‘Wiki Indaba\n’ +
‘WikiConference North America\n’ +
‘Wikimania\n’ +
‘\n’ +
‘Earth\n’ +
‘Folklore\n’ +
‘Monuments\n’ +
‘Pride\n’ +
‘Science\n’ +
‘\n’ +
"Esra’a Al Shafei\n" +
‘Florence Devouard\n’ +
‘Sue Gardner\n’ +
‘James Heilman\n’ +
‘Maryana Iskander\n’ +
‘Dariusz Jemielniak\n’ +
‘Rebecca MacKinnon\n’ +
‘Katherine Maher\n’ +
‘Magnus Manske\n’ +
‘Ira Brad Matetsky\n’ +
‘Erik Möller\n’ +
‘Jason Moore\n’ +
‘Raju Narisetti\n’ +
‘Steven Pruitt\n’ +
‘Annie Rauwerda\n’ +
‘Larry Sanger\n’ +
‘María Sefidari\n’ +
‘Lisa Seitz-Gruwell\n’ +
‘Rosie Stephenson-Goodknight\n’ +
‘Lila Tretikov\n’ +
‘Jimmy Wales’
},
———-
]

Important Considerations

When scraping websites, it’s crucial to respect the site’s robots.txt rules and terms of service. Always ensure your scraping activities are ethical and do not harm the website’s functionality or access to others. Furthermore, web pages can change over time, so you might need to update your selectors or logic to accommodate those changes.

Conclusion

Web scraping with Node.js is a powerful approach to data collection and analysis, allowing developers to automate the extraction of information from the web. By understanding the basics of web scraping demonstrated in this tutorial, you can apply similar techniques to a wide range of web scraping tasks, unlocking the potential for data-driven insights and automation in your projects.