In today's digital age, the internet is an ocean of information. But often, the data we need isn't readily available in a convenient format. This is where web scraping comes in handy.

As a beginner developer, you might be wondering how you can start building your own web scraper. Well, have no fear because I'm here to guide you through the process of building your very first web scraper using Node.js and cheerio.

What is Web Scraping

First, let's define what web scraping is. It's the process of automatically extracting data from websites and storing it in a structured format. It's a useful tool for developers to gather information from different sources and use it for various purposes.

Some of its uses include:

  • Market Research: Scraping e-commerce websites to gather pricing information and product details.
  • Competitor Analysis: Extracting data from competitor websites to analyze their strategies and offerings.
  • Content Aggregation: Collecting news articles, blog posts, or social media posts for content aggregation platforms.
  • Lead Generation: Extracting contact information from business directories or social media profiles.
  • Search Engine Indexing: Crawling web pages to index content for search engines like Google.

How To Build Your Own Web Scraper

Now, let's talk about the tools you'll need to build your web scraper. Node.js is a JavaScript runtime that allows you to run JavaScript on the server side. Cheerio is a JavaScript library that allows you to manipulate and traverse the DOM (Document Object Model) of a website. It works similar to jQuery and is used to extract the data you need from a website.

Prerequisites

Before we dive into web scraping, ensure you have Node.js installed on your system. You can download and install it from here.

Here's how you can build your first web scraper step-by-step:

Step 1: Install Node.js

Start by installing Node.js on your computer. You can download it from the official website and follow the installation instructions.

Step 2: Create a Project Folder

Open your terminal and navigate to your desired directory. Create a new folder for your project.

bash
1mkdir web-scraper 2cd web-scraper

Step 3: Initialize a new Node.js project

Run the following command in your terminal to create a package.json file.

bash
1npm init -y

Step 4: Install Cheerio

Install the Cheerio library using npm.

bash
1npm install cheerio

Step 5: Create scraper.js File

Create a new file named scraper.js in your project folder.

bash
1touch scraper.js

Step 6: Require modules

In your scraper.js file, require the modules you'll need: cheerio and http or https.

javascript
1// scraper.js 2const https = require("https"); // or const http = require('http'); 3const cheerio = require("cheerio");

Step 7: Use the http or https module to make a request to the website you want to scrape.

For this example, I will be scraping panmacmillian.com to get the best Fantasy Books of 2023.

javascript
1// URL of the website to scrape 2const url = 3 "https://www.panmacmillan.com/blogs/science-fiction-and-fantasy/best-new-fantasy-books"; 4 5// Make a GET request to fetch the HTML content of the website 6https.get(url, (response) => { 7 let data = ""; 8 9 // A chunk of data has been received. 10 response.on("data", (chunk) => { 11 data += chunk; 12 }); 13 14 // The whole response has been received. Process the data. 15 response 16 .on("end", () => {}) 17 .on("error", (error) => { 18 console.log("Error fetching data:", error); 19 }); 20});

Step 7: Using cheerio

Use the cheerio module to load the HTML from the website and select the elements you want to scrape. I already went ahead to inspect the HTML tags and classes of the elements I wanted to scrape. You can also read the cheerio documentation to find out more ways to select elements and traverse the DOM to use it.

  • Use the .text() or .attr() method to extract the data from the selected elements.
  • Use the console.log() function to print the data to the console.
javascript
1// The whole response has been received. Process the data. 2response.on("end", () => { 3 // Load HTML content into Cheerio 4 const $ = cheerio.load(data); 5 6 //Create an array to store title info 7 const titles = []; 8 9 // Select the elements you want to scrape 10 $("figure").each((i, el) => { 11 //Loop through each 'figure element' 12 const title = $(el).find("h3 a").text(); //Get each title 13 const author = $(el).find("h4 a").text(); //Get each author 14 15 // If both title and author are not empty, add to titles array 16 if (title !== "" && author !== "") { 17 titles.push({ title, author }); 18 } 19 }); 20 21 // Print the scraped data 22 23 titles.map((title) => { 24 console.log(`${title.title} by ${title.author}`); 25 }); 26});

Here's an example of what your scraper.js file might look like:

javascript
1// scraper.js 2const https = require("https"); // or const https = require('https'); 3const cheerio = require("cheerio"); 4 5// URL of the website to scrape 6const url = 7 "https://www.panmacmillan.com/blogs/science-fiction-and-fantasy/best-new-fantasy-books"; 8 9// Make a GET request to fetch the HTML content of the website 10https 11 .get(url, (response) => { 12 let data = ""; 13 14 // A chunk of data has been received. 15 response.on("data", (chunk) => { 16 data += chunk; 17 }); 18 19 // The whole response has been received. Process the data. 20 response.on("end", () => { 21 // Load HTML content into Cheerio 22 const $ = cheerio.load(data); 23 //Create an array to store title info 24 const titles = []; 25 26 // Select the elements you want to scrape 27 $("figure").each((i, el) => { 28 //Loop through each 'figure element' 29 const title = $(el).find("h3 a").text(); //Get each title 30 const author = $(el).find("h4 a").text(); //Get each author 31 32 // If both title and author are not empty, add to titles array 33 if (title !== "" && author !== "") { 34 titles.push({ title, author }); 35 } 36 }); 37 // Print the scraped data 38 39 titles.map((title) => { 40 console.log(`${title.title} by ${title.author}`); 41 }); 42 }); 43 }) 44 .on("error", (error) => { 45 console.log("Error fetching data:", error); 46 });

And here's the result in my console: Screenshot of terminal

With this, you have built your first web scraper! You can now use this as a starting point to scrape more complex websites and extract more data. However, it is also essential to understand the legal implications and respect the terms of service of the websites you scrape.

Of course, web scraping can get complicated, but as you continue to learn and practice, you'll be able to handle more complex scraping scenarios.

Happy scraping!