Building a Scraping Server in Node.js & Deploying to Windows Azure (Beginners)

Hi geeks! Today I'm going to show you how you can make a simple scraping server in node.js and deploy it to Windows Azure. Before we start, some of you may ask, "What is scraping?". Well! Building a data driven application is fun only when you have some data. Suppose you want to get some data from an online news portal and use it in your app. But you don't know where the database server is. If you are lucky enough to know the address of the database server, chances are you won't get the permission to access it. Nowadays most of the application developers expose APIs so that other developers like me and you can use the APIs according to our needs. But what if there is no API? So, here comes scraping. If you know a little bit of HTML then you know what DOM (Document Object Model) is. However if you don't know what it is then here you go, When you write HTML and run it in your browser, your whole HTML document is converted into an object tree. Where each of the HTML element is treated like an object. From the concept of an object, we know that an object has some properties. Likewise a DOM element also has some properties, properties like id, style, attributes, CSS class names etc. If you want, you can manipulate that element in real time in browser. And that the thing we are going to do today. If you are still not sure about what I'm saying, keep calm and read it to the end.

Let's scrape out the art & entertainment section from a Bangladeshi news portal site "The Daily Star". If you go to http://www.thedailystar.net/ , there you will find this section like below,

We are going to scrape the HTML of this section and get data like the section name, title, picture and description from there and use those data to build an app. So let's start

Fire up your visual studio. I've already installed the node.js tools for my visual studio. If you don't have it installed on your machine, please do so. You can download the extension from this link,

https://nodejstools.codeplex.com/

Now quickly create a Basic Azure Node.js Express 4 Application from Javascript > Node.js node, give it a name and hit ok.

I chose Basic Azure Node.js Express 4 Application since I want to deploy my node.js app in Windows Azure. The template will configure your node.js app for Azure at startup so that you don't have to go through the configuring issues later.

The newly created project will ask you to install the dependencies. Just press yes. It will install the dependencies in the background.

You can always see what is going on in the output window. After installing the dependencies they can be found in the npm node. From there you can delete or check for update of a package. Just right click on the package and do whatever you want from the menu.

In app.js you will see that the app has two routes already configured. Rather than creating another one I rename the users route to entertainment

Now if you rebuild and run the app you will get the index view since the empty route [just the / (slash) one] is configured to show the index.jade view in the view folder. Don't worry about the .jade extensions. Node.js uses jade view engine for rendering html. It is just a cleaner way to write HTML. I'll talk about jade in another post.

Now if you add the entertainment after the (/) slash, you will get this,

Instead of rendering a view like the empty route, here we are sending raw response to the browser.

Scraping a website can be done with raw Javascript. You can also use libraries like Jquery and Cheerio etc. For the sake of simplicity I'll use Cheerio.

Installing a library or in other word package is very easy. Right click on the npm node an select "Intall New npm Packages". It will load up a windows like below. There you can search for a specific npm package. We will need two packages, one the Cheerio and the second is the package called request for handling HTTP calls.

After installing the packages, they can be accessed under the npm node.

Let's first paste some code in the entertainment.js file then I'll describe what the code really does. Replace the existing code with the following,

entertainment.js

var express = require('express');
var request = require('request');
var cheerio = require('cheerio');

var router = express.Router();

var url = "http://www.thedailystar.net";

router.get('/', function (req, res) {
    request(url, function (error, response, body) {
        if (!error && response.statusCode === 200) {
            var data = scrapeDataFromHtml(body);
            res.send(data);
        }
        return console.log(error);
    });
});

var scrapeDataFromHtml = function(html) {
    var data = {};

    var $ = cheerio.load(html);
    var sectionTitle = $('a', '.creem-box').first().text().trim();
    var heading = $('h2 > a', '.creem-box').text().trim();
    var imageSource = $('a > img', '.creem-box').attr("src");
    var description = $('.intro', '.creem-box').text().trim();
    var fullNewsLink = $('a', '.creem-box').eq(2).attr("href");

    data = {
        sectionTitle: sectionTitle,
        heading : heading,
        imageSource: imageSource,
        description: description,
        fullNewsLink : url + fullNewsLink
    };
    return data;
};

module.exports = router;

First I've two reference variables for my installed modules namely "request" & "cheerio". I'm taking the "The Daily Star" new portal address in a url variable. I've removed the existing code inside router.get('/', function (req, res) { ... } and initiated a http 'GET' call to the url with the help of request(url, function (error, response, body) { ... }. The callback function has three parameters, first of which is the error parameter for describing any kind of error in the HTTP call. Then the response of the HTTP call. And last one is for the raw HTML. I took the returned html and pass it to another function scrapeDataFromHtml().The real work begins from here, cheerio has a load() method which takes raw html and convert it to cheerio object so that you can manipulate it. Look closely at the html of the "Art & Entertainment" section

For the first step we want the name of the section. Cheerio has a selector function whose syntax is given below

$( selector, [context], [root] )

Here the selector is the html element which you want to select. [context] is the part of a html document where cheerio should search for the selector. [root] is not so much important here. To know more about cheerio, go to there github repository in this address, https://github.com/cheeriojs/cheerio

According to the function syntax, to grab the section name, I've set the context of the html document to the div element with the class name "creem-box". Then we added first() so that only the first anchor tag is selected (since there are two anchor tags, specifying first() will get the first anchor tag ). Then we take the text (text()) inside of the anchor tag and trim (trim()) it to remove extra white space if there are any. Rest of the code is pretty much simple. The article heading can be found in an anchor tag under the h2 tag. That means the anchor tag is a direct child of the h2 tag. So we can again attach the first() function to our selector or we can use the syntax like 'h2 > a'. It means select the child anchor tag of h2 tag. For image the image source can be found in the src attribute so we attached the attr('src'). The description text is under a p tag but instead of selecting that I chose to select the class name of the p tag. Link to the full article can be found in the second anchor tag. So we've attached eq(2) (which will select the second link) in the selector function and took the href attribute. I'm pointing out the data we scrape out of the document below.

Since we built an object with all of those data and send the raw object on the entertainment route request, if you run and go to the entertainment route you will get a json file containing the object download in your browser.

We are almost done. Time to deploy the app. Its easy as pie. Right click the project and select publish. You will get a window like this, select "Microsoft Azure Web Apps" as a published target. Sign in and click new button right beside the exiting web apps combobox. Give your site a unique name and select your default app service plan.

Now hit create to create a publish profile for the app. After that this window will come, where you can hit publish to publish the app in the windows azure cloud. It will do all the necessary work for you and when it is done, your browser will pop up and you can navigate to entertainment to get the result like below

I hope you enjoyed this post. Don't forget to share if you like it. See you next time.