使用node.js的http模块实现爬虫小工具-CFANZ编程社区

刚开始学习node.js（早就想学了，一直抽不开时间。这会刚从慕课网那边过来。学习了一下，在这里做下笔记）

在这里先回顾一下，我们 http 请求发起后“经历”的过程：

http请求的过程（以Chrome浏览器为例）：

1、Chrome搜索自身的DNS缓存（chrome://net-internals/#dns）；

2、搜索操作系统自身的DNS缓存（浏览器没有找到缓存或缓存已经失效）；

3、读取本地的HOST文件；

4、浏览器发起一个DNS的一个系统调用：

a) 宽带运营商服务器查看本身缓存；

b) 运营商服务器发起一个迭代DNS解析的请求：

i. 运营商服务器把结果返回操作系统内核同时缓存起来；

ii. 操作系统内核把结果返回浏览器；

iii. 最终浏览器拿到了www.imooc.com对应的IP地址

5、浏览器获得域名对应的IP地址后，发起HTTP“三次握手”；

6、TCP/IP连接建立起来后，浏览器就可以向服务器发送HTTP请求了。比如说，用HTTP的GET方法请求一个根域的一个域名，协议可以采用HTTP1.0的一个协议；

7、服务器端接收到了这个请求，根据路径参数，经过后端的一些处理之后，把处理后的一个结果返回给浏览器，如果请求的页面就会把完整的HTML页面代码返回给浏览器；

8、浏览器拿到了完整的HTML页面代码，在解析和渲染这个页面的时候，里面的JS、CSS、图片静态资源，他们同样也是一个个HTTP请求都是需要经过上面的主要的七个步骤的；

9、浏览器根据拿到的资源对页面进行渲染，最终把一个完整的页面呈现给了用户。

使用node.js使用爬虫需要使用到http模块，此外还要cnpm install cheerio这个模块，将cheerio模块下载到当前项目处。使用该模块可以装载html字符串为一个jquery，如：$。可以使用$形同使用jquery一样，操作dom，使用jquery中的方法。在这里依据视频，我也在这里简单地爬一下慕课网的教程咯。

具体爬虫的实现，还需要具体的需求进行实现。这里只是简单的应用。

crawler.js 代码如下：

var http = require('http');
var cheerio = require('cheerio');
var url = 'http://www.imooc.com/learn/348';

/**
 * 这里需要安装一个模块cheerio（npm install cheerio -g）
 * @param  {[type]} html [description]
 * @return {[type]}      [description]
 */
function filterChapters(html) {
  var $ = cheerio.load(html);

  var $chapters = $('.chapter');

  //console.log(chapters);
  var chapters = [];
  var $chapter, chapterTitle, $videos;
  var v, videos = [];
  var $tempVideo;
  $chapters.each(function(index,element) {
    
    $chapter = $(this);
    chapterTitle = $chapter.find('strong').text().trim();
    $videos = $chapter.find('.video').children('li');

    videos = [];
    $videos.each(function() {
      $tempVideo = $(this);
      v = {
        id: $tempVideo.data('media-id'),
        title: $tempVideo.children('a').text().trim()
      };

      videos.push(v);
    });

    chapters.push({
      title: chapterTitle,
      videos: videos
    });
  });

  return chapters;
}

function printCourseInfo(course) {

  var chapterTitle = '';
  var res = '';
  course.forEach(function(item) {
    chapterTitle = item.title;

    res += chapterTitle + '\n';

    item.videos.forEach(function(video) {
      res += '  【' + video.id + '】 ' + video.title + '\n';
    });
  });
  console.log(res);
}

http.get(url, function(res) {
  var html = '';

  res.on('data', function(data) {
    html += data;
  });

  res.on('end', function() {
    var courses = filterChapters(html);
    printCourseInfo(courses);
  });
}).on('error', function() {
  console.log('获取失败！');
});

运行结果：

使用node.js的http模块实现爬虫小工具_http

（在这里记下这段代码，作为node.js的爬虫入门~）