0
点赞
收藏
分享

微信扫一扫

c#.net——c#.net异步实现网页信息爬取



之前研究各种语言异步的时候就想做一个C#版本的异步,但是毕竟好久不做了(也就是在大学期间用asp.net做了几个管理系统)

一开始写代码时完全蒙蔽了,语法啥的都忘差不多了~~研究了好几天,也参考了网上许多资料,终于写出了几行low逼代码

c#.net——c#.net异步实现网页信息爬取_c#


实现内容:异步并发爬取网页信息


首先异步的语法和其他语言都大同小异,async、await,定义异步方法的话要加async修饰符,如果你想在await调用,那么返回值必须为 Task

c#.net——c#.net异步实现网页信息爬取_爬虫_02

c#.net——c#.net异步实现网页信息爬取_异步_03


若返回值为Task,但是我们并没有await,也就是阻塞时返回继续执行,VS编译器也会给予提示,但可以正常运行

c#.net——c#.net异步实现网页信息爬取_异步_04


c#.net——c#.net异步实现网页信息爬取_并发_05


接下来说一下爬取程序的实现,也是也很简单,就是HTML匹配抓取关键字

先写一个方法 实现文本截取

public static string get_key_by_content(string content, string start, string end)
{
if (content.Contains(start) && content.Contains(end))
{
int start_point = content.IndexOf(start, 0, System.StringComparison.Ordinal) + start.Length;
int end_point = content.IndexOf(end, start_point, System.StringComparison.Ordinal);
return content.Substring(start_point, end_point - start_point);
}
else
{
return "";
}
}



以CSDN博客为例,截取标题、阅读量等信息


c#.net——c#.net异步实现网页信息爬取_并发_06


我们发现标题有一个超链接,于是我们把帖子的子url摘出来

string link_view = Common.get_key_by_content(content, "<span class=\"link_view\" title=\"阅读次数\">", "</span>");
string link_title = Common.get_key_by_content(content, "<span class=\"link_title\"><a href=\"/"+childurl+"\">", "</a>");



整个爬取函数:

static async Task Spider(string RootUrl, int child, string childurl)
{

using (var client = new HttpClient())
{
client.BaseAddress = new Uri(RootUrl);
client.DefaultRequestHeaders.Accept.Clear();
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("text/html"));

try
{
HttpResponseMessage response = await client.GetAsync(childurl);
response.EnsureSuccessStatusCode();
string content = await response.Content.ReadAsStringAsync();
string link_view = Common.get_key_by_content(content, "<span class=\"link_view\" title=\"阅读次数\">", "</span>");
string link_title = Common.get_key_by_content(content, "<span class=\"link_title\"><a href=\"/"+childurl+"\">", "</a>");
Console.WriteLine("第" + (child + 1).ToString() + "个帖子 " + link_title+" "+link_view);
}
catch (HttpRequestException e)
{
Console.WriteLine(e.Message);
}
}

}



然后我们定义执行异步任务函数

static async void StartTask(string url)
{
for (int p = 0; p < Common.get_article_length(); p++)
{
//await WhileRun(url, p, Common.get_article_by_id(p));
WhileRun(url, p, Common.get_article_by_id(p));
}

}

static async Task WhileRun(string RootUrl, int p, string childurl)
{
while (true)
{
await Spider(RootUrl, p, childurl);
Thread.Sleep(1000);
}
}



接下来只要定义几个帖子就可以了,由于C#没有全局的概念,所以我们要封装成一个类,同时给出获取帖子地址、帖子总数长度等方法。


完整代码:

using System;
using System.Collections.Generic;
using System.IO;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Text;
using System.Threading;
using System.Threading.Tasks;



public class Common
{
static string[] articles = {
"sm9sun/article/details/75395020",
"sm9sun/article/details/75446875",
"sm9sun/article/details/75449050",
"sm9sun/article/details/75735001",
"sm9sun/article/details/75309443",
"sm9sun/article/details/76358660"
};


public static string get_article_by_id(int id)
{
if(id < articles.Length&&id>=0)
{
return articles[id];
}
else
{
return "error";
}
}

public static int get_article_length()
{
return articles.Length;
}

public static string get_key_by_content(string content, string start, string end)
{
if (content.Contains(start) && content.Contains(end))
{
int start_point = content.IndexOf(start, 0, System.StringComparison.Ordinal) + start.Length;
int end_point = content.IndexOf(end, start_point, System.StringComparison.Ordinal);
return content.Substring(start_point, end_point - start_point);
}
else
{
return "";
}
}



}

class Program
{

static void Main()
{
string RootUrl = "http://blog.csdn.net";
StartTask(RootUrl);
Console.WriteLine("gogogo!");
Console.ReadLine();

}

static async void StartTask(string url)
{
for (int p = 0; p < Common.get_article_length(); p++)
{
//await WhileRun(url, p, Common.get_article_by_id(p));
WhileRun(url, p, Common.get_article_by_id(p));
}

}

static async Task WhileRun(string RootUrl, int p, string childurl)
{
while (true)
{
await Spider(RootUrl, p, childurl);
Thread.Sleep(1000);
}
}

static async Task Spider(string RootUrl, int child, string childurl)
{

using (var client = new HttpClient())
{
client.BaseAddress = new Uri(RootUrl);
client.DefaultRequestHeaders.Accept.Clear();
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("text/html"));

try
{
HttpResponseMessage response = await client.GetAsync(childurl);
response.EnsureSuccessStatusCode();
string content = await response.Content.ReadAsStringAsync();
string link_view = Common.get_key_by_content(content, "<span class=\"link_view\" title=\"阅读次数\">", "</span>");
string link_title = Common.get_key_by_content(content, "<span class=\"link_title\"><a href=\"/"+childurl+"\">", "</a>");
Console.WriteLine("第" + (child + 1).ToString() + "个帖子 " + link_title+" "+link_view);
}
catch (HttpRequestException e)
{
Console.WriteLine(e.Message);
}
}

}
}



运行截图:



c#.net——c#.net异步实现网页信息爬取_.net_07



顺带一提,我不是故意刷阅读量的~!~!

不是~!~!

真的不是~!~!

c#.net——c#.net异步实现网页信息爬取_.net_08




举报

相关推荐

0 条评论