前言

Python：什么？写爬虫你居然用java？
没错！今天打破常规，用java写个简单的爬虫，主要是毕设是基于java开发的，有些功能需要用到爬虫；现在改题有点晚，所以只能硬着头皮硬上了（自己选的路，再荒谬也要走完o(╥﹏╥)o）

这也是这篇水文发布的原因，研究了半天时间，也算是对jsoup有了一点点的了解，所以尝试用它来写个简单爬虫。

一、Jsoup概述

1.1、简介

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

1.2、Jsoup的主要功能

从一个URL，文件或字符串中解析HTML。
使用DOM或CSS选择器来查找、取出数据。
可操作HTML元素、属性、文本。

注意：jsoup是基于MIT协议发布的，可放心使用于商业项目。

示例：pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。

二、使用步骤

1.Idea导入jsoup

我的idea没有用破解版，用了免费的社区版本，其实功能够用就行。
先新建一个Maven项目。在这里插入图片描述
然后打开Maven的官网，https://mvnrepository.com/，在搜索栏搜索一下jsoup，像这样：

点击进去：

选择最新的版本1.14.3：

将这一串东西复制进pom.xml文件里：

注意，要用一个<dependencies>标签包裹起来，否则会报错：
在这里插入图片描述
我直接贴出来吧，整段复制粘贴进去就行。

    <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.14.3</version>
    </dependency>
    </dependencies>

2.爬虫代码设计

jsoup配置好了以后，就可以开始写爬虫了。

完整代码如下（要爬的url自己换着来，代码也要做相应的更改）：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
public class test {
    public static void main(String[] args) throws IOException {
        String url = "https://mp.weixin.qq.com/s/YPrqMOYYrAtCni2VT8c4jA?";
        Document document = Jsoup.parse(new URL(url), 10000);
        //System.out.println(totle);
        Element id = document.getElementById("js_content");
        Elements imgs = id.getElementsByTag("img");
        int pid = 0;
        //System.out.println(imgs);
        for (Element img : imgs) {
            //获取data-src属性
            String src = img.attr("data-src");
            //System.out.println(src);
            //获取输入流
            URL target = new URL(src);
            URLConnection urlConnection = target.openConnection();
            InputStream inputStream = urlConnection.getInputStream();
            //输出流
            pid++;
            FileOutputStream outputStream = new FileOutputStream("D:\\code\\picture\\" + pid + ".jpg");
            int temp = 0;
            while ((temp = inputStream.read())!= -1){
                outputStream.write(temp);
            }
            System.out.println(pid + ".jpg下载完毕！");
            outputStream.close();
            inputStream.close();
        }
    }
}

下面逐一分析

首先是前面import导入的一堆东西其实都是敲代码的时候idea自动帮你添加的。
先访问一下要爬取图片的网站：
在这里插入图片描述
有很多的图片，都在img标签的data-src属性里，所以代码的整体思路就是，发送请求连接到这个网站，获取到响应包，从响应包里获取到img标签中的data-src属性，最后输入输出流保存到自己的电脑里即可。

1、设置一个url变量，放要爬取的url地址：

String url = "https://mp.weixin.qq.com/s/YPrqMOYYrAtCni2VT8c4jA?";
        Document document = Jsoup.parse(new URL(url), 10000);
        //可以打印出来看看爬取的页面html结果
        //System.out.println(totle);

2、定位到img标签：

Element id = document.getElementById("js_content");
        Elements imgs = id.getElementsByTag("img");
        //System.out.println(imgs);

在网站右键，点检查，然后选中图片，一直往上翻，看它的父标签div有一个id叫js_content。所以document.getElementById(“js_content”)，获取html代码中id叫js_content的内容，然后id.getElementsByTag(“img”)获取所有id叫js_content的img标签的全部内容。
在这里插入图片描述
可以打印输出一下结果：

3、将图片爬下来

int pid = 0;
        //System.out.println(imgs);
        for (Element img : imgs) {
            //获取data-src属性
            String src = img.attr("data-src");
            //System.out.println(src);
            //获取输入流
            URL target = new URL(src);
            URLConnection urlConnection = target.openConnection();
            InputStream inputStream = urlConnection.getInputStream();
            //输出流
            pid++;
            FileOutputStream outputStream = new FileOutputStream("D:\\code\\picture\\" + pid + ".jpg");
            int temp = 0;
            while ((temp = inputStream.read())!= -1){
                outputStream.write(temp);
            }
            System.out.println(pid + ".jpg下载完毕！");
            outputStream.close();
            inputStream.close();
        }