爬虫

12a2024/10/3大约 9 分钟

爬虫

主要介绍python环境的parsel和JavaScript环境的cheerio爬虫库

请求网页数据

可以使用常见的各种库进行请求

python

# python版使用request请求
import requests

# 模拟浏览器请求头
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

def getHTMLText(url, timeout=30, headers=None, encoding='utf-8'):
    try:
        if headers is not None:
            r = requests.get(url, headers=headers, timeout=timeout)
        else:
            r = requests.get(url, timeout=timeout)
        r.raise_for_status() # 如果状态不是200，引发异常
        r.encoding = encoding
        return r.text
    except Exception as e:
        print(e)
        return ""

// axios请求
import axios;

function getHTMLText(url) {
    let config = {
        method: 'get',
        url
    };

    axios(config)
        .then(function (response) {
            getMessage(response.data);
        })
        .catch(function (error) {
            console.log(error);
        });
}

// request请求(无promise)
function getHTMLText(url) {
    const request = new XMLHttpRequest();

    request.timeout = 3000;

    request.open('get', url)

    request.onreadystatechange = function() {
        if (request.readyState == 4) {
            if (request.status >= 200 && request.status < 300) {
                p = document.getElementsByTagName('p')[0]
                p.innerHTML = request.response
            }
        }
    }
}

解析JSON格式数据

对于python而言，如果需要解析JSON格式数据，需要使用JSON解析库

import json

# 将字典或文件编码成字符串
# json.dumps(obj, sort_keys=False, indent=None)
# json.dump(obj, fp, sort_keys=False, indent=None)
json.dumps(dict1)

# 将字符串解码成字典或输入到文件
# json.loads(string)
# json.load(fp)
json.load(str1)

解析html格式数据

parsel支持css和xpath规则
cheerio使用自己的cheerio规则

python

# 请求库
import requests
# 爬虫库
import parsel
# JSON解析库
import json

// 请求库
import axios;
// cheerio爬虫库
import * as cheerio from 'cheerio';

css匹配规则

css选择器基本规则
- 通过结点名获取结点
- 通过.css类/#id获取对应有类/id的结点
- ::表示获取本结点上的信息,::text表示其中字符串,::attr(href)表示获取结点的href属性值
- *表示获取到的是所有的元素，也就是通配符，可以用来对应所有子结点的对应的内容

from parsel import Selector

selector = Selector(text=htmlData)


# 获取第一个a标签的href属性值
selector.css('a::attr(href)').get()
# 获取id为images的所有子结点中的内容
selector.css('#images *::text').getall()

xpath匹配规则

基本语法

符号	含义
/	从根节点开始，定位到目标节点
//	从当前节点开始，递归查找所有符合条件的节点
.	表示当前节点
..	表示当前节点的父节点
*	匹配任意节点
@	表示属性节点
[]	表示谓词，用于筛选符合条件的节点

选择所有节点：使用双斜杠//选择文档中的所有节点，例如：//node()
按标签名选择节点：使用标签名选择节点，例如：//book
按属性选择节点：使用方括号[]和@符号选择具有特定属性值的节点，例如：//book[@category="children"]
选择父节点、子节点和兄弟节点：使用父节点(…)、子节点(/)和兄弟节点(//)选择节点，例如：//book/title/..、//book/author/following-sibling::title等
使用通配符选择节点：使用星号*选择任何节点，例如：//book/*选择所有book节点的子节点
使用逻辑运算符选择节点：使用and、or、not等逻辑运算符选择节点，例如：//book[price<10 and @category="children"]
使用内置函数处理节点：使用内置函数处理节点的文本和数值，例如：//book[substring(title,1,3)="The"]选择标题以"The"开头的书籍
使用轴选择节点：使用轴选择节点，例如：//book/ancestor::library选择book节点的library祖先节点

from parsel import Selector

selector = Selector(text=htmlData)
# 返回title的文本内容
selector.xpath('/html/head/title/text()')
# 定位到根节点下的bookstore节点
selector.xpath('/bookstore')
# 查找所有category属性值为web的book节点
selector.xpath('//book[@category="web"]')

cheerio匹配规则

如果需要取值的话，一定要使用attr()、prop()、text()，才能得到字符串类型的数据，常用的基本方法如下，具体也可以查看文档，选择器和JQuery基本相同

import * as cheerio from 'cheerio';

const $ = cheerio.load(htmlData);

// 支持JQuery选择方式
const $selected = $('[data-selected=true]');
const $p = $('p:contains("hello")');
// 支持xml命名空间(需要转义冒号':'改为'//:')选择方式
const $main = $('[xml\\:id="main"');

// 对于$selected、$p、$main而言，都是选择到的一个对象，可以使用以下方法继续选择
// 返回所有在div中的a标签列表
const listItems = $('div').find('a');
// 返回ul的子项中为li的标签列表
const listItems = $('ul').children('li');
// 返回在div中的所有文本和注释列表(包括每个单独的换行)
const contents = $('div').contents();
// 返回li的父标签
const list = $('li').parent();
// 返回li的所有父标签和向上递归返回li的父标签，直到满足条件(满足条件的不会返回)
const ancestors = $('li').parents();
const ancestorsUntil = $('li').parentsUntil('div');
// 返回向上查找到满足条件的第一个li的父标签(没有返回空选择)
const list = $('li').closest('ul');
// 返回上一个或下一个元素
const nextItem = $('li:first').next();
const prevItem = $('li:eq(1)').prev();
// 返回所有上一个、下一个、其他同级元素列表
const nextAll = $('li:first').nextAll();
const prevAll = $('li:last').prevAll();
const siblings = $('li:eq(1)').siblings();
// 返回直到指定的同级(不包括)可被选择到的所有元素列表
const nextUntil = $('li:first').nextUntil('li:last-child');
const prevUntil = $('li:last').prevUntil('li:first-child');

// 过滤器(在选择字符串中也可以实现,而且效率更高)
// 选择第二个元素、第一个元素、最后一个元素
const secondItem = $('li').eq(1);
const firstItem = $('li').first();
const lastItem = $('li').last();
// 进行元素过滤，进一步筛选出有.item和无item的li列表
const matchingItems = $('li').filter('.item');
const nonMatchingItems = $('li').not('.item');

// 获取和修改attribute和property(不需要显式定义的值如checked使用property，具体与JQuery相同)
// prop还可以获得以下内容,tagName、innerHTML、outerHTML、textContent和innerText
// Set the 'src' attribute of an image element
$('img').attr('src', 'https://example.com/image.jpg');
// Set the 'checked' property of a checkbox element
$('input[type="checkbox"]').prop('checked', true);
// Get the 'href' attribute of a link element
const href = $('a').attr('href');
// Get the 'disabled' property of a button element
const isDisabled = $('button').prop('disabled');

// 添加、删除或有则删除无则添加类
// Add a class to an element
$('div').addClass('new-class');
// Add multiple classes to an element
$('div').addClass('new-class another-class');
// Remove a class from an element
$('div').removeClass('old-class');
// Remove multiple classes from an element
$('div').removeClass('old-class another-class');
// Toggle a class on an element (add if it doesn't exist, remove if it does)
$('div').toggleClass('active');

// 设置或获取其中的所有文本内容,如果有script和style标签中的内容也将一起输出,如果不需要请使用.prop('innerText')
// Set the text content of an element
$('h1').text('Hello, World!');
// Get the text content of an element
const text = $('p').text();

// 修改或设置html代码内容
// Set the inner HTML of an element
$('div').html('<p>Hello, World!</p>');
// Get the inner HTML of an element
const html = $('div').html();

// 在指定位置插入新元素
// Append an element to the end of a parent element
$('ul').append('<li>Item</li>');
// Prepend an element to the beginning of a parent element
$('ul').prepend('<li>Item</li>');
// Insert an element before a target element
$('li').before('<li>Item</li>');
// Insert an element after a target element
$('li').after('<li>Item</li>');
// 使元素插入到指定位置
// Insert an element after a target element
$('<p>Inserted element</p>').insertAfter('h1');
// Insert an element before a target element
$('<p>Inserted element</p>').insertBefore('h1');

// 用标签包装自身或所有子标签,删除父标签同时保留所有子元素
// Wrap an element in a div
$('p').wrap('<div></div>');
// Wrap the inner HTML of an element in a div
$('div').wrapInner('<div></div>');
// Unwrap an element
$('p').unwrap();

// 使用一个元素替换、删除甚至清空匹配元素
// Replace an element with another element
$('li').replaceWith('<li>Item</li>');
// Remove an element from the document
$('li').remove();
// Remove an element's children from the document
$('li').empty();

视频、音频流数据获取

使用you-get获取网页音视频,安装you-get、ffmepg

you-get -i https://www.bilibili.com/video/BV1ao4y197Fn
# 指定格式
you-get --format=dash-flv720 https://www.bilibili.com/video/BV1ao4y197Fn

批量自动化执行命令下载

from subprocess import run
strcmd = "you-get https://www.bilibili.com/video/BV1eW411Q7CE?p={}"
for i in range(1,8):
    sc = strcmd.format(i)
    print("Downloading video page {}".format(i))
    run(sc, shell=True)

selenium模拟用户访问

获取网页内容

from selenium import webdriver

# 最新版本已不需要额外下载并选择浏览器驱动
# path = 'msedgedriver.exe'
browser = webdriver.Edge()

url = 'https://www.baidu.com'

browser.get(url)

# 网页源代码内容
content = browser.page_source

# 关闭浏览器
browser.close()

使用selenium定位元素

from selenium import webdriver
from selenium.webdriver.common.by import By

# 创建浏览器操作对象
browser= webdriver.Chrome()

# 访问网站
url = 'https://www.baidu.com'
browser.get(url)

# 根据标签 id 获取元素：
button = browser.find_element(By.ID, 'su')
# button = browser.find_elements(By.ID, 'su')
print(button)

# 根据标签 name 属性的值获取元素：
button = browser.find_element(By.NAME, 'wd')
print(button)

# 根据 Xpath 语句获取元素；
button = browser.find_element(By.XPATH, '//input[@id="su"]')
print(button)

# 根据标签名获取元素：
button = browser.find_elements(By.TAG_NAME, 'input')
print(button)

# 根据 bs4 语法获取元素：
button = browser.find_elements(By.CSS_SELECTOR, '#su')
print(button)

# 根据标签的文本获取元素（精确定位）：
button = browser.find_elements(By.LINK_TEXT, '地图')
print(button)

# 根据标签的文本获取元素（模糊定位）：
button = browser.find_elements(By.PARTIAL_LINK_TEXT, '地')
print(button)

# 根据 class 属性获取元素：
button = browser.find_element(By.CLASS_NAME, 'wrapper_new')
print(button)

使用selenium获取元素的信息

from selenium import webdriver
from selenium.webdriver.common.by import By

# 创建浏览器操作对象
browser= webdriver.Chrome()

# 访问网站
url = 'https://www.baidu.com'
browser.get(url)

button = browser.find_element(By.ID, 'su')

# 获取元素的属性
print(input.get_attribute('class'))

# 获取元素标签名：
print(input.tag_name)

# 获取元素文本：
print(input.text)

# 获取元素位置：
print(input.location)

# 获取元素大小：
print(input.size)

使用selenium模拟人类交互

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# 创建浏览器操作对象
# path = 'chromedriver.exe'
browser = webdriver.Chrome()

# 访问网站
url = 'https://www.baidu.com'
browser.get(url)

# 输入类组件操作
input = browser.find_element(By.ID, 'kw')
# 输入文本selenium
input.send_keys('selenium')
# 清除selenium
input.clear()
# 回车查询
input.submit()

# 按钮类组件操作
button = browser.find_element(By.ID, 'su')
# 点击按钮
button.click()

# 直接执行js代码
# 执行js代码(下拉进度条，页面滑动)
js_bottom = 'document.documentElement.scrollTop=100000'
browser.execute_script(js_bottom)

# 页面路由操作
# 返回到上一页面
browser.back()
# 前进到下一页
browser.forward()

# 如果遇到验证，可能会因为页面由自动化程序创建而无法通过
# 可以手动添加新页面，并修改访问的页面句柄
browser.switch_to.window(browser.window_handles[-1]) # 切换到最后一个页面

接口存在加密的查询数据字段

【未实践】通过浏览器工具获取接口使用方式和调用的堆栈信息,得到调用的加密函数和时机,尝试在控制台使用该函数,试图找到加密的规律,与主流的加密算法对比,使用python或其他语言模拟写出对应的算法,进行破解