任务1说明文档

项目环境:

jupyter 4.5.0
Python 3.7.4
bs4
requests

使用站点:

https://www.scrapethissite.com/pages/forms/

1、依赖环境安装

pip安装requests和bs4库

pip install requests beautifulsoup4

其中,

  • requests用于发送网页请求和获取网页源代码。
  • BeautifulSoup用于解析html代码以提取数据。

2、定义网站基础URL

我们参考网址,观察可得

url = "https://www.scrapethissite.com/pages/forms/"

3、发送请求并获取网页源码

为了方便,我们利用Python内置的html.parser解析器,解析hteml源码。

soup = BeautifulSoup(html_content,"html.parser")

4、定位所有球队的数据节点

我们通过观察网页源码,发现储存球队数据的位置 不难发现,每个tr标签对应一只球队的完整数据。

teams = soup.find_all("tr",class_="team")

5、遍历提取球队字段

同样通过观察网页源码,我们找到每一个我们需要的数据所对应的标签 例如:队伍名称

team_name = team.find("td",class_="name").text.strip()

其中,

  • .text用于提取标签内的本文
  • .strip()用于去除文本前后的空格和换行符

值得注意的是,有一些数据有两种不同的标签。比如我们百分比胜率可能是text-success或者text-danger。如果我们不做调整,会报提取空数据的错误。 因此我们要通过逻辑判断来选择性提取非空的数据

win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
win_pct = win_td.text.strip() if win_td else ""  

6、储存数据和打印进度

为适配csv我们把字符用逗号分隔 我们把数据保存在lines中 打印进度

line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
print(f"已提取队伍:{TeamName}")
lines.append(line)

保存数据

with open('hockey_teams_page_1.csv', 'w') as f:
    f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
    f.writelines(lines)

7、翻页爬取功能

7.1、翻页url

观察网址得到翻页的url为:

原网址?page=页码 因此构造翻页url为

page_url = f"{url}?page={page_num}"

当然我们需要遍历一下页码,不作赘述。

7.2、遍历队伍

与第一页代码完全一致,故不作赘述。

7.3、自动命名文件

filename = f"hockey_teams_page_{page_num}.csv"

当然你也可以枚举实现类似的效果。

8、搜索爬取功能

8.1、定义待搜索球队名单

我们要制作一个待搜索名单,包含:

  • 中文显示名(用于搜索)
  • 文件保存后缀(用于区分文件)
    teams_to_search = [
      ("Boston Bruins", "boston_bruins"),
      ("Buffalo Sabres", "buffalo_sabres")
    ]

8.2、构造搜索用url

观察网页得到url搜索参数 原网址?q=键入数据(球队名称) 我们还需要将球队名称中的空格替换为 +(URL 参数格式要求),拼接搜索参数:

search_url = f"{url}?q={team_name.replace(' ', '+')}"

题外话,你用搜索引擎的时候也可以使用这个格式作为搜索关键词的技巧。:D

8.3、遍历

我们只需要更改请求的搜索页为我们的search_url,以及print和保存的csv文件名。

下面是全部代码1

import requests
from bs4 import BeautifulSoup

url = "https://www.scrapethissite.com/pages/forms/"

# 保留你原有的第1页爬取代码
response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser") 

teams = soup.find_all("tr", class_="team")

lines = []
for team in teams:
    TeamName = team.find("td", class_="name").text.strip()
    Year = team.find("td", class_="year").text.strip()
    Wins = team.find("td", class_="wins").text.strip()
    Losses = team.find("td", class_="losses").text.strip()
    OTLosses = team.find("td", class_="ot-losses").text.strip()
    win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
    win_pct = win_td.text.strip() if win_td else ""
    GoalsFor = team.find("td", class_="gf").text.strip()
    GoalsAgainst = team.find("td", class_="ga").text.strip()
    diff_td = team.select_one("td.diff.text-success, td.diff.text-danger")
    diff = diff_td.text.strip() if diff_td else ""
    line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
    print(f"已提取队伍:{TeamName}")
    lines.append(line)

print(f"提取完成,共 {len(teams)} 个队伍")

with open('hockey_teams_page_1.csv', 'w') as f:
    f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
    f.writelines(lines)

# 新增翻页爬取第2-5页的代码
for page_num in range(2, 6):
    # 构造当前页URL
    page_url = f"{url}?page={page_num}"
    response = requests.get(page_url)
    html_content = response.text

    soup = BeautifulSoup(html_content, "html.parser") 

    teams = soup.find_all("tr", class_="team")

    lines = []
    for team in teams:
        TeamName = team.find("td", class_="name").text.strip()
        Year = team.find("td", class_="year").text.strip()
        Wins = team.find("td", class_="wins").text.strip()
        Losses = team.find("td", class_="losses").text.strip()
        OTLosses = team.find("td", class_="ot-losses").text.strip()
        win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
        win_pct = win_td.text.strip() if win_td else ""
        GoalsFor = team.find("td", class_="gf").text.strip()
        GoalsAgainst = team.find("td", class_="ga").text.strip()
        diff_td = team.select_one("td.diff.text-success, td.diff.text-danger")
        diff = diff_td.text.strip() if diff_td else ""
        line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
        print(f"已提取第{page_num}页队伍:{TeamName}")
        lines.append(line)

    print(f"第{page_num}页提取完成,共 {len(teams)} 个队伍")

    # 保存为对应页码的CSV文件
    filename = f"hockey_teams_page_{page_num}.csv"
    with open(filename, 'w') as f:
        f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
        f.writelines(lines)

#新增搜索爬取
teams_to_search = [
    ("Boston Bruins", "boston_bruins"),
    ("Buffalo Sabres", "buffalo_sabres")
]

for team_name, file_suffix in teams_to_search:
    # 构造搜索URL(将空格替换为+,符合URL参数格式)
    search_url = f"{url}?q={team_name.replace(' ', '+')}"
    response = requests.get(search_url)
    html_content = response.text

    soup = BeautifulSoup(html_content, "html.parser") 

    teams = soup.find_all("tr", class_="team")

    lines = []
    for team in teams:
        TeamName = team.find("td", class_="name").text.strip()
        Year = team.find("td", class_="year").text.strip()
        Wins = team.find("td", class_="wins").text.strip()
        Losses = team.find("td", class_="losses").text.strip()
        OTLosses = team.find("td", class_="ot-losses").text.strip()
        win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
        win_pct = win_td.text.strip() if win_td else ""
        GoalsFor = team.find("td", class_="gf").text.strip()
        GoalsAgainst = team.find("td", class_="ga").text.strip()
        diff_td = team.select_one("td.diff.text-success, td.diff.text-danger")
        diff = diff_td.text.strip() if diff_td else ""
        line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
        print(f"已提取{team_name}队伍:{TeamName}")
        lines.append(line)

    print(f"{team_name}提取完成,共 {len(teams)} 个队伍")

    # 保存为指定文件名
    filename = f"hockey_teams_{file_suffix}.csv"
    with open(filename, 'w') as f:
        f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
        f.writelines(lines)

任务2说明文档

项目环境:

jupyter 4.5.0
Python 3.7.4
requests

使用站点:

https://jsonplaceholder.typicode.com/posts

已安装过requests,不再做赘述。

1、分析页面和接口

我们要爬取的是纯json格式接口数据,无前端页面渲染,属于ajax异步数据接口 我们通过userId参数筛选指定用户的所有帖子,并返回id,title,body,userId的数组

2、请求方式

2.1、请求类型和参数

我们使用GET请求请求userId用户的帖子数据

2.2、配置请求头

headers = {
    "User-Agent": "浏览器标识",
    "X-Requested-With": "XMLHttpRequest"  # 标记为AJAX请求
}

3、数据处理

3.1、数据解析

直接用response.json()将接口返回的json字符串转为Python的列表。

3.2、遍历并提取字段

遍历字段提取:

  • 帖子id:post_id
  • 帖子标题:title
  • 帖子正文内容:content 并用","拼接为csv

4、保存数据

  1. 先创建文件夹os.makedirs创建user_posts文件夹用于保存csv文件
  2. 按用户id命名,写入表头并批量写入数据 注意:编码格式使用UTF-8
  3. 利用time.sleep(1)控制爬取频率

下面是全部代码2

import os
import requests
import time

def scrape(user_id: int):
    # 1. 目标接口 URL
    # 通过改变 user_id 参数获取不同用户的发帖数据
    base_url = "https://jsonplaceholder.typicode.com/posts"
    params = {
        "userId": user_id  # 按用户ID筛选帖子
    }

    # 2. 发送请求
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "X-Requested-With": "XMLHttpRequest"
    }

    response = requests.get(base_url, params=params, headers=headers)

    # 3. 直接解析 JSON 数据
    data = response.json()

    # 4. 提取数据
    lines = []
    for post in data:
        post_id = post.get("id")
        title = post.get("title")
        content = post.get("body")
        # 拼接行
        line = f"{post_id},{title},{content}\n"
        lines.append(line)

    # 5. 保存文件
    os.makedirs('user_posts', exist_ok=True)
    with open(f'user_posts/{user_id}.csv', 'w', encoding='utf-8') as f:
        f.write("post_id,title,content\n")  # 表头
        f.writelines(lines)
    print(f"成功提取用户{user_id}的发帖数据")

# 循环爬取多个用户+ 休眠防封
for user_id in [1, 2, 3, 4, 5]:
    scrape(user_id)
    time.sleep(1)