任务1说明文档
项目环境:
jupyter 4.5.0
Python 3.7.4
bs4
requests
使用站点:
https://www.scrapethissite.com/pages/forms/
1、依赖环境安装
pip安装requests和bs4库
pip install requests beautifulsoup4
其中,
- requests用于发送网页请求和获取网页源代码。
- BeautifulSoup用于解析html代码以提取数据。
2、定义网站基础URL
我们参考网址,观察可得
url = "https://www.scrapethissite.com/pages/forms/"
3、发送请求并获取网页源码
为了方便,我们利用Python内置的html.parser解析器,解析hteml源码。
soup = BeautifulSoup(html_content,"html.parser")
4、定位所有球队的数据节点
我们通过观察网页源码,发现储存球队数据的位置 不难发现,每个tr标签对应一只球队的完整数据。
teams = soup.find_all("tr",class_="team")
5、遍历提取球队字段
同样通过观察网页源码,我们找到每一个我们需要的数据所对应的标签 例如:队伍名称
team_name = team.find("td",class_="name").text.strip()
其中,
.text用于提取标签内的本文.strip()用于去除文本前后的空格和换行符
值得注意的是,有一些数据有两种不同的标签。比如我们百分比胜率可能是text-success或者text-danger。如果我们不做调整,会报提取空数据的错误。
因此我们要通过逻辑判断来选择性提取非空的数据:
win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
win_pct = win_td.text.strip() if win_td else ""
6、储存数据和打印进度
为适配csv我们把字符用逗号分隔
我们把数据保存在lines中
打印进度
line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
print(f"已提取队伍:{TeamName}")
lines.append(line)
保存数据
with open('hockey_teams_page_1.csv', 'w') as f:
f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
f.writelines(lines)
7、翻页爬取功能
7.1、翻页url
观察网址得到翻页的url为:
原网址?page=页码
因此构造翻页url为
page_url = f"{url}?page={page_num}"
当然我们需要遍历一下页码,不作赘述。
7.2、遍历队伍
与第一页代码完全一致,故不作赘述。
7.3、自动命名文件
filename = f"hockey_teams_page_{page_num}.csv"
当然你也可以枚举实现类似的效果。
8、搜索爬取功能
8.1、定义待搜索球队名单
我们要制作一个待搜索名单,包含:
- 中文显示名(用于搜索)
- 文件保存后缀(用于区分文件)
teams_to_search = [ ("Boston Bruins", "boston_bruins"), ("Buffalo Sabres", "buffalo_sabres") ]
8.2、构造搜索用url
观察网页得到url搜索参数
原网址?q=键入数据(球队名称)
我们还需要将球队名称中的空格替换为 +(URL 参数格式要求),拼接搜索参数:
search_url = f"{url}?q={team_name.replace(' ', '+')}"
题外话,你用搜索引擎的时候也可以使用这个格式作为搜索关键词的技巧。:D
8.3、遍历
我们只需要更改请求的搜索页为我们的search_url,以及print和保存的csv文件名。
下面是全部代码1
import requests
from bs4 import BeautifulSoup
url = "https://www.scrapethissite.com/pages/forms/"
# 保留你原有的第1页爬取代码
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
teams = soup.find_all("tr", class_="team")
lines = []
for team in teams:
TeamName = team.find("td", class_="name").text.strip()
Year = team.find("td", class_="year").text.strip()
Wins = team.find("td", class_="wins").text.strip()
Losses = team.find("td", class_="losses").text.strip()
OTLosses = team.find("td", class_="ot-losses").text.strip()
win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
win_pct = win_td.text.strip() if win_td else ""
GoalsFor = team.find("td", class_="gf").text.strip()
GoalsAgainst = team.find("td", class_="ga").text.strip()
diff_td = team.select_one("td.diff.text-success, td.diff.text-danger")
diff = diff_td.text.strip() if diff_td else ""
line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
print(f"已提取队伍:{TeamName}")
lines.append(line)
print(f"提取完成,共 {len(teams)} 个队伍")
with open('hockey_teams_page_1.csv', 'w') as f:
f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
f.writelines(lines)
# 新增翻页爬取第2-5页的代码
for page_num in range(2, 6):
# 构造当前页URL
page_url = f"{url}?page={page_num}"
response = requests.get(page_url)
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
teams = soup.find_all("tr", class_="team")
lines = []
for team in teams:
TeamName = team.find("td", class_="name").text.strip()
Year = team.find("td", class_="year").text.strip()
Wins = team.find("td", class_="wins").text.strip()
Losses = team.find("td", class_="losses").text.strip()
OTLosses = team.find("td", class_="ot-losses").text.strip()
win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
win_pct = win_td.text.strip() if win_td else ""
GoalsFor = team.find("td", class_="gf").text.strip()
GoalsAgainst = team.find("td", class_="ga").text.strip()
diff_td = team.select_one("td.diff.text-success, td.diff.text-danger")
diff = diff_td.text.strip() if diff_td else ""
line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
print(f"已提取第{page_num}页队伍:{TeamName}")
lines.append(line)
print(f"第{page_num}页提取完成,共 {len(teams)} 个队伍")
# 保存为对应页码的CSV文件
filename = f"hockey_teams_page_{page_num}.csv"
with open(filename, 'w') as f:
f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
f.writelines(lines)
#新增搜索爬取
teams_to_search = [
("Boston Bruins", "boston_bruins"),
("Buffalo Sabres", "buffalo_sabres")
]
for team_name, file_suffix in teams_to_search:
# 构造搜索URL(将空格替换为+,符合URL参数格式)
search_url = f"{url}?q={team_name.replace(' ', '+')}"
response = requests.get(search_url)
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
teams = soup.find_all("tr", class_="team")
lines = []
for team in teams:
TeamName = team.find("td", class_="name").text.strip()
Year = team.find("td", class_="year").text.strip()
Wins = team.find("td", class_="wins").text.strip()
Losses = team.find("td", class_="losses").text.strip()
OTLosses = team.find("td", class_="ot-losses").text.strip()
win_td = team.select_one("td.pct.text-success, td.pct.text-danger")
win_pct = win_td.text.strip() if win_td else ""
GoalsFor = team.find("td", class_="gf").text.strip()
GoalsAgainst = team.find("td", class_="ga").text.strip()
diff_td = team.select_one("td.diff.text-success, td.diff.text-danger")
diff = diff_td.text.strip() if diff_td else ""
line = f"{TeamName},{Year},{Wins},{Losses},{OTLosses},{win_pct},{GoalsFor},{GoalsAgainst},{diff}\n"
print(f"已提取{team_name}队伍:{TeamName}")
lines.append(line)
print(f"{team_name}提取完成,共 {len(teams)} 个队伍")
# 保存为指定文件名
filename = f"hockey_teams_{file_suffix}.csv"
with open(filename, 'w') as f:
f.write("TeamName,Year,Wins,Losses,OTLosses,Win%,GoalsFor(GF),GoalsAgainst(GA),+/-\n")
f.writelines(lines)
任务2说明文档
项目环境:
jupyter 4.5.0
Python 3.7.4
requests
使用站点:
https://jsonplaceholder.typicode.com/posts
已安装过requests,不再做赘述。
1、分析页面和接口
我们要爬取的是纯json格式接口数据,无前端页面渲染,属于ajax异步数据接口
我们通过userId参数筛选指定用户的所有帖子,并返回id,title,body,userId的数组
2、请求方式
2.1、请求类型和参数
我们使用GET请求请求userId用户的帖子数据
2.2、配置请求头
headers = {
"User-Agent": "浏览器标识",
"X-Requested-With": "XMLHttpRequest" # 标记为AJAX请求
}
3、数据处理
3.1、数据解析
直接用response.json()将接口返回的json字符串转为Python的列表。
3.2、遍历并提取字段
遍历字段提取:
- 帖子id:
post_id - 帖子标题:
title - 帖子正文内容:
content并用","拼接为csv
4、保存数据
- 先创建文件夹
os.makedirs创建user_posts文件夹用于保存csv文件 - 按用户id命名,写入表头并批量写入数据 注意:编码格式使用UTF-8
- 利用
time.sleep(1)控制爬取频率
下面是全部代码2
import os
import requests
import time
def scrape(user_id: int):
# 1. 目标接口 URL
# 通过改变 user_id 参数获取不同用户的发帖数据
base_url = "https://jsonplaceholder.typicode.com/posts"
params = {
"userId": user_id # 按用户ID筛选帖子
}
# 2. 发送请求
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"X-Requested-With": "XMLHttpRequest"
}
response = requests.get(base_url, params=params, headers=headers)
# 3. 直接解析 JSON 数据
data = response.json()
# 4. 提取数据
lines = []
for post in data:
post_id = post.get("id")
title = post.get("title")
content = post.get("body")
# 拼接行
line = f"{post_id},{title},{content}\n"
lines.append(line)
# 5. 保存文件
os.makedirs('user_posts', exist_ok=True)
with open(f'user_posts/{user_id}.csv', 'w', encoding='utf-8') as f:
f.write("post_id,title,content\n") # 表头
f.writelines(lines)
print(f"成功提取用户{user_id}的发帖数据")
# 循环爬取多个用户+ 休眠防封
for user_id in [1, 2, 3, 4, 5]:
scrape(user_id)
time.sleep(1) 
