简单爬虫:东方财富网股票数据爬取(20231230)
目标网站:https://quote.eastmoney.com/center/gridlist.html#hs_a_board
需求:将东方财富网行情中心不同板块的股票数据爬取下来

目标是将各个选项卡的股票数据全部爬取并以excel文件保存在本地。
通过查看网页源代码,并未发现目标数据。因此必须对网页进行抓包分析以确定哪些文件包含目标数据。为了实现这一目的,请通过按下F12键切换至开发者模式,并定位到目标文件位置进行操作。

首先查看url,获取沪京深A股的第一页数据
# 沪深京A股
url = "https://62.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124007675389012158473_1703949729655&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1703949729656"
访问形式为GET
审阅预览界面后发现,在文件的'data'目录下的'diff'字段中存储了要获取的数据。具体来说,请参考下图所示的相关信息。

在观察过程中发现,在观察过程中发现
data = response.text
# 找到开头到第一个'('的部分
left_data = re.search(r'^.*?(?=()', data).group()
# 将匹配到的内容加上'('替换成空字符串
data = re.sub(left_data + '(', '', data)
# 将结尾的');'替换成空字符串
data = re.sub(');', '', data)
# 用eval将data转换成字典
data = eval(data)
请注意,在匹配初始内容阶段时,请确保以下操作不会导致错误:当使用特定语句直接对应到JavaScript对象引用标识符如'jQuery1124007675389012158473_1703949729655('时,请避免后续的操作可能导致错误发生。
left_data = re.search(r'^.*?(', data).group()
print(left_data)
data = re.sub(left_data, '', data)

发生此错误的原因在于未对圆括号进行转义处理。因此被视为捕获组的问题由此产生。通过在前面添加反斜杠的方式即可解决问题。从而必须首先识别并提取出位于左括号之前的文本内容,并将其与左括号组合到一起以便完成替换操作。

将目标数据转为字典类型后,则需从data字段中提取diff部分,并可预先定义一个数据结构用于存储所需信息。通过对网页表头与文件代码的对应关系进行分析,在此基础上设定如下数据结构:
df = data['data']['diff']
for index in df:
dict = {
"代码": index["f12"],
"名称": index['f14'],
"最新价": index['f2'],
"涨跌幅": index['f3'],
"涨跌额": index['f4'],
"成交量(手)": index['f5'],
"成交额": index['f6'],
"振幅(%)": index['f7'],
"最高": index['f15'],
"最低": index['f16'],
"今开": index['f17'],
"昨收": index['f18'],
"量比": index['f10'],
"换手率": index['f8'],
"市盈率(动态)": index['f9'],
"市净率": index['f23'],
}
同时通过翻页和选其他板块来观察url,发现规律如下图:

改写说明
cmd = {
"沪深京A股": "f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048",
"上证A股": "f3&fs=m:1+t:2,m:1+t:23",
"深证A股": "f3&fs=m:0+t:6,m:0+t:80",
"北证A股": "f3&fs=m:0+t:81+s:2048",
"新股": "f26&fs=m:0+f:8,m:1+f:8",
"创业板": "f3&fs=m:0+t:80",
"科创板": "f3&fs=m:1+t:23",
"沪股通": "f26&fs=b:BK0707",
"深股通": "f26&fs=b:BK0804",
"B股": "f3&fs=m:0+t:7,m:1+t:3",
"风险警示板": "f3&fs=m:0+f:4,m:1+f:4",
}
在获取数据的过程中,需要确定在获取数据的过程中何时停止当前板块的数据抓取。如图所示,沪深京A股的总页数为279页。为了验证结果的一致性,请确保将URL中的page字段设置为280后能够正确获取所需的数据内容。


从返回的数据看,在遍历每个板块的过程中
null = "null"
for i in cmd.keys():
page = 0
stocks = []
while True:
page += 1
data = get_html(cmd[i], page)
if data['data'] != null:
print("正在爬取"+i+"第"+str(page)+"页")
df = data['data']['diff']
for index in df:
dict = {
"代码": index["f12"],
"名称": index['f14'],
"最新价": index['f2'],
"涨跌幅": index['f3'],
"涨跌额": index['f4'],
"成交量(手)": index['f5'],
"成交额": index['f6'],
"振幅(%)": index['f7'],
"最高": index['f15'],
"最低": index['f16'],
"今开": index['f17'],
"昨收": index['f18'],
"量比": index['f10'],
"换手率": index['f8'],
"市盈率(动态)": index['f9'],
"市净率": index['f23'],
}
stocks.append(dict)
else:
break
df = pd.DataFrame(stocks)
df.to_excel("股票_"+i+".xlsx", index=False)
执行结果如下:


完整源代码:
import requests
import re
import pandas as pd
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Cookie": "qgqp_b_id=18c28b304dff3b8ce113d0cca03e6727; websitepoptg_api_time=1703860143525; st_si=92728505415389; st_asi=delete; HAList=ty-100-HSI-%u6052%u751F%u6307%u6570; st_pvi=46517537371152; st_sp=2023-10-29%2017%3A00%3A19; st_inirUrl=https%3A%2F%2Fcn.bing.com%2F; st_sn=8; st_psi=20231229230312485-113200301321-2076002087"
}
def get_html(cmd, page):
url = f"https://7.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112409467675731682619_1703939377395&pn={page}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid={cmd}&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1703939377396"
response = requests.get(url, headers=header)
data = response.text
left_data = re.search(r'^.*?(?=()', data).group()
data = re.sub(left_data + '(', '', data)
# right_data = re.search(r')', data).group()
data = re.sub(');', '', data)
data = eval(data)
return data
cmd = {
"沪深京A股": "f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048",
"上证A股": "f3&fs=m:1+t:2,m:1+t:23",
"深证A股": "f3&fs=m:0+t:6,m:0+t:80",
"北证A股": "f3&fs=m:0+t:81+s:2048",
"新股": "f26&fs=m:0+f:8,m:1+f:8",
"创业板": "f3&fs=m:0+t:80",
"科创板": "f3&fs=m:1+t:23",
"沪股通": "f26&fs=b:BK0707",
"深股通": "f26&fs=b:BK0804",
"B股": "f3&fs=m:0+t:7,m:1+t:3",
"风险警示板": "f3&fs=m:0+f:4,m:1+f:4",
}
null = "null"
for i in cmd.keys():
page = 0
stocks = []
while True:
page += 1
data = get_html(cmd[i], page)
if data['data'] != null:
print("正在爬取"+i+"第"+str(page)+"页")
df = data['data']['diff']
for index in df:
dict = {
"代码": index["f12"],
"名称": index['f14'],
"最新价": index['f2'],
"涨跌幅": index['f3'],
"涨跌额": index['f4'],
"成交量(手)": index['f5'],
"成交额": index['f6'],
"振幅(%)": index['f7'],
"最高": index['f15'],
"最低": index['f16'],
"今开": index['f17'],
"昨收": index['f18'],
"量比": index['f10'],
"换手率": index['f8'],
"市盈率(动态)": index['f9'],
"市净率": index['f23'],
}
stocks.append(dict)
else:
break
df = pd.DataFrame(stocks)
df.to_excel("股票_"+i+".xlsx", index=False)
