【Mindspore】使用Mindspore训练Faster RCNN报错:cause:TBEException:ERROR:
问题:
【功能模块】
系统:EulerOS x86_64
mindspore-ascend 1.1.1.20210201
Package Version
---------------- -------------------
addict 2.4.0
asttokens 2.0.4
astunparse 1.6.3
attrs 20.3.0
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
cycler 0.10.0
Cython 0.29.22
decorator 4.4.2
easydict 1.9
hccl 0.1.0
idna 2.10
kiwisolver 1.3.1
matplotlib 3.3.4
mindspore-ascend 1.1.1.20210201
mmcv 0.2.14
mpmath 1.2.1
numpy 1.17.5
opencv-python 4.5.1.48
packaging 20.9
Pillow 8.1.2
pip 21.0.1
protobuf 3.15.5
psutil 5.8.0
pycocotools 2.0.2
pycparser 2.20
pyparsing 2.4.7
python-dateutil 2.8.1
PyYAML 5.4.1
requests 2.25.1
scipy 1.6.1
setuptools 52.0.0.post20210125
six 1.15.0
sympy 1.7.1
te 0.4.0
topi 0.4.0
urllib3 1.26.3
wheel 0.36.2
【操作步骤 &问题现象】
通过mindspore的GitHub仓库获取Faster RCNN源码
2、在src/config.py修改数据集路径(除此之外,未修改其它位置的源码)
3、使用src/convert_checkpoint.py脚本转换resnet预训练模型
4、执行位于scripts文件夹下的脚本训练:调用名为run_standalone_train_ascend.sh的shell命令,并提供指定路径
5、scripts文件夹生成train文件夹,train文件夹中log日志报错如下:
【截图信息】
[WARNING] ME(32429:140219627501376,MainProcess):2021-03-16-17:01:07.330.962 [mindspore/ops/operations/array_ops.py:2302] WARN_DEPRECATED: The usage of Pack is deprecated. Please use Stack.
[警告] 来源标识符ME(32429:140219627501376, MainProcess)在时间戳[mindspore/train/serialization.py:386]记录了日期和时间信息:[日期时分秒千分之一秒微秒]. 该日志表明有参数数量问题:共有"params"未正确加载。
[WARNING] ME(...) [mindspore/ops/operations/array_ops.py:2302] WARN Obsolete: The use of the Pack function is no longer recommended. Suggested replacement: Stack.
[WARNING] ME(33607:140071613749056,MainProcess):2021-03-16-17:03:15.350.267 [mindspore/ops/operations/array_ops.py:2302] WARN_DEPRECATED: The usage of Pack is deprecated. Please use Stack.
SelectKernelInfo警告:DEVICE(32429,python):在mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:482位置出现错误
[WARNING] DEVICE(32429,python):2021-03-16-17:03:23.200.358 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:282] TagRaiseReduce] node:[TopK]reduce precision from int64 to int32
[WARNING] DEVICE(32429,python):2021-03-16-17:03:23.200.416 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:282] TagRaiseReduce] node:[TopK]reduce precision from int64 to int32
[WARNING] DEVICE(32429,python):2021-03-16-17:03:23.342.200 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:482] SelectKernelInfo] The node [kernel_graph_1:[CNode]0{[0]: ValueNode
[WARNING] DEVICE(32429,python):2021-03-16-17:03:23.342.344 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:282] TagRaiseReduce] node:[TopK]reduce precision from int64 to int32
[WARNING] Device(?,python): Change Log - 串口日志记录 [mindspore.cc...] TagRaiseReduce节点[TAG_raise_reduce]: 节点[TOP_K]: 将精度从64位整数降级到32位整数
There is a warning message in the log file at path SESSION(32429,python):2021-03-16-17:03:28.628.333 [mindspore.ccsrc.backend.session.ascend_session.cc:1412]. The message indicates that 246 nodes/ node(s) have been used to reduce precision in order to select the appropriate kernel for processing.
CAUTION: 'ControlDepend' has been retired from version 1.1 and will be phased out in a future release; it should be replaced with 'Depend'.
[ERROR] KERNEL(32429,python):2021-03-16-17:08:05.756.789 [mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_parallel_build.cc:88] TbeOpParallelBuild] task compile Failed, task id:984, cause:TBEException:ERROR:
Traceback (most recent call last):
File within the MindSpore framework's _extends/parallel_compile directory, specifically within the tbe_compiler's compiler.py file, line 113 of the build_op method, has undergone modifications to enhance operation compilation efficiency.
执行op_func函数,并将*inputs_args、*outputs_args、*attrs_args以及kernel_name参数传递给该函数。
该文件位于/miniconda3/envs/gwc目录下,并包含在te/utils/para_check.py中的一个辅助函数中,在第529行,在_in_wrapper方法内
此Python脚本位于特定环境中,并包含在一个辅助函数中,在第529行
该Python脚本位于/miniconda3/envs/gwc目录下,并包含在te/utils/para_check.py中,在第529行,在_in_wrapper方法内
该Python脚本位于/miniconda3/envs/gwc目录下,并包含在te/utils/para_check.py中,在第529行,在_in_wrapper方法内
return func(*args, **kwargs)
File "/usr/local/Ascend/opp/op_impl/built-in/ai_core/tbe/impl/select.py", line 305 is within the selection block.
"x1", error_detail)
该文件位于'root/miniconda3/envs/gwc/lib/python3.7/site-packages/te/utils/error_manager/error_manager_vector.py'中,在定义invalid_two_input_shape_error_function时用于处理异常
raise RuntimeError(args_dict, msg)
RuntimeError 会显示以下详细信息:包含操作名称和出错原因的元组。其中 errCode字段为'E80029'并指定各个参数值:"operation name": "Select_...", "param_name1": "condition", "param_name2": "x1", "error_detail": "Shape of tensor condition and x1 must be equal!"。在操作[...]中形状不符合要求:"Shape of tensor condition and x1 must be equal!".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/disk1/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/_extends/parallel_compile/tbe_compiler/compiler.py", line 154, in
result = compile_with_json(in_args)
在编译器.py文件中第149行的位置上运行编译器.py文件中的函数
ret = build_op(op_build, json_str)
File用于获取特定目录路径中的文件信息,在mindspore库的_parallel_compile模块中定位到tbe_compiler子模块中的compiler.py文件中,在第116行定义build_op方法
注释说明
raise RuntimeError(e)
在op[Select_16037015126926824515_0]中, input tensors condition和x1的形状不符, [Shape of tensor condition and x1 must be equal!]
input_args: {
"full_name": "该网络采用ResNet50模型结合多级特征提取机制",
"gen_model": "单个",
"impl_path": "",
"op_info": {
"attrs": null,
"inputs": [
[
{
"dtype": "整型",
"format": "NC1HWC0",
"name": "condition_0",
"ori_format": "NCHW",
"ori_shape": "[1280,256,7,7]",
"param_type": "必要参数",
"range": "[[1280,1280],[256,256],[7,7],[7,7]]",
"shape": "[1280,8,7,7,32]",
"valid": true
}
],
[
{
dtype: float32,
format: NC1HWC0,
name: x1_0,
ori_format: NCHW,
ori_shape: [1280,256,7,7],
param_type: 必要参数,
range: "[[1280,1280],[256,256],[7,7],[7,7]]"
}
]
],
is_dynamic_shape: false,
kernel_name: 选中操作单元Select_16...,
name: select
},
outputs: [
[
dtype=float32,
format=NC1HWC0,
name=y,
ori_format=NCHW,
ori_shape=[1280,256,...],
param_type=必要参数,
range=...,
shape=...,
valid=true
]
],
socVersion="Ascend91..."
},
platform="TBE"
In the file system /disk2/gwc/faster_rcnn0315/scripts/train/src/FasterRcnn/roi_align.py(177), res = self.select(mask, roi_feats_t, res).
Within the file /disk2/gwc/faster_rcnn0315/scripts/train/src/FasterRcnn/faster_rcnn_r50.py(276), the variable roi_feats is assigned to the result of self.roi_align(rois,)
位于文件/disk2/gwc/faster_rcnn0315/scripts/train/src/network_define.py的第181行(line 181)中/自$losses = {loss}\backbone{x}, \texttt{img_shape}, \texttt{gt_bboxe}, \texttt{gt_label}, \texttt{gt_num}}/执行操作
位于文件根目录下/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/train/dataset_helper.py的第87行中执行该函数以执行网络的前向传播
WARNING: The function 'ControlDepend' has been retired as of version 1.1 and will no longer be supported starting from version 1.2; it has been superseded by the function 'Depend', which should be used in its place.
Start create dataset!
CHECKING MINDRECORD FILES ...
CHECKING MINDRECORD FILES DONE!
Create dataset done!
Traceback (most recent call last):
File "train.py", line 178, in
model.train(config.epoch_size, dataset, callbacks=cb)
The code snippet is located at File '/root/minconda3/envs/gwc/lib/python3.7/site-packages/mindspore/train/model.py', line 592, in train.
sink_size=sink_size)
The file path is located at /root/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/train/model.py within the _train function.
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File "/root/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/train/model.py", line 452, in _train_dataset_sink_process
outputs = self._train_network(*inputs)
该代码文件位于'/root/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/nn/cell.py'中,在第322行,在__call__方法中
out = self.compile_and_run(*inputs)
代码文件下mindspore库中的神经网络层模块中的第578行,在运行时综合编译函数中定义
self.compile(*inputs)
The error occurs at the path '/root/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/nn/cell.py', line 565 of the file, which is an error in the Python implementation.
_executor.compile(self, *inputs, where the phase is configured as self.phase, with auto_parallel_mode set to self._auto_parallel_mode)
The code snippet is located at line 505th during the compilation process.
result = self._executor.compile(obj, args_list, phase, use_vm)
运行时异常发生在mindspore.ccsrc/backend/kernel compiler/tbe/tbe_kernel_parallel_build.cc文件的第88行(TbeOpParallelBuild)处。该操作未能完成任务标识为984的编译过程,并因TBEException(ERROR)原因导致程序终止。
Traceback (most recent call last):
File "/disk1/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/_extends/parallel_compile/tbe_compiler/compiler.py", line 113, in build_op
该操作函数将输入参数组、输出参数组以及属性参数组传递给op_func,并在其中设置kernel_name与传入的kernel_name相同
File located at '/root/miniconda3/envs/gwc/lib/python3.7/site-packages/te/utils/para_check.py', line 529, in _in_wrapper
return func(*args, **kwargs)
File "/usr/local/Ascend/opp/op_impl/built-in/ai_core/tbe/impl/select.py", line 305, in select
"x1", error_detail)
The program located in '/root/miniconda3/envs/gwc/lib/python3.7/site-packages/te/utils/error_manager/error_manager_vector.py' within the raise_err_two_input_shape_invalid method.
raise RuntimeError(args_dict, msg)
错误信息如下:
{
'errCode': 'E80029',
'op_id': 16037015126926824515,
'param_names': ['condition', 'x1'],
'error_description': '形状不符'
}
在op_id为16037015126926824515的操作中,
输入张量条件与x1的形状应满足一致的要求。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
程序文件位于该模块中的第154行
result = compile_with_json(in_args)
该文件位于/disk1/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/_extends/parallel_compile/tbe_compiler/compiler.py。第149行,在编译JSON数据时使用该方法。
ret = build_op(op_build, json_str)
该程序文件内嵌于操作构建模块中
raise RuntimeError(e)
RuntimeError: ({'error_code': 'E80029', 'operation_name': 'Select_16037015126926824515_0', 'parameter_names': ['condition', 'x1'], 'error_description': 'Shapes of tensors condition and x1 must be equal!'}, Within operation Select_16037015126926824515_0, the shapes of tensors condition and x1 are mismatched.)
输入参数如下:其中'full_name'字段指定全名,并采用JSON格式表示;'gen_model'字段设为'single'表示模型类型;'impl_path'字段为空则表示实现路径未指定;op_info字段包含操作相关信息
In directory /disk2/gwc/faster_rcnn0315/scripts/train/src/FasterRcnn/roi_align.py line 177, the code performs a selection on mask and roi_feats_t to compute res.
In folder /disk2/gwc/faster_rcnn0315/scripts/train/src/FasterRcnn/faster_rcnn_r50.py line 276, the variable roi_feats is assigned the result of self.roi_align(rois,)
In the file located at /disk2/gwc/faster_rcnn0315/scripts/train/src/network_define.py line 181 of the source code base, the variables loss_1 through loss_6 are computed and assigned to the output of self.backbone.
位于文件 /root/miniconda3/envs/gwc/lib/python3.7/site-packages/mindspore/train/dataset_helper.py 的第 87 行,并执行返回 self.network 的结果
解决方案:
尝试重装te和topi
Atlas的包的话
