Advertisement

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one

阅读量:

The training model encountered this issue previously, and after extensive research across various online forums, the problem could not be resolved. The error message translates to "It is necessary to ensure that all output tensors of the forward function participate in the calculation of loss before commencing a new iteration." To enable the detection of unused parameters, one should pass find_unused_parameters=True when instantiating torch.nn.parallel.DistributedDataParallel. Additionally, it is crucial to confirm that all outputs from the forward function are utilized in computing the loss. If these steps have already been implemented, the issue may stem from the absence of output tensors within the return value of your module's forward function. For detailed reporting, please include specifics such as whether you included a loss function and details about your forward function's return type (e.g., list, dict, or iterable). Furthermore, it is advisable to check if any parameters lack gradient updates for rank 1: 0 1 2 3 4 5. To troubleshoot further, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL for enhanced error logging information regarding which specific parameters failed to receive gradient updates on this rank

RuntimeError:应在上一次迭代结束后完成缩减。此错误表明您的模块存在未被使用的参数,并且可以通过传递关键字参数find_unused_parameters=Truetorch.nn.parallelism来启用对未使用参数的检测。DistributedDataParallel(分布式数据并行)模块无法在模块‘正向’函数返回值中确定输出张量的位置。当您提交报告时,请包含以下信息:模块的具体损失函数及其‘前向’返回值的数据结构(例如列表、字典、可迭代对象)。没有获得秩1梯度的参数索引包括:0,1,2,3,4,5。请注意,在某些情况下即使您已执行上述操作后仍无法定位输出张量,请确保所有‘正向’函数的输出都会参与损失计算并查看相关日志信息以获取更多信息。

具体而言,在 bankbone 架构中引入了一个模块用于处理特定的任务,并未被包含在损失函数中。

复制代码
    self.ema=EMA(746)

但却没有加入forword函数中,就将实例化模块注释掉,解决!

全部评论 (0)

还没有任何评论哟~