1 报错描述
1.1 系统环境
Hardware Environment(Ascend/GPU/CPU): GPU
Software Environment:
– MindSpore version (source or binary): 1.6.0
– Python version (e.g., Python 3.7.5): 3.7.6
– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic
– GCC/Compiler version (if compiled from source):
1.2 基本信息
1.2.1 脚本
训练脚本是构建了GRU的单算子网络,通过门控制循环单元更新值。脚本如下:
01 class Net(nn.Cell):
02 def __init__(self):
03 super(Net, self).__init__()
04 self.gru = nn.GRU(10, 16, 1000, has_bias=True, batch_first=True, bidirectional=False)
05
06 def construct(self, x, h0):
07 output = self.gru(x, h0)
08 return output
09
10 net = Net()
11 x = Tensor(np.ones([3, 5, 10]).astype(np.float32))
12 h0 = Tensor(np.ones([1 * 1000, 3, 16]).astype(np.float32))
13 output, hn = net(x, h0)
14 print('output', output.shape))
1.2.2 报错
这里报错信息如下:
Traceback (most recent call last):
File "demo.py", line 13, in <module>
output, hn = net(x, h0)
…
RuntimeError: mindspore/ccsrc/pipeline/jit/static_analysis/evaluator.cc:198 Eval] Exceed function call depth limit 1000, (function call depth: 1001, simulate call depth: 997).
It's always happened with complex construction of code or infinite recursion or loop.
Please check the code if it's has the infinite recursion or call 'context.set_context(max_call_depth=value)' to adjust this value.
If max_call_depth is set larger, the system max stack depth should be set larger too to avoid stack overflow.
For more details, please refer to the FAQ at https://www.mindspore.cn.
The function call stack (See file ' /rank_0/om/analyze_fail.dat' for more details):
\# 0 In file demo.py(07)
output = self.gru(x, h0)
2 原因分析
在MindSpore 1.6中,在脚本中第10行代码发现了在construct中创建tensor对象并使用。
接着看报错信息:在RuntimeError中,写到Exceed function call depth limit 1000, (function call depth: 1001, simulate call depth: 997),意思是超过函数调用深度限制,结合官方接口可知是配置的num_layers太大,导致loop嵌套的数量超过阈值了,因此我们需要将num_layers改小点,或者按照上面提示说的,通过*context.set_context(max_call_depth=value)*调大阈值,但不建议这样做,因为嵌套层次这么多,执行速度会非常慢。
检查代码发现,04行代码num_layers为1000,我们需要将其改为合理范围内。
3 解决方法
基于上面已知的原因,很容易做出如下修改:
此时执行成功,输出如下:
output (3, 5, 16)
4 总结
定位报错问题的步骤:
1、 找到报错的用户代码行:output = self.gru(x, h0);
2、 根据日志报错信息中的关键字,缩小分析问题的范围:Exceed function call depth limit 1000, (function call depth: 1001, simulate call depth: 997);
3、需要重点关注报错提示信息、初始化的正确性。