For anyone who is still developing non-standard deep learning components using Tensorflow, be warned that some raw_ops do not check tensor shape or memory bounds. It is possible for the code to run error free but produce nonsensical updates when the inputs do not all have the right shape.
The documentation is kind of sparse. But it seems the separation of concern occurs at this interface. The c++ implementations seem to assume everything is sized correctly. The trouble is that GPU code does not produce segmentation faults at the same rate as CPU code, so the only way to detect is to write tests and monitor training metrics.
Hope this is helpful to somebody.