When verbose level 4 is enabled, Blender prints kernel performance data for Cycles on GPU backends (except Metal that doesn't use debug_enqueue_* methods) for groups of kernels. These changes introduce a new CYCLES_DEBUG_PER_KERNEL_PERFORMANCE environment variable to allow getting timings for each kernels separately and not grouped with others. This is done by adding explicit synchronization after each kernel execution. Differential Revision: https://developer.blender.org/D15971