index = i0  + i1 * N0 + i2 * N0*N1 + i3 * N0*N1*N2

so if iterating over dimension 2,

for i2 = 0, 1 2, ...
index = i0 + i1*N0 + 0       + i3 * N0*N1*N2
      = i0 + i1*N0 + 1*N0*N1 + i3 * N0*N1*N2
      = i0 + i1*N0 + 2*N0*N1 + i3 * N0*N1*N2

so the step size is N0*N1, or more formally, the step size is:

prod( N(1:(n-1)) )

but the ``base address'', i0 + i1*N0 + i3*N0*N1*N2
doesn't occupy some range 0 .. N, it occupies a pain in the arse range

         0 ..              N0*N1-1, 
  N0*N1*N2 ..   N0*N1*N2 + N0*N1-1,
2*N0*N1*N2 .. 2*N0*N1*N2 + N0*N1-1, etc

or more formally

i_end*prod(N(1:(end-1)) + (0 .. prod(N(1:(n-1)) 
   big bit                   twiddly bit

but the maximum value the base address takes is:

N0-1 + (N1-1)*N0 + (N3-1)*N0*N1*N2 = N0*N1*N2*N3 - N0*N1*N2 + N0*N1 - 1


so how can we easily iterate over all these?

we know there are 1 .. N0*N1*N3 rows to iterate over

take j = 0 .. N0*N1*N3-1

then (j % (N0*N1)) can correspond to the twiddly bit
and floor(j / (N0*N1)) * N0*N1*N2 can correspond to the big bit

floor(j / (N0*N1)) is in the range 0 ... (N3-1)
so floor(j / (N0*N1))*N0*N1*N2 is in the range 0 .. (N3-1)*N0*N1*N2 = 0 .. N0*N1*N2*N3 - N0*N1*N2

so floor(j / (N0*N1))*N0*N1*N2 + (j%(N0*N1)) is in the range 0 .. N0*N1*N2*N3 - N0*N1*N2 + N0*N1 - 1

so we should use

base address = floor(j/(N0*N1))*N0*N1*N2 + (j%(N0*N1))