gpuClustering.h - countModules

The whole kernel:

countModules kernel

  template <bool isPhase2>
  __global__ void countModules(uint16_t const* __restrict__ id,
                               uint32_t* __restrict__ moduleStart,
                               int32_t* __restrict__ clusterId,
                               int numElements) {
    int first = blockDim.x * blockIdx.x + threadIdx.x;
    constexpr int nMaxModules = isPhase2 ? phase2PixelTopology::numberOfModules : phase1PixelTopology::numberOfModules;
    assert(nMaxModules < maxNumModules);
    for (int i = first; i < numElements; i += gridDim.x * blockDim.x) {
      clusterId[i] = i;
      if (invalidModuleId == id[i])
        continue;
      auto j = i - 1;
      while (j >= 0 and id[j] == invalidModuleId)
        --j;
      if (j < 0 or id[j] != id[i]) {
        // boundary...
        auto loc = atomicInc(moduleStart, nMaxModules);
        moduleStart[loc + 1] = i;
      }
    }
  }

1. Init for clustering

Again we have some part of the code here that has nothing to do with counting the modules.

9 10	`for (int i = first; i < numElements; i += gridDim.x * blockDim.x) { clusterId[i] = i;`

We initialise the clusterIds for the findClus kernel.

2. Digi order

Let's say we have a snippet from our id array.

Instead of having numbers for the id we'll use letters, A, B, C and D, and mark invalid module ids with ❌.

id

A

❌

A

B

❌

B

C

❌

D

Digis ordered by modules

It is a prerequisite and we know that digis belonging to one module will appear consecutive in our buffer. They might be separated by invalid digis/hits.

3. Look for boundary elements

Let's use our example digi array from the previous point.

In the first row we'll show id and in the second column the threadIdx.x.

id	A	A	❌	❌	A	B	❌	B	B	C	C	❌	❌	D
thid.x	0	1	2	3	4	5	6	7	8	9	10	11	12	13

Let's execute some of our code:

11
12
13

if (invalidModuleId == id[i])
  continue;
auto j = i - 1;

id	A	A	❌	❌	A	B	❌	B	B	C	C	❌	❌	D
thid.x	0	1	2	3	4	5	6	7	8	9	10	11	12	13
i	0	1	2	3	4	5	6	7	8	9	10	11	12	13
j	-1	0	❌	❌	3	4	❌	6	7	8	9	❌	❌	12

14
15

while (j >= 0 and id[j] == invalidModuleId)
  --j;

id	A	A	❌	❌	A	B	❌	B	B	C	C	❌	❌	D
thid.x	0	1	2	3	4	5	6	7	8	9	10	11	12	13
i	0	1	2	3	4	5	6	7	8	9	10	11	12	13
j before	-1	0	❌	❌	3	4	❌	6	7	8	9	❌	❌	12
while					↓			↓						↓
j after	-1	0	❌	❌	1	4	❌	5	7	8	9	❌	❌	10

if (j < 0 or id[j] != id[i]) {
  // boundary...
  auto loc = atomicInc(moduleStart, nMaxModules);
  moduleStart[loc + 1] = i;
}

Let's set cond = (j < 0 or id[j] != id[i]). Check when this will be true (T is true, F is false, ❌ is not evaluated because that thread terminated early):

id	A	A	❌	❌	A	B	❌	B	B	C	C	❌	❌	D
thid.x	0	1	2	3	4	5	6	7	8	9	10	11	12	13
i	0	1	2	3	4	5	6	7	8	9	10	11	12	13
j after	-1	0	❌	❌	1	4	❌	5	7	8	9	❌	❌	10
cond	T	F	❌	❌	F	T	❌	F	F	T	F	❌	❌	T

Let's just look at the first and last rows and get rid if False and not evaluated threads for cond to better see what is happening.

id

A

❌

A

B

❌

B

C

❌

D

cond

T

4. set `moduleStart` for each module

18
19

auto loc = atomicInc(moduleStart, nMaxModules);
moduleStart[loc + 1] = i;

We fill the moduleStart array with starting module indices. Note that we can't make sure that the first module we mark is A and then B, etc. This code is executed competitively so we might have different moduleStart array each execution:

pos	0	1	2	3	4	5	6
moduleStart	4	0	5	9	13	0	0
moduleStart	4	0	9	13	5	0	0
moduleStart	4	13	0	9	5	0	0

The order will be determined by in what order each thread reaches the line 18.

18	`auto loc = atomicInc(moduleStart, nMaxModules);`

5. Conclusion

Conclusion

We initialise our clusterIds for our later clustering algorrithm in findClus and fill our moduleStart array with the indices of the first digi/hit in each module.