Quentin L. Meunier

Finding an efficient shuffle in SIMD

The SIMD instructions are extensions to standard instruction sets, which apply the same operation on registers containing several words. It is a way to make programs parallel at a low-level. Among SIMD instructions, some are used to shift the words inside a register, or to mix words coming from two different registers. These instructions are called "shuffle".
Among the uses of the shuffle, a pattern often used in SIMD consists in taking the left-most word of a register, and to concatenate it with the all the words (except the left-most one) of another register, as shown on the figure below.

A colleague who is a specialist of SIMD has challenged me to find the minimum number of instructions to achieve this pattern with the AVX instruction set (specified at here). The approach I used is an enumerative approach.
I went through a certain number of steps before arriving to the last solution, which I think minimizes the computation time and memory usage. I will try to describe here the successive steps.
The basic idea is to represent a word by a unique value, since we do not consider arithmetical or logical operations. Thus, initially, we consider that the first 8-word register contains the values from 0 to 7, and the second register the values from 8 to 15. We can already note that there is a maximum of 16⁸ = 2³² possible values for the destination register. The data structure storing the reachable values appears naturally to be a tree, in which a given depth in the tree corresponds to the index of the word in the destination register. Each node contains an array of pointers towards the children nodes, indexed by the value of the child node. Thus, on 4-word registers, the resulting value [ 2, 0, 3, 2 ] is stored by a node in the tree reached by taking successively the pointers at the indexes 2, 0, 3 and 2 of the encountered nodes (see figure below).

Additionally, the program must store the way this result has been reached, what is done in the leaf nodes by keeping pointers towards the operand nodes, the used instruction and the potential immediate in the instruction.
The first algorithm I thought of is the following:

Current = [ src0, src1 ] (liste)
Processed = { } (set)
Results = { src0, src1 } (set)

while (!Current.empty()):
   c = Current.head()
   for each instruction [single op] and each immediate imm:
      g = inst(c, imm)
      if (g ∉ Results):
         Results.add(g)
   for each p in processed:
       for each instruction [double op] and each immediate imm:
         g = inst(c, p, imm)
         if (g ∉ Results):
            Results.add(g)
            Current.add(g) # add to tail
   Current.pop()
   Processed.add(c)

The problem of this algorithm is that the results are not stored according to their order, i.e. the number of instructions required to reach them. If the order is stored in a leaf (and computed from the parents order), it is not easy to update the nodes for which the current node intervenes as an operand. Indeed, during the course of the Processed set, the nodes are not traveled in ascending order, and it is possible to obtain the same result with a lower order later in time. The other problem is that it is more difficult to determine a stopping condition that guarantees an optimal result, that is to say a moment from which we are sure that we can no longer produce results of an order inferior to a given order.
Another idea is to store the values to be processed according to their order, in a list (there is therefore a list by order). A tree is kept to quickly test if a result has already been obtained, as well as store its construction. The goal is to browse the lists containing the values to be processed so as to produce all the results of a given order at one time. The algorithm becomes the following:

order[0] = [ src0, src1 ] (liste)
Results = { src0, src1 } (set)

target_order = 1

while (target_order < MAX_ORDER):
   for elem in order[target_order - 1]:
      for each instruction [single op] and each immediate imm:
         g = inst(c, imm)
         if (g ∉ Results):
            Results.add(g)
            order[target_order].add(g)
   for all combinations (i, j) with i < j such that i + j == target_order - 1:
      for each elem0 in order[i]:
         for each elem1 in order[j]:
            for each instruction [double op] and each immediate imm:
               g = inst(c, p, imm)
               if (g ∉ Results):
                  Results.add(g)
                  order[target_order].add(g) # add to tail
            if Results.contains(target):
               print target stack
   target_order += 1

Other optimization were successively made:

Deleting the val, index and order fields in the nodes to reduce the memory footprint:

The index value can be found with an additional parameter in the functions.
The val field is only useful for going from one node to the next or to find the value corresponding to a given node; in each of these cases, val avoids having to go through the pointer array of the parent node to compare its elements with itself. However, in this algorithm, there is no browsing of the tree, and finding the value corresponding to a node is done only in the display of the result.
The order field is no longer useful in this algorithm.

The use of arrays rather than lists, since there is no insertion nor deletion at the head of the list. Tables are resized at constant amortized cost.
The "packing" of the values coded in the array to use all the bits (8 values per word).
The use of different nodes for the leaves to not unnecessarily pay the memory cost of the fields specific to each type of node (table of pointers for the intermediate nodes; function, parent and immediate operands for the leaf nodes).

The final code is here.