Vector Operations

The multi-media extensions in today’s mainstream processors implement vector operations only in a limited fashion.

Vector instructions are characterized by large numbers of operations which are performed as the result of one instruction (Single Instruction Multiple Data, SIMD).

Compared with scalar operations, this can be said about the multi-media instructions, but it is a far cry from what vector computers like the Cray-1 or vector units for machines like the IBM 3090 did.

To compensate for the limited number of operations performed for one instruction (four float or two double operations on most machines) the surrounding loops have to be executed more often.

The example in section A.1 shows this clearly, each cache line requires SM iterations.



With wider vector registers and operations, the number of loop iterations can be reduced.

This results in more than just improvements in the instruction decoding etc.; here we are more interested in the memory effects.




With a single instruction loading or storing more data, the processor has a better picture about the memory use of the application and does not have to try to piece together the nformation from the behavior of individual instructions.

Furthermore, it becomes more useful to provide load or store instructions which do not affect the caches.

With 16 byte wide loads of an SSE register in an x86 CPU, it is a bad idea to use uncached loads since later accesses to the same cache line have to load the data from memory again (in case of cache misses).

If, on the other hand, the vector registers are wide enough to hold one or more cache lines, uncached loads or stores do not have negative impacts.

It becomes more practical to perform operations on data sets which do not fit into the caches.



在x86 CPU中使用16字节宽的SSE寄存器加载时,使用未缓存的加载是一个坏主意,因为以后对同一缓存行的访问必须再次从内存加载数据(在缓存未命中的情况下)。




Having large vector registers does not necessarily mean the latency of the instructions is increased;

vector instructions do not have to wait until all data is read or stored.

The vector units could start with the data which has already been read if it can recognize the code flow.

That means, if, for instance, a vector register is to be loaded and then all vector elements multiplied by a scalar, the CPU could start the multiplication operation as soon as the first part of the vector has been loaded.



It is just a matter of sophistication(诡辩) of the vector unit.

What this shows is that, in theory, the vector registers can grow really wide, and that programs could potentially(可能) be designed today with this in mind.

In practice, there are limitations imposed on the vector register size by the fact that the processors are used in multi-process and multithread OSes.

As a result, the context switch times, which include storing and loading register values, is important.




With wider vector registers there is the problem that the input and output data of the operations cannot be sequentially laid out in memory.

This might be because a matrix is sparse, a matrix is accessed by columns instead of rows, and many other factors.


Vector units provide, for this case, ways to access memory in non-sequential patterns.

A single vector load or store can be parametrized and instructed to load data from many different places in the address space.

Using today’s multi-media instructions, this is not possible at all.

The values would have to be explicitly loaded one by one and then painstakingly combined into one vector register.





The vector units of the old days had different modes to allow the most useful access patterns:

(1) using striding, the program can specify how big the gap between two neighboring vector elements is.

The gap between all elements must be the same but this would, for instance, easily allow to read the column of a matrix into a vector register in one instruction instead of one instruction per row.

(2)using indirection, arbitrary access patterns can be created.

The load or store instruction receive a pointer to an array which contains addresses or offsets of the real memory locations which have to be loaded.

It is unclear at this point whether we will see a revival of true vector operations in future versions of mainstream processors.


Maybe this work will be relegated to coprocessors.

In any case, should we get access to vector operations, it is all the more important to correctly organize the code performing such operations.

The code should be self-contained and replaceable, and the interface should be general enough to efficiently apply vector operations.

For instance, interfaces should allow adding entire matrixes instead of operating on rows, columns, or even groups of elements.

The larger the building blocks, the better the chance of using vector operations.






In [11] the authors make a passionate plea for the revival of vector operations.

They point out many advantages and try to debunk various myths(揭穿各种神话).

They paint an overly simplistic image, though.

As mentioned above, large register sets mean high context switch times, which have to be avoided in general purpose OSes.

See the problems of the IA-64 processor when it comes to context switchintensive operations.

The long execution time for vector operations is also a problem if interrupts are involved.

If an interrupt is raised, the processor must stop its current work and start working on handling the interrupt.

After that, it must resume executing the interrupted code.

It is generally a big problem to interrupt an instruction in the middle of the work; it is not impossible, but it is complicated.

For long running instructions this has to happen, or the instructions must be implemented in a restartable fashion, since otherwise the interrupt reaction time is too high. The latter is not acceptable.






在工作中断中断指令通常是一个大问题; 这不是不可能,但它很复杂。

对于长时间运行的指令,必须发生这种情况,或者指令必须以可重启的方式实现,否则中断响应时间太长。 后者是不可接受的。


Vector units also were forgiving as far as alignment of the memory access is concerned, which shaped the algorithms which were developed.

Some of today’s processors (especially RISC processors) require strict alignment so the extension to full vector operations is not trivial.

There are big potential upsides to having vector operations, especially when striding and indirection are supported, so that we can hope to see this functionality in the future.