2.2: Short Vector Processing (Multimedia Operations)
Introduction
This section introduces the learner to the short vector processing (multimedia operations) which was initially developed for super-computing applications. Today its important for multimedia operations.
Activity Details
Vector processor
Is also called the array processor and is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks.
They are commonly called supercomputers, the vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which operations on vectors and matrices can efficiently exploit.
Vector processors have high- level operations that work on linear arrays of numbers: “vectors”.
Properties of Vector Processors
the vector processors have the following properties;
-
Single vector instruction implies lots of work (loop), i.e. it has fewer instruction fetches
-
Each result independent of previous result.
a. Multiple operations can be executed in parallel
b. Simpler design, high clock rate
c. Compiler (programmer) ensures no dependencies -
Reduces branches and branch problems in pipelines
-
Vector instructions access memory with known pattern
a. Effective prefetching
b. Amortize memory latency of over large number of elements
c. Can exploit a high bandwidth memory system
d. No (data) caches required!
Styles of Vector Architectures
They are of two styles
-
Memory- memory vector processors
a. All vector operations are memory to memory
-
Vector-register processors
-
All vector operations between vector registers (except vector load and store)
-
Vector equivalent of load-store architectures
-
Includes all vector machines since late 1980s
-
We assume vector-register for rest of the lecture
-
All vector operations between vector registers (except vector load and store)
Components of a Vector Processor
-
Scalar CPU: registers, data paths, instruction fetch logic
-
Vector register
-
Fixed length memory bank holding a single vector
-
Typically 8-32 vector registers, each holding 1 to 8 Kbits
-
Has at least 2 read and 1 write ports
-
MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements
-
Fixed length memory bank holding a single vector
-
Vector functional units (FUs)
-
Fully pipelined, start new operation every clock
-
Typically 2 to 8 FUs: integer and FP
-
Multiple data paths (pipelines) used for each unit to process multiple elements per cycle
-
Fully pipelined, start new operation every clock
-
Vector load-store units (LSUs)
-
Fully pipelined unit to load or store a vector
-
Multiple elements fetched/stored per cycle
-
May have multiple LSUs
-
Fully pipelined unit to load or store a vector
Cross-bar to connect FUs , LSUs, registers Basic Vector Instructions
This are in Table 1 below;
Table 1
| Instr. | Operands | Operation | Comment |
| VADD.VV | V1, V2, V3 | V1 = V2 + V3 | vector + vector |
| VADD.SV | V1, R0, V2 | V1 = R0 + V2 | scalar + vector |
| VMUL.VV | V1, V2, V3 | V1 = V2\(\times\)V3 | vector \(\times\) vector |
| VMUL.SV | V1, R0, V2 | V1 = R0\(\times\)V2 | scalar \(\times\) vector |
| VLD | V1, R1 | V1 = M [R1..R1 + 63] | load, stride = 1 |
| VLDS | V1, R1, R2 | V1 = M [R1..R1 + 63*R2] | load, stride = R2 |
| VLDX | V1, R1, V2 | V1 = M [R1 + V2i, i = 0..63] | indexed("gather") |
| VST | V1, R1 | M [R1..R1 + 63] = V1 | store, stride = 1 |
| VSTS | V1, R1, R2 | V1 = M [R1..R1 + 63*R2] | store, stride = R2 |
| VSTX | V1, R1, V2 | V1 = M[R1 + V2i, i=0..63] | indexed("scatter") |
+ all the regular scalar instructions (RISC style)...
Table 1
Vector Memory Operations
-
Load/store operations move groups of data between registers and memory
-
Three types of addressing
-
Unit stride
-
Fastest
-
Non-unit (constant) stride
-
Indexed (gather-scatter)
-
Vector equivalent of register indirect
-
Good for sparse arrays of data
-
Increases number of programs that vectorize
-
compress/expand variant also
-
Support for various combinations of data widths in memory
-
{.L,.W,.H.,.B} \(\times\) {64b, 32b, 16b, 8b}
-
{.L,.W,.H.,.B} \(\times\) {64b, 32b, 16b, 8b}
Vector Code Example
Y[0:63] = Y[0:653] + a*X[0:63]
See the table 2 & 3 below;
Table 2
| 64 element SAXPY: scalar | 64 element SAXPY: vector | |||
| LD | R0, a | LD R0, a |
#load scalar
a |
|
| ADDI | R4,Rx,#512 | VLD, V1, Rx |
#load vector
X |
|
| loop: | LD | R2, 0(Rx) |
VMUL.SV
V2, R0, V1 |
#vector mult |
| MULTD | R2, R0, R2 | VLD V3, Ry |
#load vector
Y |
|
| LD | R4, 0(Ry) |
VADD.VV
V4, V2, V3 |
#vector add | |
| ADDD | R4, R2, R4 | VST Ry, V4 |
#store
Vector Y |
Table 3
| SD | R4, 0(Ry) |
| ADDI | Rx, Rx, #8 |
| ADDI | Ry, Ry, #8 |
| SUB | R20, R4, Rx |
| BNZ | R20, loop |
Vector Length
vector register can hold some maximum number of elements for each data width (maximum vector length or MVL). What to do when the application vector length is not exactly MVL? Vector-length (VL) register controls the length of any vector operation, including a vector load or store E.g. vadd.vv with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I]
The VL can be anything from 0 to MVL
Conclusion
The section introduced the learner to the short vector processing (multimedia operations) which is used for the operations on high level linear array of numbers.
Assessment
1. State the properties of the vector processors
- Each result independent of previous result
=> Long pipeline, compiler ensures no dependencies
=> High clock rate
- Vector instructions access memory with known pattern
=> Highly interleaved memory
=> Amortize memory latency of over \(\approx\) 64 elements
=> No (data) caches required! (Do use instruction cache)
- Reduces branches and branch problems in pipelines
- Single vector instruction implies lots of work (\(\approx\) loop)
=> Fewer instruction fetches