2.2: Short Vector Processing (Multimedia Operations)
 Page ID
 14841
Introduction
This section introduces the learner to the short vector processing (multimedia operations) which was initially developed for supercomputing applications. Today its important for multimedia operations.
Activity Details
Vector processor
Is also called the array processor and is a central processing unit (CPU) that implements an instruction set containing instructions that operate on onedimensional arrays of data called vectors, compared to scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks.
They are commonly called supercomputers, the vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which operations on vectors and matrices can efficiently exploit.
Vector processors have high level operations that work on linear arrays of numbers: “vectors”.
Properties of Vector Processors
the vector processors have the following properties;
 Single vector instruction implies lots of work (loop), i.e. it has fewer instruction fetches
 Each result independent of previous result.
a. Multiple operations can be executed in parallel
b. Simpler design, high clock rate
c. Compiler (programmer) ensures no dependencies  Reduces branches and branch problems in pipelines
 Vector instructions access memory with known pattern
a. Effective prefetching
b. Amortize memory latency of over large number of elements
c. Can exploit a high bandwidth memory system
d. No (data) caches required!
Styles of Vector Architectures
They are of two styles
 Memory memory vector processors
a. All vector operations are memory to memory
 Vectorregister processors
 All vector operations between vector registers (except vector load and store)
 Vector equivalent of loadstore architectures
 Includes all vector machines since late 1980s
 We assume vectorregister for rest of the lecture
Components of a Vector Processor
 Scalar CPU: registers, data paths, instruction fetch logic
 Vector register
 Fixed length memory bank holding a single vector
 Typically 832 vector registers, each holding 1 to 8 Kbits
 Has at least 2 read and 1 write ports
 MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements
 Vector functional units (FUs)
 Fully pipelined, start new operation every clock
 Typically 2 to 8 FUs: integer and FP
 Multiple data paths (pipelines) used for each unit to process multiple elements per cycle
 Vector loadstore units (LSUs)
 Fully pipelined unit to load or store a vector
 Multiple elements fetched/stored per cycle
 May have multiple LSUs
Crossbar to connect FUs , LSUs, registers Basic Vector Instructions
This are in Table 1 below;
Table 1
Instr.  Operands  Operation  Comment 
VADD.VV  V1, V2, V3  V1 = V2 + V3  vector + vector 
VADD.SV  V1, R0, V2  V1 = R0 + V2  scalar + vector 
VMUL.VV  V1, V2, V3  V1 = V2\(\times\)V3  vector \(\times\) vector 
VMUL.SV  V1, R0, V2  V1 = R0\(\times\)V2  scalar \(\times\) vector 
VLD  V1, R1  V1 = M [R1..R1 + 63]  load, stride = 1 
VLDS  V1, R1, R2  V1 = M [R1..R1 + 63*R2]  load, stride = R2 
VLDX  V1, R1, V2  V1 = M [R1 + V2i, i = 0..63]  indexed("gather") 
VST  V1, R1  M [R1..R1 + 63] = V1  store, stride = 1 
VSTS  V1, R1, R2  V1 = M [R1..R1 + 63*R2]  store, stride = R2 
VSTX  V1, R1, V2  V1 = M[R1 + V2i, i=0..63]  indexed("scatter") 
+ all the regular scalar instructions (RISC style)...
Table 1
Vector Memory Operations
 Load/store operations move groups of data between registers and memory
 Three types of addressing
 Unit stride
 Fastest
 Nonunit (constant) stride
 Indexed (gatherscatter)
 Vector equivalent of register indirect
 Good for sparse arrays of data
 Increases number of programs that vectorize
 compress/expand variant also
 Support for various combinations of data widths in memory
 {.L,.W,.H.,.B} \(\times\) {64b, 32b, 16b, 8b}
Vector Code Example
Y[0:63] = Y[0:653] + a*X[0:63]
See the table 2 & 3 below;
Table 2
64 element SAXPY: scalar  64 element SAXPY: vector  
LD  R0, a  LD R0, a  #load scalar a 

ADDI  R4,Rx,#512  VLD, V1, Rx  #load vector X 

loop:  LD  R2, 0(Rx)  VMUL.SV V2, R0, V1 
#vector mult 
MULTD  R2, R0, R2  VLD V3, Ry  #load vector Y 

LD  R4, 0(Ry)  VADD.VV V4, V2, V3 
#vector add  
ADDD  R4, R2, R4  VST Ry, V4  #store Vector Y 
Table 3
SD  R4, 0(Ry) 
ADDI  Rx, Rx, #8 
ADDI  Ry, Ry, #8 
SUB  R20, R4, Rx 
BNZ  R20, loop 
Vector Length
vector register can hold some maximum number of elements for each data width (maximum vector length or MVL). What to do when the application vector length is not exactly MVL? Vectorlength (VL) register controls the length of any vector operation, including a vector load or store E.g. vadd.vv with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I]
The VL can be anything from 0 to MVL
Conclusion
The section introduced the learner to the short vector processing (multimedia operations) which is used for the operations on high level linear array of numbers.
Assessment
1. State the properties of the vector processors
 Each result independent of previous result
=> Long pipeline, compiler ensures no dependencies
=> High clock rate
 Vector instructions access memory with known pattern
=> Highly interleaved memory
=> Amortize memory latency of over \(\approx\) 64 elements
=> No (data) caches required! (Do use instruction cache)
 Reduces branches and branch problems in pipelines
 Single vector instruction implies lots of work (\(\approx\) loop)
=> Fewer instruction fetches