# 2.2: Short Vector Processing (Multimedia Operations)

• • Harrison Njoroge
• African Virtual University
$$\newcommand{\vecs}{\overset { \rightharpoonup} {\mathbf{#1}} }$$ $$\newcommand{\vecd}{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}{\| #1 \|}$$ $$\newcommand{\inner}{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}{\| #1 \|}$$ $$\newcommand{\inner}{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

### Introduction

This section introduces the learner to the short vector processing (multimedia operations) which was initially developed for super-computing applications. Today its important for multimedia operations.

## Activity Details

### Vector processor

Is also called the array processor and is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks.

They are commonly called supercomputers, the vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which operations on vectors and matrices can efficiently exploit.

Vector processors have high- level operations that work on linear arrays of numbers: “vectors”.

## Properties of Vector Processors

the vector processors have the following properties;

1. Single vector instruction implies lots of work (loop), i.e. it has fewer instruction fetches

2. Each result independent of previous result.
a. Multiple operations can be executed in parallel
b. Simpler design, high clock rate
c. Compiler (programmer) ensures no dependencies

3. Reduces branches and branch problems in pipelines

4. Vector instructions access memory with known pattern
a. Effective prefetching
b. Amortize memory latency of over large number of elements
c. Can exploit a high bandwidth memory system
d. No (data) caches required!

## Styles of Vector Architectures

They are of two styles

1. Memory- memory vector processors

a. All vector operations are memory to memory

2. Vector-register processors

1. All vector operations between vector registers (except vector load and store)

2. Vector equivalent of load-store architectures

3. Includes all vector machines since late 1980s

4. We assume vector-register for rest of the lecture

### Components of a Vector Processor

• Scalar CPU: registers, data paths, instruction fetch logic

• Vector register

• Fixed length memory bank holding a single vector

• Typically 8-32 vector registers, each holding 1 to 8 Kbits

• Has at least 2 read and 1 write ports

• MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements

• Vector functional units (FUs)

• Fully pipelined, start new operation every clock

• Typically 2 to 8 FUs: integer and FP

• Multiple data paths (pipelines) used for each unit to process multiple elements per cycle

• Fully pipelined unit to load or store a vector

• Multiple elements fetched/stored per cycle

• May have multiple LSUs

Cross-bar to connect FUs , LSUs, registers Basic Vector Instructions

This are in Table 1 below;

Table 1

 Instr. Operands Operation Comment VADD.VV V1, V2, V3 V1 = V2 + V3 vector + vector VADD.SV V1, R0, V2 V1 = R0 + V2 scalar + vector VMUL.VV V1, V2, V3 V1 = V2$$\times$$V3 vector $$\times$$ vector VMUL.SV V1, R0, V2 V1 = R0$$\times$$V2 scalar $$\times$$ vector VLD V1, R1 V1 = M [R1..R1 + 63] load, stride = 1 VLDS V1, R1, R2 V1 = M [R1..R1 + 63*R2] load, stride = R2 VLDX V1, R1, V2 V1 = M [R1 + V2i, i = 0..63] indexed("gather") VST V1, R1 M [R1..R1 + 63] = V1 store, stride = 1 VSTS V1, R1, R2 V1 = M [R1..R1 + 63*R2] store, stride = R2 VSTX V1, R1, V2 V1 = M[R1 + V2i, i=0..63] indexed("scatter")

+ all the regular scalar instructions (RISC style)...

Table 1

Vector Memory Operations

• Load/store operations move groups of data between registers and memory

• Unit stride

• Fastest

• Non-unit (constant) stride

• Indexed (gather-scatter)

• Vector equivalent of register indirect

• Good for sparse arrays of data

• Increases number of programs that vectorize

• compress/expand variant also

• Support for various combinations of data widths in memory

• {.L,.W,.H.,.B} $$\times$$ {64b, 32b, 16b, 8b}

Vector Code Example

Y[0:63] = Y[0:653] + a*X[0:63]

See the table 2 & 3 below;

Table 2

 64 element SAXPY: scalar 64 element SAXPY: vector LD R0, a LD R0, a #load scalar a ADDI R4,Rx,#512 VLD, V1, Rx #load vector X loop: LD R2, 0(Rx) VMUL.SV V2, R0, V1 #vector mult MULTD R2, R0, R2 VLD V3, Ry #load vector Y LD R4, 0(Ry) VADD.VV V4, V2, V3 #vector add ADDD R4, R2, R4 VST Ry, V4 #store Vector Y

Table 3

 SD R4, 0(Ry) ADDI Rx, Rx, #8 ADDI Ry, Ry, #8 SUB R20, R4, Rx BNZ R20, loop

### Vector Length

vector register can hold some maximum number of elements for each data width (maximum vector length or MVL). What to do when the application vector length is not exactly MVL? Vector-length (VL) register controls the length of any vector operation, including a vector load or store E.g. vadd.vv with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I]

The VL can be anything from 0 to MVL

### Conclusion

The section introduced the learner to the short vector processing (multimedia operations) which is used for the operations on high level linear array of numbers.

### Assessment

1. State the properties of the vector processors

• Each result independent of previous result

=> Long pipeline, compiler ensures no dependencies

=> High clock rate

• Vector instructions access memory with known pattern

=> Highly interleaved memory

=> Amortize memory latency of over $$\approx$$ 64 elements

=> No (data) caches required! (Do use instruction cache)

• Reduces branches and branch problems in pipelines
• Single vector instruction implies lots of work ($$\approx$$ loop)

=> Fewer instruction fetches

This page titled 2.2: Short Vector Processing (Multimedia Operations) is shared under a CC BY-SA license and was authored, remixed, and/or curated by Harrison Njoroge (African Virtual University) .