2.2: Short Vector Processing (Multimedia Operations)

Last updated
Save as PDF

Page ID: 14841

Harrison Njoroge
African Virtual University

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

Introduction

This section introduces the learner to the short vector processing (multimedia operations) which was initially developed for super-computing applications. Today its important for multimedia operations.

Activity Details

Vector processor

Is also called the array processor and is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks.

They are commonly called supercomputers, the vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which operations on vectors and matrices can efficiently exploit.

Vector processors have high- level operations that work on linear arrays of numbers: “vectors”.

Properties of Vector Processors

the vector processors have the following properties;

Single vector instruction implies lots of work (loop), i.e. it has fewer instruction fetches
Each result independent of previous result.
a. Multiple operations can be executed in parallel
b. Simpler design, high clock rate
c. Compiler (programmer) ensures no dependencies
Reduces branches and branch problems in pipelines
Vector instructions access memory with known pattern
a. Effective prefetching
b. Amortize memory latency of over large number of elements
c. Can exploit a high bandwidth memory system
d. No (data) caches required!

Styles of Vector Architectures

They are of two styles

Memory- memory vector processors

a. All vector operations are memory to memory
Vector-register processors
1. All vector operations between vector registers (except vector load and store)
2. Vector equivalent of load-store architectures
3. Includes all vector machines since late 1980s
4. We assume vector-register for rest of the lecture

Components of a Vector Processor

Scalar CPU: registers, data paths, instruction fetch logic
Vector register
- Fixed length memory bank holding a single vector
- Typically 8-32 vector registers, each holding 1 to 8 Kbits
- Has at least 2 read and 1 write ports
- MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements
Vector functional units (FUs)
- Fully pipelined, start new operation every clock
- Typically 2 to 8 FUs: integer and FP
- Multiple data paths (pipelines) used for each unit to process multiple elements per cycle
Vector load-store units (LSUs)
- Fully pipelined unit to load or store a vector
- Multiple elements fetched/stored per cycle
- May have multiple LSUs

Cross-bar to connect FUs , LSUs, registers Basic Vector Instructions

This are in Table 1 below;

Table 1

Instr.	Operands	Operation	Comment
VADD.VV	V1, V2, V3	V1 = V2 + V3	vector + vector
VADD.SV	V1, R0, V2	V1 = R0 + V2	scalar + vector
VMUL.VV	V1, V2, V3	V1 = V2\(\times\)V3	vector \(\times\) vector
VMUL.SV	V1, R0, V2	V1 = R0\(\times\)V2	scalar \(\times\) vector
VLD	V1, R1	V1 = M [R1..R1 + 63]	load, stride = 1
VLDS	V1, R1, R2	V1 = M [R1..R1 + 63*R2]	load, stride = R2
VLDX	V1, R1, V2	V1 = M [R1 + V2i, i = 0..63]	indexed("gather")
VST	V1, R1	M [R1..R1 + 63] = V1	store, stride = 1
VSTS	V1, R1, R2	V1 = M [R1..R1 + 63*R2]	store, stride = R2
VSTX	V1, R1, V2	V1 = M[R1 + V2i, i=0..63]	indexed("scatter")

+ all the regular scalar instructions (RISC style)...

Table 1

Vector Memory Operations

Load/store operations move groups of data between registers and memory
Three types of addressing
Unit stride
Fastest
Non-unit (constant) stride
Indexed (gather-scatter)
Vector equivalent of register indirect
Good for sparse arrays of data
Increases number of programs that vectorize
compress/expand variant also
Support for various combinations of data widths in memory
- {.L,.W,.H.,.B} \(\times\) {64b, 32b, 16b, 8b}

Vector Code Example

Y[0:63] = Y[0:653] + a*X[0:63]

See the table 2 & 3 below;

Table 2

	64 element SAXPY: scalar		64 element SAXPY: vector
	LD	R0, a	LD R0, a	#load scalar a
	ADDI	R4,Rx,#512	VLD, V1, Rx	#load vector X
loop:	LD	R2, 0(Rx)	VMUL.SV V2, R0, V1	#vector mult
	MULTD	R2, R0, R2	VLD V3, Ry	#load vector Y
	LD	R4, 0(Ry)	VADD.VV V4, V2, V3	#vector add
	ADDD	R4, R2, R4	VST Ry, V4	#store Vector Y

Table 3

SD	R4, 0(Ry)
ADDI	Rx, Rx, #8
ADDI	Ry, Ry, #8
SUB	R20, R4, Rx
BNZ	R20, loop

Vector Length

vector register can hold some maximum number of elements for each data width (maximum vector length or MVL). What to do when the application vector length is not exactly MVL? Vector-length (VL) register controls the length of any vector operation, including a vector load or store E.g. vadd.vv with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I]

The VL can be anything from 0 to MVL

Conclusion

The section introduced the learner to the short vector processing (multimedia operations) which is used for the operations on high level linear array of numbers.

Assessment

1. State the properties of the vector processors

Each result independent of previous result

=> Long pipeline, compiler ensures no dependencies

=> High clock rate

Vector instructions access memory with known pattern

=> Highly interleaved memory

=> Amortize memory latency of over \(\approx\) 64 elements

=> No (data) caches required! (Do use instruction cache)

Reduces branches and branch problems in pipelines
Single vector instruction implies lots of work (\(\approx\) loop)

=> Fewer instruction fetches

Search

Text Color

Text Size

Margin Size

Font Type