Fixed vs floating point representation
When we use FPGA, we need to deal with fixed-point arithmetic. Even if new FPGA like Intel Stratix 10 implements floating-point multiplier, if we need to implement Digital Signal Processing (DSP) in FPGA we have to use fixed-point arithmetic.
Many people have serious problems dealing with fixed-point binary representation or quantization of floating point value in fixed-point.
Fixed-point representation introduction
In VHDL generally, we use binary signed fixed-point number representation. Using the signed binary representation, we can take advantage of the standard library for the
Just to be clear, you can use also unsigned binary number representation, but if you are implementing a DSP architecture is very likely that you need to deal with positive and negative numbers.
In signed fixed-point representation we need to define the number of bits we are using to represent our number.
For example, if we use 8 bit, we can represent all the integer number in the range -128<= Number <=+127
i.e. -2^(N-1) <= Number <= +(2^(N-1))-1
in the signed representation the negative numbers one more than the positive.
If we deal with unsigned binary number the range will be
Number <= +(2^N)-1
For example, for 8-bit number 0..255
Fixed-point representation of an FIR coefficient
What we are going to say in this section is valid for any kind of binary quantization.
When we need to quantize the impulse response of an FIR filter we need to:
- set the number of bits to represent the coefficient
- scale the floating point of the impulse response w.r.t. the maximum value
- convert the scaled coefficient to a fixed-point value
N = 8; MaxPos = (2^(N-1))-1; MaxNeg = (2^(N-1)); h = [ 0.02674 -0.01668 -0.07822 0.26686 0.60294 0.26686 -0.07822 -0.01668 0.02674]; hNorm = h./max(h) ptrNeg = find(hNorm<0); hQ = hNorm .* MaxPos hQ(ptrNeg) = hNorm(ptrNeg) .* MaxNeg hQ = round(hQ) plot(h) title('Floating-Point Response') figure plot(hQ) title('Quantized Response')
In Figure 1 the floating-point and quantized version of the FIR impulse response
In line 17 we are normalizing the impulse response between -1..+1.
Notice that we are assuming that the maximum value is positive.
In line 19 we find all the index of negative values.
In line 21 we are quantizing for the positive values multiplying the normalized impulse response by the maximum positive value.
In line 23 we overwrite the negative quantized values
In line 25 quantize the coefficient in fixed-point values by rounding the floating-point version.
In this post, we learn how to quantize in fixed point the impulse response of an FIR. The quantization has been implemented using Scilab (compatible with MATLAB) commands.