When FPGAs do not implement a multiplier HW macro
The modern FPGAs implement multiplication using a dedicated hardware resource. Such dedicated hardware resource generally implements 18×18 multiply and accumulate function. Many FPGAs use two 9×9 multiplier IP to implement a single 18×18 multiplier macro. It depends on the technology you are using.
Depending on the technology and FPGA you are using it is also possible to have no multiplier at all as dedicated IP. In this case, if you need to perform a multiplication it could be a problem. Multiplication is a very demanding operator in term of area and timing resources, so you need to pay attention to the operand number of bits in order to minimize the area and timing impact on your FPGA.
There are some particular cases in which is possible to optimize the multiplier number of bits.
In this post I want to address an example you can use as guideline for multiplier optimization.
Sine and Cosine quantization
A typical example where we can optimize the number of bits of a multiplier is when one of the two operands is the quantization of sine and cosine.
DISCLAIMER: this is a particular case it cannot be applied in all multiplication
The assumption is that we need to perform a multiplication like that:
M1 = op1 * sin(a) EQ1
M2 = op2 * cos(a)
Op1 and op2 quantized with N bit, sin(a), cos(a) quantized using K bits.
There is a very particular case where the angle “a” takes ONLY values on a restricted range, for instance, +/- 10°
In this case, cos(a) needs all the K bits but sin(a) need few bits. Let give a practical example with K=8
Sin(-/+10°) = -/+ 0.1736
Cos(-/+10°) = 0.9848
Using 8 bit we have the quantized values:
round(127*Sin(+10°)) = +22
round(128*Sin(-10°)) = -22
round(128*Cos(-/+10°)) = 126
as clear, even using 8 bits, for sine quantization the number of bits we really need are only 6 instead of 8, since using 6 bits we can range [-32..+31].
For cosine we have a problem. We need to use all the 8 bits!
Optimizing the cosine quantization
We realized that sine quantization ranges in [-22 .. + 22] so we need 6 bits but cosine quantization needs all the 8 bits.
As you know any number can be written as this equation:
C = (C-1) + 1 EQ2
We will not win the Nobel prize for this equation, but it is useful for cosine quantization. Using Eq2 we can rewrite the cosine value as follow:
Cos(-/+10°) = (cos(-/+10) -1) + 1 = EQ3
(0.9848 – 1) + 1 = -0.0152 + 1
If we quantize the EQ2 we have:
round(128*(Cos(-/+10°)-1) + 128 = -2 +128
As clear for cosine quantization we need a number of bits much less than 8
VHDL implementation of multiplier optimization
The multiplications in EQ1 can be implemented in VHDL as follow.
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity mult is port ( i_clk : in std_logic; i_rstb : in std_logic; i_sin : in std_logic_vector( 5 downto 0); -- +/-10° 8 bit quantization i_cos : in std_logic_vector( 3 downto 0); i_op1 : in std_logic_vector( 7 downto 0); i_op2 : in std_logic_vector( 7 downto 0); o_m1 : out std_logic_vector(15 downto 0); -- 8+8 bit output o_m2 : out std_logic_vector(15 downto 0)); end mult; architecture rtl of mult is -- used to implement left shift for op2 constant C_ZERO_FILL : std_logic_vector(6 downto 0):=(others=>'0'); signal r_op2 : signed(i_op2'length+C_ZERO_FILL'length-1 downto 0); signal r_sin : signed(i_op1'length-1 downto 0); signal r_cos : signed(i_op2'length-1 downto 0); signal r_m1 : signed(i_op1'length*2-1 downto 0); signal r_m2 : signed(i_op2'length*2-1 downto 0); begin p_input : process (i_rstb,i_clk) begin if(i_rstb='0') then r_op2 <= (others=>'0'); r_sin <= (others=>'0'); r_cos <= (others=>'0'); r_m1 <= (others=>'0'); r_m2 <= (others=>'0'); o_m1 <= (others=>'0'); o_m2 <= (others=>'0'); elsif(rising_edge(i_clk)) then -- multiply by 2^7 r_op2 <= signed(i_op2&C_ZERO_FILL); -- sign extension r_sin <= resize(signed(i_sin),i_op1'length); r_cos <= resize(signed(i_cos),i_op2'length); r_m1 <= r_sin * signed(i_op1); r_m2 <= r_cos * signed(i_op2); o_m1 <= std_logic_vector(r_m1); o_m2 <= std_logic_vector(r_m2 + r_op2); end if; end process p_input; end rtl;
In the VHDL code for the multiplier, the value of the cosine multiplied by 128 is simply left shifted by 7 bits. As you know, a multiplication by power of two can be implemented as a left shift by N where N is the value of the exponent.
Simulation of VHDL implementation of multiplier optimization
In the simulation presented in is reported the value of the multiplication of
op1*sin(10°)
op2*cos(10°)
where the second multiplication is implemented using the optimization presented in the previous section.
In this case the value passed to the VHDL code for the cos(10°) is not the quantization of
Round(128*Cos(10°)) = round(128*0.9848) = 126
But the value
round(128*(Cos(-/+10°)-1) = -2
the error signals are relative to the comparison of the output results of the VHDL code above vs the classical multiplication
op2*cos(10°) = op2*126
as clear the output relative to the multiplication performed with the optimized VHDL matches with the value performed with the classical multiplication.
Conclusion
In this post, we addressed a possible optimization that can be adopted in a multiplier when one of the multiplier operand value is close to the maximum value. A typical example is the cosine value when the angle is on the range of +/- 10°
In this case, we don’t need to use all the number of bits to represent the cosine value, but we can take advantage of from the equation:
Cos(a) = (cos(a)-1)+1