High-speed hardware implementation of fixed and runtime variable window length 1-D median filters

Suplementary results regarding this paper can be found below. Please click here to download the Verilog source codes for regenerating the results of the paper. The article and its source codes can be cited as follows:

Nikahd, Eesa, Payman Behnam, and Reza Sameni. "High-speed hardware implementation of fixed and runtime variable window length 1-D median filters." IEEE Transactions on Circuits and Systems II: Express Briefs 63, no. 5 (2016): 478-482.

 

Supplementary Results

Nonlinear digital filters play an important role in digital signal processing applications. In this paper, a novel architecture is proposed for the hardware implementation of fixed and run-time variable window length one-dimensional (1-D) median filters. In the proposed architecture, the maximum working clock frequency is almost independent of the median filter window length, while the hardware complexity is proportional to the number of samples in the window. This feature enables the construction of filters with relatively large window lengths with negligible reduction in the maximum clock frequency; while in previous architectures the maximum clock frequency drops significantly as the window length is increased. The benchmark results show the efficiency of the proposed architecture in comparison with state-of-the-art techniques.

The following additional analyses have been carried-out during the revision and are brought here for reviewers' information. However, due to the lack of space only a summary of these results has been added to the paper.
1- Detailed post-synthesis power results for the variable length case
The detailed dynamic and static power consumption ratios are as follows. The power result are extracted from Design Compiler Tool by setting true the "power_preserve_rtl_hier_names" attribute. In Table IV of the paper, only the total power (static plus dynamic) have been added to save space.

MAXIMUM THROUGHPUT, AREA AND POWER RESULTS FOR VARIABLE WINDOW LENGTH ON 45NM CMOS ASIC

Max L

Max Freq.(MHZ)

Area (um2)

Power (micro Watt)

8-bit

16-bit

8-bit

16-bit

8-bit

16-bit

Static

Dyn.

Static

Dyn.

5

1324.2

1324.2

1029.1

1873.2

9.2

70.2

17.9

131.1

8

1309.7

1309.7

1625.4

3086.4

15.7

113.1

29.5

212.3

20

1261.6

1255.2

3946.3

7891.5

43.7

287.2

83.3

530.8

51

1219.7

1210.8

9614.1

18113.6

104.2

692

188.8

1298.3

100

1157.8

1148.1

19783.1

38397.9

213.4

1410.8

385.2

2641.7

201

1089.2

1075.2

39411.6

76469.9

435.5

2852.2

792.7

537.2

301

998.9

974.2

59099.2

110625.4

685.3

4308

1235.6

8079

600

915.8

880.1

118286.2

211306.2

1372.8

8667.6

2467

16261.7

1001

877.1

828.4

233693.2

403147.1

2298.2

14382.9

4101.2

26101.5

2- FPGA-based implementation
For FPGA implementation, the Quartus II 12.1 from Altera was used as the synthesis tool and an Altera Cyclone IVE FPGA with the following specifications was used as the target device: chip number EP4CE115F29I8L, 114480 logic cells, 3981312 memory bits, 1.0v core voltage. Tables I to V show the results for fixed window length in terms of maximum clock frequency and logic cells with 8-bit and 16-bit word lengths, respectively. As expected the overall trend of the maximum frequencies and area are the same as the ASIC results. Table III repeat these results for variable length filter.

Table I. Maximum Frequency And Area Results For Fixed Window Length And 8-Bit Word Length On Altera Cyclone IVE FPGA

L

Max Freq.(MHZ)

Area(um2)

Moshnyaga[4]

Chen[10]

Our arch.

Moshnyaga[4]

Chen[10]

Our arch.

5

144.7

151.4

157.6

210

201

197

8

127.8

141.3

145.8

354

345

337

20

91.8

110.7

140.5

1101

919

865

51

55.5

73.2

129.6

2689

2483

2187

100

36.1

44.3

127.3

5198

4498

4389

201

22.2

24.1

120.5

10229

9811

8789

301

15.8

16.7

116.8

16921

15623

13191

600

7.6

8.1

110.2

34989

33042

26393

1001

4.2

4.9

107.1

57289

55375

43995

 

Table II. Maximum Frequency And Area Results For Fixed Window Length And 16-Bit Word Length On Altera Cyclone IVE FPGA

L

Max Freq.(MHZ)

Area(um2)

Moshnyaga[4]

Chen[10]

Our arch.

Moshnyaga[4]

Chen[10]

Our arch.

5

132.9

135.8

142.2

332

291

319

8

123.8

128.6

137.8

575

476

557

20

88.8

99.1

121.4

1402

1102

1381

51

52.9

72.9

119.7

3481

2907

3464

100

33.5

43.1

116.2

6929

6107

6812

201

20.9

22.9

111.6

15498

13028

13865

301

13.3

15.8

109.7

23987

20452

20811

600

6.9

7.4

104.4

43987

40701

41233

1001

3.9

4.2

100.38

72567

68272

69282

Table III. Maximum Frequency And Area FPGA Results For Variable Window Size, 8-Bit And 16-Bit Word Length

Window-size

Max Freq.(MHZ)

Area (Logic Cell)

8-bit

16-bit

8-bit

16-bit

5

133.07

116.5

258

491

8

125.1

114.0

410

785

20

109.5

102.8

1050

2014

51

99.38

86.6

2694

5086

100

89.04

78.6

5386

10179

201

79.27

69.9

10763

20371

301

74.6

66.5

16159

30554

600

65.5

56.6

32389

61332

1001

60.6

52.2

53957

102563

Table IV. Hardware Utilization For Fixed Window Length And 8-bit Word Length On FPGA

Win. Len.

Total logic elements (%)

Dedicated logic registers (%)

5

1

1

8

1

1

20

1

1

51

2

1

100

4

1

201

8

3

301

12

4

600

23

8

1001

38

14

Table V. Hardware Utilization For Fixed Window Length And 16-bit Word Length On FPGA

Win. Len.

Total logic elements (%)

Dedicated logic registers (%)

5

1

1

8

1

1

20

1

1

51

3

1

100

6

3

201

12

6

301

18

8

600

36

17

1001

61

28

 

3- Detailed hardware utilization
In Comparison with the latest work, Chen[10], our ASIC results show that in the proposed architecture for different sizes, the combinational area occupies about 42%-44% of the total area, non-combinational area takes up 32% and the net interconnections area occupies about 24% -26% of the total area. These results for the Chen [10] are 47%-49%, 24%-26% and 25%-27% respectively. These results are reflected in Tables IV, and V. Considering only the component areas (as noted by one of the reviewers), we have: For different word and window lengths, the average resource utilization ratio of the proposed architecture is as follows: the combinational area occupies about 57% and the noncombinational area takes 43% (27% of which is due to the register chain), of the total component areas. For Chen's architecture, these ratios are 66% and 34%, respectively.

Table IV. Hardware Utilization For Fixed Window Length And 8-bit Word Length On ASIC

Win. Len.

 Comb Area/  Total Area  Non Comb Area/  Total Area  Net Area/  Total Area

 Comb Area/  Total Area [10]

 Non Comb Area/  Total Area [10]

 Net Area/  Total Area[10]

5

44%

32%

24%

49%

26%

25%

8

44%

32%

24%

49%

26%

25%

25

44%

32%

24%

49%

26%

25%

51

44%

32%

24%

49%

26%

25%

100

44%

32%

24%

49%

26%

25%

201

44%

32%

24%

51%

24%

25%

301

44%

32%

24%

51%

24%

25%

600

44%

32%

24%

51%

24%

25%

1001

44%

32%

24%

51%

24%

25%

Table V. Hardware Utilization For Fixed Window Length And 16-bit Word Length On ASIC

Win. Len.

 Comb Area/  Total Area  Non Comb Area/  Total Area  Net Area/  Total Area

 Comb Area/  Total Area[10]

 Non Comb Area/  Total Area [10]

 Net Area/   Total Area[10]

5

42%

32%

26%

47%

26%

27%

8

42%

32%

26%

47%

26%

27%

25

42%

32%

26%

47%

26%

27%

51

42%

32%

26%

47%

26%

27%

100

42%

32%

26%

47%

26%

27%

201

42%

32%

26%

48%

25%

27%

301

42%

32%

26%

48%

25%

27%

600

42%

32%

26%

48%

25%

27%

1001

42%

32%

26%

48%

25%

27%

4- Hardware sharing for high system clock and low data throughputs
The memory elements of our architecture cannot be shared and based on the results of the proposed architecture approximately 32% of total area belongs to non-combinational elements. In the proposed architecture, two comparators and a controller in each cell are combinational resources. These elements can be shared. To do so, we can make use of one comparator (instead of two) in each cell; but in this case, we need two clock cycles to generate the output. By this way, in the first clock, the oldest element in the register chain is compared to thecontent of Ri of all array cells in parallel using a comparator in each cell. The old cell in the array cells is specified and the right side elements of this cell i are shifted to the left. In the second clock, the new input sample is compared with the content of all array cells inparallel and the right position for the new sample is located. This position and its right-side elements are shifted right and the new sample is located in this place. Therefore, we can save the area utilization by sharing the comparators. However, the latency is increased to two clock cycles (decreasing the actual working frequency). We can also share the control unit by using a multiplexer, but since its design is simple, it seems that it cannot save the occupied area significantly. Additionally, by adding some multiplexers and slight modification in the control unit, one can also automatically select one of two proposed designs easily. We leave this extension for future research.