High-speed hardware implementation of fixed and runtime variable window length 1-D median filters

Nonlinear digital filters play an important role in digital signal processing applications. In this paper, a novel architecture is proposed for the hardware implementation of fixed and run-time variable window length one-dimensional (1-D) median filters. In the proposed architecture, the maximum working clock frequency is almost independent of the median filter window length, while the hardware complexity is proportional to the number of samples in the window. This feature enables the construction of filters with relatively large window lengths with negligible reduction in the maximum clock frequency; while in previous architectures the maximum clock frequency drops significantly as the window length is increased. The benchmark results show the efficiency of the proposed architecture in comparison with state-of-the-art techniques.

The following additional analyses have been carried-out during the revision and are brought here for reviewers' information. However, due to the lack of space only a summary of these results has been added to the paper.
1- Detailed post-synthesis power results for the variable length case
The detailed dynamic and static power consumption ratios are as follows. The power result are extracted from Design Compiler Tool by setting true the "power_preserve_rtl_hier_names" attribute. In Table IV of the paper, only the total power (static plus dynamic) have been added to save space.

MAXIMUM THROUGHPUT, AREA AND POWER RESULTS FOR VARIABLE WINDOW LENGTH ON 45NM CMOS ASIC

Max L	Max Freq.(MHZ)		Area (um2)		Power (micro Watt)
	8-bit	16-bit	8-bit	16-bit	8-bit		16-bit
	8-bit	16-bit	8-bit	16-bit	Static	Dyn.	Static	Dyn.
5	1324.2	1324.2	1029.1	1873.2	9.2	70.2	17.9	131.1
8	1309.7	1309.7	1625.4	3086.4	15.7	113.1	29.5	212.3
20	1261.6	1255.2	3946.3	7891.5	43.7	287.2	83.3	530.8
51	1219.7	1210.8	9614.1	18113.6	104.2	692	188.8	1298.3
100	1157.8	1148.1	19783.1	38397.9	213.4	1410.8	385.2	2641.7
201	1089.2	1075.2	39411.6	76469.9	435.5	2852.2	792.7	537.2
301	998.9	974.2	59099.2	110625.4	685.3	4308	1235.6	8079
600	915.8	880.1	118286.2	211306.2	1372.8	8667.6	2467	16261.7
1001	877.1	828.4	233693.2	403147.1	2298.2	14382.9	4101.2	26101.5

2- FPGA-based implementation
For FPGA implementation, the Quartus II 12.1 from Altera was used as the synthesis tool and an Altera Cyclone IVE FPGA with the following specifications was used as the target device: chip number EP4CE115F29I8L, 114480 logic cells, 3981312 memory bits, 1.0v core voltage. Tables I to V show the results for fixed window length in terms of maximum clock frequency and logic cells with 8-bit and 16-bit word lengths, respectively. As expected the overall trend of the maximum frequencies and area are the same as the ASIC results. Table III repeat these results for variable length filter.

Table I. Maximum Frequency And Area Results For Fixed Window Length And 8-Bit Word Length On Altera Cyclone IVE FPGA

L	Max Freq.(MHZ)			Area(um2)
L	Moshnyaga[4]	Chen[10]	Our arch.	Moshnyaga[4]	Chen[10]	Our arch.
5	144.7	151.4	157.6	210	201	197
8	127.8	141.3	145.8	354	345	337
20	91.8	110.7	140.5	1101	919	865
51	55.5	73.2	129.6	2689	2483	2187
100	36.1	44.3	127.3	5198	4498	4389
201	22.2	24.1	120.5	10229	9811	8789
301	15.8	16.7	116.8	16921	15623	13191
600	7.6	8.1	110.2	34989	33042	26393
1001	4.2	4.9	107.1	57289	55375	43995

Table II. Maximum Frequency And Area Results For Fixed Window Length And 16-Bit Word Length On Altera Cyclone IVE FPGA

L	Max Freq.(MHZ)			Area(um2)
L	Moshnyaga[4]	Chen[10]	Our arch.	Moshnyaga[4]	Chen[10]	Our arch.
5	132.9	135.8	142.2	332	291	319
8	123.8	128.6	137.8	575	476	557
20	88.8	99.1	121.4	1402	1102	1381
51	52.9	72.9	119.7	3481	2907	3464
100	33.5	43.1	116.2	6929	6107	6812
201	20.9	22.9	111.6	15498	13028	13865
301	13.3	15.8	109.7	23987	20452	20811
600	6.9	7.4	104.4	43987	40701	41233
1001	3.9	4.2	100.38	72567	68272	69282

Table III. Maximum Frequency And Area FPGA Results For Variable Window Size, 8-Bit And 16-Bit Word Length

Window-size	Max Freq.(MHZ)		Area (Logic Cell)
Window-size	8-bit	16-bit	8-bit	16-bit
5	133.07	116.5	258	491
8	125.1	114.0	410	785
20	109.5	102.8	1050	2014
51	99.38	86.6	2694	5086
100	89.04	78.6	5386	10179
201	79.27	69.9	10763	20371
301	74.6	66.5	16159	30554
600	65.5	56.6	32389	61332
1001	60.6	52.2	53957	102563

Table IV. Hardware Utilization For Fixed Window Length And 8-bit Word Length On FPGA

Win. Len.	Total logic elements (%)	Dedicated logic registers (%)
5	1	1
8	1	1
20	1	1
51	2	1
100	4	1
201	8	3
301	12	4
600	23	8
1001	38	14

Table V. Hardware Utilization For Fixed Window Length And 16-bit Word Length On FPGA

Win. Len.	Total logic elements (%)	Dedicated logic registers (%)
5	1	1
8	1	1
20	1	1
51	3	1
100	6	3
201	12	6
301	18	8
600	36	17
1001	61	28

3- Detailed hardware utilization
In Comparison with the latest work, Chen[10], our ASIC results show that in the proposed architecture for different sizes, the combinational area occupies about 42%-44% of the total area, non-combinational area takes up 32% and the net interconnections area occupies about 24% -26% of the total area. These results for the Chen [10] are 47%-49%, 24%-26% and 25%-27% respectively. These results are reflected in Tables IV, and V. Considering only the component areas (as noted by one of the reviewers), we have: For different word and window lengths, the average resource utilization ratio of the proposed architecture is as follows: the combinational area occupies about 57% and the noncombinational area takes 43% (27% of which is due to the register chain), of the total component areas. For Chen's architecture, these ratios are 66% and 34%, respectively.

Table IV. Hardware Utilization For Fixed Window Length And 8-bit Word Length On ASIC

Table V. Hardware Utilization For Fixed Window Length And 16-bit Word Length On ASIC

4- Hardware sharing for high system clock and low data throughputs
The memory elements of our architecture cannot be shared and based on the results of the proposed architecture approximately 32% of total area belongs to non-combinational elements. In the proposed architecture, two comparators and a controller in each cell are combinational resources. These elements can be shared. To do so, we can make use of one comparator (instead of two) in each cell; but in this case, we need two clock cycles to generate the output. By this way, in the first clock, the oldest element in the register chain is compared to thecontent of Ri of all array cells in parallel using a comparator in each cell. The old cell in the array cells is specified and the right side elements of this cell i are shifted to the left. In the second clock, the new input sample is compared with the content of all array cells inparallel and the right position for the new sample is located. This position and its right-side elements are shifted right and the new sample is located in this place. Therefore, we can save the area utilization by sharing the comparators. However, the latency is increased to two clock cycles (decreasing the actual working frequency). We can also share the control unit by using a multiplexer, but since its design is simple, it seems that it cannot save the occupied area significantly. Additionally, by adding some multiplexers and slight modification in the control unit, one can also automatically select one of two proposed designs easily. We leave this extension for future research.

Win. Len.	Comb Area/ Total Area	Non Comb Area/ Total Area	Net Area/ Total Area	Comb Area/ Total Area [10]	Non Comb Area/ Total Area [10]	Net Area/ Total Area[10]
5	44%	32%	24%	49%	26%	25%
8	44%	32%	24%	49%	26%	25%
25	44%	32%	24%	49%	26%	25%
51	44%	32%	24%	49%	26%	25%
100	44%	32%	24%	49%	26%	25%
201	44%	32%	24%	51%	24%	25%
301	44%	32%	24%	51%	24%	25%
600	44%	32%	24%	51%	24%	25%
1001	44%	32%	24%	51%	24%	25%