[alexander96parallel] 
Winser E. Alexander, Douglas S. Reeves, and Clay S. Gloster jr.
Parallel image processing with the block data parallel architecture.
Proceedings of the IEEE, 84(7):947968, July 1996.
[ bib 
.pdf ]
Many digital signal and image processing algorithms can be speeded up by executing them in parallel on multiple processors. The speed of parallel execution is limited by the need for communication and synchronization between processors. In this paper we present a paradigm for parallel processing that we call the block data flow paradigm (BDFP). The goal of this paradigm is to reduce interprocessor communication, and relax the synchronization requirements for such applications. We present the block data parallel architecture which implements this paradigm, and we present methods for mapping algorithms onto this architecture. We illustrate this methodology for several applications including twodimensional (2D) digital filters, the 2D discrete cosine transform, QR decomposition of a matrix, and Cholesky factorization of a matrix. We analyze the resulting system performance for these applications with regard to speedup and efficiency as the number of processors increases. Our results demonstrate that the block data parallel architecture is a flexible, highperformance solution for numerous digital signal and image processing algorithms.

[an03implementation] 
B. An, S. Balakrishnan, C.H. van Berkel, D. Cheresiz, B. Juurlink, and
S. Vassiliadis.
Implementation of mpeg4 on the philips co vector processor, December
2003.
[ bib 
.pdf ]
Multimedia applications provide new highly valuable services to the consumer and form, consequently, a new important workload for desktop systems. The increased computing power of the embedded processors required in baseband processing for new highbandwidth wireless communication protocols (e.g. UMTS, CDMA2000) can make multimedia processing possible also for the mobile devices, such as cell phones. These devices must meet high performance requirements of multimedia applications while maintaining low cost and low p[ower consumption. As a new platform which can provide the necessary computing power for processing the inner tranciever functions of the modem for UMTS, Philips develops a novel processor, called Co VectorProcessor (CVP), which is based on the following techniques. First, CVP exploits the datalevel parallelism (DLP) by processing 256bit data items, which are interpreted as vectors consisting of 32 8bit, 16 16bit, or eight 32bit elements. Therefore, a single vector instruction of CVP performs up to 32 operations. Second, CVP also exploits the instructionlevel parallelism (ILP) using VLIW approach: several vector instructions are packed in a single very long instruction word (VLIW), and are executed in parallel. This highperformance machine can be employed not only for its original purpose, the baseband processing, but also for multimedia processing. In this paper we investigate how well the parallel processing capabilities of CVP can be utilized for a typical media application, and estimate the performance levels which can be achieved. We use the MPEG4 video encoder as a benchmark. We identify two most timeconsuming kernels of this application, the Motion Estimation (ME) and the Discrete Cosine Transform (DCT), and rewrite them using CVP instructions. These kernels operate on 8 by 8 pixel blocks. We propose two different storage schemes: halfblock based and pixel based. Additionally, we propose some architecture extensions, such as emplying fullshuffles. We show that by using the appropriate storage schemes and the proposed extensions the performance of ME and DCT can be improved by factors of 2.88 and 1.84, respectively. We, therefore, show that the most important kernels of MPEG4 encoder can be vectorized and efficiencly implemented on CVP.

[ananthakrishna80blockstate] 
P. Ananthakrishna and S.K. Mitra.
Blockstate recursive digital filters with minimum roundoff noise.
In IEEE International Conference on ICASSP '80, volume 5, pages
8184, April 1980.
[ bib 
.pdf ]
The development of a block statestructure with minimum roundoff noise subject to l2norm dynamic range constraint is outlined. The pertinent equations for scaling and for roundoff noise analysis of blockstate structures are first derived. Next, a lower bound and the global minimum of the output noise due to the roundoff of the blockstate variables are derived. A method of deriving the minimum roundoff noise block statestructure is outlined. A numerical example is included. With regard to computational complexity and overall noise performance, the block state realization of recursive digital filters is shown to be superior.

[arun86ultrahighspeed] 
K.S. Arun.
Ultrahighspeed parallel implementation of loworder digital
filters.
In Proceedings of the IEEE International Symposium on Circuits
and Systems (ISCAS), volume 3, May 1986.
[ bib ]
Extant parallel implementations of onedimensional digital filters use linear of bilinear arrays of processors, and generate a maximum of one output in the time required for one multiplication and one addition. The objective of this presentation is to develop new parallel structures for both nonrecursive and recursive filters, that use many more processors and generate multiple outputs in the same time frame.

[asanovic02programmable]  Krste Asanović. Programmable Neurocomputing. MIT Press, November 2002. [ bib  .pdf ] 
[barnes80blockshift] 
Casper W. Barnes and Shinji Shinnaka.
Blockshift invariance and block implementation of discretetime
filters.
IEEE Transactions on Circuits and Systems, CAS27(8), August
1980.
[ bib 
.pdf ]
This paper describes the general class of linear, timeinvariant multivariable systems that can be used in block implementations of timeinvariant discretetime filters. Explicit relations between the properties of the block processor and the properties of the implemented filter are derived, including an explicit expression for the matrix transfer function of the block processor in terms of the singleinput, singleoutput filter transfer function. These properties and relations are independent of the form of realization of the block processor. It is shown that all irreducible statespace realizations of the block processor can be derived by a simple procedure from a simple realization of the required filter transfer function.

[benesty92fast] 
Jacob Benesty and Pierre Duhamel.
A fast exact least mean square adaptive algorithm.
IEEE Transactions on Signal Processing, 40(12):29042920,
December 1992.
[ bib 
.pdf ]
A general block formulation of the leastmeansquare (LMS) algorithm for adaptive filtering is presented. This formulation has an exact equivalence with the original LMS algorithm; hence it retains its convergence properties, while allowing a reduction in arithmetic complexity, even for very small block lengths. Working with small block lengths is interesting from an implementation point of view (large blocks result in large memory and large system delay) and allows a significant reduction in the number of operations. Tradeoffs between a number of operations and a convergence rate are obtainable by applying certain approximations to a matrix involved in the algorithm. Hence, the usual block LMS appears as a special case, which explains its convergence behavior according to the type of input signal (correlated or uncorrelated)

[bergmans96digital]  Jan W.M. Bergmans. Digital Baseband Transmission and Recording. Kluwer Academic Publishers, 1996. [ bib ] 
[berkel05vector] 
Kees van Berkel, Frank Heinle, Patrick P.E. Meuwissen, Kees Moerman, and
Matthias Weiss.
Vector processing as an enabler for softwaredefined radio in
handheld devices.
EURASIP Journal on Applied Signal Processing, 16:26132625,
2005.
[ bib 
.html ]
A major challenge of softwaredefined radio (SDR) is to realize many giga operations per second of flexible baseband processing within a power budget of only a few hundred mW. A heterogeneous hardware architecture with the programmable vector processor EVP as key component can support WLAN, UMTS, and other standards. A detailed rationale for the EVP architecture, based on the analysis of a number of key algorithms, as well as implementation and benchmarking results are described.

[berkel04vector]  C.H. van Berkel, Frank Heinle, Patrick P.E. Meuwissen, Kees Moerman, and Matthias Weiss. Vector processing as an enabler for softwaredefined radio in handsets from 3G+WLAN onwards. In Proceedings of the 2004 Software Defined Radio Technical Conference, volume B, pages 125130, November 2004. [ bib ] 
[berrou93near] 
Claude Berrou, Alain Glavieux, and Punya Thitimajshima.
Near shannon limit errorcorrecting coding and decoding: Turbocodes.
In Proceedings of the IEEE International Conference on
Communication (ICC), volume 2, pages 10641070. IEEE, May 1993.
[ bib 
.pdf ]
This paper deals with a new class of convolutional codes called Turbocodes, whose performances in terms of Bit Error Rate (BER) are close to the SHANNON limit. The TruboCode encoder is built using a parallel concatenation of two Recursive Systematic Convolutional codes and the associated decoder, using a feedback decoding rule, is implemented as P pipelined identical elementary decoders.

[braun01using] 
Gunnar Braun, Andreas Hoffman, Achim Nohl, and Heinrich Meyr.
Using static scheduling techniques for the retargeting of high speed,
compiled simulators for embedded processors from an abstract machine
description.
In ISSS, pages 5762, 2001.
[ bib 
.html ]
Instruction set simulators are indispensable tools for both the design of programmable architectures and software development. However, due to a constantly increasing processor complexity and the frequent demand for cycleaccurate models, such simulators have become defectively slow. The principle of compiled simulation addresses this shortcoming. Compiled simulators make use o a priori knowledge to accelerate simulation, with the highest efficiency achieved by emplyoing static scheduling techniques. In the past, such statically scheduled simulators have only been implemented for specific DSP architectures. The approach presented here discusses the application of static scheduling techniques to retargetable simulation tools based on the processor description language LISA. Principles and immmplementation issues are discussed in this paper, and results are presented for two selected processor architectures.

[brent82regular] 
Richard P. Brent and H.T. Kung.
A regular layout for parallel adders.
IEEE Transactions on Computers, C31:260264, March 1982.
[ bib 
.pdf ]
With VLSI architecture, the chip area and design regularity represent a better measure of cost than the conventional gate count. We show that addition of nbit binary numbers can be performed on a chip with a regular layout in time proportional to log n and with area proportional to n. Index Terms: Addition, areatime complexity, carry lookahead, circuit design, combinational logic, models of computation, parallel addition, parallel polynomial evaluation, prefix computation, VLSI.

[brent91stabilized] 
Richard P. Brent and Zhou Bing Bing.
A stabilized parallel algorithm for directform recursive filters.
IEEE Transactions on Computers, 40(3):333336, March 1991.
[ bib 
.pdf ]
A stabilized parallel algorithm for directform recursive filters is obtained using a new method of derivation in the Z domain. The algorithm is regular and modular, so very efficient VLSI architectures can be constructed to implement it. The degree of parallelism in these implementations can be chosen freely, and is not restricted to be a power of two.

[burrus71block] 
Charles S. Burrus.
Block implementation of digital filters.
IEEE Transactions on Circuit Theory, CT18(6), November 1971.
[ bib 
.pdf ]
A theory for implementation of recursive digital filters that process signals by blocks is presented. It is shown that this approach provides a very general family of filters that includes both the conventional scalar implementation and batch processing. The approach is based on a matrix representation of convolution and results in a statevariable description with block feedback. An eigenvalue analysis guarantees stability of the realization and indicates a reduction in sensitivity to roundoff and coefficient accuracy

[burrus72block] 
Charles S. Burrus.
Block realization of digital filters.
IEEE Transactions on Audio and Electroacoustics, AU20(4),
October 1972.
[ bib 
.pdf ]
Different forms of block recursive digital filters are formulated using a matrix representation of convolution. These approaches have the same relation to recursive filters that fast covolution does to nonrecursive filters. The multiplication efficiencies are calculated and compared showing that the block realization can become more efficient for filters with orders exceeding approximately 25. The general improvements on stability and sensitivity to roundoff error and coefficient accuracy are discussed.

[burrus77digital] 
C. Sidney Burrus.
Digital filter structures described by distributed arithmetic.
IEEE Transactions on Circuits and Systems, CAS24(12):674680,
December 1977.
[ bib 
.pdf ]
This paper presents a new formulation of digital filters that combines the description of signalprocessing and arithmetic operations. This is done by noting that multiplication is a form of convolution and therefore normal onedimensional scalar convolution is in fact twodimensional binary convolution. This is generalized to multidimensions and can be applied with tablelookup and transform techniques. The result is a unified description that describes a digital filter structure down to the bit level.

[chakrabarti93sorting] 
Chaitali Chakrabarti.
Sorting network based architectures for median filters.
IEEE Transactions on Circuits and Systems II: Analog and Digital
Signal Processing, 40(11):723727, November 1993.
[ bib 
http ]
Sorting network based architectures for computing nonrecursive and recursive median filters are presented. The proposed architectures are highly pipelined and consist of fewer compareswap units than existing architectures. The reduction in the number of compareswap units is achieved by minimizing computational overlap between successive outputs and also by using Batcher's oddeven merge sort (instead of bubblesort). The latency of these networks is reduced by building them with sorting units that sort 2 elements (sort2) as well as 3 elements (sort3) in 1 time unit.

[chakrabarti94novel] 
Chaitali Chakrabarti and LiYu Wang.
Novel sorting networkbased architecture for rank order filters.
IEEE Transactions on VLSI Systems, 2(4):502507, December
1994.
[ bib 
.pdf ]
This paper presents two novel sorting networkbased architectures for computing high sample rate nonrecursive rank order filters. The proposed architectures consist of significantly fewer comparators than existing sorting networkbased architectures that are based on bubblesort and Batcher's oddeven merge sort. The reduction in the number of comparators is obtained by sorting the columns of the window only once, and by merging the sorted columns in a way such that the number of candidate elements for the output is very small. The number of comparators per output is reduced even further by processing a block of outputs at a time. Block processing procedures that exploit the computational overlap between consecutive windows are developed for both the proposed networks.

[chakraborty03pipelining] 
Mrityunjoy Chakraborty and Suraiya Pervin.
Pipelining the adaptive decision feedback equalizer with zero
latency.
Signal Processing, 83(12):26752681, December 2003.
[ bib 
http ]
This paper extends a recently proposed approach to pipeline an adaptive filter to the case of least mean square (LMS)based adaptive decision feedback equalizer. It is shown that by certain manipulation of the equalizer weight update equation, a pipelined architecture can be realized which generates the output at the first stage itself and thus is free from any latency. The proposed method overcomes certain limitations of the conventional approaches in that due to the absence of any latency, there is no compulsion to employ delayed LMS algorithm for coefficient adaptation and thus no compromise with convergence rate becomes necessary.

[chang92designing] 
ShihFu Chang and David G. Messerschmitt.
Designing highthroughput vlc decoder. part i. concurrent vlsi
architectures.
IEEE Transactions on Circuits and Systems for Video Technology,
2(2):187196, June 1992.
[ bib 
.pdf ]
Two classes of architecturesthe treebased and the PLAbased architectureshave been discussed in the literature for the variable length code (VLC) decoder. Pipelined or parallel architectures in these two classes are proposed for highspeed implementation. The pipelined treebased architectures have the advantages of fully pipelined design, short clock cycle, and partial programmability. They are suitable for concurrent decoding of multiple independent bit streams. The PLAbased architectures have greater flexibility and can take advantages of some highlevel optimization techniques. The input/output rate can be fixed or variable to meet the application requirements. As an experiment, the authors have constructed a VLC based on a popular video compression system and compared the architectures. A layout of the major parts and a simulation of the critical path of the pipelined constantinputrate PLAbased architecture using a highlevel synthesis approach estimates that a decoding throughput of 200 Mb/s with a single chip is achievable with CMOS 2.0 µm technology

[chang99vlsi] 
HaoChieh Chang, LiangGee Chen, YungChi Chang, and ShengChieh Huang.
A vlsi architecture design of vlc encoder for high data rate
video/image coding.
In Proceedings of the 1999 IEEE International Symposium on
Circuits and Systems (ISCAS '99), volume 4, pages 398401, May 1999.
[ bib 
.pdf ]
An efficient architecture of variable length coding (VLC) is developed for recent multimedia applications, such as video and image compression. VLC plays a crucial part in these applications in that it provides a very effective coding gain. In this paper, we will describe an architecture design of VLC encoder. It can produce VLC codeword and amplitude, and pack them in order to achieve the constant wordlength output. In addition, in this pipeline architecture, the VLC codeword and the amplitude can be processed in one clock cycle such that the input data rate of VLC encoder can reach as high as the sampling rate of video/image data. Therefore, it is very suitable for very high data rate video and image compression applications.

[chanoux70synthesis] 
David Chanoux.
Synthesis of recursive digital filters using the fft.
IEEE Transactions on Audio and Electroacoustics,
AU18:211212, June 1970.
[ bib ]
It has been shown that recursive digital filters can be synthesized using the fast Fourier transform. An algorithm for computer implementation has been developed and used in comparing the computation times and noise figures of filters synthesized in this manner with the computation times and noise figures of filters synthesized by recursion. A model has beenn proposed for analysis of the noise in the twopole filter. Predictions of this model have been found to be in good agreement with noise measurements.

[clark81block] 
Gregory A. Clark, Sanjit K. Mitra, and Sydney R. Parker.
Block implementation of adaptive digital filters.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
29(3), June 1981.
[ bib 
.pdf ]
Block digital filtering involves the calculation of a block or finite set of filter outputs from a block of input values. This paper presents a block adaptive filtering procedure in which the filter coefficients are adjusted once per each output block in accordance with a generalized least meansquare (LMS) algorithm. Analyses of convergence properties and computational complexity show that the block adaptive filter permits fast implementations while maintaining performance equivalent to that of the widely used LMS adaptive filter.

[clark02automatically] 
Nathan Clark, Wilkin Tang, and Scott Mahlke.
Automatically generating custom instruction set extensions.
In Proceedings of the 1st Workshop on Application Specific
Processors, pages 94101, November 2002.
[ bib 
.pdf ]
Generalpurpose processor that are utilized as cores are often incapable of achieving the challenging cost, performance, and power demands of highperformance audio, video, and netwworking applications. To meet these demands , most systems employ a number of hardware accelerators to offload the computationally demanding portions of the applications. As an alternative to this strategy, we examine customizing the computation capabilities of a core processor for a particular application. Our goal is to enable some or all of the computation that is offloaded to the accelerators to be taken over by the customized core. The computation capabilities of the core processor are extended with hardware in the form of a set custom function units and new instructions. The compiler is responsible for analyzing the target application and identifying a set of costeffective custom function units. In this paper we provide an overview of the system that we are developing to automatically identify instruction set extensions and report some preliminary analysis of four media benchmarks.

[coene03twodimensional]  Wim M.J. Coene. Twodimensional optical storage. In ODS 2003, Techn. Digest, pages 9092, 2003. [ bib ] 
[connell73huffman] 
J. Brian Connell.
A huffmanshannonfano code.
In Proceedings of the IEEE, pages 10461047, July 1973.
[ bib ]
A variablewordlength minimumredundant code is described. It has the advantages of both the Huffman and ShannonFano codes in that it reduces transmission time, storage space, translation table space, and encoding and decoding times

[constantinides01multiple] 
George A. Constantinides, Peter Y.K. Cheung, and Wayne Luk.
The multiple wordlength paradigm.
In Proceedings of the 11th Annual IEEE Symposium on
FieldProgrammable Custom Computing Machines, April 2001.
[ bib 
.pdf ]
This paper presents a paradigm for the design of multiple wordlength parallel processing systemks for DSP applications, based on varying the wordlength and scaling of each signal in a DSP block diagram. A technique for estimating the observable effects of truncation and roundoff error is illustrated, and used to form the basis of an optimization algorithm to automate the design of such multiple wordlength systems. Results from implementation on a reconfigurable computing platform show that significant logic usage savings and increased clock rates can be obtained by customizing the datapath precision to the algorithm according to the techniques described in this paper. On selected DSP benchmarks, we obtain up to 45% area reduction and up to 39% speed increase over standard design techniques.

[conte97challenges] 
Thomas M. Conte, Pradeep K. Dubey, Matthew D. Jennings, Ruby B. Lee, Alex
Peleg, Salliah Rathnam, Mike Schlansker, Peter Song, and Andrew Wolfe.
Challenges to combining generalpurpose and multimedia processors.
IEEE Computer, 3U(12):3337, 1997.
[ bib 
.html ]
Multimediaprocessor media extensions to generalpurpose processors present new challenges to the compiler writer, language designer, and microarchitect.

[conway99implementation] 
Conway.
Implementation of high speed viterbi detectors.
Electronics Letters, 35(24):20892090, November 1999.
[ bib 
.pdf ]
The normal Viterbi architecture and a radix 4 architecture are compared with a parallel ACS version using a latch based storage element. Results obtained using an eightstate EPR4 Viterbi detector show that parallel ACS architectures can provide a high level of performance with modest area requirements.

[cormen01introduction]  Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to algorithms. MIT Press, 2nd edition, 2001. [ bib ] 
[dang03efficient] 
B.L. Dang, Nur Engin, and G.N. Gaydadjiev.
Efficient filtering with the covector processor.
In Proceedings of PRORISC '03, pages 351356, November 2003.
[ bib 
.pdf ]
This paper describes the mapping of Finite Impulse Response (FIR) and Decimation filters on a new DSP architecture: the CoVector Processor (DVP) developed by Philips. This architecture is targeting the baseband signal processing algorithms for the third generation mobile communication (3G). CVP is a Very Long Instruction Word (VLIW) architecture with functional units supporting vector parallelism. To exploit efficiently the architecture, a large portion of the targeted DSP algorithms must be properly vectorized. In this paper, different vectorization strategies for FIR and Decimation filters for the CVP architecture are investigated. The approach used is to restructure the original sequential algorithms into block forms that are suitable for parallel processing. In addition, the vecotrization should fully utilize the MultiplyAccumulate (MAC) structure. It is shown that for the targeted filters, several good vectorization strategies can be applied. The benchmark results obtained using the proposed strategies outperform results of other architectures previously reported.

[derby03high] 
Jeff H. Derby and Jaime H. Moreno.
A highperformance embedded dsp core with novel simd features.
In Proceedings of the IEEE International Conference on
Acoustics, SPeech, and Signal Processing (ICASSP'03), volume 2, pages
II301II304, April 2003.
[ bib 
.pdf ]
A lowpower, highperformance, compilerfriendly DSP core has been under development in the IBM Communications Research & Development Center, as part of its eLite DSP project. This DSP incorporates instructionlevel parallelism through the packing of multiple instructions in 64bit longinstruction words, while datalevel parallelism is realized through the use of SIMD techniques, such that SIMD operations can be applied to both dynamically composed vectors and packed vectors. Dynamic composition of vectors is made possible through the use of a vector pointer mechanism, which permits the addressing in a very flexible way of groups of four 16bit elements in a large, multiport, scalar register file. This paper provides an overview of the architecture of this DSP core, with a focus on its SIMD features. We describe these features in some detail and discuss how they are used, with a block FIR filter and a radix4 FFT taken as examples.

[dongen91mapping]  Rene C.A. van Dongen. Mapping for digital video signal processors: models and algorithms. Master's thesis, Technische Universiteit Eindhoven, 1991. Wisk. en Inf. bibl.:ARC 01 WSK (600). Afstudeerhoogleraar E.H.L. Aarts. Philips Research Lab. [ bib ] 
[douglas98pipelined] 
Scott C. Douglas, Quanhong Zhu, and Kent F. Smith.
A pipelined lms adaptive fir filter architecture without adaptation
delay.
IEEE Transactions on Signal Processing, 46(3), March 1998.
[ bib 
.pdf ]
Past methods for mapping the leastmeansquare (LMS) adaptive finiteimpulseresponse (FIR) filter onto parallel and pipelined architectures either introduce delays in the coefficient updates or have excessive hardware requirements. In this correspondence, we describe a hardwareefficient pipelined architecture for the LMS adaptive FIR filter that produces the same output and error signals as would be produced by the standard LMS adaptive filter architecture without adaptation delays. Unlike existing architectures for delayless LMS adaptation, the new architecture's throughput is independent of the filter length.

[feijen99method]  W.H.J. Feijen and A.J.M. van Gasteren. On a method of multiprogramming. Monographs in computer science. Springer, 1999. [ bib ] 
[fettweis89parallel] 
Gerhard Fettweis and Heinrich Meyr.
Parallel viterbi algorithm implementation: Breaking the
acsbottleneck.
IEEE Transactions on Communications, 37(8):785790, August
1989.
[ bib 
.pdf ]
The central unit of a Viterbi decoder is a datadependent feedback loop which performs an addcompareselect (ACS) operation. This nonlinear recursion is the only bottleneck for a highspeed parallel implementation. A linear scale solution (architecture) is presented which allows the implementation of the Viterbi algorithm (VA) despite the fact that it contains a datadependent decision feedback loop. For a fixed processing speed it allows a linear speedup in the throughput rate by a linear increase in hardware complexity. A systolic array implementation is discussed for the addcompareselect unit of the VA. The implementation of the survivor memory is considered. The method for implementing the algorithm is based on its underlying finite state feature. Thus, it is possible to transfer this method to other types of algorithms which contain a datadependent feedback loop and have a finite state property.

[fettweis90algorithm] 
G. Fettweis, L. Thiele, and G. Meyr.
Algorithm transformations for unlimited parallelism.
In IEEE International Symposium on Circuits and Systems,
volume 3, pages 17561759, May 1990.
[ bib 
.pdf ]
A unified algebraic basis for transforming algorithms to achieve unlimited parallelism is provided. In the case of recursive algorithms, a certain class of algebraic structures is shown to be sufficient for the application of the lookahead computation. This approach is generalized to matrixbased algorithms where the operations form a semiring. Previously known approaches to speeding up recursions are shown to be special instances of the given approach. A block processing realization corresponding to the logarithmic lookahead computation is given which is efficient in its area and time requirements.

[fettweis91highspeed] 
Gerhard Fettweis and Heinrich Meyr.
Highspeed parallel viterbi decoding: Algorithm and
vlsiarchitecture.
IEEE Communications Magazine, pages 4655, May 1991.
[ bib 
.pdf ]
The Viterbi algorithm (VA) is considered as an example of a fairly complex algorithm that needs to be implemented for highspeed applications. A brief introduction to the algorithm is given, and the state of the art of highspeed Viterbi decoders is reviewed. The three principal levels of introducing additional parallelism into an algorithmbit level, word level, and algorithm levelare outlined, and a solution for the VA at the bit level is indicated.

[fettweis95algebraic] 
Gerhard Fettweis.
Algebraic survivor memory management design for viterbi detectors.
IEEE Transactions on Cummunications, 43(9), September 1995.
[ bib 
.pdf ]
The problem of survivor memory management of a Viterbi detector is classically solved either by a registerexchange implementation which has minimal latency, but large hardware complexity and power consumption, or by a traceback scheme with small power consumption, but larger latency. Here an algebraic formulation of the survivor memory management is introduced which provides a framework for the derivation of new algorithmic and architectural solutions. This allows for solutions to be designed with greatly reduced latency and/or complexity, as well as for achieving tradeoff between latency and complexity. VLSI case studies of specific new solutions have shown that at minimal latency more than 50% savings are possible in hardware complexity as well as power consumption.

[fettweis03embedded] 
Gerhard P. Fettweis.
Embedded simd vector signal processor design.
In Proc. International Workshop on Systems, Architectures,
Modeling and Simulation (SAMOS), pages 7176, July 2003.
[ bib 
.pdf ]
Digital signal processors (DSPs) have become a key component for the design of communications ICs, in particular for wireless solutions. However, why does it make sense to use them and for which target application area? How can the new processing power requirements be met? The block nature of wireless communications leads to SIMD (single instruction multiple data) vector signal processors being a natural fit to the problem. Novel methods of developing archhitectures make these VDSPs small in size, power consumption, and flexible in programming.

[freking99unrestricedly] 
Robert A. Freking and Keshab K. Parhi.
An unrestrictedly parallel scheme for ultrahighrate reprogrammable
huffman coding.
In Proceedings of the IEEE International Conference on ICASSP,
volume 4, pages 19371940, March 1999.
[ bib 
.pdf ]
This paper proposes a comprehensive method for overcoming the inherently serial nature of variablelength nearentropy coding to obtain unrestrictedly parallel realizations of Huffman compression. A codestream rearrangement technique together with a symbolstream orderrecovery procedure form a concurrent approach capable of exceeding all previously attainable code rate figures. Furthermore, the method is noteworthy for achieving 100% hardware utilization with no code rate overhead while maintaining data output in a traditional streamed format. To further this endeavor, bitserial encoder and decoder designs that possess compelling speed and area advantages are developed for service as parallel processing elements. However, both are suitable in more general contexts as well. The decoder, in particular, is optimally fast. The encoder and decoder designs are programmable, thus suggesting the appropriateness of the composite approach for a generalpurpose ultrahighspeed codec. The benefits for lowpower and variablerate applications are briefly discussed

[fridman00tigersharc] 
Jose Fridman and Zvi Greenfield.
The TigerSHARC dsp architecture.
IEEE Micro, 20(1):6676, 2000.
[ bib ]
This highly parallel DSP architecture based on a shortvector memory system incorporates techniques found in generalpurpose computing. It promises sustained performance close to its peak computational rates of 900 MFLOPS (32bit floatingpoint) or 3.6 BOPS (16bit fixedpoint).

[flinsenberg04creating] 
Ingrid Flinsenberg, Martijn van der Horst, Johan Lukkien, and Jacques Verriet.
Creating graph partitions for fast optimum route planning.
In WSEAS Transactions on computers, 2004.
[ bib ]
We investigate fast optimum route planning in large, realworld road networks for car navigation systems. We show how graph partitioning can be used to increase the speed of planning optimum routes. Creating a graph partition with future route planning in mind leads to a nonstandard graph partitioning problem. In particular, the quality of a partition, indicated by the objective value, is assumed to represent the execution time of the route planning process. We present an efficient approximation algorithm for creating graph partitions suited for fast optimum route planning. We study the relation between the objective value and the number of edges evaluated by the route planning algorithm, which is an objective measure of the route planning speed. Experiments show that the best partition according to the objective value does not lead to the fastest route planning process. We present a new objective value and show that better partitions result in faster route planning for our new objective value.

[ghazal00retargetable] 
Naji Ghazal, Richard Newton, and Jan Rabaey.
Retargetable estimation scheme for dsp architecture selection.
In Proceedings ASPDAC 2000, pages 377380, January 2000.
[ bib 
.pdf ]
Given the recent wave of innovation and diversification in digital signal processor (DSP) architecture, the need for quickly evaluating the true potential of considered architectural choices for a given application has been rising. We propose a new scheme, called Retargetable Estimation, that involves analysis of a highlevel description of a DSP application, with aggressive optimization search, to provide a performance estimate of its optimal implementation on the architectures considered. With this scheme, we present a new parameterized architecture model that allows quick retargeting to a wide range of architectural choices, and that emphasizes capturing an architecture's salient optimizing features. We show that for a set of DSP benchmarks and two full applications, handoptimized performance can be predicted reliably. We applied this scheme to two different processors.

[gil93computing] 
Joseph Gil and Michael Werman.
Computing 2d min, median, and max fitlers.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
15(5):504507, May 1993.
[ bib 
.pdf ]
Fast algorithms for computing min, median, max, or any other order statistic filter transforms are described. The algorithms take constant time per pixel to compute min or max filters and polylog time per pixel, in the size of the filter, to compute the median filter. A logarithmic time per pixel lower bound for the computation of the median filter is shown.

[glossner97delftjava] 
C. John Glossner and Stamatis Vassiliadis.
The DelftJava engine: An introduction.
Lecture Notes in Computer Science, 1300, 1997.
[ bib 
.ps.gz ]
In this paper we introduce the DelftJava multithreaded processor achitecture and organization. The proposed architecture provides direct translation capability from the Java Virtual Machine instruction set into the DelftJava instruction set. The instruction set is a 32bit RISC instruction set architecture with support for multiple concurrent threads and Java specific constructs. The parallelism is extracted transparently to the programmer. Except for kernel programs, programmers need only be concerned with the semantics of the Java programming language. In addition, programmers who desire to take greater advantage of parallelism can execute privileged instructions which provide additional capabilities for Multimedia and DSP processing.

[gold68note] 
B. Gold and K.L. Jordan.
A note on digital filter synthesis.
Proceedings of the IEEE (Letters), 56:17171718, October 1968.
[ bib ]
It is commonly assumed that digital filters with both poles and zeros in the complex zplane can be synthesized using only recursive techniques while filters with zeros alone can be synthesized by either direct convolution or via the discrete Fourier transform (DFT). In this letter it is shown that no such restrictions hold and that both types of filters (those with zeros alone or those with both poles and zeros) can be synthesized using any of the three methods, namely, recursion, DFT, or direct convolution.

[gordon98survey] 
Daniel M. Gordon.
A survey of fast exponentiation methods.
Journal of Algorithms, 27(1):129146, 1998.
[ bib 
.html ]
Publickey cryptographic systems often involve raising elements of some group (e.g. GF(2 n ), Z/NZ, or elliptic curves) to large powers. An important question is how fast this exponentiation can be done, which often determines whether a given system is practical. The best method for exponentiation depends strongly on the group being used, the hardware the system is implemented on, and whether one element is being raised repeatedly to different powers, different elements are raised to a fixed power, or both powers and group elements vary. This problem has received much attention, but the results are scattered through the literature. In this paper we survey the known methods for fast exponentiation, examining their relative strengths and weaknesses.

[hauck98asynchronous] 
O. Hauck, H. Sauerwein, and S.A. Huss.
Asynchronous vlsi architectures for huffman codecs.
In Proceedings of the 1998 IEEE International Symposium on
Circuits and Systems (ISCAS '98), volume 5, pages 542545, May 1998.
[ bib 
.pdf ]
A novel asynchronous VLSI architecture for Huffman codecs employing fixed code books is presented. The main idea is to layout the Huffman tree in hardware and to exploit signal statistics via asynchronous logic thus avoiding expensive pipelining. This approach applies to encoders and decoders as well. The resulting circuits are both compact and fast. Postlayout HSpice simulations of an encoder tree demonstrate their efficiency. A comparison of twophase and fourphase protocol results in only a marginal speed advantage of twophase while incurring significant overhead and a more difficult design.

[horst05recursive] 
M. van der Horst, K. van Berkel, J. Lukkien, and R. Mak.
Recursive filtering on a vector dsp with linear speedup.
In IEEE 16th Int. Conf. Applicationspecific Systems,
Architectures and Processors, pages 379386. IEEE Computer Society, July
2005.
[ bib 
.pdf ]
Vector DSPs, or SIMD DSPs, have received considerable attention recently, since they are considered to be a viable alternative for dedicated hardware in signal processing for multimedia and wireless communication. It is possible to construct dedicated hardware for IIR filters with a linear speedup, but because of their recursive nature these filters are considered difficult to map efficiently on a vector DSP. The IIR programs for vector DSPs presented so far have their speedup bounded by the order of the filter. In this paper we present a program that has a linear speed up, in the sense that doubling the vector size will double the throughput. The program is a vectorization of the incremental blockstate architecture. The speedup of this program is not bounded by the order of the filter, and even works for low order filters. Besides strided memory access, no special processor features are required by our program. As a proof of concept we implemented it on the Philips EVP processor.

[howard92parallel] 
Paul G. Howard and Jeffrey Scott Vitter.
Parallel lossless image compression using huffman and arithmetic
coding.
In Data Compression Conference, pages 299308, March 1992.
[ bib 
.pdf ]
We show that highresolution images can be encoded and decoded efficiently in parallel. We present an algorithm based on the hierarchical multilevel progressive (MLP) method, used either with Huffman coding or with a new variant of arithmetic coding called quasiarithmetic coding. The coding step can be parallelized, even though the codes for different pixels are of different lengths; parallelization of the prediction and error modeling components is straightforward.

[huang79fast] 
Thomas S. Huang, George J. Yang, and Gregory Y. Tang.
A fast twodimensional median filtering algorithm.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
27(1):1318, February 1979.
[ bib 
http ]
We present a fast algorithm for twodimensional median filtering. It is based on storing and updating the gray level histogram of the picture elements in the window. The algorithm is much faster than conventional sorting methods. For a window size of m × n, the computer time required is 0(n).

[huffman98method] 
David A. Huffman.
A method for the construction of minimum redundancy codes.
In Proceedings of the I.R.E., pages 10981101, September 1952.
[ bib 
.pdf ]
An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimumredundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

[immink03signal] 
A.H.J. Immink, W.M.J. Coene, A.M. van der Lee, C. Busch, A.P. Hekstra, J.W.M.
Bergmans, J. Riani, S.J.L. v. Beneden, and T. Conway.
Signal processing and coding for twodimensional optical storage.
In Global Telecommunications Conference, 2003. GLOBECOM '03.
IEEE, volume 7, pages 39043908, December 2003.
[ bib ]
This paper introduces the concept of TwoDimensional Optical Storage (TwoDOS). In this concept, bits are written in a broad spiral consisting of a number of bitrows stacked together in a hexagonal packing. Bits with a value '1' are represented physically as circular pitholes on the disc, while bits with a value '0' are characterized by the absence of such a pithole. A scalar diffraction model is used to calculate the signal levels for various diameters of the pits. A stripewise Viterbi detector is proposed to perform 2D bitdetection with a limited state complexity of the trellis. Simulation results are shown for various diameters of the pits A 2D modulation code is applied to eliminate patterns that yield a high probability of erroneous detection.

[imminkXXadaptation] 
A.H.J. Immink, J. Riani, S.J.L. v. Beneden, J.W.M. Bergmans, M. Ciacci,
A. Nowbakht Irani, W.M.J. Coene, A.M. van der Lee, and D. Bruls.
Adaptation and timing recovery for twodimensional optical storage.
.
[ bib ]
This paper discusses several issues related to adaptation and timing recovery for TwoDimensional Optical Storage. PRML detection in the form of a stripewise Viterbi detector is used. This detector introduces and increasing detection delay when going from the outer rows towards the center of the broad spiral. For fast control loops in a decision directed mode, special measures are needed to avoid instability due to this delay. Another issue is the large span of the 2D intersymmbol interference at higher densities and tilt, leading to a large 2D Equalizer. Furthermore, writechannel imperfections such as timevarying lattice distortion due to multiplepass mastering require independent timing recovery on each row within the broad spiral.

[jiang94parallel] 
J. Jiang and S. Jones.
Parallel design of arithmetic coding.
In IEE Proceedings on Computers and Digital Techniques, volume
141, pages 327333, November 1994.
[ bib 
.pdf ]
The paper presents a parallel algorithm design for realtime implementation of arithmetic coding. The implementation comprises a parallelprocessing array arranged in a tree structure. Within each cycle, a group of input symbols can be encoded. This increases the arithmetic coding speed substantially. Details of a fixedprecision algorithm design, its implementation and simulation of its performance are reported.

[juhola91comparison] 
Martti Juhola, Jyrki Katajained, and Timo Raita.
Comparison of algorithms for standard median filtering.
IEEE Transactions on Signal Processing, 39(1):204208, January
1991.
[ bib 
.pdf ]
Standard median filtering that is searched repeatedly for a median from a sample set which changes only slightly between the subsequent searches is discussed. Several wellknown methods for solving this running median problem are reviewed, the (asymptotical) time complexities of the methods are analyzed, and simple variants are proposed which are especially suited for small sample sets, a frequent situation. Although the discussion is restricted to the onedimensional case, the ideas are easily extended to higher dimensions.

[karkada94high] 
Srikanth Karkada, Chaitali Chakrabarti, and Andreas Spanias.
High sample rate architectures for block adaptive filters.
In IEEE International Symposium on Circuits and Systems,
volume 4, pages 131134, 1994.
[ bib 
.pdf ]
In this paper we propose a variety of architectures for implementing block adaptive filters in the timedomain. These filters are based on a block implementation of the least mean squares (BLMS) algorithm. First, we present an architecture which directly maps the BLMS algorithm into an array of processors. Next, we describe an architecture where the weight vector is updated without explicitly computing the filter error. Third, we describe an architecture which exploits the redundant computations of overlapping windows. All the architectures have a significantly smaller sample period compared to frequency domain implementations. Moreover, the sample periods can be reduced even further by applying relaxed lookahead techniques.

[kim02fast] 
Jongmyon Kim and D. Scott Wills.
Fast vector median filter implementation using the color pack
instruction set.
In Proceedings of the IEEE 2002 10th Digital Signal Processing
Workshop, pages 339343, October 2002.
[ bib 
.pdf ]
Vector median filters (VMF) are widely used to reduce noise while preserving image details. However, their high computational complexity for color images makes them impractical for realtime systems. We propose a new realtime VMF implementation using compressed colorpack extension (CCPX) instruction set architecture. A 3x3 VMF applied to a 256x256 color image implemented with CCPX executes four times faster (15,190 cycles versus 65,650 cyles) than the same machine without CCPX. CCPX can be incorporated in range of architectures from current ILP processors to future massively data parallel machines. An implementation of a 4,096 node SIMD processor is examined.

[kleihorst01xetal] 
R.P. Kleihorst, A.A. Abbo, A. van der Avoird, M.J.R. Op de Beeck, and L. Sevat.
Xetal: A lowpower highperformance smart camera processor.
In IEEE Int. Symposium on Circuits and Systems (ISCAS),
volume 5, pages 215218, May 2001.
[ bib 
.pdf ]
Xetal is a digital signal processor to be combined with a 30 frames per second VGAformat CMOS or CCD image sensor or any other source of digital video data. The processor is fully programmable and therefore able to run a variety of algorithms ranging from image communication to machine vision. Xetal comprises a parallel processor array and a special purpose controller to achieve high computational performances (up to 5 GOPS) with a very modest power consumption. This can go down to 30 mW for simple applications such as a digital camera for video conferencing. The Xetal chip has been realized in a 0.18 µm CMOS process and takes up an area of 25 mm2

[knuth98art]  Donald E. Knuth. The Art of Computer Programming: Seminumerical algorithms, volume 2. AddisonWesley, 3rd edition, 1998. [ bib ] 
[knuth68art3]  Donald E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. AddisonWesley, 2nd edition, 1968. [ bib ] 
[koc96analyzing] 
Cetin Kaya Koc, Tolga Acar, and Burton S. Kaliski Jr.
Analyzing and comparing montgomery multiplication algorithms.
IEEE Micro, 16(3):2633, June 1996.
[ bib 
.pdf ]
This paper discusses several Montgomery multiplication algorithms, two of which have been proposed before. We describe three additional algorithms, and analyze in detail the space and time requirements of all five methods. These algorithms have been implemented in C and in assembler. The analyses and actual performance results indicate that the Coarsely Integrated Operand Scanning (CIOS) method, detailed in this paper, is the most efficient of all five algorithms, at least for the general class of processor we considered. The Montgomery multiplication methods constitute the core of the modular exponentiation operation which is the most popular method used in publickey cryptography for encrypting and signing digital data.

[kolte99fast] 
Priyadarshan Kolte, Roger Smith, and Wen Su.
A fast median filter using altivec.
In International Conference on Computer Design (ICCD), pages
384391, 1999.
[ bib 
.pdf ]
This paper describes the design and implementation of a median filter for graphics images on the Motorola AltiVec architecture. The filter utilizes 16way SIMD parallelism to filter images at rates of 1.15 cycles/pixel for 3 ×3 squares and 6.6 cycles/pixel for 5 ×5 squares. The median filter is based on a new sorting network which sorts N^{2} numbers (arranged in an N ×N square) by sorting all columns, rows, and diagonal lines in the square. This paper also describes a scheme for efficient testing of the sorting network.

[konvalina04palindromepolynomials] 
John Konvalina and Valentin Matache.
Palindromepolyonomials with roots on the unit circle.
C.R. Math. Rep. Acad. Sci. Canada, 26:3944, 2004.
[ bib 
.pdf ]
Given a polynomial f(x) of degree n, let f^{r}(x) denote its reciprocal, i.e., f^{r}(x)=x^{n}f(1/x). If a polynomial is equal to its reciprocal, we call it a palindrome since the coefficients are the same when read backwards or forwards. In this mathematical note we show that palindromes whose coefficients satisfy a certain magnitude condition must have a root on the unit circle. More exactly our main result is the following. If a palindrome f(x) of even degree n with real coefficients ε_{0},ε_{1},..., ε_{n} satisfies the condition ε_{k} >=ε_{n/2} (π/([(n/2)/(n/2k)]+2)), for some k in{0,1,...,n/21}, then f(x) has unimodular roots. In particular, palindromes with coefficients 0 and 1 always have a root on the unit circle.

[krukowski96decomposition] 
A. Krukowski, I. Kale, and G.D. Cain.
Decomposition of iir transfer functions into parallel arbitraryorder
iir subfilters.
In Proceedings of the IEEE Nordic Signal Processing Symposium
(NORSIG '96), pages 175178, September 1996.
[ bib 
.pdf ]
This paper presents decomposition methods for any arbitrary complex IIR filter transfer function into a sum of variableorder IIR sections or firstorder IIR or allpass sections. Such transformation allows parallel processing of every section in each path by a separate processing element and hence greatly increases the filter computation speed. For the general case, both real and complex filters are decomposed into parallel complex IIR filters as well as real filters decomposed into a set of real IIR sections.

[kwan87high] 
H.K. Kwan and M.T. Tsim.
High speed 1d fir digital filtering architectures using polynomial
convolution.
In IEEE International Conference on ICASSP, volume 12, pages
18631866, April 1987.
[ bib 
.pdf ]
Two sets of 1D FIR digital filtering architectures are proposed to reduce computational complexity and to increase throughput rate. Each architecture is regular in structure with a high degree of parallelism and pipelining. Consequently, they are suitable for VLSI or multiprocessor implementation. In the paper, infinite linear convolution is first converted into finite length linear or cyclic convolution in polynomial ring. Certain algorithms that are used to reduce computational complexity in finite length linear or cyclic convolution can then be applied here to reduce computational complexity of polynomial convolution and give the resulting filter structure.

[ladner80parallel] 
Richard E. Ladner and Michael J. Fischer.
Parallel prefix computation.
Journal of the Association for Computing Machinery,
27(4):831838, October 1980.
[ bib 
http ]
The prefix problem is to compute all the products x1 o x2 . . . . o xk for i <= k <= n, where o is an associative operation. A recursive construction is used to obtain a product circuit for solving the prefix problem which has depth exactly [log2 n] and size bounded by 4n. An application yields fast, small Boolean circuits to simulate finitestate transducers. By simulating a sequental adder, a Boolean circuit which has depth 2[log2 n] + 2 and size bounded by 14n is obtained for nbit binary addition. The size can be decreased significantly by permitting the depth to increase by an additive constant. KEY WORDS AND PHRASES automaton, binary addition, circuit, combinational complexity, depth, fanout, parallelism, size, transducer

[lakatos02polynomials] 
Piroska Lakatos.
On polynomials having zeros on the unit circle.
Comptes Rendus Mathématiques de l'Acad. des Sci. (Soc. Roy. du
Canada), 24(2):9196, 2002.
[ bib ]
Nous prouvons que toute zéros du polynomial réel réciproqueh_m(z) = l(z^m+z^m1+...+z+1)+_k=1^[(m)/(2)]a_k(z^mk+z^k) (z inC)de degré m où l, a_{0}, a_{1},...,a_{[(m)/(2)]}inR, l <>0, m inN, m >=2 sont sur le cercle d'unité si l >=2 _{k=1}^{[(m)/(2)]}a_{k}. Utilisant ce résultat nous recevrons qu'un polynomials réciproque P_{m}(z) = _{j=0}^{m} A_{j}z^{j} (z inC) de degré m>=2 avec coefficients réels A_{j} inR a zéros sur le cercle d'unité supposant queA_m>=_k=1^m1  A_k  A_m .

[lakatos03zeros]  Piroska Lakatos and László Losonczi. On zeros of reciprocal polynomials of odd degree. Journal of Inequalities in Pure and Applied Mathematics, 4(3), 2003. [ bib  .pdf ] 
[lam88software] 
Monica Lam.
Software pipelining: An effective scheduling technique for vliw
machines.
In Proceedings of the SIGPLAN '88 Conference on Programming
Language Design and Implementation, pages 318328, June 1988.
[ bib 
http ]
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code. This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, nearoptimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the startup cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained. The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.

[langdon84introduction] 
Glen G. Langdon Jr.
An introduction to arithmetic coding.
IBM journal of research and development, 28(2):135149, March
1984.
[ bib 
.pdf ]
Arithmetic coding is a data compression technique that encodes data (the data string) by creating a code string which represents a fractional value on the number line between 0 and 1. The coding algorithm is symbolwise recursive; i.e., it operates upon and encodes (decodes) one data symbol per iteration or recursion. On each recursion the algorithm successively partitions an interval of the number line between 0 and 1, and retains one of the partitions as the new interval. Thus, the algorithm successively deals with smaller intervals, and the code string, viewed as a magnitude, lies in each of the nested intervals. The data string is recovered by using magnitude comparisons on the code string to recreate how the encoder must have succesively partitioned and retained each nested subinterval. Aritmetic coding differs considerably from the more familiar compression coding techniques, such as prefix (Huffman) codes. Also, it should not be confused with error control coding, whose object is to detect and correct errors in computer operations. This paper presents the key notions of arithmetic compression coding by means of simple examples

[lapointe91very] 
Marcel Lapointe, Paul Fortier, and Huu Tuê Huynh.
A very fast digital realization of a timedomain block lms filter.
In 1991 International Conference on Acoustics, SPeech, and
Signal Processing (ICASSP91), volume 3, pages 21012104, April 1991.
[ bib 
.pdf ]
A novel digital implementation of timedomain block LMS (least mean square) filtering is presented. The primitive operators are serialparallel multipliers which produce the results digit by digit, most significant first. The use of a redundant notation is essential here. These operators are placed in a parallel and pipeline structure, resulting in a fast realization with O(log(L×N)L) time, where N and L are filter length and block length, respectively. This new realization is particularly suitable for VLSI implementation because of this modularity and the high integration capacity of the serialparallel multipliers.

[larsen00exploiting] 
Samuel Larsen and Saman Amarasinghe.
Exploiting superword level parallelism with multimedia instruction
sets.
In Proceedings of the ACM SIGPLAN 2000 conference on Programming
language design and implementation, pages 145156, June 2000.
[ bib 
http ]
Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to inline assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMDstyle parallelism within a basic block. In this paper we introduce the concept of Superword Level Parallelism (SLP) ,a novel way of viewing parallelism in multimedia and scientific applications. We believe SLPP is fundamentally different from the loop level parallelism exploited by traditional vector processing, and therefore demands a new method of extracting it. We have developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. In our experiments, dynamic instruction counts were reduced by 46%. Speedups ranged from 1.24 to 6.70.

[lee97parallel] 
HorngYeong Lee, LeuShing Lan, MingHwa Sheu, and ChienHsing Wu.
A parallel architecture for arithmetic coding and its vlsi
implementation.
In IEEE 39th Midwest symposium on Circuits and Systems,
volume 3, August 1996.
[ bib 
.pdf ]
A new parallel architecture for arithmetic coding is presented in this paper. By dividing the input symbols into a number of groups and processing them in parallel, significant speedup can be achieved in comparison with existing architectures. The advantages of this parallel architecture are its easier expandability, higher speed, and smaller latency. The parallel arithmetic coder has also been implemented on VLSI using the VHDL technique. The resultant chip layout has a size of 4993×6503 μm^{2}

[lee02implications] 
Byong Kil Lee and Lizy Kurian John.
Implications of programmable general purpose processors for
compression/encryption applications.
In Proceedings of the 13th International Conference on
Applicationspecific Systems, Architectures and Processors (ASAP 2002),
pages 233242, July 2002.
[ bib 
.html ]
With the growth of the Internet and mobile communication industry, multimedia applications form a dominant computer workload. Media workloads are typically executed on Application Specific Integrated Circuits (ASICs), application specific processors (ASPs) or general purpose processors (GPPs). GPPs are flexible and allow changes in the applications and algorithms better than ASICs and ASPs. However, executing these applications on GPPs is done at a high cost. In this paper, we analyze media compression/decompression algorithms from the perspective of the overhead of executing them on a programmable general purpose processor versus ASPs. We choose nine encode/decode programs from audio, image/video and encryption applications. The instruction mix, memory access and parallelism aspects during the execution of these programs are analyzed. Memory access latency is observed to be the main factor influencing the execution time on general purpose processors. Most of these compression/decompression algorithms involve processing the data through execution phases (e.g. quantization, encoding, etc) and temporary results are stored and retrieved between these phases. A metric called overhead memoryaccess bandwidth per input/output byte is defined to characterize the temporary memory activity of each application. We observe that more than 90% of the memory accesses made by these programs are temporary data stores and loads arising from the general purpose nature of the execution platform. We also study the data parallelism in these applications, indicating the ability of instruction level and data level parallel processors to exploit the parallelism in these applications. The parallelism ranges from 6 to 529 in encode processes and 18 to 558 in decode processes.

[leiserson83optimizing] 
Charles E. Leiserson and James B. Saxe.
Optimizing synchronous systems.
Journal of VLSI and Computer Systems, 1(1):4167, 1983.
[ bib ]
The complexity of integratedcircuit chips produced today makes it feasible to build inexpensive, specialpurpose subsystems that rapidly solve sophisticated problems on behalf of a generalpurpose host computer. This paper contributes to the design methodology of efficient VLSI algorithms. We present a transformation that converts synchronous systems into more timeefficient, systolic implementations by removing combinational rippling. The problem of determining the optimized system can be reduced to the graphtheoretic singledestinationshortestpaths problem. More importantly from an engineering standpoint, however, the kinds of rippling that can be removed from a circuit at essentially no cost can be easily characterized. For example, if the only global communication in a system is broadcasting from the host computer, the broadcast can always be replaced by local communication.

[li00hardware] 
Yanbing Li, Tim Callahan, Ervan Darnell, Randolph Harr, Uday Kurkure, and Jon
Stockwood.
Hardwaresoftware codesign of embedded reconfigurable architectures.
In Proceedings of the 37th Conference on Design Automation,
2000.
[ bib 
.pdf ]
In this paper we describe a new hardware/software partitioning approach for embedded reconfigurable architectures consisting of a generalpurpose processor (CPU), a dynamically reconfigurable datapath (e.g. an FPGA), and a memory hierarchy. We have developed a framework called Nimble that automatically compiles systemlevel applications specified in C to executables on the target platform. A key component of this framework is a hardware/software partitioning algorithm that performs finegrained partitioning (at loop and basicblock levels) of an application to execute on the combined CPU and datapath. The partitioning algorithm optimizes the global application execution time, including the software and hardware execution times, communication time and datapath reconfiguration time. Experimental results on real applications show that our algorithm is effective in rapidly finding close to optimal solutions.

[lim92pipelined] 
Y.C. Lim and Bede Liu.
Pipelined recursive filter with minimum order augmentation.
IEEE Transactions on Signal Processing, 40(7), July 1992.
[ bib 
.pdf ]
Pipelining is an efficient way for improving the average computation speed of an arithmetic processor. The higher the degree of pipeline segmentation, the faster the possible pipeline clock speed. However, for an Mstage pipeline, the result of a given operation is available only M clock periods after initiating the computation. In a recursive filter, the output at the nth sampling instance, y(n), is a function of its previous outputs, y(n1) through y(nN). Hence, the computation of y(n) cannot be initiated before the computations of y(n1) through y(nN) are completed. Voelcker et al. and Kogge et al. independently devised two augmentation techniques for resolving the dependence problem in the computation of y(n). The advantage of their techniques is that the number of nonzero denominator coefficients of the augmented filter is the same as that of the prototype filter. However, using their techniques, the augmentation required to ensure stability may be excessively high resulting in a very complex numerator realization. In this paper, we present a technique which results in a minimum order augmentation. The complexity of the filter designed using our technique is very much lower. Various pipelining architectures are presented. It is also demonstrated by using an example that when compared to the prototype filter, the augmented filter has a lower coefficient sensitivity and better roundoff noise performance.

[lin89improving] 
HorngDar Lin and David G. Messerschmitt.
Improving the iteration bound of finite state machines.
In IEEE International Symposium on Circuits and Systems,
volume 2, pages 13281331, May 1989.
[ bib 
.pdf ]
For a given IC technology or computing environment, the speed of a finite state machine is limited by the iteration bound imposed by the recursive state transitions. This paper describes a general approach to improve the speed beyond the given iteration bound of an arbitrary synchronous finite state machine, or a discretetime finitestate Markov process. The methods proposed can be implemented with pipelined arrays of simple hardware modules, achieving a throughput rate on the order of the reciprocal of a singlegate delay or latch setup time, whichever is limiting, at the expense of latency. Combining the pipelined arrays and modules in parallel further increases the throughput, which is theoretically unbounded. The implemented concurrent FSM is fully efficient and has minimal overhead, where the complexity grows only linearly with the speedup. Our approach is practical for FSM's with a small number of states, or other special structure.

[lin91finite] 
HorngDar Lin and David G. Messerschmitt.
Finite state machine has unlimited concurrency.
IEEE Transactions on Circuits and Systems, 38(5), May 1991.
[ bib 
.pdf ]
Because of the recursion in state transitions, the throughput of a finite state machine is generally thought to be limited for a given IC technology or processing environment. Existing methods of improving throughput are limited in generality and/or achievable speedup, due to their complexity overhead. This paper describes general methods for introducing concurrency, by which the throughput can be improved at the expense of latency. The methods are applicable to software and hardware implementation using parallelism or pipelining. The methods demonstrate that there is no theoretical limit to concurrency in a discretetime finite state machine. In practice, our methods can effectively improve the throughput, as opposed to the response time, for finite state machines with a moderate number of states.

[lin92designing] 
HorngDar Lin and David G. Messerschmitt.
Designing a highthroughput vlc decoder. part ii. parallel decoding
methods.
IEEE Transactions on Circuits and Systems for Video Technology,
2(2), June 1992.
[ bib 
.pdf ]
For applications in graphic computers, image and video composition, highdefinition television (HDTV), and optical fiber networks, Huffmancoded images need to be reconstructed at a high throughput rate. Part I showed several architectural and architecturespecific optimization techniques. However, due to the recursion within the reconstruction algorithm, the achievable throughput rate for a given decoding architecture in a given IC technology is limited. The authors propose various concurrent decoding methods to relax the throughput limit by using parallel or pipelined hardware. These methods are simple, effective, flexible, and applicable to general decoder architectures. Unlimited concurrency can be achieved at the expense of additional latency, the overhead is low, and the complexity increases linearly with the throughput improvement. It is believed that the proposed methods and architectures make it possible to reconstruct arbitrarily high resolution Huffmancoded images and video in real time with current electronics.

[long89lms] 
Guozhu Long, Fuyun Ling, and John G. Proakis.
The lms algorithm with delayed coefficient adaptation.
IEEE Transactions on Acoustics, Speech and Signal Processing,
37(9), September 1989.
[ bib 
.pdf ]
The behavior of the delayed leastmeansquare (DLMS) algorithm is studied. It is found that the step size in the coefficient update plays a key role in the convergence and stability of the algorithm. An upper bound for the step size is derived that ensures the stability of the DLMS. The relationship between the step size and the convergence speed, and the effect of the delay on the convergence speed, are also studied. The analytical results are supported by computer simulations.

[lu85fast] 
HuiHung Lu, Edward Ashford Lee, and David G. Messerschmitt.
Fast recursive filtering with multiple slow processing elements.
IEEE Transactions on Circuits and Systems, CAS32(11), November
1985.
[ bib 
.pdf ]
This paper describes systolic realizations of FIR and IIR digital filters with sample rates much higher than the speed of a single arithmetic unit or processing element. The architecture trades increased throughtput for increased latency. For IIR filters, the technique is based on blockstate filter descriptions in which the state update matrix is converted to triangular or quasitriangular form via a unitary or orthogonal similarity transformation. The effect of this transformation on the roundoff noise is examined in the Appendix. The latency, complexity, and suitability to VLSI implementations are considered, as well as an attractive application to interpolation and decimation

[lucke92new] 
Lori E. Lucke and Keshab K. Parhi.
A new vlsi architecture for rank order and stack filters.
In IEEE International Symposium on Circuits and Systems,
volume 1, pages 101104, May 1992.
[ bib 
.pdf ]
Onedimensional rank order filters are nonlinear filters which choose an output based on its rank within a onedimensional window of sample inputs determined by sorting the inputs. Several extensions to rank order filtering include recursive rank order filtering, twostage rank order filtering, stack filtering, and twodimensional rank order filtering. Two classes of VLSI architectures are commonly used for rank order filtering. The first cllass stores the inputs within the sample window in a shift register and employs sorting networks to completely sort the window of inputs when a new sample arrives. The second class called running order sorters utilizes the overlapping windows between consecutive outputs to efficiently maintain a sorted list of the inputs within the window. Although the class two architectures are computationally more efficient than the class one architectures, they are difficult both to pipeline and to apply to the rank order filter extensions. A new VLSI architecture for rank order filtering is presented in this paper. This architecture is an addition to the class two rank order filter architectures and can be pipelined and applied to the extensions to rank order filtering.

[lucke92parallel] 
Lori E. Lucke and Keshab K. Parhi.
Parallel structures for rank order and stack filters.
In IEEE Int. Conf. on Acoustics, Speech and Signal Processing,
volume 5, pages 645648, March 1992.
[ bib 
http ]
Rank order filters are nonlinear filters which choose an output based on the rank within a window of sample inputs determined by sorting the inputs. Rank order filters are a subset of the class of stack filters which are also nonlinear. A systematic method for applying block processing, which transforms singleinput singleoutput structures to parallelinput paralleloutput structures for nonrecursive rank order and stack filters is presented. This method takes advantage of shared substructures within the block structure to efficiently generate a block filter whose complexity can be up to onehalf the size of the original filter structure times the block size. This method can also be applied to twodimensional nonrecursive rank order filters.

[lucke94parallel] 
Lori E. Lucke and Keshab K. Parhi.
Parallel processing architectures for rank order and stack filters.
IEEE Transactions on Signal Processing, 42(5):11781189, May
1994.
[ bib 
.pdf ]
Many architectures have been proposed for rank order and stack filtering. To achieve additional speedup in these structures requires the use of parallel processing techniques such as pipelining and block processing. Pipelining is well understood but few block architectures have been developed for rank order and stack filtering. Block processing is essential for additional speedup when the original architecture has reached the throughput limits caused by the underlying technology. A trivial block structure simply repeats a single input, single output structure to generate a multiple input, multiple output structure. Therefore the architecture can achieve speedups equal to the number of multiple outputs or the block size. However, unlike linear filters, the rank order and stack filter outputs are calculated using comparisons. It is possible to share these comparisons within the block structure and thus substantially reduce the size of the block structure. The authors introduce a systematic method for applying block processing to rank order filters and stack filters. This method takes advantage of shared comparisons within the block structure to generate a block filter with shared substructures whose complexity is reduced by up to onethird compared to the original filter structure times the block size. Furthermore, block processing is important for the generation of low power designs. A block structure can trade its increased speedup for a throughput equal to the original single output architecture but with a significantly lower power requirement. The power reduction in the trivial block structures is limited by the power supply voltage. They demonstrate how block structures with shared substructures allow them to continue decreasing the power consumption beyond the limit imposed by the supply voltage.

[luk93structured] 
Wayne Luk, Davind Ferguson, and Ian Page.
Structured hardware compilation of parallel programs.
In More FPGAs, pages 8294. Abingdon EE&CS Books, 1994.
[ bib 
.pdf ]
A major bottleneck in automatic hardware synthesis is the time to place and route the netlist produced by a hardware compiler. This paper presents a method which exploits the syntax of the source program to guide its layout in a deviceindependent manner. The technique has been used in prototyping a hardware compiler for a commerciallyavailable device, the Algotronix CAL1024 FieldProgrammable Gate Array. The potential of this approach is evaluated.

[luk98visualising] 
Wayne Luk and Scott Guo.
Visualising reconfigurable libraries for fpgas.
In Proceedings of the IEEE Symposium on FieldProgrammable
Custom Computing Machines, pages 167176. IEEE Computer Society Press,
1997.
[ bib 
.ps.gz ]
This paper describes a framework and tools for visualising hardware libraries for FleldProgrammable Gate Arrays (FPGAs), which should also be useful for circuit design in general. Our approach integrates the visualisation of design behaviour and structure, supports various simulation modes, and assists the development of runtime reconfigurable designs in FPGAs such as Xilinx 6200 devices. Our tools can automatically generate a block diagram from a concise parametrised description. Design operations are animated by projecting a dataflow model on the block diagram. The user can select to view data values on specific input and output ports and internal paths. Numerical, symbolic and bitlevel simulation and their combination are supported, and the animation speed can be adjusted. The tools should benefit both library users and suppliers, since they can be used (a) to show the internal structure of a design, (b) to illustrate effective usage of library components, and (c) to present the consequences of parametrising designs in different ways.

[lutovac98symbolic] 
Miroslav D. Lutovac, Dejan V. Tošić, and Brian L. Evans.
Symbolic analysis of programmable digital filters.
In Proceedings of the 5th International Workshop on Symbolic
Methods and Applications to Circuit Design (SMACD'98), pages 177180,
October 1998.
[ bib ]
This paper focuses on symbolic derivation of system and noise transfer functions and symbolic synthesis of programmable digital filters. A new symbolic approach is presented to obtain robust and costeffective realizations not available by classical digital filter design. The advantages of the new approach are discussed.

[mak02design] 
Rudolf H. Mak.
Design and performance analysis of buffers: a constructive approach.
Eighth International Symposium on Asynchronus Circuits and
Systems, April 2002.
[ bib ]
This paper presents a theoretical framework to reason about the correctness of VLSIprograms for buffers and to compare the performance of the corresponding circuits. A very simple calculus consisting of only two operators is presented that suffices to establish the functional correctness of complicated buffer designs. Furthermore, sequence functions are presented both as a formalism to show absence of deadlock and as a vehicle for performance analysis. t is shown that the class of square FIFOs is optimal in the sense that no buffer of the same capacity and i/odistance can accommodate a larger range of occupancies, when run at its minimum cycle time. Moreover, the theory accurately predicts the size of the range of occupancies that has been found experimentally.

[makXXperformance] 
Rudolf H. Mak.
Performance analysis of block computations: with an application to
elastic block sorters.
.
[ bib ]
The main feature of this report is a theoretical framework that allows performance analysis of block computations. Apart from common metrics like throughput and latency, also some lesser known metrics like occupancy and elasticity are taken into account. The latter metric in particular is an indications how wel a design can cope with jitter and bursts in its environment. We have applied the framework to a particular type of block computation, viz. block sorting. For a wide range of block sorter designs we have determined the values of the various metrics. The outcome is a very diverse design landscape in which it depends very much on the metrics considered which design is judged best. Apart from comparison between fundamentally different designs the performance analyses also provide details that can be used to finetune individual designs.

[marpe03context] 
Detlev Marpe, Heiko Schwarz, and Thomas Wiegand.
Contextbased adaptive binary arithmetic coding in the h.264/avc
video compression standard.
IEEE Transactions on Circuits and Systems for Video Technology,
13(7):620636, July 2003.
[ bib 
.pdf ]
ContextBased Adaptive Binary Arithmetic Coding (CABAC) as a normative part of the new ITUT/ISO/IEC standard H.264/AVC for video compression is presented. By combining an adaptive binary arithmetic coding technique with context modeling, a high degree of adaptation and redundancy reduction is achieved. The CABAC framework also includes a novel lowcomplexity method for binary arithmetic coding and probability estimation that is well suited for efficient hardware and software implementations. CABAC significantly outperforms the baseline entropy coding method of H.264/AVC for the typical area of envisaged target applications. For a set of test sequences representing typical material used in broadcast applications and for a range of acceptable video quality of about 30 to 38 dB, average bitrate savings of 9%14% are achieved.

[matsubara99pipelined] 
Katsushige Matsubara, Kyoshi Nishikawa, and Hitoshi Kiya.
Pipelined lms adaptive filter using a new lookahead transformation.
IEEE Transactions on Circuits and Systems II:Analog and Digital
Signal Processing, 46(1), January 1999.
[ bib 
.pdf ]
This paper proposes an adaptive least mean square (LMS) algorithm that can be pipelined and has an efficient architecture for hardware implementation. The proposed algorithm shows the possibility of having LMS algorithms perform pipelined processing without degrading the convergence characteristics. An architecture based on the proposed algorithm is considered.

[matsubara99pipelineda] 
Katsushige Matsubara, Kyoshi Nishikawa, and Hitoshi Kiya.
Pipelined adaptive filters based on lookaheadbased delayed lms
algorithm.
Electronics and Communications in Japan, Part 2, 82(1), 1999.
[ bib 
http ]
As an adaptive algorithm enabling pipelining, the delayed least mean square (DLMS) algorithm is well known. In a pipelined adaptive filter based on the conventional DLMS, the operating clock rate and the convergence characteristic are in a tradeoff relationship, so that they cannot be improved at the same time. In this paper, a lookaheadbased DLMS (LDLMS) algorithm, as an extension of the DLMS based on the lookahead transformation, and its pipelined processing configuration are proposed. By the proposed method, it is shown that the adaptive filter can carry out pipelined processing at a high operating clock rate without degrading the convergence characteristic. The proposed method is characterized by minimal possibility of the delay affecting the convergence property (effective delay) under a desired operating clock rate. Also, by means of computer simulation, the effectiveness of the proposed method is confirmed. (c) 1999 Scripta Technica, Electron Comm Jpn Pt 2, 82(1): 5562, 1999

[meng86implementations] 
Teresa H.Y. Meng and David G. Messerschmitt.
Implementations of arbitrarily fast adaptive lattice filters with
multiple slow processing elements.
In IEEE International Conference on Acoustics, Speech, and
Signal Processing, volume 11, pages 11531156, April 1986.
[ bib 
.pdf ]
This paper describes systolic realizations of adaptive lattice filters with arbitrarily high sampling rates using relatively slow processing elements. The lattice filter is a desirable structure for this purpose because it can adapt coefficients within each stage using the residual errors from the previous stage and the adaptation algorithms at each stage can be modelled as timevarying linear systems. The properties of independence among stages and linear stateupdates make adaptive lattice filters suitable for implementations with arbitrary levels of concurrency. The architecture introduced can be called the vectorized adaptive lattice filter with parallel input and output. It has the property that an arbitrarily high sampling rate can be achieved at the expense of latency. An example of vectorized adaptive lattice filters, the implementation of a normalized prewindowed leastsquares lattice predictor, is illustrated, and its latency and complexity are discussed.

[meng87arbitrarily] 
Teresa H.Y. Meng and David G. Messerschmitt.
Arbitrarily high sampling rate adaptive filters.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
35(4):455470, April 1987.
[ bib 
.pdf ]
Adaptive filters have an inherent feedback from error signal back to the adaptation of the coefficients. This represents a problem in high sampling rate realizations. We demonstrate a realization of adaptive filters for which there is no theoretical limit on sampling rate in a given speed of hardware, at the expense of additional hardware and latency. Our realization does not change the inputoutput characteristics aside from finite precision effects, and hence does not degrade the filter tracking capability. The basis for our realization is the adaptive lattice filter, which uses only local feedback in adapting reflection coefficients at each stage, and for which the recursive portion of the adaptation algorithms in each stage is linear. Our realization is based on these two properties and the basic technique of lookahead computation. Several forms of our realization applying to different recursive leastsquares and stochasticgradient adaptation algorithms are described.

[miller85implementation] 
T.K. Miller and S.T. Alexander.
An implementation of the lms adaptive filter using an simd
multiprocessor ring architecture.
In International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), volume 10, pages 16051608, April 1985.
[ bib 
.pdf ]
A new architecture for a SingleInstruction Multiple Data (SIMD) implementation of the LMS adaptive algorithm is investigated. This is denoted as a ring architecture, due to its physical configuration, and it effectively solves the latency problem often associated with prediction error feedback in adaptive filters. The multiprocessor ring efficiently updates the filter input vector by operating as a pipeline structure, while behaving as a parallel structure in computing the filter output and applying the weight adaptation algorithm. Lastly, individual processor timing and capacity considerations are examined.

[miller86simd] 
T.K. Miller, S.T. Alexander, and L.J. Faber.
An simd multiplicessor ring architecture for the lms adaptive
algorithm.
IEEE Transactions on Communications, 34(1):8992, January
1986.
[ bib 
.pdf ]
A new architecture for a single instruction stream, multiple data stream (SIMD) implementation of the LMS adaptive algorithm is investigated. This is denoted as a ring architecture, due to its physical configuration, and it effectively solves the latency problem often associated with prediction error feedback in adaptive filters. The multiprocessor ring efficiently updates the filter input vector by operating as a pipeline structure, while behaving as a parallel structure in computing the filter output and applying the weight adaptation algorithm. Last, individual processor timing and capacity considerations are examined.

[mitra74digital] 
S.K. Mitra and K. Hirano.
Digital allpass networks.
IEEE Transactions on Circuits and Systems, CAS21(5), September
1974.
[ bib 
.pdf ]
A systematic method is outlined to realize an mthorder allpass digital transfer function using only m multipliers as a cascade of firstorder and/or secondorder allpass sections. The realization is based on the multiplier extraction approach in which the nthorder filter section is considered as a digital (n + 1)pair of which n pairs of input and output terminal variables are constrained by n multipliers. The transfer matrix parameters of the digital (n + 1)pair, containing only delays and adders, are first identified from which the realization is obtained by inspection. Both canonic and noncanonic realizations are derived. All realizations are then compared with regard to the effect of multiplication roundoff and hardware requirements.

[mitra78block] 
Sanjit K. Mitra and R. Gnanasekaran.
Block implementation of recursive digital filtersnew structures and
properties.
IEEE Transactions on Circuits and Systems, CAS25(4), April
1978.
[ bib 
.pdf ]
Several new structures for the block implementation of IIR digital filters are proposed. The relation between the pole locations of the block structure to that of the original scalar transfer function is derived. A method to obtain the scalar transfer function from a given block structure is described.

[mou87fast] 
Z.J. Mou and P. Duhamel.
Fast fir filtering: algorithms and implementations.
Signal Processing, 13(4):377384, December 1987.
[ bib ]
We first establish through a simple example a new fast FIR filtering algorithm based on a divideandconquer approach. This algorithm does not require the use of overlap techniques as is usual in the approaches based on cyclic or aperiodic convolutions. We outline the advantages of the proposed algorithm when implemented both in software and in hardware. Finally, we give a systematic way of deriving these algorithms.

[moyer76efficient] 
Alan L. Moyer.
An efficient parallel algorithm for digital iir filters.
In Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP '76), volume 1, pages
525528, April 1976.
[ bib 
.pdf ]
A parallel algorithm is presented which speeds up the operation of arbitrary infinite impulse response (IIR) digital filters by an integer factor n. When n is a power of two or highly composite a special numerator factorization significantly reduces the number of multipliers required for large n. The resulting filter can operate at a sampling rate exceeding the inverse of a multiply time. Hence, such filters are not multiplier speed limited. The IIR algorithm is shown to share the desirable property of the FIR algorithm that when resampling is done at the output, part of the filter hardware can be eliminated. The IIR parallel algorithm implements a degenerate filter form. The implications of this on practical filter design are considered and a design example given. It is shown that implementations of these parallel filters can exhibit reduced sensitivity allowing a reduction in the multiplier constant length.

[nguyen99exploiting] 
Huy Nguyen and Lizy Kurian John.
Exploiting simd parallelism in dsp and multimedia algorithms using
the altivec technology.
In Proceedings of the 13th international conference on
Supercomputing (ICS'99), pages 1120. ACM Press, 1999.
[ bib 
http ]
AltiVec technology is Motorola's highperformance vector parallel processing extension to the PowerPC RISC microprocessor. It is designed to improve the performance of algorithms and applications that can exploit data parallelism such as those in digital signal processing (DSP) and multimedia. In this paper, we investigate the behavior of the AltiVec technology on a set of common DSP and multimedia algorithms. These algorithms include digital filters, fast Fourier transforms, inverse discrete cosine transforms, and vector arithmetic. Each algorithm has one nonAltiVec version, and one version implemented with the AltiVec instruction set by using the AltiVec programming model. The AltiVec version of the algorithms is evaluated using an AltiVec emulator. Traces for both versions are obtained by using an AltiVecenabled trace generator, and tracedriven simulation is performed by using a cycleaccurate performance simulator. The observed speedup for the AltiVec version of DSP and multimedia algorithms ranges from a factor of 1.60 up to 11.66, and the number of dynamic instructions is reduced by a factor of 1.82 up to 10.25. In addition to quantifying the speedup, we also perform detailed instruction level analysis, which helps to understand issues that become significant while utilizing AltiVec to exploit SIMD parallelism. Efficient utilization of a vector extension such as AltiVec currently requires significant programming effort.

[nikara04multiple] 
Jari Nikara, Stamatis Vassiliadis, Jarmo Takala, and Petri Liuha.
Multiplesymbol parallel decoding for variable length codes.
IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 12(7):676685, July 2004.
[ bib 
.pdf ]
In this paper, a multiplesymbol parallel variable length decoding (VLD) scheme is introduced. The scheme is capable of decoding all the codewords in an Nbit block of encoded input data stream. The proposed method partially breaks the recursive dependency related to the VLD. First, all possible codewords in the block are detected in parallel and lengths are returned. The procedure results redundant number of codeword lengths from which incorrect values are removed by recursive selection. Next, the index for each symbol corresponding the detected codeword is generated from the length determining the page and the partial codeword defining the offset in symbol table. The symbol lookup can be performed independently from symbol table. Finally, the sum of the valid codeword lengths is provided to an external shifter aligning the encoded input stream for a new decoding cycle. In order to prove feasibility and determine the limiting factors of our proposal, the variable length decoder has been implemented on an fieldprogrammable gatearray (FPGA) technology. When applied to MPEG2 standard benchmark scenes, on average 4.8 codewords are decoded per cycle resulting in the throughput of 106 million symbols per second.

[nikias84fast] 
Chrysostomos L. Nikias.
Fast block data processing via a new iir digital filter structure.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
ASSP32(4), August 1984.
[ bib 
.pdf ]
A new structure for realizing IIR digital filters is introduced based on the idea of processing sequences by blocks. It is shown that this structure exhibits high inherent parallelism that is ideally suited for VLSI implementation or multimicroprocessor systems. This enables highspeed processing with the number of multiplies and additions per output sample much less than those of the blockstate or canonical filter realizations. The roundoff noise level in the output of the new filter structure is derived and compared to the noise levels of the blockstate and canonical forms. Further, it is shown that the new filter structure will present no scaling problems if the canonical filter is scaled. Finally, the extension of the new filter structure to the realization of periodically timevarying digital filters is also presented.

[nishikawa01pipeline] 
Kiyoshi Nishikawa and Hitoshi Kiya.
Pipeline implementation of gradienttype adaptive filters.
Electronics and Communications in Japan (Part 3: Fundamental
Electronic Science), 84(5):3342, January 2001.
[ bib 
http ]
This paper describes pipeline realization of a gradienttype adaptive filter on the algorithmic level. By pipeline realization, the longest path (critical path) of the filter can be reduced so that the throughput characteristic can be improved. In addition, the reduction of the critical path relaxes the restriction on the operators to be used and the power consumption can be decreased. Evaluation criteria of the pipeline realization are throughput, convergence, latency, and computational complexity. Using the delayed LMS adaptive filter as one of the gradienttype adaptive filters, it is shown that pipeline realization is possible that not only has a high throughput but also convergence and latency identical to those in the LMS adaptive filter taking into account the hardware complexity. (c) 2001 Scripta Technica, Electron Comm Jpn Pt 3, 84(5): 3342, 2001

[oneil01retiming] 
Timothy W. O'Neil and Edwin H.M. Sha.
Retiming synchrnous dataflow graphs to reduce execution time.
IEEE Transactions on Signal Processing, 49(10):23972407,
October 2001.
[ bib 
.pdf ]
Many common iterative or recursive DSP applications be can represented by synchronous dataflow graphs (SDFGs). A great deal of research has been done attempting to optimize such applications through retiming. However, despite its proven effectiveness in transforming singlerate dataflow graphs to equivalent DFGs with smaller clock periods, the use of retiming for attempting to reduce the execution time of synchronous DFGs has never been explored. In this paper, we do just this. We develop the basic definitions and results necessary to express and study SDFGs. We review the problems faced when attempting to retime an SDFG in order to minimize clock period and then present algorithms for doing this. Finally, we demonstrate the effectiveness of our methods on several examples.

[osorio97new] 
Roberto R. Osorio and Javiers D. Bruguera.
New arithmetic coder/decoder architectures based on pipelining.
In Proceedings of the IEEE International Conference on
ApplicationSpecific Systems, Architectures and Processors, pages 106115,
July 1997.
[ bib 
.pdf ]
In this paper we present new VLSI architectures for the arithmetic encoding and decoding of multilevel images. In these algorithms the speed is limited by their recursive natures and the arithmetic and memory access operations. They become specially critical in the case of decoding. In order to reduce the cycle length we propose working with two executions of the algorithm which alternate in the use of the pipelined hardware with a minimum increase in its cost.

[osorio04arithmetic] 
Roberto R. Osorio and Javier D. Bruguera.
Arithmetic coding architecture for H.264/AVC CABAC compression
system.
In Proceedings of the EUROMICRO Systems on Digitial System
Design, pages 6269, August 2004.
[ bib 
.pdf ]
In this paper we propose an efficient implementation of CABAC's binary arithmetic coder and context management system. CABAC is the context adaptive binary arithmetic coder used in new H.264/AVC video standard. Arithmetic coding allows a significant enhancement in compression. However, implementation complexity is a drawback due to hardware cost and slowness. In this paper we show the need for a hardware implementation of arithmetic coding in current video compression systems. We propose a fast and efficient implementation of the encoding algorithm. We prove that memory accesses constitute a bottleneck and propose solutions that apply to the encoding algorithm and context management system. As a result, a fast architecture is presented, able to process one symbol per cycle.

[padiy04signal] 
Alexander Padiy, Bin Yin, Coen Verschuren, Juil Lee, Ruud Vlutters, and Theo
Janssen.
Signal processing for 35gb on a singlelayer bluray disc.
In ODS 2004, April 2004.
[ bib ]
We report on the technical progress in increasing the recording density of optical storage systems by means of improved readchannel signal processing and writechannel optimization. The recording density increase is realized by employing PRML (Viterbi) bit detection in combination with improved timing recovery and adaptive equalization algorithms, and by using a signal quality characterization scheme which enables a proper control of the write process in the considered range of storage densities. The Bluray Disc (BD) system employing blueviolet laser with a wavelength of 405nm, objective lens with numerical aperture of 0.85 and disc cover layer thickness of 0.1 mm is used as an experimental platform in our present study.

[parhi87architecture] 
Keshab Kumar Parhi, WenLung Chen, and David G. Messerschmitt.
Architecture considerations for high speed recursive filtering.
In Proceedings of the IEEE International Symposium on Circuits
and Systems, pages 374377, May 1987.
[ bib ]
In this paper, we introduce pipeline interleaving in the incremental blockstate structure for pipelined block implementation of high sampling rate recursive digital filters. We present a new approach to the state update implementation using a novel decomposition technique. This nover state update implementation, and the incremental output computation (based on the incremental blockstate structure) lead to an efficient implementation of pipelined block recursive digital filters of (asymptotic) complexity linear in filter order (based on a quasidiagonal state update matrix) and block size. The complexity of this implementation as measured by the number of multiplications is asymptotically same as that of nonrecursive systems independent fo the implementation methodology. However, for smaller block sizes, bitlevel pipelined bitserial wordparallel implementation may lead to reduced complexity realization as compared to bitlevel pipelined bitparallel wordparallel implementation. This comparison is assisted by the concept of an implementable delay operator introduced in this paper.

[parhi87block] 
Keshab Kumar Parhi and David G. Messerschmitt.
Block digital filtering via incremental blockstate structure.
In Proceedings of the IEEE International Symposium on Circuits
and Systems, May 1987.
[ bib ]
Previous approaches on block realization of recursive digital filters have been based on blockstate and parallel blockstate representations. However, the multiplication complexity of these block structures is proportional to the square of the block size, and hence is unacceptable for very large block sizes. In this paper, we introduce a new incremental blockstate structure for block ralization of LTI recursive digital filters of multiplication complexity linear in block size. Block realization of multirate recursive digital filters via the incremental blockstate structure is also presented.

[parhi87concurrent] 
Keshab K. Parhi and David G. Messerschmitt.
Concurrent cellular vlsi adaptive filter architectures.
IEEE Transactions on circuits and systems, CAS34(10), October
1987.
[ bib 
.pdf ]
Previous approaches to highsamplingrate adaptive filter implementations have been based on wordlevel pipelined wordparallel (or block) realizations. In this paper, we show that adaptive filters can be implemented in an areaefficient manner by first using pipelining to the maximum possible extend, and then using block processing in combination with pipelining if further increase in sampling rate is needed. We show that, with the use of a decomposition technique, highspeed realizations can be achieved using pipelining with a logarithmic increase in hardware (the block realizations require a linear increase in hardware). We derive pipelined wordparallel realizations of highsamplingrate adaptive lattice filters using the techniques of lookahead computation, decomposed state update implementation, and incremental output computation. These three techniques combined make it possible to achieve asymptotically optimal complexity realizations (i.e., the same complexity asymptoctically as nonrecursive systems) of highspeed adaptive lattice filters (in both bitserial and bitparallel methodologies) and provide a system solution to highspeed adaptive filtering. The adaptive lattice filter structures are ideal for highsamplingrate implementations, since the error residuals of a particular stage are adapted orderrecursively based on those of the previous stage, and the coefficient update recursion inside each stage is linear in nature. An example of a normalized stochastic gradient adaptive lattice filter is presented, and its complexity, latency, and implementation methodology tradeoffs are studied

[parhi88pipelined] 
Keshab Kumar Parhi and David G. Messerschmitt.
Pipelined vlsi recursive filter architectures using scattered
lookahead and decomposition.
In Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing, April 1988.
[ bib 
.pdf ]
This paper explores various approaches to pipelining recursive digital filters. Past attempt to pipeline direct form recursive filters was based on clustered lookahead computation, which leads to a linear increase in hardware with respect to the number of loop pipeline stages. Unfortunately, the pipelined filters derived using this technique are not guaranteed to be stable. In this paper, we introduce a new lookahead approach (referred to as scattered lookahead) to pipeline recursive filters in a way that guarantees stability. We also propose a new decomposition technique to implement the nonrecursive portion (generated due to the scattered lookahead) in a decomposed manner (for cases where the number of loop pipeline stages is a power of 2) to obtain pipelined realizations of logarithmic implementation complexity with respect to the number of loop pipeline stages (as opposed to linear). Based on the scattered lookahead technique, we present fully pipelined and fully hardware efficient bidirectional linear systolic arrays and unidirectional ring arrays for implementation of high speed recursive digital filters.

[parhi89algorithm] 
Keshab K. Parhi.
Algorithm transformation techniques for concurrent processors.
Proceedings of the IEEE, 77(12):18791895, December 1989.
[ bib 
.pdf ]
Progress in supercomputing technology has led to two major trends. First, many existing algorithms need to be redesigned for efficient concurrent implementation using supercomputers. Second, a continuous increase will be apparent in the number of applicationspecific VLSI integrated circuits, which can provide the performance of supercomputers using single chips or chipsets (at the expense of design time for algorithm and architecture development). Both of these approaches require considerable effort in the development of algorithms for specific applications. Four independent algorithm transformation methodologiesprogram unfolding, retiming, lookahead algorithms, and index mapping transformationsare reviewed. These transformation techniques exploit the available parallelism in iterative dataflow programs and create additional parallelism if necessary.

[parhi89pipeline] 
Keshab K. Parhi and David G. Messerschmitt.
Pipeline interleaving and parallelism in recursive digital filters.
part i: Pipelining using scattered lookahead and decomposition.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
37(7), July 1989.
[ bib 
.pdf ]
The computational latency associated with the internal recursion or feedback in recursive systems limits the opportunities to use a pipelining technique to achieve high sampling rate realizations. Pipelining recursive loops by simply inserting latches is useful for applications requiring moderate sampling rates and where multiple independent computations are available to be interleaved in the pipeline; but not where a single recursive operation needs to be performed at very high sampling rates. In this paper, we introduce a new lookahead approach (referred to as scattered lookahead) to pipeline recursive loops in a way that guarantees stability. We also propose a new decomposition technique to implement the nonrecursive portion (generated due to the scattered lookahead process) in a decomposed manner to obtain concurrent stable pipelined realizations of logarithmic implementation complexity with respect to the number of loop pipeline stages (as opposed to linear). The upper bound on the roundoff error in these pipelined filters is shown to improve with increase in the number of direct form and state space form recursive digital filters. Based on the scattered lookahead technique, we present fully pipelined and fully hardware efficient linear bidirectional systolic arrays for recursive digital filters. The decomposition technique is also extended to time varying recursive systems.

[parhi89pipeline2] 
Keshab K. Parhi and David G. Messerschmitt.
Pipeline interleaving and parallelism in recursive digital filters.
part ii: Pipelined incremental block filtering.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
37(7), July 1989.
[ bib 
.pdf ]
In the companion paper, we proposed high speed pipelined realizations of recursive digital filters of logarithmic complexity with respect to the number of loop pipeline stages using scattered lookahead and decomposition techniques. In this paper, we address block implementation and finegrain pipelined block implementation of recursive digital filters. Recently, Wu and Cappello proposed a direct form block filter structure for secondorder recursive filters of complexity linear in block size. We extend this linear complexity block filter structure to higher order systems, and refer to it as incremental block filter. Block implementation of state space recursive digital filters has been known for a long time. The two existing popular block structures are the blockstate structure proposed by Barnes and Shinnake, and the parralel blockstate structure proposed by Nikias. However, the multiplication complexity of these structures is proportional to the square of the block size. The blockstate update operation in these filter structures is performed based on the clustered lookahead computation, and requires a linear complexity in block size. But, the output computation of the complete block is done all at once and requires a square complexity in block size. In this paper, we introduce a new technique of incremental output computation which requires a linear complexity in block size. Based on the clustered lookahead and incremental output computation approaches, we derive our incremental blockstate structure of block implementation of state space filters of multiplication complexity linear in block size. The incremental blockstate structure is also extended for the multirate recursive filtering case. Finally, we combine the techniques of scattered lookahead, clustered lookahead, decomposition, and incremental output computation to introduce several pipeline stages inside the recursive loop of the block filter. We derive deeply pipelined block filter structures for implementation of direct form and state space form recursive digital filters. The multiplication complexity of these pipelined block filters is linear with respect to the block size, logarithmic with respect to the number of loop pipeline stages, and the complexities due to pipelining and block processing are additive.

[parhi91highspeed] 
Keshab K. Parhi.
Highspeed huffman decoder architectures.
In Conference Record of the 25th Asilomar Conference on Signals,
Systems and Computers, volume 1, pages 6468, November 1991.
[ bib 
.pdf ]
The author presents pipelined and parallel architectures for highspeed implementation of Huffman decoders using lookahead computation techniques. Huffman decoders are used in highdefinition television, video, and other data compression systems. The achievable speed in these decoders is inherently limited due to their sequential nature of computation. The unequal code word length of the Huffman code words makes it difficult to apply lookahead. This problem is overcome by representing Huffman decoders as finite state machines which can exploit lookahead. The proposed approach is useful for highspeed Huffman decoder implementations where the number of symbols of the decoder is low.

[parhi92highspeed] 
Keshab K. Parhi.
Highspeed vlsi architectures for huffman and viterbi decoders.
IEEE Transactions on Circuits and Systems II: Analog and Digital
Signal Processing, 39(6):385391, June 1992.
[ bib 
.pdf ]
This paper presents pipelined and parallel architectures for highspeed implementation of Huffman and Viterbi decoders (both of which belong to the class of treebased decoders). Huffman decoders are used for lossless compression. The Viterbi decoder is commonly used in communications systems. The achievable speed in these decoders is inherently limited due to their sequential nature of computation. This speed limitation is overcome using our previously proposed technique of lookahead computation. The incremental computation technique is used to obtain efficient parallel (or block) implementations. The decomposition technique is exploited to reduce the hardware complexity in pipelined Viterbi decoders, but not in Huffman decoders. Logic minimization is used to reduce the hardware overhead complexity in pipelined Huffman decoders.

[parhi99vlsi]  Keshab K. Parhi. VLSI Digital Signal Processing Systems. Wiley InterScience, 1999. [ bib ] 
[park88novel] 
SeongMo Park, Winser E. Alexander, Jung H. Kim, William E. Batchelor, and
William T. Krakow.
A novel vlsi architecture for the realtime implementation of 2d
signal processing systems.
In Proceedings of the 1988 IEEE International Conference on
Computer Design: VLSI in Computers and Processors (ICCD '88), pages
582585, October 1988.
[ bib 
.pdf ]
This paper presents a high performance VLSI architecture for twodimensional (2D) digital signal processing applications. The architecture uses a specially designed digital signal processor (DSP) as a nod in a multiprocessor system for realtime or near realtime 2D signal processing. The DSP is custom designed and has a throughput of two multiplications and three additions in a single cycle. It has wavefront array processor (WAP) properties and operates asycnhronously in a multiprocessor system. This architecture extends the concepts used in onedimensional DSP chips to 2D applications and extends the concept of using a single processor to the use of a multiprocessor system.

[parker97lowarea] 
David A. Parker and Keshab K. Parhi.
Lowarea/power parallel fir digital filter implementations.
Journal of VLSI Signal Processing, 17:7592, 1997.
[ bib 
http ]
This paper presents a novel approach for implementing areaefficient parallel (block) finite impulse response (FIR) filters that require less hardware than traditional block FIR filter implementations. Parallel processing is a powerful technique because it can be used to increase the throughput of a FIR filter or reduce the power consumption of a FIR filter. However, a traditional block filter implementation causes a linear increase in the hardware cost (area) by a factor of L, the block size. In many design situations, this large hardware penalty cannot be tolerated. Therefore, it is important to design parallel FIR filter structures that require less area than traditional block FIR filtering structures. In this paper, we propose a method to design parallel FIR filter structures that require a lessthanlinear increase in the hardware cost. A novel adjacent coefficient sharing based substructure sharing technique is introduced and used to reduce the hardware cost of parallel FIR filters. A novel coefficient quantization technique, referred to as a scalable maximum absolute difference (MAD) quantization process, is introduced and used to produce quantized filters with good spectrum characteristics. By using a combination of fast FIR filtering algorithms, a novel coefficient quantization process and area reduction techniques, we show that parallel FIR filters can be implemented with up to a 45% reduction in hardware compared to traditional parallel FIR filters.

[pitas89fast] 
Ioannis Pitas.
Fast algorithms for running ordering and max/min calculation.
IEEE Transactions on Circuits and Systems, 36(6), June 1989.
[ bib 
http ]
Order statistics are used in a variety of filtering techniques (e.g. median, 03b1trimmed mean, nonlinear order statistics filtering, morphological filtering). Their computation is relatively fast, because it requires only comparisons. The author presents an algorithm that requires a significantly smaller number of comparisons and is significantly faster than the traditional approach to order statistics filtering. Also proposed are filter structures for order statistics filtering that are much faster than the known sorting structures.

[poltmann95conversion] 
Rainer D. Poltmann.
Conversion of the delayed lms algorithm into the lms algorithm.
IEEE Signal Processing Letters, 2(12), December 1995.
[ bib 
.pdf ]
For some applications of the adaptive finite impulse response (FIR) filtering, the adaptation algorithm can be implemented only with a delay in the coefficient update. It is well known that this has an adverse effect on the convergence behavior of the algorithm. In this contribution it is shown in which way the delayed LMS (DLMS) algorithm can be transformed into the standard LMS algorithm at only slightly increased computational expense.

[potkonjak92fast] 
Miodrag Potkonjak and Jan Rabaey.
Fast implementation of recursive programs using transformations.
In IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), volume 5, pages 569572, 1992.
[ bib 
.pdf ]
An automatic transformational approach used to reduce the iteration bound of recursive DSP algorithms is presented. The proposed approach combines delay retiming, algebraic transformations and loop unrolling in a well defined order. The effectiveness of the approach is demonstrated using examples.

[ramsey00design] 
Norman Ramsey, Jack W. Davidson, and Mary F. Fernández.
Design principles for machinedescription languages.
www.eecs.harvard.edu/ nr/pubs/desprinabstract.html.
[ bib 
.pdf ]
We have taken a federated approach to the design of domainspecific languages that support automatic generation of systems software. These languages are intended to help generate parts of software tools that manipulate machine instructions. This paper explains the oals and principles that have governed the design and implementation of these languages. The primary design goal for languages in the confeederation is to ensure that the same machine descriptions can be used to help build a variety of different tools. Other goals include producing useful results from partial descriptions, providing mechanisms that build users' trust in descriptions, and making descriptions concise and readable. Our design and development of machinedescription languages has been guided by a dozen languagedesign principles. Using examples both from our work and from others' work, we describe and illustrate each principle. Some of the principles are well known to the programminglanguage community and have been previously described in the literature, but many of the principles were not obious from the beginning and have not been as well described in the literature. These principles have emerged from our goals and our application area. They will interest other researchers who design machinedescription languages as well as other designers of domainspecific languages.

[reed60polynomial]  I.S. Reed and G. Solomon. Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics, 8(2):300304, June 1960. [ bib  http ] 
[rem95lecture]  Martin Rem. Lecture notes in parallel programming (2r700), 1995. [ bib ] 
[riani03equalization] 
J. Riani, J.W.M. Bergmans, S.J.L. v. Beneden, W.M.J. Coene, and A.H.J. Immink.
Equalization and target response optimization for high density
twodimensional optical storage.
In Proceedings of the 24th Symposium on Information and
communication theory in the Benelux, pages 141148, 2003.
[ bib ]
A TwoDimensional Optical Storage (TwoDOS) technology is currently under development, in which data is stored in a hexagonal bidimensional lattice, along broad spirals. This paper investigates the problem of 2D Partial Response Equalization and the design of the target response for the TwoDOS receiver. In the first part we investigate 2D Minimum Mean Square Error (MMSE) Equalization on a hexagonal lattice. An analytical expression describing the MMSE equalizer is presented. The second part gives a Bit Error Rate (BER) analysis of a general 2D Partial Response Maximum Likelihood (PRML) receiver. Performance consequences of channel noise and residual Inter Symbol Interference (ISI) are analyzed. This analysis is used to derive a measure to optimize the target response in terms of BER. Simulations were conducted to confirm the validity of the analytical results.

[rissanen79arithmetic] 
J. Rissanen and G.G. Langdon Jr.
Arithmetic coding.
IBM journal of research and development, 23(2):149162, 1979.
[ bib 
.pdf ]
The earlier introduced arithmetic coding idea has been generalized to a very broad and flexible coding technique which includes virtually all known variable rate noiseless coding techniques as special cases. An outstanding feature of this technique is that alphabet extensions are not required. A complete decodability analysis is given. The relationship of arithmetic coding to other known nonblock codes is illuminated.

[robelly04automatic] 
J.P. Robelly, G. Cichon, H. Seidel, and G. Fettweis.
Automatic code generation for simd dsp architectures: An algebraic
approach.
In Proceedings of the International Conference on Parallel
Computing in Electrical Engineering (PARELEC'04), pages 372375, September
2004.
[ bib 
.pdf ]
Driven by the ever increasing algorithm complexity on the field of mobile communications systems, SIMD DSP architectures have emerged as an approach that offers the necessary processing power at reasonable levels of die size and power consumption. However, this kind of DSP architectures imposes new challenges for programmers, since algorithms have to be designed to exploit the available parallelism on the processor. Taking as a starting point an algebraic framework that captures the SIMD computational model, we report in this paper about our efforts to design and automatically generate object code for our family of DSP architectures independent of the available SIMD parallelism. We show how these algebraic structures can be used as a high level programming language that offers a unified approach to design and describe algorithms using SIMD parallelism. Moreover, we show how these algebraic structures offer concise rules for the automatic code generation.

[robelly04implementation] 
J.P. Robelly, G. Cichon, H. Seidel, and G. Fettweis.
Implementation of recursive digital filters into vector simd dsp
architectures.
In Proceedings of the International Conference on Acoustics,
Speech and Signal Processing (ICASSP) 2004, May 2004.
[ bib 
.pdf ]
Recently, digital signal processors featuring vector SIMD instructions have gained renewed attention, since they offer the potential to speed up the computation of digital signal processing algorithms. However, when implementing recursive algorithms the maximum achievable speed up factors are upper bounded. In this paper we investigate these performace limitations when pure recursive filters are implemented into SIMD DSP architectures. We also show that the number of additional vector operations introduced by the transformation grows linearly with the level of parallelism and that it does not depend on the recursion order. These results enable the achievement of important speed up factors even for low order recursions. Moreover, we introduce a suitable algebraic notation of the block formulation of the recursive filter, which reveals the processor instructions required to implement the algorithm into the SIMD DSP.

[rudbergXXhigh]  Mikael Karlsson Rudberg and Lars Wanhammar. High speed pipelined parallel huffman decoding. The article was only available online and has been taken off the website. Google might still contain a html version in its cache though. [ bib  http ] 
[sakai86parallel] 
Hideaki Sakai.
A parallel leastsquares linear prediction method based on the
circular lattice filter.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
34(3), June 1986.
[ bib 
.pdf ]
In this correspondence, we propose a parallel leastsquares method for linear prediction and estimation based on the circular lattice filter. Instead of minimizing the total sum of prediction errors, the method proceeds by dividing the sum into several parts and then minimizing each part in parallel, and then averaging the resulting estimates to form a final estimate. It is shown that this estimator is asymptotically efficient in the same way as the traditional leastsquares estimator.

[saramaki85design] 
Tapio Saramäki.
On the design of digital filters as a sum of two allpass filters.
IEEE Transactions on Circuits and Systems, CAS32(11), November
1985.
[ bib 
.pdf ]
The necessary and sufficient conditions are given for a digital filter transfer function to be implementable as a sum of two allpass filters. The conditions are derived directly in the zplane. The class of filters satisfying these conditions is shown to be wider than the class of filters obtained via the bilinear transformation from the corresponding conventional analog filters. An example shows that the given conditions enable us to design complementary filter pairs with different numerator and denominator orders directly using magnitude squared functions. These filters compare favorably with the corresponding classical filters.

[schaffer03recursive] 
Rainer Schaffer, Michael Hosemann, Renate Merker, and Gerhard Fettweis.
Recursive filtering on simd architectures.
In IEEE Workshop on Signal Processing Systems, Design and
Implementation (SIPS 2003), August 2003.
[ bib 
.pdf ]
Recursive filters are used frequently in digital signal processing. They can be implemented in dedicated hardware or in software on a digital signal processor (DSP). Software solutions often are preferable for their speed of implementation and flexibility. However, contermporary DSPs are mostly not fast enough to perform filtering for high datarates or large filters. A method to increase the computational power of a DSP without sacrificing efficiency is to use multiple processor elements controlled by the singleinstruction multipledata (SIMD) paradigm. The parallelization of recursive algorithms is difficult, because of the data dependencies. We are using design methods for parallel processor arrays to realize implementations that can be used on a prallel DSP. Further, we are focusing on the partitioning of the algorithm so that the realization can be used for different architectures. Consequences for the architecture are considered, too.

[sekanina04evolutionary] 
Lukáš Sekanina.
Evolutionary design space exploration for median circuits.
Lecture notes in computer science, 3005:240249, 2004.
[ bib 
.pdf ]
This paper shows that it is possible to (1) discover novel implementations of median circuits using evolutionary techniques and (2) find out suitable median circuits in case that only limited resources are available for their implementation. These problems are approached using Cartesian genetic programming and an ordinary compare?swap encoding. Combining the proposed approaches a method is demonstrated for effective exploration of the design space of median circuits under various constraints.

[setz96designing]  R.B.W. Setz. Designing parallel programs for linear block computations. Master's thesis, Eindhoven University of Technology, June 1996. [ bib ] 
[shah02network] 
Niraj Shah and Kurt Keutzer.
Network processors: Origin of species.
In Proceedings of ISCIS XVI, The Seventeenth International
Symposium on Computer and Information Sciences, 2002.
[ bib 
.pdf ]
Numerous programmable alternatives to network processing have emerged in the past few years to meet the current and future needs of network equipment. They all promise various tradeoffs between performance and flexibility. In this paper we attempt to understand these new network processing alternatives. We present five major aspects of network processor architectures: approaches to parallel processing, elements of specialpurpose hardware, structure of memory architectures, types of onchip communication mechanisms, and use of peripherals. For each of these aspects, we include examples of specific network processor features.

[shanbhag93relaxed] 
Naresh R. Shanbhag and Keshab K. Parhi.
Relaxed lookahead pipelined lms adaptive filters and their
application to adpcm coder.
IEEE Transactions on Circuits and Systems II: Analog and Digital
Signal Processing, 40(12):753766, December 1993.
[ bib 
.pdf ]
Relaxed lookahead is presented as an attractive technique for pipelining adaptive filters. Unlike conventional lookahead, relaxed lookahead does not attempt to maintain the inputoutput (I/O) mapping between the serial and pipelined architectures but preserves the adaptation characteristics, resulting in a small hardware overhead. Relaxed lookahead is employed to develop finegrained pipelined architectures for leastmeansquare (LMS) adaptive filtering. The proposed architecture achieves the desired speedup, with little or no degradation in the convergence behavior. Simulation results verifying the convergence analysis results for the pipelined LMS filter are presented. The filter is then employed to develop a highspeed adaptive differential pulsecodemodulation (ADPCM) codec. The new architecture has a negligible hardware overhead which is independent of the number of quantizer levels, the predictor order and the pipelining level. Additionally, the pipelined codec has a much lower output latency than the level of pipelining. Simulations with image data indicate that speedups of up to 43 can be achieved with less than 1dB loss in SNR.

[shanbhag95pipelined] 
Naresh R. Shanbhag and Keshab K. Parhi.
Pipelined adaptive dfe architectures using relaxed lookahead.
IEEE Transactions on Signal Processing, 43(6):13681385, June
1995.
[ bib 
.pdf ]
Finegrain pipelined adaptive decisionfeedback equalizer (ADFE) architectures are developed using the relaxed lookahead technique. This technique, which is an approximation to the conventional lookahead computation, maintains functionality of the algorithm rather than the inputoutput behavior. Thus, it results in substantial hardware savings as compared to either parallel processing or lookahead techniques. Pipelining of the decision feedback loop and the adaptation loop is achieved by the use of delay relaxation and sum relaxation. Both the conventional and the predictor form of ADFE have been pipelined. Results of the convergence analysis of the proposed algorithms are also provided. The performance of the pipelined algorithms for the equalization of a magnetic recording channel is studied. It is shown that the conventional ADFE results in an SNR loss of about 0.6 dB per unit increase in the speedup factor. The predictor form of ADFE is much more robust and results in less than 0.1 dB SNR loss per unit increase in the speedup factor. Speedups of up to 8 and 45 have been demonstrated for the conventional and predictor forms of ADFE.

[shanbhag96pipelined] 
Naresh R. Shanbhag and GiHong Im.
Pipelined adaptive iir filter architectures using scattered and
relaxed lookahead transformations.
IEEE Transactions on Signal Processing, 44(7), July 1996.
[ bib 
.pdf ]
Finegrain pipelined architectures for adaptive infinite impulse response (AIIR) filters are presented in this paper. The AIIR filters are equation error based. The proposed architectures are developed by employing a combination of scattered lookahead and relaxed lookahead pipelining techniques. First, a pipelined system identification scenario is developed. Then, the scattered lookahead technique is applied to the nonadaptive (but timevarying) recursive section. Finally, a relaxed lookahead technique is applied to the adaptive blocks. Convergence analysis of the pipelined architectures is presented and verified via simulations. It is shown that speedups of up to 8 and more can be achieved with marginal or no degradation in performance.

[shanbhag97lowpower] 
Naresh R. Shanbhag and Manish Goel.
Lowpower adaptive filter architectures and their application to
51.84 mb/s atmlan.
IEEE Transactions on Signal Processing, 45(5):12761290, May
1997.
[ bib 
.pdf ]
In this paper, we present lowpower and highspeed algorithms and architectures for complex adaptive filters. These architectures have been derived via the application of algebraic and algorithm transformations. The strength reduction transformation is applied at the algorithmic level as opposed to the traditional application at the architectural level. This results in a power reduction by 21% as compared with the traditional crosscoupled structure. A finegrained pipelined architecture for the strengthreduced algorithm is then developed via the relaxed lookahead transformation. This technique, which is an approximation of the conventional lookahead computation, maintains the functionality of the algorithm rather than the inputoutput behavior. Convergence analysis of the proposed architecture is presented and supported via simulation results. The pipelined architecture allows highspeed operation with negligible hardware overhead. It also enables an additional power savings of 39 to 69% when combined with powersupply reduction. Thus, an overall power reduction ranging from 6090% over the traditional crosscoupled architecture is achieved. The proposed architecture is then employed as a receive equalizer in a communication system for a data rate of 51.84 Mb/s over 100 m of UTP3 wiring in an ATMLAN environment. Simulation results indicate that speedups of up to 156 can be achieved with about a 0.8dB loss in performance.

[shirai95design] 
Katsuhiko Shirai and Jin Hiwatashi.
A design system for special purpose processors based on architectures
for distributed processing.
In Proceedings of the conference on European design automation,
pages 380385. IEEE Computer Society Press, 1995.
[ bib 
.pdf ]
In recent years, although design systems using highlevel synthesis are becoming practicable, yet they are not powerful enough to design the digital signal processing (DSP) applications which are needed in realtime processing. In this paper, we introduce the system synthesis based on the assumption that a behavioral description is mapped to an architecture for distributed processing, and design several DSP algorithms which include the Forward algorithm for HMM (Hidden Markov Model) speech recognition. We also evaluate execution time and hardware resources of those applications, and indicate that the new design methodology is suitable for those DSP algorithms.

[shrimali93highspeed] 
Gireesh Shrimali and Keshab K. Parhi.
Highspeed arithmetic coder/decoder architectures.
In IEEE Conference on Acoustics, Speech and Signal Processing,
volume 1, pages 361364, April 1993.
[ bib 
.pdf ]
The state of art in data compression is arithmetic coding, not the better known Huffman method. To a unuique data string, arithmetic coding technique assigns a unique fractional value on the number line between 0 and 1. The speed of this algorithm is limited because of its inherent recursive nature. In this paper we present the design of fast decoders using a novel interval tree search method. The decoder can be modeled as a FSM (finite state machine), enabling the application of lookahead technique to achieve higher speeds. Lookahead approach leads to slight degradation in performance (in terms of the adder/subtractor delay in the coder/decoder due to increased word lengths). We improve the performance of the decoder by using redundant arithmetic. The tree search method combined with redundant arithmetic and lookahead leads to desired speedups without any degradation in performance.

[sidhu01fast] 
Reetinder Sidhu and Viktor K. Prasanna.
Fast regular expression matching using fpgas.
In IEEE Symposium on FieldProgrammable Custom Computing
Machines, April 2001.
[ bib 
.html ]
This paper presents an efficient method for finding matches to a given regular expression in given text using FPGAs. To match a regular expression of length n, a serial machine requires O(2^{n}) memory and takes O(1) time per text character. The proposed approach requires only O(n^{2}) space and still processes a text character in O(1) time (one clock cycle). The improvement is due to the Nondeterministic Finite Automation (NFA) used to perform the matching. As far as the authors are aware, this is the first practical use of a nondeterministic state machine on programmable logic. Furthermore, the paper presents a simple, fast algorithm that quickly constructs the NFA for the given regular expression. Fast NFA construction is crucial because the NFA structure depends on the regular expression, which is known only at runtime. Implementations of the algorithm for conventional FPGAs and the SelfReconfigurable Gate Array (SRGA) are described. To evaluate performance, the NFA logic was mapped onto the Virtex XCV100 FPGA and the SRGA. Also, the performance of GNU grep for matching regular expressions was evaluated on an 800 Mhz Pentium III machine. The proposed approach was faster than best case grep performance in most cases. It was orders of magnitude faster than worst case grep performance. Logic for the largest NFA considered fit in less than a 1000 CLBs while DFA storage for grep in the worst case consumed a few hundred megabytes.

[siegel79interconnection] 
Howard Jay Siegel.
Interconnection networks for simd machines.
IEEE Computer, 12(6):5765, 1979.
[ bib 
.pdf ]
Many SIMD interconnection networks have been proposed. To put the different approaches into perspective, this analysis compares a number of single and multistage networks.

[siegel84highly] 
Leah Jamieson Siegel.
Highly parallel architectures and algorithms for speech analysis.
In IEEE International conference on Acoustics, Speech, and
Signal Processing (ICASSP'84), volume 9, pages 331334, March 1984.
[ bib 
.pdf ]
Highly parallel computer architectures and algorithms for speech analysis operations are surveyed. The classes of architectures considered include SIMD, skewedSIMD, MIMD, and data flow machines, associative processors, and pipelined systolic and wavefront arrays. Parallel algorithms for digital filtering, FFTs, and linear predictive coding are summarized. System requirements are analyzed in terms of number and complexity of processors, memory requirements, and nature and complexity of interprocessor communications.

[soderstrand95new] 
Michael A. Soderstrand, Antonio E. de la Serna, and Herschel H. Loomis Jr.
New approach to clustered lookahead pipelined iir digital filters.
IEEE Transactions on Circuits and Systems II: Analog and Digital
Signal Processing, 42(4), April 1995.
[ bib 
.pdf ]
Authors Lim and Liu recently introduced the minimum order augmentation technique for pipelining IIR digital filters which guarantees the addition of the least number of superfluous poles to obtain a stable pipelined IIR filter while maintaining a clustered lookahead pipeline structure. Unfortunately, this minimization of the superfluous poles comes at the expense of adding additional denominator multipliers. In this paper, we introduce a minimum order clustered lookahead that achieves the minimum number of superfluous poles possible while minimizing the total number of multipliers for a clustered lookahead pipeline structure. We show that while our new technique does in some instances require more superfluous poles, the increase in hardware complexity with respect to incremental augmentation is lower when compared to the Lim and Liu approach. A MATLAB computer program is described which allows the design of any order pipelined IIR filter. Examples demonstrate that stable clustered lookahead pipelined IIR filters can be designed with the minimum number of superfluous poles (as achieved by Lim and Liu), but with fewer denominator multipliers thus reducing significantly the computational complexity in many cases. Furthermore, an analytic solution to the secondorder case gives a very practical approach to pipelined IIR filter design with great insight into the stability characteristics of pipelined IIR filters.

[sung86efficient] 
Wonyong Sung and Sanjit K. Mitra.
Efficient multiprocessor implementation of recursive digital
filters.
In Proceedings of the 2003 International Symposium on Circuits
and Systems (ISCAS '03), volume 11, pages 257260, April 1986.
[ bib 
.pdf ]
Efficient computation method for the recursive digital filtering is studied in the multiprocessor environment. The method solves the dependency problem by seperate computations of the particular and transient solutions. The throughput of the algorithm increases linearly with the number of processors, making it possible to increase the throughput effectively by using multiple number of processors. The implementations of the algorithm using a vectorprocessor and a multiprocessor in a ring network are also studied

[voelcker70digital] 
Herbert B. Voelcker and E. Eugene Hartquist.
Digital filtering via block recursion.
IEEE Transactions on Audio and Electroacoustics, AU18(2), June
1970.
[ bib 
.pdf ]
Digital filters which have poles in their transfer functions are usually implemented by various direct feedback arrangements. Such filters can also be synthesized in a form amenable to implementation via the FFT. The key feature of this procedure is block feedback through a special finiteresponse filter. Although the procedure appears to offer few immediate practical advantages, its asymptotic properties are interesting. It also provides a useful conceptual link between recursive and nonrecursive filtering techniques.

[wei95parallel] 
Belle W.Y. Wei and Teresa H. Meng.
A parallel decoder of programmable huffman codes.
IEEE Transactions on Circuits and Systems for Video Technology,
5(2):175178, April 1995.
[ bib 
.pdf ]
Huffman coding, a variablelength entropy coding scheme, is an integral component of international standards on image and video compression including highdefinition television (HDTV). The highbandwidth HDTV systems of data rate in excess of 100 Mpixels/s presents a challenge for designing a fast and economic circuit for intrinsically sequential Huffman decoding operations. This paper presents an algorithm and a circuit implementation for parallel decoding of programmable Huffman codes by using the numerical properties of Huffman codes. The 1.2 µm CMOS implementation for a single JPEG AC table of 256 codewords of up to 16b codeword lengths is estimated to run at 10 MHz with a chip area of 11 mm2, decoding one codeword per cycle. The design can be pipelined to deliver a throughput of 80 MHz for decoding input streams of consecutive Huffman codes. Furthermore, our programmable scheme can be easily integrated into data paths of video processors to support different Huffman tables used in image/video applications.

[wiegand03overview] 
Thomas Wiegand, Gary J. Sullivan, Gisle Bjöntegaard, and Ajay Luthra.
Overview of the h.264/avc video coding standard.
IEEE Transactions on Circuits and Systems for Video Technology,
13(7):560576, July 2003.
[ bib 
http ]
H.264/AVC is newest video coding standard of the ITUT Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goals of the H.264/AVC standardization effort have been enhanced compression performance and provision of a networkfriendly video representation addressing conversational (video telephony) and nonconversational (storage, broadcast, or streaming) applications. H.264/AVC has achieved a significant improvement in ratedistortion efficiency relative to existing standards. This article provides an overview of the technical features of H.264/AVC, describes profiles and applications for the standard, and outlines the history of the standardization process.

[witten87arithmetic] 
Ian H. Witten, Radford M. Neal, and John G. Cleary.
Arithmetic coding for data compression.
Communications of the ACM, 30(6):520540, June 1987.
[ bib 
http ]
The state of the art in data compression is arithmetic coding, not the betterknown Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.

[wolfe88flexible] 
Andrew Wolfe and John P. Shen.
Flexible processors: A promising applicationspecific processor
design approach.
In Proceedings of the 21st annual workshop on Microprogramming
and microarchitecture, pages 3039, 1988.
[ bib 
http ]
A new approach to application specific processor design is presented in this paper. Existing application specific processors are either based on existing general purpose processors or custom designed special purpose processors. The availability of a new technology, the Xilinx Logic Cell Array, presents the opportunity for a new alternative. The Flexible Processor Cell is a prototype of an extremely confgurable application specific processor. Flexible processors can potentially provide the performance advantages of special purpose processors as well as the cost advantages of general purpose processors. The flexible processor concept opens many potential areas for future research in processor architecture and implementation. This paper presents the design, implementation, and preliminary performance evaluation of an experimental flexible processor.

[wu87computeraided] 
ChengWen Wu and Peter R. Capello.
Computeraided design of vlsi secondorder sections.
In Proceedings of the IEEE International Converence on
Acoustics, Speech and Signal Processing (ICASSP '87), volume 12, pages
19071910, April 1987.
[ bib 
.pdf ]
A CAD tool is presented for producing very high throughput IIR filters. The architecture is a cascade of wordparallel, blocked, secondorder sections. Because it is applicationspecific, it is a very high level CAD tool. An engineer only needs to specify 1) the word size W, 2) the block size B; and 3) each secondorder section's coefficients. Using this information, the CAD tool will generate CIF files for a filter system that, operating at 10Mhz, can process 5B/W million samples per second. Our purpose is to illustrate the benefits of applying both bitlevel array architecture and applicationspecific CAD to the problem of IIR filtering. The resulting CAD system reduces the costs of very high throughput IIR filters with respect to design, fabrication, and operation.

[wu93new] 
ChenMie Wu, Mohan Vishwanath, Robert M. Owens, and Mary Jane Irwin.
A new blocked iir algorithm.
In Proceedings of ICASSP93, pages 113116, April 1993.
[ bib ]
In this paper a new IIR algorithm is presented. It is based on a block implementation method for IIR filters. This new algorithm has two stages. First, a circular convolution is used to transform the input into intermediate results. Second, a correction circuit transforms the intermediate result into the correct output. Because the correction circuit uses part of the previous output to correct the current intermediate result, our approach is an overlapsave based algorithm. The multiplacitive complexity of this algorithm is shown to be 2 log kn + 2 log n + 5 real multiplications per output point, for a block of size kn, where n is the order of the filter and k >= 2, is a constant. This is less than the counts for other known algorithms. Another advantage of this algorithm is that it can be easily implemented in an optimal manner in hardware.

[wu96mimd] 
MinYou Wu and Wei Shu.
Mimd programs on simd architectures.
In The 6th symposium on the Frontiers of Massively Parallel
Computation, pages 162170, October 1996.
[ bib 
.pdf ]
Can MIMD programs execute on SIMD architectures? To attack this difficult problem various methods have been developed to fill the gap between MIMD applications and SIMD architectures.

[youssef98parallel] 
Abdou Youssef.
Parallel algorithms for entropycoding techniques.
NISTIR 6113, December 1998.
[ bib 
.html ]
With the explosion of imaging applications, and due to the massive amounts of imagery data, data compression is essential. Lossless compression, also called entropy coding, is of special importance because not only it serves as a standalone system for certain applications such as medical imaging, it also is an inherent part of lossy compression. Therefore, fast entropy codingdecoding algorithms are desirable. In this paper we will develop parallel algorithms for several widely used entropy coding techniques, namely, arithmetic coding, runlength encoding (RLE), and Huffman coding. Our parallel arithmetic coding algorihtm takes O(log^{2} N) time on an Nprocessor hypercube, where N is the input size. For RLE, our parallel coding and decoding algorithms take O(logN) time on an Nprocessors computer. Finally, in the case of Huffman coding, the parallel coding algorithm takes O(log^{2} N + n logn), where n is the alphabet size, n << N. As for decoding, both arithmetic and Huffman are hard to parallelize. However, special provisions could be made in many applications to make arithmetic and Huffman decoding fairly parallel.

[zeman81fast] 
Jan Zeman and Allen G. Lindgren.
Fast digital filters with low roundoff noise.
IEEE Transactions on Circuits and Systems, CAS28(7), July
1981.
[ bib 
.pdf ]
A novel approach to the realization of fast and efficient IIR digital filters is presented. The realization is based on recently introduced statespace structures and retains and enhances their low noise and sensitivity properties while reducing the number of required multiplications. This permits the implementation of higher order optimal forms requiring only 1.34 to 1.65 times the number of multiplications used in the direct form (well known for its poor coefficient sensitivity and high output noise levels). The high inherent parallelism of the statespace structure is also further enhanced by the proposed fast digital filter (FDF). The resulting blockprocessing algorithm is suitable for high speed implementation of digital filters on parallel processing systems. These systems employ fast and efficient techniques, such as distributed arithmetic or multimicroprocessor techniques, to perform the required computation of the longindependent inner products. In addition to low quantization noise, the FDF structure has other desirable properties including low sensitivity and improved limit cycle behavior. For output decimating IIR filters additional savings in the number of multiplications is achieved with the proposed structure

[zhou88twolevel] 
B.B. Zhou and R.P. Brent.
A twolevel pipelined implementation of directform recursive
filters.
Technical report, Australian National University, Canberra,
Australia, April 1988.
TRCS8806.
[ bib 
.pdf ]
A stabilized parallel algorithm for directform recursive filters is obtained using a new method of derivation in the Z domain. The algorithm is regular and modular, so very efficient VLSI architectures can be constructed to implement it. The degree of parallelism in these implementations can be chosen freely, and is not restricted to be a power of two.

[kozyrakis03overcoming] 
Christos Kozyrakis and David Patterson.
Overcoming the limitations of conventional vector processors.
In Proceedings of the 30th Annual International Symposium on
Computer Architecture (ISCA'03), pages 283293. IEEE, ACM Press, 2003.
[ bib 
http ]
Despite their superior performance for multimedia applications, vector processors have three limitations that hinder their widespread acceptance. First, the complexity and size of the centralized vector register file limits the number of functional units. Second, precise exceptions for vector instructions are difficult to implement. Third, vector processors require an expensive onchip memory system that supports high bandwidth at low access latency. We introduce CODE, a scalable vector microarchitecture that addresses these three shortcomings. It is designed around a clustered vector register file and uses a separate network for operand transfers across functional units. With extensive use of decoupling, it can hide the latency of communication across functional units and provides 26% performance improvement over a centralized organization. CODE scales efficiently to 8 functional units without requiring wide instruction issue capabilities. A renaming table makes the clustered register file transparent at the instruction set level. Renaming also enables precise exceptions for vector instructions at a performance loss of less than 5%. Finally, decoupling allows CODE to tolerate large increases in memory latency at sublinear performance degradation without using onchip caches. Thus, CODE can use economical, offchip, memory systems.

[weis06fast] 
Ben Weiss.
Fast median and bilateral filtering.
ACM Transactions on Graphics (TOG), 25(3):519526, July 2006.
[ bib 
.pdf ]
Median filtering is a cornerstone of modern image processing and is used extensively in smoothing and denoising applications. The fastest commercial implementations (e.g. in Adobe Photoshop CS2) exhibit O(r) runtime in the radius of the filter, which limits their usefulness in realtime or resolutionindependent contexts. We introduce a CPUbased, vectorizable O(log r) algorithm for median filtering, to our knowledge the most efficient yet developed. Our algorithm extends to images of any bitdepth, and can also be adapted to perform bilateral filtering. On 8bit data our median filter outperforms Photoshop's implementation by up to a factor of fifty.

[tuck88optimally] 
Russ Tuck.
An optimally portable simd programming language.
In Frontiers of Massively Parallel Computation, pages 617624.
IEEE, October 1988.
[ bib 
.pdf ]
Existing programming languages for SIMD (SingleInstruction MutlitpleData) parallel computers make implicit architectural assumptions. These limit each language to architectures satisfying its assumptions. This paper presents a theoretical foundation for developing much more portable languages for SIMD computers. It also describes work in progress on the design and implementation of such a language.<br/>An optimally portable programming language for a set of architectures is one which allows each program to specify the subset of those architectures on which it must be able to run, and which thena llows the program to exploit exactly those architectural features available on all of the target architectures. The features available on an architecture are defined to be those the architecture can implement with a constantbounded number of operations. This definition ensures reasonable execution efficiency, and identifies architectural differences which are relevant to algorithm selection.<br/>An optimally portable programming language for SIMD computers, called Porta_SIMD (portasimm'd), is being developed to demonstrate these ideas. Based on C++, it currently runs on the Connection Machine and PixelPlanes 4.

[hawver96processor] 
Dennis M. Hawver and George B. Adams III.
Processor autonomy and its effect on parallel program execution.
In Frontiers of Massively Parallel Computing, pages 144153.
IEEE, October 1996.
[ bib 
.pdf ]
Processor autonomy is the potential of a processing element in a parallel computer to act differently from other processors during execution. A new parallel architecture taxonomy is presented that includes the necessary and sufficient conditions to achieve processor autonomy. Processor autonomy is possible when multiple data address, data value, instruction address, or instruction value streams are available. Parallel program execution can be significantly aided by processor autonomy, allowing various mappings and dynamic reassignment of PEs to streams. Parallel program performance is evaluated for several sorting algorithms using one form of data address autonomy, indirect addressing, and one form of instruction value autonomy, branch selection.

[gurd88taxonomy] 
J.R. Gurd.
A taxonomy of parallel computer architectures.
In Design and Application of Parallel Digital Processors, pages
5761, April 1988.
[ bib 
.pdf ]
Discusses a framework within which it is possible to compare the overall performance of a wide variety of parallel computing systems. In this context, a computing system comprises components at several levels of abstraction, ranging from applications, through algorithms and various levels of language, to hardware implementation. The basis of the desired framework will be a series of classification schemes or taxonomies, at the various levels of abstraction, together with a set of mapping strategies that guide selection of appropriate options at each level, topdown. At present, it is not feasible to address all levels simultaneously, and so the author concentrates on the `lowest' level of abstraction (i.e. the hardware implementation)

[duncan90survey] 
Ralph Duncan.
A survey of parallel computer architectures.
IEEE Computer, 23(2):516, February 1990.
[ bib 
.pdf ]
An attempt is made to place recent architectural innovations in the broader context of parallel architecture development by surveying the fundamentals of both newer and more established parallel computer architectures and by placing these architectural alternatives in a coherent framework. The primary emphasis is on architectural constructs rather than specific parallel machines. Three categories of architecture are defined and discussed: synchronous architectures, comprising vector, SIMD (singleinstructionstream, multipledatastream) and systolic machines; MIMD (multipleinstructionstream, multipledatastream) with either distributed or shared memory; and MIMDbased paradigms, comprising MIMD/SIMD hybrid, dataflow, reduction, and wavefront types.

[seitz84concurrent] 
Charles L. Seitz.
Concurrent vlsi architectures.
IEEE Transactions on Computers, C33(12):12471265, December
1984.
[ bib 
.pdf ]
This tutorial paper addresses some of the principles and provides examples of concurrent architectures and designs that have been inspired by VLSI technology. The circuit density offered by VLSI provides the means for implementing systems with very large numbers of computing elements, while its physical characteristics provide an incentive to organize systems so that the elements are relatively loosely coupled. One class of computer architectures that evolve from this reasoning include an interesting and varied class of concurrent machines that adhere to a structural model based on the repetition of regularly connected elements. The systems included under this structural model range from 1) systems that combine storage and logic at a fine grain size, and are typically aimed at computations with images or storage retrieval, to 2) systems that combine registers and arithmetic at a medium grain size to form computational or systolic arrays for signal processing and matrix computations, to 3) arrays of instruction interpreting computers that use teamwork to perform many of the same demanding computations for which we use highperformance single process computers today.

[blas05kestrel] 
Andrea Di Blas, David M. Dahle, Mark Diekhans, Leslie Grate, Jeffrey
Hirschberg, Kevin Karplus, Hansjörg Keller, Mark Kendrick, Francisco J.
MesaMartinez, David Pease, Eric Rice, Angela Schultz, Don Speck, and Richard
Hughey.
The ucsc kestrel parallel processor.
IEEE Transactions on Parallel and Distributed Systems,
16(1):8092, January 2005.
[ bib 
.pdf ]
The architectural landscape of highperformance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Singlepurpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain singlepurpose performance while maintaining user programmability. Kestrel is a singleinstruction stream, multipledata stream (SIMD) parallel processor with a 512element linear array of 8bit processing elements. The system design focuses on efficient highthroughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz.

[harrington06method] 
Steven J. Harrington.
Us patent #7019760: Method and apparatus for fast computation of
associative operations over fixed size regions of a digital image, March
2006.
Xerox Corporation.
[ bib ]
This invention relates to a method and appaaratus for fast computation of associative operations over fixed size regions of a digital image. More particularly, this invention relates to the application of associative operationssuch as MINIMUM, MAXIMUM, AND and ORto fixed size regions of a digital image such as hexagonal, octagonal and rectangular regions. This is accomplished by tiling the image and calculating twodimensional running calculations of the operator for each possible corner overlap configuration of a window that is run over the tiles of the image and thus used to analyze the image. Combining the appropriate running calculations for a particular window position yields the final result for that position.

[lin93parallel] 
HorngDar Lin and David G. Messerschmitt.
Parallel viterbi decoding methods for uncontrollable and controllable
sources.
IEEE Transactions on Communications, 41(1):6269, January
1993.
[ bib 
http ]
The fastest conventional Viterbi decoding method processes all states in parallel, but still decodes code symbols sequentially. In order to decode faster than this rate, parallel methods must be developed to break the processing into independent stateparallel decoding tasks. This paper discusses parallel and controllable sources. For general, uncontrollable Markov processes, we extend a previously known parallel method to a hierarchical parallel decoding approach, which achieves a lower latency. For controllable Markov sources in telecommunications applications, we propose new parallel decoding methods by controlling the source processes in appropriate ways. This paper focuses on the parallel decoding methods for controllable sources because these methods have zero processing overhead. Because the methods modify the coding processs, they bring positive changes to framing and negative changes to latency and code performance. However, we can adjust the parameters of the methods to make the degration negligible. Becuase of their low overhead, the methods are most attractive for highspeed decoders for convolutional and trellis codes, and also applicable to other sequential algorithms for suboptimal decoding and estimation of complex Markov sources.

[thapar89block] 
Hemant K. Thapar and John M. Cioffi.
A block processing method for designing highspeed viterbi detectors.
In IEEE Int. Conf. on Communications, volume 2, pages
10961100, June 1989.
[ bib 
http ]
This paper describes a block processing method for the Viterbi algorithm to circumvent the data rate dependence of conventional architectures on available integrated circuit technology. The method applies the Viterbi algorithm to a block of received symbols instead of the conventional symbolbysymbol approach. Backward and forward recursions are used to provide highly modular and layered architectures, requiring replication and suitable interconnection of basic addcompareselect modules. The data rate may be increased by adding more layers of the modules. The area/data rate tradeoff is linear with a slope proportional to the square of the number of states.

[omura68viterbi] 
Jim K. Omura.
On the viterbi decoding algorithm.
IEEE Transactions on Information Theory, 15(1):177179,
January 1969.
[ bib 
http ]
A new interpretation of the Viterbi decoding algorithm based on the statespace approach to dyamical systems is presented. In this interpretation the optimum decoder solves a generalized regulator control problem by dynamic programming techniques.

[horst07scalable]  M.G. van der Horst and R.H. Mak. Scalable parallel rank order filters. Technical report, Technische Universiteit Eindhoven, February 2007. CSReport 0705, http://alexandria.tue.nl/repository/books/627228.pdf. [ bib  .pdf ] 
[flynn66very] 
Michael J. Flynn.
Very highspeed computing systems.
In Proceedings of the IEEE, volume 12, pages 19011909,
December 1966.
[ bib 
http ]
Very highspeed computers may be classified as follows: 1) Single Instruction StreamSingle Data Stream (SISD) 2) Single Instruction StreamMultiple Data Stream (SIMD) 3) Multiple Instruction StreamSingle Data Stream (MISD) 4) Multiple Instruction StreamMultiple Data Stream (MIMD). Stream, as used here, refers to the sequence of data or instructions as seen by the machine during the execution of a program. The constituents of a system: storage, execution, and instruction handling (branching) are discussed with regard to recent developments and/or systems limitations. The constituents are discussed in terms of concurrent SISD systems (CDC 6600 series and, in particular, IBM Model 90 series), since multiple stream organizations usually do not require any more elaborate components. Representative organizations are selected from each class and the arrangement of the constituents is shown.

[oberman99amd] 
Stuart Oberman, Greg Favor, and Fred Weber.
Amd 3dnow! technology: Architecture and implementations.
IEEE Micro, 19(2):3748, April 1999.
[ bib 
.pdf ]
The AMDK62 microprocessor is the first implementation of AMD 3DNow!, a technology innovation for the x86 architecture that drives today's personal computers. 3DNow! technology is a set of 21 new instructions designed to open the traditional processing bottlenecks for floatingpointintensive and multimedia applications. Using these instructions, applications can implement more powerful solutions to create a more entertaining and productive PC platform. Examples of the type of improvements that 3DNow! technology enables are faster frame rates on highresolution scenes, better physical modeling of realworld environments, sharper and more detailed 3D imaging, smoother video playback, and near theaterquality audio. Future AMD processors such as the AMDK7, designed to operate at frequencies greater than 500 MHz, should provide even higher performance implementations of 3DNow! technology.

[peleg96mmx] 
Alex Peleg and Uri Weiser.
Mmx technology extension to the intel architecture.
IEEE Micro, 16(4):4250, August 1996.
[ bib 
.pdf ]
Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numericintensive applications. It uses a SIMD (singleinstruction, multipledata) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU.

[thakkar99internet] 
Shreekant Thakkar and Tom Huff.
The internet streaming simd extensions.
Computer, 12(12):2634, December 1999.
[ bib 
http ]
Because floatingpoint computation is the heart of 3D geometry, speeding up floatingpoint computation is vital to overall 3D performance. To produce a visually perceptible difference in graphics applications, Intel's 32bit processorsbased on the IA32 architecturerequired an increase of 1.5 to 2 times the native floatingpoint performance. One path to better performance involves studying how the system uses data. Today's 3D applications can execute a lot faster by differentiating between data used repeatedly and streaming datadata used only once and then discarded. The Pentium III's new floatingpoint extension lets programmers designate data as streaming and provides instructions that handle this data efficiently. The authors designed the Internet Streaming SIMD Extensions (ISSE) to enable a new level of visual computing on the volume PC platform. They discuss their results in terms of boosting the performance of 3D and video applications.

[kozyrakis03scalable] 
Christoforos E. Kozyrakis and David A. Patterson.
Scalable vector processors for embedded systems.
IEEE Micro, 23(6):3645, December 2003.
[ bib 
.pdf ]
For embedded applications with datalevel parallelism, a vector processor offers high performance at low power consumption and low design complexity. Unlike superscalar and VLIW designs, a vector processor is scalable and can optimally match specific application requirements.To demonstrate that vector architectures meet the requirements of embedded media processing, we evaluate the Vector IRAM, or VIRAM (pronounced VIRAM), architecture developed at UC Berkeley, using benchmarks from the Embedded Microprocessor Benchmark Consortium (EEMBC). Our evaluation covers all three components of the VIRAM architecture: the instruction set, the vectorizing compiler, and the processor microarchitecture. We show that a compiler can vectorize embedded tasks automatically without compromising code density. We also describe a prototype vector processor that outperforms highend superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption. Finally, we demonstrate that clustering and modular design techniques let a vector processor scale to tens of arithmetic data paths before wide instructionissue capabilities become necessary.

[chatterjee90scan] 
Siddharta Chatterjee, Guy E. Blelloch, and Marco Zaga.
Scan primitives for vector computers.
In Proceedings of the 1990 conference on Supercomputing, pages
666675, November 1990.
[ bib 
.pdf ]
The authors describe an optimized implementation of a set of scan (also called allprefixsums) primitives on a single processor of a CRAY YMP, and demonstrate that their use leads to greatly improved performance for several applications that cannot be vectorized with existing computer technology. The algorithm used to implement the scans is based on an algorithm for parallel computers. A set of segmented versions of these scans is only marginally more expensive than the unsegmented versions. The authors describe a radix sorting routine based on the scans that is 13 times faster than a Fortran version and within 20% of a highly optimized library sort routine, three operations on trees that are between 10 to 20 times faster than the corresponding C versions, and a connectionist learning algorithm that is 10 times faster than the corresponding C version for sparse and irregular networks.

[loomis84high] 
Herschel H. Loomis Jr. and Bhaskar Sinha.
High speed recursive digital filter realization.
Circuits, Systems and Signal Processing, 3(3):267294,
September 1984.
[ bib 
.pdf ]
Pipeline techniques have been successfully applied to speeding up processing in both general and specialpurpose digital computers. Application of these techniques to nonrecursive (FIR) filters has been suggested and is quite straightforward. Application to recursive (IIR) filters has not previously been shown. In this paper, the technique for applying pipeline techniques to recursive filters is shown and the advantages and disadvantages of the technique are discussed. Using these techniques, recursive digital filters operating at hitherto impossibly high rates can be designed.

[datar02maintaining] 
Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani.
Maintaining stream statistics over sliding windows.
2002 Annual ACMSIAM symposium on discrete algorithms,
31(6):17941813, 2002.
[ bib 
.pdf ]
We consider the problem of maintaining aggregates and statistics over data streams,<br/>with respect to the last N data elements seen so far. We refer to this model as the sliding window<br/>model. We consider the following basic problem: Given a stream of bits, maintain a count of the<br/>number of 1?s in the last N elements seen from the stream. We show that, using O( 1 / e log2 N) bits<br/>of memory, we can estimate the number of 1?s to within a factor of 1 + e . We also give a matching<br/>lower bound of ?( 1/e log2 N) memory bits for any deterministic or randomized algorithms. We extend<br/>our scheme to maintain the sum of the last N positive integers and provide matching upper and<br/>lower bounds for this more general problem as well. We also show how to efficiently compute the Lp<br/>norms (p in [1, 2]) of vectors in the sliding window model using our techniques. Using our algorithm,<br/>one can adapt many other techniques to work for the sliding window model with a multiplicative<br/>overhead of O( 1/e logN) in memory and a 1 + e factor loss in accuracy. These include maintaining<br/>approximate histograms, hash tables, and statistics or aggregates such as sum and averages.

[arasu04resource] 
Arvind Arasu and Jennifer Widom.
Resource sharing in continous slidingwindow aggregates.
Technical report, Stanford University, 2004.
[ bib 
.pdf ]
We consider the problem of resource sharing when processing large numbers of continuous queries. We specifically address slidingwindow aggregates over data streams, an important class of continuous operators for which sharing has not been addressed. We present a suite of sharing techniques that cover a wide range of possible scenarios: different classes of aggregation functions (algebraic, distributive, holistic), different window types (timebased, tuplebased, suffix, historical), and different input models (single stream, multiple substreams). We provide precise theoretical performance guarantees for our techniques, and show their practical effectiveness through a thorough experimental study.

[fuller98motorola]  Sam Fuller. Motorola's altivec technology. White paper, January 1998. ALTIVECWP. [ bib ] 
[richardson00umts] 
K.W. Richardson.
Umts overview.
Electronics and Communication Engineering Journal,
12(3):93100, June 2000.
[ bib 
.pdf ]
The Universal Mobile Telecommunications System (UMTS) as specified by the Third Generation Partnership Project (3GPP) was formally adopted by the ITU as a member of its family of IMT2000 Third Generation Mobile Communication standards in November 1999. This paper provides some background to the UMTS standard and an overview of the system architecture. Some information about the current status of technology trials is provided as well as predictions for the services that future UMTS networks are likely to deliver to the end user.

[parker99defining] 
Dana J. Parker.
Defining dvd.
IEEE Multimedia, 6(1):8084, January 1999.
[ bib 
.pdf ]
The accepted and muchrepeated definition for DVD is digital versatile disc or digital video disc, depending on whom you ask. The words behind the letters are not the only things about DVD whose meaning depends on the respondent; you will also get a different answer regarding DVD's purpose and future depending on individual viewpoint. The author gives a brief history of DVD and considers DVDROM and DVD Video.

[gilder97fiber]  George Gilder. Fiber keeps its promise. Forbes ASAP, April 1997. [ bib  .html ] 
[gasteren90shape]  A.J.M. van Gasteren. On the shape of mathematical arguments. Springer, 1990. [ bib ] 
[carroll89theory]  John Carroll and Darrell Long. Theory of finite automata. Prentice Hall, 1989. [ bib ] 
[nilsson04accelerator] 
Anders Nilsson, Eric Tell, and Dake Liu.
An accelerator architecture for programmable multistandard baseband
processors.
In Wireless Networks and Emerging Technologies, July 2004.
[ bib 
.pdf ]
Programmability will be increasingly important in future multistandard radio systems. We are proposing an archi tecture for fully programmable baseband processing, based on a programmable DSP processor and a number of config urable accelerators which communicate via a configurable network. Acceleration of common cycleconsuming DSP jobs is necessary in order to manage wideband modula tion schemes. In this paper we investigate which jobs are suitable for acceleration in a programmable baseband proc sessor supporting a number of common Wireless LAN and 3G standards. Simulations show that with the proposed set of accelerators, our architecture can support the discussed standards, including IEEE 802.11a 54 Mbit/s wireless LAN reception, at a clock frequency not exceeding 120 MHz.

[nilsson05design]  Anders Nilsson. Design of multistandard baseband processors. Thesis No. 1173. Linköpings Studies in Science and Technology, June 2005. [ bib  .pdf ] 
[nilsson07design] 
Anders Nilsson.
Design of programmable multistandard baseband processors.
PhD thesis, Linköping University, 2007.
[ bib 
.pdf ]
Efficient programmable baseband processors are important to enable true multistandard radio platforms as convergence of mobile communication devices and systems requires multistandard processing devices. The processors do not only need the capability to handle differences in a single standard, often there is a great need to cover several completely different modulation methods such as OFDM and CDMA with the same processing device. Programmability can also be used to quickly adapt to new and updated standards within the ever changing wireless communication industry since a pure ASIC solution will not be flexible enough. ASIC solutions for multistandard baseband processing are also less area efficient than their programmable counterparts since processing resources cannot be efficiently shared between different operations. However, as baseband processing is computationally demanding, traditional DSP architectures cannot be used due to their limited computing capacity. Instead VLIW and SIMDbased processors are used to provide sufficient computing capacity for baseband applications. The drawback of VLIWbased DSPs is their low power efficiency due to the wide instructions that need to be fetched every clock cycle and their controlpath overhead. On the other hand, pure SIMDbased DSPs lack the possibility to perform different concurrent operations. Since memory access power is the dominating part of the power consumption in a processor, other alternatives should be investigated.<br/><br/>In this dissertation a new and unique type of processor architecture has been designed that instead of using the traditional architectures has started from the application requirements with efficiency in mind. The architecture is named “Single Instruction stream Multiple Tasks”, SIMT in short. The SIMT architecture uses the vector nature of most baseband programs to provide a good tradeoff between the flexibility of a VLIW processor and the processing efficiency of a SIMD processor. The contributions of this project are the design and research of key architectural components in the SIMT architecture as well as development of design methodologies. Methodologies for accelerator selection are also presented. Furthermore data dependency control and memory management are studied. Architecture and performance characteristics have also been compared between the SIMT and more traditional processor architectures.<br/><br/>A complete system is demonstrated by the BBP2 baseband processor that has been designed using SIMT technology. The SIMT principle has previously been proven in a small scale in silicon in the BBP1 processor implementing a Wireless LAN transceiver. The second demonstrator chip (BBP2) was manufactured early 2007 and implements a full scale system with multiple SIMD clusters and a controller core supporting multiple threads. It includes enough memory to run symbol processing of DVBH/T, WiMAX, IEEE 802.11a/b/g and WCDMA, and the silicon area is 11 mm2 in a 0.12 um CMOS technology.

[wu05single] 
Di Wu, Tiejun Hu, and Dake Liu.
A single scalar dsp based programmable h.264 decoder.
In Swedish SystemonChip Conference (SSoCC), 2005.
[ bib 
.pdf ]
This paper presents a feasibility study of applying a legacy reconfigurable single scalar DSP processor to media processing. The latest video compression standard  H.264 was adopted as target application. First, a pure software H.264 decoder was implemented based on the legacy DSP processor.<br/>Although realtime performance was not achieved, it exposed essential computational as well memory access costs. The second design explored the possibility of achieving both programmability programmable DSP processor and realtime performance with affordable hardware acceleration. At the same time, features of H.264 processing were analyzed for more suitable accelerator design. In order to avoid conflicts in memory access, tasks scheduling was carefully considered.<br/>In the end, performance of proposed solution and the bottlenecks which still limit the performance are presented as conclusion.

[alvarez05performance] 
Mauricio Alvarez, Esther Salami, Alex Ramirez, and Mateo Valero.
A performance characterization of high definition digital video
decoding using h.264/avc.
In Proceedings of the IEEE International Workload
Characterization Symposium, pages 2433, October 2005.
[ bib 
.pdf ]
H.264/AVC is a new international video coding standard that provides higher coding efficiency with respect to previous standards at the expense of a higher computational complexity. The complexity is even higher when H.264/AVC is used in applications with high bandwidth and high quality like high definition (HD) video decoding. In this paper, we analyze the computational requirements of H.264 decoder with a special emphasis in HD video and we compare it with previous standards and lower resolutions. The analysis was done with a SIMD optimized decoder using hardware performance monitoring. The main objective is to identify the application bottlenecks and to suggest the necessary support in the architecture for processing HD video efficiently. We have found that H.264/AVC decoding of HD video perform many more operations per frame than MPEG4 and MPEG2, has new kernels with more demanding memory access patterns and has a lot data dependent branches that are difficult to predict. In order to improve the H.264/AVC decoding process at HD it is necessary to explore a better support in media instructions, specialized prefetching techniques and possibly, the use of some kind of multiprocessor architecture.

[eilert05design] 
Johan Eilert and Dake Liu.
Design of a floating point dsp for full precision mpeg1 layer ii and
iii decoding.
In Swedish SystemonChip Conference, April 2005.
[ bib 
.pdf ]
The purpose of this work has been to design and<br/>evaluate a oating point DSP for MPEG1 layer II and III audio<br/>decoding. The intention has been to minimize the word length and<br/>memory usage while still satisfying the full precision requirements<br/>for both layer II and III as speci ed in the standard. This is<br/>compared to one of our previous works where we have only<br/>aimed for a speci c word length and the audio quality was not<br/>considered.<br/>A oating point number representation offers a better tradeoff<br/>between dynamic range and precision than a xed point<br/>representation for a given word length. Further, using a oating<br/>point representation means that smaller memories can be used<br/>which leads to smaller chip area and lower power consumption.<br/>The result is a very area and MIPS effective DSP for MPEG1<br/>audio decoding.

[semiconductor06itrsexecsum]  Semiconductor Industry. International Technology Roadmap for Semiconductors  2006 update  Overview and Working Group Summaries. Semiconductor Industry, 2006. [ bib  .pdf ] 
[lofar.org]  Lofar Consortium. Lofar. www.lofar.org. [ bib  www: ] 
[herk92fast]  Marcel van Herk. A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels. Pattern Recognition Letters, 13:517521, July 1992. [ bib ] 
[cray77cray1]  Inc. Cray Research. CRAY1 Computer System Hardware Reference Manual. Cray Research, Inc., rev. c edition, November 1977. Publication No. 2240004. [ bib  .pdf ] 
[cdc75star100]  Control Data Corporation. Control Data STAR100 Computer Harware Reference Manual. Control Data Corporation, rev. 09 edition, December 1975. Publication No. 60256000. [ bib  .pdf ] 
[gwennap98altivec]  Linley Gwennap. Altivec vectorizes powerpc. Microprocessor Report, 12(6), May 1998. [ bib ] 
[blank90maspar] 
Tom Blank.
The maspar mp1 architecture.
In IEEE Computer Society International Conference, pages
2024, March 1990.
[ bib 
.pdf ]
The MasPar MP1 architecture is described. It is a massively parallel SIMD machine with the following key characteristics: scalable architecture in terms of the number of processing elements, system memory, and system communication bandwidth; reducedinstructionsetcomputerlike instruction set design which leverages optimizing compiler technology; adherence to industrystandard floating point formats, specifically VAX and IEEE floating point; and an architectural design amenable to a VLSI implementation. The architecture provides not only high computational capability, but also a mesh and global interconnect style of communication. The computational model and subsystems of the MP1, including the interconnection mechanisms, are described

[horst07multidimensional]  M.G. van der Horst and R.H. Mak. Multidimensional parallel rank order filtering. In IEEE Workshop on Signal Processing Systems (SiPS), pages 627632, October 2007. [ bib  .pdf ] 
This file has been generated by bibtex2html 1.87.