# University of L'Aquila

# DEPARTMENT OF INFORMATION ENGINEERING, COMPUTER SCIENCE AND MATHEMATICS



Ph.D. Program in ICT - Systems Engineering, Telecommunications and HW/SW Platforms

XXXIII cycle

# A Model for Early Power Estimation on multirate Digital systems

SSD - ING-INF/01

PH.D. STUDENT Graziano Battisti

COURSE COORDINATOR ADVISOR

Prof. Vittorio Cortellessa Prof. Marco Faccio

Co-Advisor

Prof. Fortunato Santucci

A.A. 2020/2021

# Sommario

Le crescenti esigenze di capacità di connessione per le aree rurali, remote e anche urbane probabilmente aumenteranno il costo della pura copertura terrestre. In questo scenario le comunicazioni satellitari sono destinate a svolgere un ruolo significativo negli eco-sistemi 5/6G per fornire una copertura onnipresente, broadcast/multicast e l'impiego in situazioni di emergenza e disastri. A questo scopo, diverse costellazioni sono state proposte o sono già in fase di sviluppo sia per l'orbita geo-stazionaria (GEO) (ad esempio ViaSat-2, la rete Global Xpress (GX) di Inmarsat, la piattaforma EpicNG di Intelsat) e non-GEO (per esempio LeoSat, il sistema OneWeb, il sistema satellitare SpaceX Starlink).

I transponder satellitari semi-trasparenti che si basano su processori digitali trasparenti (DTP) sono considerati come soluzioni interessanti per fornire un'adeguata flessibilità rispetto agli standard in evoluzione e l'adattabilità rispetto ai modelli di traffico variabili nel tempo. Tuttavia, la loro progettazione hardware, presenta diverse criticità, specialmente quando si considerano grandi costellazioni di piccoli satelliti (es. LEO). In questo scenario è importante ricorrere all'uso di una procedura di progettazione a supporto, che muovendo dai requisiti di linkbudget permette di effettuare un'accurata progettazione hardware basata su un trade-off tra prestazioni e complessità.

Il presente lavoro propone un nuovo approccio di modellazione al fine di incorporare il consumo energetico delle architetture HW esplicitandolo nella fase concettuale della progettazione. Questa fase in ambito DSP spesso è legata alla forma aritmetica dell'algoritmo e come base vengono utilizzati i Signal Flow Graph (SFG).

Il modello è stato utilizzato, in prima istanza, per studiare dal punto di vista energetico l'implementazione dell'intera catena del DTP su FPGA confrontando le varie soluzioni e individuando strategie più efficaci. In seconda istanza il metodo è stato poi utilizzato nel campo delle comunicazioni ottiche ed in particolare come base di progettazione per il modulatore e il demodulatore di un sistema di biotelemetria ottica basato su codifica multilivello ad impulsi corti. In questo caso l'analisi ha permesso di confrontare varie soluzioni per poi scegliere quella più adeguata dal punto di vista energetico durante la fase di progettazione

concettuale di un caso reale. I risultati ottenuti per diverse configurazioni, requisiti e obiettivi di progettazione in entrambe i casi forniscono spunti promettenti nella prospettiva di formulare problemi di ottimizzazione. Infine viene definito un modello per la stima anticipata della potenza dinamica dei DTP, e non solo, implementati in FPGA. Il quadro teorico per i DTP è ulteriormente esteso per incorporare elementi legati alla tecnologia nel processo di analisi e progettazione. La complessità, che in lavori precedenti è stata espressa con blocchi elementari, è ulteriormente dettagliata attraverso l'espressione in termini di primitive hardware.

Un particolare nuovo contributo è quello di stimare il consumo di energia e includerlo sia nella valutazione delle prestazioni che nell'approccio progettuale. Il modello anticipato di consumo energetico consente di ottenere espressioni per la complessità hardware e la stima della potenza che siano valide per gli FPGA attualmente disponibili in commercio, e in realizzazioni VLSI full-custom.

# **Summary**

Increasing connection capacity needs for rural, remote, and even urban areas are likely to increment the cost of pure terrestrial coverage. Satellite communications are an essential solution in the 5/6G ecosystems to provide ubiquitous coverage, broadcast/multicast provision, and emergency/disaster recovery. In this direction, several satellites constellations have either been proposed or are already under development for both the Geosynchronous Orbit (GEO) (e.g. ViaSat-2, Inmarsat's Global Xpress (GX) network, Intelsat's EpicNG platform), and Low Earth Orbit (LEO) (e.g. LeoSat, the OneWeb system, the SpaceX Starlink satellite system).

Semi-transparent satellite transponders based on Digital Transparent Processors (DTP) are considered appealing solutions to provide adequate flexibility in evolving standards and time-varying traffic patterns adaptivity. Nevertheless, their hardware design strength appears a critical concern, especially when prominent constellations of small LEO satellites are considered. We can carry out an accurate hardware design based on a performance-complexity trade-off. In this regard, we plan to extend and exploit an equivalent noise model and a design procedure that we have developed, validated, and documented in a series of papers: moving from link-budget requirements.

This work proposes a new modeling approach incorporating the energy consumption of HW architectures in the conceptual phase of the design. This phase, in the Digital Signal Processing domain, is often related to the arithmetic form of the algorithm using Signal Flow Graphs (SFG) as a basis.

The model has been used in the first instance to study from the energy point of view the implementation of the entire DTP chain on FPGA comparing the various solutions identifying more effective strategies. Also, we applied the early power estimation model in the optical communication field. In this case, the method used to design the modulator and demodulator for an optical biotelemetry system is based on multilevel coding with short pulses. This analysis allowed us to compare various solutions to choose the most appropriate one at the conceptual time design in a real scenario. The results obtained for both case studies, using different configurations, requirements and design goals in both cases provide promising

insights in the perspective of formulating optimization problems. From this experience, a model for early dynamic power estimation of DTPs implemented in FPGAs is defined. The theoretical framework for DTPs is further extended to incorporate technology-related elements into the analysis and design process. The complexity, previously based on elementary blocks [1], is expressed in terms of hardware primitives.

Another contribution includes estimating energy consumption in the performance evaluation at the design process stage. This early energy consumption model allows us to obtain hardware complexity and valid power expressions for commercially available FPGAs and VLSI full-custom solutions.

# **Contents**

| So | mma  | rio                                                    | 3  |
|----|------|--------------------------------------------------------|----|
| Su | ımma | ry                                                     | 5  |
| 1  | Intr | oduction and Background                                | 9  |
|    | 1.1  | Contribution                                           | 12 |
|    | 1.2  | Outline                                                | 13 |
| 2  | The  | ory of Multirate Systems                               | 15 |
|    | 2.1  | Introduction                                           | 15 |
|    | 2.2  | Multirate system definition                            | 16 |
|    | 2.3  | Sampling frequency                                     | 16 |
|    |      | 2.3.1 Reduction of sampling frequency (Decimation)     | 17 |
|    |      | 2.3.2 Increase in sampling rate (Interpolation)        | 18 |
|    |      | 2.3.3 Blocks usually used in literature                | 19 |
|    |      | 2.3.4 Resampling properties                            | 20 |
|    | 2.4  | Filters banks                                          | 22 |
|    | 2.5  | Polyphase Channelizer                                  | 23 |
|    | 2.6  | Conclusion                                             | 24 |
| 3  | Pow  | ver Estimation Model                                   | 25 |
|    | 3.1  | Introduction                                           | 25 |
|    | 3.2  | Power consumption in CMOS circuits                     | 25 |
|    |      | 3.2.1 Basic Model                                      | 26 |
|    |      | 3.2.2 Influencing factors                              | 27 |
|    | 3.3  | Probabilistic Model and propagation input output       | 32 |
|    |      | 3.3.1 Method for calculating the Toggle rate an output | 34 |
|    | 3.4  | Power analysis method                                  | 35 |
|    | 3.5  | Signal-Flow Graph                                      | 36 |
|    | 3.6  | Conclusion                                             | 39 |

| 4  | DTP     | Architecture: A Case Study                                     | 40 |  |  |  |
|----|---------|----------------------------------------------------------------|----|--|--|--|
|    | 4.1     | Introduction                                                   | 40 |  |  |  |
|    | 4.2     | DTP Requirements                                               | 40 |  |  |  |
|    | 4.3     | Analog-to-Digital Converter                                    |    |  |  |  |
|    | 4.4     | IF to Analytic                                                 |    |  |  |  |
|    | 4.5     | Analysis Channelizer                                           |    |  |  |  |
|    | 4.6     | Synthesis Channelizer                                          |    |  |  |  |
|    | 4.7     | Analytic to IF                                                 | 46 |  |  |  |
|    | 4.8     | Noise Model                                                    |    |  |  |  |
|    | 4.9     | Hardware Complexity Model                                      | 50 |  |  |  |
|    |         | 4.9.1 Sub-blocks case analyzed                                 | 51 |  |  |  |
|    |         | 4.9.2 IF to Analytic block in detail                           | 51 |  |  |  |
|    | 4.10    | FPGA implementation: Simulation Results                        | 51 |  |  |  |
|    | 4.11    | Conclusion                                                     | 57 |  |  |  |
| 5  | New     | Multilevel Pulsed Modulation for Optical Biotelemetry: Another |    |  |  |  |
|    |         | e Study                                                        | 58 |  |  |  |
|    | 5.1     | Introduction                                                   | 58 |  |  |  |
|    | 5.2     | Optical Biotelemetry Overview                                  | 60 |  |  |  |
|    | 5.3     | The proposed Multilevel Pulsed Modulation Technique 61         |    |  |  |  |
|    | 5.4     | System Design and Implementation                               |    |  |  |  |
|    |         | 5.4.1 Power Analysis in Pre-Design Phase                       | 62 |  |  |  |
|    |         | 5.4.2 Design and Implementation                                | 64 |  |  |  |
|    | 5.5     | Experimental set-up and Measurements: System Characteriza-     |    |  |  |  |
|    |         | tion, Validation and Results                                   | 67 |  |  |  |
|    | 5.6     | Conclusion                                                     | 71 |  |  |  |
| 6  | Cone    | clusions                                                       | 72 |  |  |  |
| Pu | ıblicat | tions                                                          | 74 |  |  |  |
| Bi | bliogr  | raphy                                                          | 75 |  |  |  |

# Chapter 1

# **Introduction and Background**

Recent years have seen a substantial proliferation of devices that are increasingly widespread and connected to the network, making the idea of total-always coverage increasingly essential and strategic. To this destiny, integrating terrestrial 5G and Beyond-5G (B5G) networks with satellite constellations offers countless solutions. In particular, the use of Low Earth Orbit (LEO) small-satellite constellations seems to be one of the most appropriate solutions [2]. Many of the use cases contemplated in these networks, such as Enhanced Mobile Broadband (eMBB), Ultra-Reliable Low Latency Communications (URLLC), or massive Machine Type Communications (mMTC) find in LEO constellations good opportunities for realizability and ubiquity. Compared to Geostationary Earth Orbit (GEO) constellations, LEOs having a lower orbit introduce a lower latency (typically 2 ms for LEO versus  $\geq 100 ms$  for GEO [2] Figure 1.1) and thus more suitable to meet the fundamental requirements for the use cases above. Given their low altitude also can communicate with different types of ground terminals, including vehicles or IoT devices with narrower communication possibilities (Internet of Things).

The disadvantages of using a satellite network such a solution are the high number of satellites needed to guarantee a continuous ground coverage since a single satellite would cover a small portion of the globe surface, equivalent to 0.45% [2] of the total Earth's surface and, therefore, the management of the relevant complexities. For example, Kepler, Telesat, and Starlink constellations will consist of 140, 300, and between 12000 and 42000 satellites, respectively [3, 4]. Disturbances from small satellites and the use of low orbits pose significant challenges to the design and performance of these networks and open up new opportunities for innovation. Starting from this scenario, we implemented a solution for a transparent on-board transponder, Digital Transparent Processor (DTP), from the link-budget constraints.

The DTP was used to process a large portion of the spectrum divided into





Figure 1.1: GEo and LEO satellites.



Figure 1.2: Workflow Design.

channels that were utterly transparent to the information contained therein by subsequently performing switching operations between channels with an appropriate delay and noise figure. The methodology depicted in Figure 1.2 provides multiple solutions defined by different sets of parameters: input/output data size, coefficient width, and the number of channels. The solutions are analyzed to choose the best or most appropriate solution to implement. In this phase, the solutions are estimated under various profiles: Hardware complexity, Time delay, and Power consumption. The aspect of **power estimation** is highlighted in this part of the research. In particular, the main object of the research is the identification of a model of power consumption that is suitable for the stage of design examined (high level or mathematical signals level), which also has a low computational complexity in order to address the design and avoid costly redesign cycles.

In order to support the need for more information during the conceptual phase of design, and in particular, to increase knowledge about power consumption, we conducted state-of-the-art research. Najm in [5], stated: "With the advent of portable and high-density microelectronic devices, the power dissipation of very large scale integrated (VLSI) circuits is becoming a critical concem. Accurate and efficient power estimation during the design phase is required in order to meet the power specifications without a costly redesign process." and introduce many methodologies for power estimation. In [6], the three fundamental techniques used in the literature to perform power estimates are presented: Probabilistic-based method; Simulation-based method; statistical-based method. Reporting the summary Table 1.1. These estimation techniques need different levels of detail

| Estimation techniques | State of the art                            | Advantage                         | Limitations                |
|-----------------------|---------------------------------------------|-----------------------------------|----------------------------|
| Probabilistic based   | [7], [8], [9], [10], [11], [12], [13], [14] | High estimation speed             | Low accuracy               |
|                       |                                             |                                   |                            |
| Simulation based      | [15], [16], [17], [18]                      |                                   |                            |
|                       |                                             | <ol> <li>High accuracy</li> </ol> | Large amount of memory re- |
|                       |                                             | 2. Generic                        | sources                    |
|                       |                                             |                                   | 2. Low estimation speed    |
| Statistical based     | [19], [20], [21], [22], [23]                | Moderate accuracy                 | Moderate estimation speed  |
|                       |                                             |                                   |                            |

**Table 1.1:** Estimation techniques advantages and limitations.

and appropriate power modelling techniques. Figure 1.3 shows the classification of compared solutions.

The power model that best suits a high level of design based on signal theory is the analytical one that, although generally below-average, offers a good compromise as an estimator in the early stages of design. This work identifies the main early power estimation strategies in the design stage, but at the same time, light from the computational point of view while guaranteeing an adequate accuracy



Power characterization techniques

Figure 1.3: Classification of the related works analyzed in [6].

to compare the trends of power consumption of sub-blocks or their alternatives. In general terms, the expected goal is to describe a **"fast and light power estimation model for multirate digital circuits"**. To validate its effectiveness in the design phase has been applied to an implementation of the DTP on the FPGA platform carried out within the research group. After analyzing the factors involved in calculating the estimate itself, the model is implemented in the simplest version possible, leaving open the parameters that can lead to future complications of the same, making it more and more accurate.

### 1.1 Contribution

The main contribution of this work is the identification of a fast power estimation strategy for digital hardware blocks in multirate domains that gives a rough estimate of their consumption adequate to identify a design direction to avoid costly and redesign cycles. From this, can be easily extracted a generalized model that can be included in a broader methodology for the design of digital circuits for Digital Signal Processing applications capable of making estimates of different nature from the theoretical or mathematical plant.

Several languages for specific applications oriented to the design of digital systems oriented to numerical processing could draw from this a valuable tool

for design refinement. In particular, the most suitable ones are those that have a formalism more similar to the mathematical one, i.e., those with a functional paradigm mentioned in [24]. Such languages, such as  $C\lambda$ ash [25] (or Clash) a modern, functional, hardware description language, which offers many advantages and design facilities in this area, could be integrated or coupled with these models, providing fast estimates already in the stages immediately following the requirements analysis. More generally, term rewriting techniques could be used to synthesize digital circuits [26] that, in addition to offering a high-level mathematical view, maintain a strong link with the hardware counterpart implementation more suitable for a system of estimators. An example of how these languages can be used in numerical signal processing cases can be seen in [27] where an integer polyphase channelizer is implemented with a formalism close to the mathematical one.

### 1.2 Outline

This thesis is divided into three fundamental parts. First, more theoretical composes of two chapters introducing the basics of multirate design techniques and the power model used. Then, consisting of the validation through two cases of study: DTP Architecture and Multilevel Pulsed Modulation for Optical Biotelemetry, and finally, the conclusions.

The Chapter 2 introduces multirate systems theory. In particular, it shows often a valuable theoretical tool for the efficient implementation of systems and digital numerical processing. Chapter 3 introduces the adopted power model that starts from a transistor-level view [18, 28] and then generalizes to more complex blocks. Also in Chapter 3, it is reported how to set up a pseudo-Boolean problem using as an example a full-adder block, able to find the worst case of a logic network in terms of numbers of internal commutations (activity switching), leading to the division of the dynamic power into two contributions, the fixed internal one and the outputs that will be weighted with the effective fanout.

Chapter 4 introduces the Digital Transparent Processor (DTP), realized for satellite systems by the research group [29,30], for satellite applications as a case study. This DTP is implemented with a typical fixed architecture but fully configurable in its word width and coefficients and any rounding and saturation parameters. This parameterization is at the system's base respecting the specifications on the link budget wanted by the requirements. In this section, we highlighted how the exact requirement of link-budget can correspond to various solutions, among which the fundamental ones are: Uniform Contribution (CU), Uniform Parameters (PU), Minimum RAM (MR), and CU/MR trade-off (TO). At the end of this chapter, we introduce the noise model used and estimate the hardware complexity.

Chapter 5 shows another case where we apply power estimation in the pre-design stage. This section explains the design, implementation, and characterization of a novel multilevel synchronized pulse position modulation technique suitable to reduce the overall power consumption of wireless ultra-wide-band optical biotelemetry links for implantable and wearable medical devices.

Chapter 6 concludes the thesis and provides recommendations that impact the design of telecommunications systems.

# Chapter 2

# **Theory of Multirate Systems**

### 2.1 Introduction

In recent years we have seen how digital signal processing is gaining more space compared to the analog part. In this scenario often several sub-blocks operating at different sampling frequencies have to coexist. Often when two subsystems working at different sampling frequencies need to communicate these must be made compatible. Other cases where working with different frequencies is extremely profitable are found in signal and image processing and therefore in telecommunications. For example, a system receiving a wideband digital signal that must be decomposed into several non-overlapping narrowband channels to be transmitted, will need to work each narrowband channel with its own sampling frequency reduced to its Nyquist limit, thus saving transmission bandwidth [31].

Precisely by leveraging this last point multirate systems, or more particularly polyphasic decompositions, can be used either in favor of computational performance, decomposing a system into subsystems more manageable by a technology and then raise the frequency limits by operating on undersampled systems, or exploiting parallelism use technologically inferior hardware.

Then, we introduce the basic operations of decimation and interpolation listing different properties including the noble ones that allow to implement arbitrary and rational changes of the sampling rate.

Finally, with the presentation of polyphasis decomposition and switch models, which are key tools in multirate systems we will introduce the design of decimation and interpolation filters necessary for the design of the case study of this work, i.e., a transparent digital satellite processor (DTP).

### 2.2 Multirate system definition

In general the expression Multirate, Multirate systems (or Multirate signal processing) indicates those systems or methods of treatment of signal in which there are more sampling frequencies. This means that two signals coming from two different parts of the system are not directly comparable.

In signals processing is often necessary to modify the sampling rate. An example is the one related to switching between two different standard sampling rates audio: from Digital Audio Tape (DAT) (48KHz) to Compact Disk (CD) (44.1KHz) or viceversa.

In other situations the frequency conversion of sampling can be used as an expedient to reduce the computational cost. In the signal processing, in fact, there is no reason why the various processes should all be done at the same sampling rate

Why Multirate Filters?

- Multirate filters can bring efficiency to a particular filter implementation;
- In general, multirate filters means operate at different rates;

In [32] Fredric J. Harris says: "However, multirate filters are also often used in designs where this is not the case. For example you may have a system where the input sample rate and output sample rate are the same, but internally there is decimation and interpolation occurring in a series of filters, such that the final output of the system has the same sample rate as the input. Such a design may exhibit lower cost than could be achieved with a single-rate filter for various reasons".

### 2.3 Sampling frequency

The sampling signal can to be see as a simple digital sequences derived from analog signal:

$$x_c(t) \Rightarrow x[n] = x_c(nT)$$
  $F_c = \frac{1}{T}$ 

sometimes it may be useful to have a sampled signal with frequency different from the starting one. Often, we are used to see the sampling rate as coming directly from conversion and therefore an "obvious" solution would be to reconstruct the analog signal and then resample it.

$$x_c(t) \Rightarrow x'[n] = x_c(nT')$$
  $F_c = \frac{1}{T'}$ 

But is possible to change the sampling frequency of a signal without necessarily going through a reconstruction of the signal then analog / digital conversion again.

### 2.3.1 Reduction of sampling frequency (Decimation)

Decimation is the first basic procedure used for lower the sampling rate (in the literature they come often also used terms like subsampling or downsampling).

$$x_d[n] = x[nM] = x_c(nMT)$$

Figure 2.1 depicts the symbol for the decimation block, while in Figures 2.2 and 2.3 the behaviour in time and frequency domain respectively.

$$x[n] \longrightarrow M \downarrow \longrightarrow x_d[n] = x[nM]$$

Figure 2.1: Down-sampling block.



Figure 2.2: Down-sampling in the time domain.



Figure 2.3: Down-sampling in the frequency domain.

The sampling frequency can be reduced by a factor equal to M without aliasing if the original sampling frequency is at least M times higher than the Nyquist<sup>1</sup>. That is, we want:

 $X(e^{j\omega}) = 0 \ per \ |\omega| > \frac{\pi}{M}$ 

### 2.3.2 Increase in sampling rate (Interpolation)

Interpolation is the basic procedure used to increase the sampling frequency (in the literature terms like oversampling or upsampling are also used).

$$x_d[n] = x[n/L]$$

Figure 2.4 depicts the symbol for the decimation block, while in Figures 2.5 and 2.6 the behaviour in time and frequency domain respectively.

$$x[n] \longrightarrow L \uparrow \longrightarrow x_e[n] = x[n/L]$$

Figure 2.4: Up-sampling block.



**Figure 2.5:** UP-sampling in the time domain.

Similar to decimation, there are also undesirable effects to consider in interpolation. Although less destructive, it is also worth paying attention to this. In this case we will can't talk about aliasing but imaging. This effect introduces in the frequency domain of the signal replicas if the final band is considered equal to the starting one.

Sampling theorem:  $\Omega_s=\frac{2\pi}{T_s}\geq 2\Omega_N$  with  $\Omega_N$  continuous time signal bandwidth also  $\frac{\pi}{T_s}\geq \Omega_N$ 



**Figure 2.6:** UP-sampling in the frequency domain.

## 2.3.3 Blocks usually used in literature

The elementary blocks seen above can be found in the literature represented by various symbols all equivalent but often emphasize a possible way of physical implementation of the same.



Figure 2.7: Usualy blocks.

In particular the last elements represented in Figure 2.7 let us glimpse what will be one of the arrival points of this short discussion, that is the polyphase decomposition of the signal.

#### 2.3.4 Resampling properties

Before describing the various properties of resampling the following notation is introduced:

will be used to refer with a notation like to the functional one to a block that will return as output the signal x(n) decimated in order N.

In dual notation mode:

will refer to the oversampling with order N of the signal x(n) with addition of zeros.

#### Property I - Linearity of the downsampling

$$DS(M, a_1 \cdot x1(n) + a_2 \cdot x2(n)) = a_1 \cdot DS(M, x1(n)) + a_2 \cdot DS(x2(n))$$

The undersampling operation is linear compared to the linear combination of two signals. In fact it can be highlighted or distributed with respect to them, see Figure 2.8 on the left and right respectively.



Figure 2.8: Property I.

#### Property II (Dual) - Linearity of the upsampling

$$US(M, a_i \cdot x(n)) = a_i \cdot US(M, x(n))$$

The oversampling operation for duality is also linear compared to the linear combination of two signals. In fact it can be highlighted or distributed with respect to them, see Figure 2.9 on the left and right respectively.



Figure 2.9: Property II.

#### **Properties III and IV (Dual)**

This properties highlights what happens to the delay blocks in the transition from downstream to upstream insertion. The Figure 2.10 shows graphically the operation for a decimator and dually for an interpolator.



Figure 2.10: Properties III and IV.

#### Properties V e VI (Dual) - Noble properties

Two new properties are derived from the previous ones, which are often referred to as the noble properties of multirate filters. These are very important properties as they highlight the possible simplifications that can be made on a generic filter. There is always to remember that at the output to a decimator you work at a lower frequency, as well as dual input to an interpolator.



Figure 2.11: Properties V and VI or Noble Properties.

#### **Rational Resampling**

Very often there is the necessity to make resamplings of rational order and these can come carried out combining opportunely in cascade the blocks of under and oversampling as in Figure 2.12, but in general the order in which the under and over sampling is carried out is important and not always invertible.



**Figure 2.12:** Rational resampling  $\frac{L}{M}$ .

The position of the sub and oversampler can be exchanged if and only if the two orders are prime among them, in the Figure 2.13 an example.



Figure 2.13: Example of subsampling with order 2 and oversampling with order 3 in cascade.

An important application of all these properties can be expressed by the corollary of Figure 2.14.



**Figure 2.14:** Rational resampling  $\frac{L}{M}$  with filter in the middle.

### 2.4 Filters banks

Multirate processing algorithms have been largely used in recent years together with filters banks especially in telecommunications and multimedia applications. Filters banks are made up of low-pass, band-pass and high-pass filters, combined according to appropriate architectures, designed to decompose the spectrum of the input signal into a certain number of contiguous bands (sub- bands). In this case we speak of an analysis bank. The signal reconstruction procedure is called synthesis bank.

An analysis bank is a set of filters  $H_k(z)$  whose aim is to decompose the original signal x[n] into M sub-band signals  $x_k[n]$  shown in Figure 2.15, while a synthesis bank is a set of M synthesis filters  $F_k(z)$  combining M signals  $y_k[n]$  in a recovered signal  $\tilde{x}[n]$  shown in Figure 2.16.



Figure 2.15: Analysis filter bank.



Figure 2.16: Synthesis filter bank.

# 2.5 Polyphase Channelizer

This section introduces the filter bank that goes by the name of polyphase channeler. The treatment will be very intuitive and informal as it is the same used for the seminars held during these three years of study.

A polyphase filter bank consists of two parts, a polyphase filter and an FFT block. The former is used to decimate the input signal before sending it to the latter. The FFT on the other hand splits the signal into its different channel [33].

Suppose we have a number J of frequency-multiplexed channels that constitute an FDM frame. This frame can be frequency shifted by centering it on the origin of the frequency axis using the representation of the samples  $\in \mathbb{C}$  and an appropriate Hilbert filter capable of transforming the signal into its analytical form.

#### 2.6 Conclusion

Adopting multirate solutions significantly influences the final architecture of the system, changing its working frequencies and hardware complexity. For this reason, it is important to study the various alternatives also from the point of view of power consumption, making their use transparent and, if necessary, finding the best ones.

# Chapter 3

# **Power Estimation Model**

#### 3.1 Introduction

The mobility paradigm and the Internet of Things (IoT) introduce new challenges in energy consumption and hardware complexity. The new constraints and requirements increase the complexity of the VLSI design process. Thus, if the complexity and performance problems manifest primarily as heat dissipation, the latter are mostly energy saving problems. To address these challenges, sophisticated design methodologies and algorithms have been developed for electronic design automation (EDA), but these often require design detail in an advanced state. The importance of power consumption in design highlight the need to find appropriate estimators. These are needed to help identify alternatives that are more efficient in this respect. Since power estimates may be required at various phases of the design sometimes it may be necessary to sacrifice accuracy to increase the speed of tool response or vice versa while maintaining some degree of reliability. Obtaining a power estimate is much more complex than estimating circuit area and delays because power depends not only on circuit topology, but also on signal activity. Typically, design exploration is performed at each level of abstraction, motivating power estimation tools at different levels. At a high level of abstraction only little information is known about the implementation of the circuit, so less accurate estimates can be obtained.

# 3.2 Power consumption in CMOS circuits

Power consumption in digital CMOS circuits can be seen as the sum of three components [28, 34]:

• Dynamic power  $(P_{dyn})$ ;

- Short-circuit power  $(P_{short})$ ;
- Static power ( $P_{static}$ );

$$P = P_{dyn} + P_{short} + P_{static}$$

Although over the years static power  $P_{stat}$  is increasing its contribution on par with dynamic power  $P_{dyn}$ , only the latter can be directly managed in the initial design phase while the former can be considered as a direct implication of the area occupied and the technology node to which it refers. Regarding the short circuit power  $P_{short}$  this can be considered dynamic because it changes according to frequency and activity on the lines but static because it is always linked to the number of internal components.

During the design phase there are many strategies that can be used to reduce the power dissipated in integrated circuits. Among the best known and practicable there is the reduction of the supply voltage and the lowering of the integration scale, but nevertheless not very practicable, at least the first one, in the FPGA field.

A more general strategy, which must be taken into account already in the early stages of design, is instead that of the reorganization of the logical synthesis and calculation architecture.

In CMOS circuits, the power dynamic consumption depends on the number of transitions on the internal signals. Logically, therefore, power optimization strategies can be used to reduce the number of these transitions.

In this document when we will talk about dissipated power we are referring to the dynamic component.

A simple and fast method is developed to estimate the power dissipation useful to deal with the choice between equivalent solutions of the internal blocks already different in area and delay. Opening the door to possible optimization algorithms. Ultimately, it can be analyzed whether changing the architecture can actually contribute to the reduction of power dissipation.

#### 3.2.1 Basic Model

The basic model adopted for power estimation is based on the fact that a fundamental contribution to power consumption is provided by dynamic power, i.e. when a logical signal switches between the two states. Basically this is seen as the energy needed to write a logical state and then delete it on a line (output).

Fig. 3.1 depicts charge and discharge of the output capacity of a CMOS circuit responsible for the dissipation of the dynamic power. The losses during the

switching of the two transistors p and n are neglected and the power dissipated over an entire LOW-HIGH-LOW output cycle is considered.



Figure 3.1: Charge phase and discharge phase of output in CMOS circuits.

By representing the internal connections to a circuit as capabilities we can write the formula:

$$P_{dyn} = \frac{1}{2}C V^2 F$$

where C is the line capacity, V is the supply voltage and F is the frequency as the inverse of the symbol period on the line (usually in synchronous circuits it is the clock frequency).

### 3.2.2 Influencing factors

When generalizing the formula for calculating the power of a circuit, attention must be paid to many factors and it is important to identify a model for the signals that are present on the lines easy to treat.

#### Probabilistic model of digital signals

The probabilistic method to characterize the switching activity of a digital signal in logical networks most used in the literature is the one that is based on activity swtching, i.e. each signal is described taking into account the following parameters:

- Static Probability (SP): probability of the 1 symbol ( $\in$  [0, 1]);
- Signal Transition (ST); Transition Frequency ([Tr/s] or [MTr/s]).

To better understand previous parameters we always refer to synchronous cases, that is where the signals change according to a general clock signal that later we will indicate with CLK, but this is not expressly necessary as we will see from the formulas.

In our discussion we will often refer to an alternative representation used in literature that is more suitable for synchronous cases. In this one the Signal Transition is replaced with the Toggle rate (TR) that can be defined as the inverse of the number of average clock cycles between two successive transitions of the signal, that is:

$$TR = \frac{ST(a)}{F_{CLK}}$$

where ST(a) is the Signal Transition of the a signal, and  $F_{CLK}$  is the clock frequency. The Toggle rate is often also shown as a percentage.

**Example 1** If the clock frequency is  $F_{CLK} = 1 \ MHz$  and the  $ST(a) = 1 \ MTr/s$  (average bit rate) results:

$$TR(a) = \frac{ST(a)}{F_{CLK}} = \frac{1 MTr/s}{1 MHz} = 1$$

or even: TR(a) = 100%.

**Example 2** Knowing that in a clock period it completes two transitions itself is obtained:

$$TR(CLK) = \frac{ST(CLK)}{F_{CLK}} = \frac{2 * F_{CLK}}{F_{CLK}} = 2$$

or even: TR(CLK) = 200%.

Transporting what has been said so far our basic model will be:

$$P_{dyn}(a) = \frac{1}{2}CV^2 F_{CLK} TR(a)$$

#### Fanout

Analyzing the dynamic power function a parameter still to be discussed is the capacitor of the wire  $(C_{net})$ . Estimating  $C_{net}$  by putting it equal for all wires at this point of design detail would seem inevitable as a more precise estimate would require more complex procedures and knowledge of detail.

The only correction that is possible at this stage is to bring out the fanout as a multiplicative factor of a  $C_{tech}$  capacity that will be estimated later as a technology parameter.

Fig. 3.2 shows how the load capacity of a CMOS logic port is influenced not only by the connection line  $(C_W^n)$ , but also by the capacitance related to the inputs of the cascaded blocks and by the cardinality (fanout).



**Figure 3.2:** Capacitors involved in the calculation of  $C_{net}$ .

The fanout FO (later FAO Average Fanout) is the number of blocks/circuits driven by a wire. At this point the expression of dynamic power will be:

$$P_{dyn}(a) = \frac{1}{2}C_{tech} FO V^2 F_{CLK} TR(a)$$

Previous formula can be further generalized if we consider generic blocks of variable size or made up of sub-blocks. These can be divided as a fixed internal part and the output part that will be affected by fanout. To model this aspect the formula is transformed into the following:

$$P_{dyn}(a) = \frac{1}{2}(\alpha_{tech} + \beta_{tech}FO) V^2 F_{CLK} TR(a)$$

Where  $\alpha_{tech}$  represent the internal contribution and  $\beta_{tech}$  the outgoing contribution of energy consumption.

Generalizing to all the internal blocks of a circuit, considering that a line can be seen as output of one and only one of them, the total dynamic power is expressed as follows:

$$P_{dyn} = \frac{1}{2} \sum_{\forall net} (\alpha_{tech}(net) + \beta_{tech}(net) FO(net)) V^2 F_{CLK} TR(net)$$

where with  $\alpha_{tech}(net)$  we mean the internal power of the block that generates the signal on the net and with beta  $\beta_{tech}(net)FO(net)$  the contribution given by the capacity of the line itself for each net present in the circuit.

Grouping by homogeneous blocks (for example: LUT, FullADDER, CARRY chain etc.) we can introduce the average fan-out for each of them FOA(block), and similarly the average Toggle rate TRA(block) getting:

$$P_{dyn} = \frac{1}{2} \sum_{\forall block} (\alpha_{tech}(block) + \beta_{tech}(block) FOA(block)) V^2 F_{CLK} TRA(block)$$

#### Parameter $\alpha_{tech}$ and $\beta_{tech}$

The parameters  $\alpha_{tech}$  and  $\beta_{tech}$ , although abundantly justified by normal VLSI design practices, can be traced back to contributions on the power of the constituent elements inside a circuit.

Analyzing, for example, a simple network often used as an example it is possible to derive these parameters and highlight why they change from block to block and the dependence on input that is estimated as a worst case.

Further ahead we will set the network of a full-adder to calculate the worst case from the point of view of power consumption.

#### The Full Adder's case

The Full Adder is a basic circuit in digital electronics but also in computing and telecommunication systems.



Figure 3.3: Full adder gate level.

The description of the Full Adder in Figure 3.3 is given by the following formulas:

$$out = a \oplus b \oplus c_i$$
$$c_o = ab + (a \oplus b)c_i$$

in our case the study is done using the description AIG (And Inverter Graph) or better (NAND Inverter Graph) as visible in Figure 3.4.

By setting the maximization problem to a linear pseudo-Boolean function (Boolean variables and integer coefficients) that adds the internal node variations of two different instances (Boolean) of the circuit to time  $t_0$  and the next  $t_1$  and



Figure 3.4: Full adder gate-level with NAND and Inverter gates only.

taking into account the respective fan-out (integer). To do this, we use the following constraints in the form of product sums:

$$\begin{split} H^{n} = & (a^{n} + d_{1}^{n})(\overline{a^{n}} + \overline{d_{1}^{n}}) \\ & (b^{n} + d_{2}^{n})(\overline{b^{n}} + \overline{d_{2}^{n}}) \\ & (d_{5}^{n} + d_{6}^{n})(\overline{d_{5}^{n}} + \overline{d_{6}^{n}}) \\ & (c_{i}^{n} + d_{7}^{n})(\overline{c_{i}^{n}} + \overline{d_{7}^{n}}) \\ & (d_{1}^{n} + d_{3}^{n})(b^{n} + d_{3}^{n})(\overline{d_{1}^{n}} + \overline{b^{n}} + \overline{d_{3}^{n}}) \\ & (a^{n} + d_{4}^{n})(d_{2}^{n} + d_{4}^{n})(\overline{a^{n}} + \overline{d_{2}^{n}} + \overline{d_{4}^{n}}) \\ & (d_{3}^{n} + d_{5}^{n})(d_{4}^{n} + d_{5}^{n})(\overline{d_{3}^{n}} + \overline{d_{4}^{n}} + \overline{d_{5}^{n}}) \\ & (d_{6}^{n} + d_{8}^{n})(c_{i}^{n} + d_{8}^{n})(\overline{d_{6}^{n}} + \overline{c_{i}^{n}} + \overline{d_{8}^{n}}) \\ & (d_{5}^{n} + d_{10}^{n})(d_{7}^{n} + d_{9}^{n})(\overline{d_{5}^{n}} + \overline{d_{7}^{n}} + \overline{d_{10}^{n}}) \\ & (a^{n} + d_{11}^{n})(b^{n} + d_{11}^{n})(\overline{a^{n}} + \overline{b^{n}} + \overline{d_{11}^{n}}) \\ & (d_{8}^{n} + out^{n})(d_{9}^{n} + out^{n})(\overline{d_{10}^{n}} + \overline{d_{11}^{n}} + \overline{co^{n}}) = 1 \quad n \in \{0, 1\} \end{split}$$

where  $d_g$  indicates the output value of the g-th gate and the value of the n exponent refers to the time  $t_0$  or  $t_1$  for n=0 or n=1 respectively.

The pseudo-Boolean target function refers as optimization variables to the following:

$$d_g = d_g^0 \oplus d_g^1 \quad \forall g \in \{Internal \ gates\}.$$

getting:

$$W = d_1 + d_2 + d_3 + d_4 + 3 d_5 + d_6 + d_7 + d_8 + d_9 + d_{10} + d_{11} + out + co$$

where  $out = d_{12}$  and  $co = d_{13}$  and the only node that has a different fanout than the unit fanout is  $d_5$ .

Using the MiniZinc solver a optimal solution is obtained by passing from the inputs  $(a(t_0), b(t_0), c_i(t_0)) = (a^0, b^0, c_i^0) = (1, 1, 1)$  to  $(a(t_1), b(t_1), c_i(t_1)) = (a^1, b^1, c_i^1) = (1, 0, 0)$  with W = 11.

Not considering in the objective function the outputs out and co the configuration remains the same while W=10.

Returning to the model of the dynamic power and considering that the outputs are two it is deduced for our full adder that:

$$\max(P_{dyn}) = \frac{1}{2} (10\alpha'_{tech} + 2\beta'_{tech} FOA) V^2 F_{CLK} TRA$$

where with the coefficients of  $\alpha'_{tech}$  and  $\beta'_{tech}$  is meant as the multiplicative contribution to the line capacities in the fanout case.

# 3.3 Probabilistic Model and propagation input output

The idea behind probabilistic techniques is to directly propagate the input statistics to obtain the switching probability of each node in the circuit and the output. This approach is potentially very efficient, as only one pass through the circuit is needed [28]. However, it requires a new simulation engine with a set of rules for propagating the signal statistics.

Proceed, for example, by analyzing the output probability of a single AND port. The output will be at logic level 1 when each of its inputs is 1. When the inputs are independent, the static probability at the output will be the simple multiplication of the static probabilities of the inputs:  $PS_{AND} = PS_a \ PS_b$ . In Table 3.1 we can see other basic logic gates.

| Gate                    | Static Probability     |
|-------------------------|------------------------|
| $a \ AND \ b$           | PS(a) PS(b)            |
| a OR b                  | PS(a) + PS(b) - PS(ab) |
| $NOT(a) = \overline{a}$ | 1 - PS(a)              |

**Table 3.1:** Static probability propagation (PS) in basic logic gates.

In addition to the previous formulas there are other aspects to analyze in signal probability propagation. First of all, a delay model is introduced and therefore delays on the paths between inputs and outputs. In a general model each port has its own specific delay time.

Using the circuit in Figure 3.5 we assume that the two NAND gates 1 and 2 have response times  $\Delta_1$  and  $\Delta_2$ , respectively.



**Figure 3.5:** Example of a logic circuit with glitching and spatial correlation [28]

We observe that variations in the output z can be triggered by signals with paths of different lengths. That is, z will have a transaction probability both relative to the path with time  $\Delta_2$  and to the path with time  $\Delta_1 + \Delta_2$ . Of course, the total switching activity of signal z will be the sum of these two probabilities. This study also highlights possible **glitches** within the network.

Another problem is **spatial correlation**. When two logic signals are analyzed together, they can be assumed to be independent only if they have no common input signal in their support. If there is one or more common inputs, these signals are spatially correlated.

Again from the Figure 3.5 it is shown that the probability PS(z) cannot be computed directly from PS(w) and PS(y) as for the case of independent inputs by directly computing the static probability of the NAND gate  $(PS(z) \neq 1 - PS(w)PS(y))$ . Being PS(w) = 1 - PS(x)PS(y) we get PS(z) = 1 - (1 - PS(x)PS(y))PS(y) from which, knowing that PS(y)PS(y) = PS(y) we get: PS(z) = 1 - PS(y) + PS(x)PS(y).

For example having PS(x) = PS(y) = 0.5 we have:

$$PS(z) = 1 - PS(y) + PS(x)PS(y) = 1 - 0.5 + 0.25 = 0.75$$

different from the direct calculation 1 - PS(w)PS(x) = 0.625 obtained by considering w and y indipendent and with PS(w) = 0.75.

| $\boldsymbol{x}$ | y | w | z |
|------------------|---|---|---|
| 0                | 0 | 1 | 1 |
| 0                | 1 | 1 | 0 |
| 1                | 0 | 1 | 1 |
| 1                | 1 | 0 | 1 |

**Table 3.2:** Truth Table for the circuit in Figure 3.5.

Which is easily deduced by looking at the Truth Table 3.2 where both  $\boldsymbol{w}$  and  $\boldsymbol{z}$  are 0 in one configuration.

A third important issue is **time correlation**. In probabilistic methods the static probability alone is not sufficient to model a digital signal but, as introduced earlier, the average switching activity or toogle rate (TR) is needed. It represents

precisely the probability that a signal will make a transition from 0 to 1 or vice versa

In a synchronous system, the static probability and toogle rate are related by the following:

 $\frac{TR}{2} \le PS \le 1 - \frac{TR}{2}$ 

#### 3.3.1 Method for calculating the Toggle rate an output

#### Partial derivate method

At this point it remains to be analyzed how the calculation of the toggle rate at the output of a block is done knowing the behavior of the inputs.

First of all this will depend strongly on the particular function implemented by the circuit and this explains how algorithms based on generic components only (such as LUT or reprogrammable cells) are highly approximate, but often for problems of complexity we tend to simplify.

This relationship is based on the analysis of the Boolean function to partial derivatives.

Given the Boolean function;

$$y = f(\underline{x}) = f(x_0, x_1, \dots, x_{n-1})$$

is defined as partial derivative with respect to the variable  $x_i$  the following:

$$\frac{\partial f(\underline{x})}{\partial x_i} = \frac{\partial y}{\partial x_i} \stackrel{\triangle}{=} f_{x_i} \oplus f_{\overline{x_i}}$$

where  $f_{x_i}$  and  $f_{\overline{x_i}}$  are Shannon's co-factors.

The output switching activity will then be calculated as follows for each individual input:

$$TR(y) = \sum_{i=0}^{n-1} P\left(\frac{\partial y}{\partial x_i}\right) TR(x_i)$$

where the following will also be used to calculate the Static Probability of a logic function SP(y) output for indipendent input.

$$P(a) = SP(a)$$

$$P(\overline{a}) = 1 - P(a)$$

$$P(ab) = P(a)P(b)$$

$$P(a+b) = P(a) + P(b) - P(a)P(b)$$

where on the left or in the argument of the functional P the symbols of the operators are to be considered boolean while the others are the classic operations in  $\mathbb{R}$ .

In the case of non-independent inputs, the more general formulas in Figure 3.6 are used:



Figure 3.6: Calculation of probability of Boolean functions

#### **Entropy-Based method**

An alternative method to the previous one to estimate the activity switching of the line, also used at word-level, is the one that uses an Entropy-Based approach [35].

### 3.4 Power analysis method

In the DSP domain, the initial theoretical design is approached from a mathematical point of view by making strong use of difference equations. In fact, the conceptual blocks produced in the very early stages of design can all be formally characterized by mathematical processes. These mathematical processes seen as mathematical systems can be translated into graphs that from the studies of Shannon first, Madson later, take the name of Signal-flow graph (SFG). These graphs in practical aspects and especially in specific application domains can have an "unambiguous" link to a physical digital implementation.

As reported by Robichaud in [36], however, a complex mathematical system can be translated into various graphs, depending on the choice of intermediate signals, and the interpretation of precedence or parallelism of operations. For this reason, from the same system of equations one can obtain various equivalent solutions but different in terms of realizability and weights. This opens the door initially to the study of exploring the space of solutions and then the metrics within it to evaluate the different effects.

Example a + b + c can be written as: (a + b) + c, a + (b + c) or using ad hoc circuits as (a + b + c) as a whole.

In modern numerical signal processing systems, there is a strong improvement in performance and feasibility with the use of multirate techniques. Often these, however, are not represented in the classical SFG formalization and therefore an augmented SFG formalism is introduced for our blocks.

This formalism also needs to represent not only the different frequency domains of the system, but also the information needed for the various estimates including power. For this the graphs will also draw notations on the number of bits of the signals.

## 3.5 Signal-Flow Graph

Signal-Flow Graphs (SFG), invented by Claude Shannon previously, and independently later by Samuel Jefferson Mason who renamed them to Mason Graph, are used to graphically represent algebraic equations.

They consist of nodes (also called signal) and annotated arcs that introduce a multiplicative factor. The nodes with multiple arcs in input represent the sum signal of the inputs.



**Figure 3.7:** Signal Flow Graph transformation.

An example of these graphs, very common in the literature, is visible in Figure 3.7a.

To perform the transformation from SFG to SFG Augmented (SFG+) are used the equivalences in Figure 3.7b

The following notation is used to annotate the signal information:

where signal x is n bit wide and works with clock domain "clk" and a toggle rate equal to TR.

#### **Example of power estimation using SFG**





Function block node:  $z^{-1}$  • Register + • Summator  $a_0$  • Multiplier  $L \uparrow$  • Up-converter  $N \downarrow$  • Down-converter F(.) • Generic function

#### Signal node

x: n@clk! TR
Signal x is 8 bit wide and
works with clock domain
"clk" and a toggle rate TR

# Having 2 arcs in output the signal fanout is doubled



#### Signal node

 $x: n@clk! TR_x$  $d^1$ :  $n@clk!TR_{d1}$  $d^2: n@clk! TR_{d2}$   $d^2: n@clk! TR_{d2}$   $d^3: n + |a| - 1@clk! TR_{d3}$   $d^4: n + |a| - 1@clk! TR_{d4}$  $y: n + |a|@clk!TR_y$ 

$$|a_0| = |a_1| = |a|$$

| Hardware complexity                        | Power estimation for signal                                                                                                                      |  |  |
|--------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 2 register n bit = 2n Flip Flop D          | The signal for connection is:                                                                                                                    |  |  |
| 2 multiplier<br>1 adder n bit (output n+1) | $x = n$ $d^{1} = 2n$ $d^{2} = n$ $d^{3} = d^{4} = n +  a  - 1$ $y = n a $ Fanout input Fanout Reg 1 Fanout Reg 2 Fanout Multipliers Fanout Adder |  |  |

| Hardware complexity                                                     | Power estimation for signal                                                              |                                                                                    |  |
|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|--|
| 2 register n bit = 2n Flip Flop D                                       | The signal for connection is:                                                            |                                                                                    |  |
| 2 multiplier n x  a  (output n +  a  - 1)<br>1 adder n bit (output n+1) | $ x  = n$ $ d^{1}  = 2n$ $ d^{2}  = n$ $ d^{3}  =  d^{4}  = n +  a  - 1$ $ y  = n +  a $ | Fanout input<br>Fanout Reg 1<br>Fanout Reg 2<br>Fanout Multipliers<br>Fanout Adder |  |

Hypothesis:

 $\alpha_{clk}$  (Power clock tree)

 $\alpha_{FF}$  $\alpha_{MULT}$ 

 $\alpha_{FA}$  $\beta_{FF} = \beta_{MULT} = \beta_{FA} = \beta$ 

Power estimation  $P_{dyn} = P_{clk} + P_{data}$  Conversion:

1 register n bit = n FF1 multiplier = 1 MULT 1 adder n bit = n FA

$$\begin{split} P_{clk} &= CV^2 \sum_{f \, \forall \, FF} \alpha_{clk} \ F_f \\ P_{data} &= 1/2 \left( \sum_{b \, \forall \, blocks} \alpha_b TR_b \, F_b + \sum_{s \, \forall \, signals} \beta \, FO_s \, TR_s \, F_s \right) \end{split}$$

 $blocks = \{all\ blocks\ elementary\ FF, MULT, FA\}$ 

 $signals = \{all\ signals\}$ 



#### 3.6 Conclusion

SFG Augmented (SFG+) makes possible power and area estimation equal to the Register Transfer Level (RTL) estimation method. The proposed methodology transforms the classical SFG representation into a more functional explicit mode. This new representation consent to moving from arithmetic description to the architectural description without consider the RTL representation. In this way, power estimation is anticipated in the conceptual design process.

## **Chapter 4**

## **DTP Architecture: A Case Study**

#### 4.1 Introduction

In this chapter, the Digital Transparent Processor studied by our research group in [1,37] for satellite purposes will be introduced and used as a case study to extrapolate and test the power model and derive estimates. Starting from the reference system scenario, the digital processing blocks are designed, their functional description, the RTL scheme up to the VHDL implementation.

#### 4.2 DTP Requirements

According to the reference system scenario described in [38], which was motivated by the European Space Agency (ESA) project, the coverage of the european geographic area is implemented by partitioning the reference area into 79 beams. In particular, the Ka-band is considered, with a 500 MHz uplink bandwidth in the range 29.5-30.0 GHz and a 500 MHz downlink bandwidth in the range 19.7-20.2 GHz. A multi-beam coverage and a four frequency reuse pattern is assumed. Therefore each beam is assigned a 125 MHz bandwidth (with single polarization) or a 250 MHz bandwidth (with double polarization). The coverage pattern is generated on board by adopting a beamforming scheme that involves an analog beamforming network and an Array-Fed Reflector (AFR) with a seven Feeds Per Beam scheme [38]. Figure 4.1 shows the architecture of the transparent transponder that includes an analog front-end for beam forming (Beam Forming Network), the next part of the RF chain (Low Noise Amplifier, Mixer and Automatic Gain Control, etc.), and the digital core that is representative of the DTP. For each beam a DTP chain is considered. The digital section is composed by a Digital RX Chain, a Switch and a Digital TX Chain. The Switch performs the routing intra-beam and inter-beam. The Digital RX Chain performs the decomposition of the uplink sig-



Figure 4.1: Block diagram of the transponder architecture.

nal into *J* independent signal. The Digital TX Chain realizes the complementary operation. In the present manuscript the on-board switch section is considered as a transparent element from the signal processing perspective therefore it was not subject of analysis. DTP synthesis and analysis channelizers [39] are designed as non-critically sampled and endowed with a sum-to-one feature. The sum-to-one feature is described in Figure 4.2. The processed bandwidth is split in several Elementary Switched Bandwidths (ESBWs) that may be independently managed in order to configure the best frequency planning. In fact, the ESBW is the smallest unit that can be allocated, alone or in combination with adjacent ESBWs, to host on-board processing of communication carriers. The adjacent ESBWs are filtered with a sum-to-one transfer function so that, if they are contiguously switched from the uplink input to the downlink output, no linear distortion occurs at their intermediate boundaries when considering a user channel whose bandwidth spans over multiple ESBWs. Figure 4.2 further illustrates a possible trade-off that can be made between lost bandwidth resources and processor hardware complexity:



**Figure 4.2:** Reduction of edge transition-bandwidth waste with increase of J.

if smaller size ESBWs are used, the transition bandwidth - which is unavoidably lost at user channel edges - becomes smaller and user channels can be set closer with minimum guard bands. However, this requires a larger number of channels J and increased hardware complexity.

The DTP model is composed of an ADC, a "IF to Analytic" block, a channelizer of analysis (working on J-channels), the switch, and their dual blocks. Figure 4.3 shows the DTP processing chain for a simplified scenario, where the routing from different beams (and ADCs) is not considered. The processes within the chain are labeled from  $P_0$  (i.e. the ADC) to  $P_5$  (the Digital to Analog Converter, DAC). When considering the internal structure of each stage, three basic building



Figure 4.3: DTP processing chain.

blocks have to be considered to implement input channels' separation and recomposition of the output pass-band signal: 1) FIR (Finite Impulse Response) filter elements, implemented in direct form, 2) FFT butterfly structures and 3) Dual Port RAM Buffers (DPRB). The architecture also includes some Saturation and Rounding Blocks (SRB) that are necessary to maintain a limited-size fixed-point

arithmetic; in some cases these blocks can also perform the shift in the binary words in order to apply a scaling of signal amplitude.

#### 4.3 Analog-to-Digital Converter

A signal with maximum spectral extension equal to  $2f_0$  and spectrum centered on the Intermediate Frequency  $f_0$  (I.F.) is assumed as the input of the ADC. This signal provided by the analog RF section of the on-board receiver is sampled, in accordance with the Nyquist Theorem, at the frequency  $F_s=4f_0$  and quantized and encoded in two's complement through  $n_s$  bits.

#### 4.4 IF to Analytic

The "IF to Analytic" block (P1) then performs the extrapolation of the analytical signal from the IF input digital signal. Figure 4.4 shows the complete and detailed Register Transfer Level (RTL) architecture of the block, wherein also the equivalent sources of round-off noise are indicated [30], [40]. In particular, the architecture includes some Saturation and Rounding Blocks (SRB) that are necessary when operating with a limited-size fixed-point arithmetics.



Figure 4.4: RTL diagram of the "IF to Analytic" process.

A first level of SRBs is placed after the multiplication level, in order to reduce the data-path width, and then after the accumulation process. This simplified architecture, wherein the complex output signal is obtained through a single processing branch (the other branch is a simple FIFO register), can be obtained by assuming a polyphase implementation of the decimation process and a Half-Band

prototype FIR filter [41]. The elaboration branch is composed by the status register, the multiplication level and the block of accumulation (that may be implemented by a pipelined binary tree adder). In order to interrupt the combinatorial delay, the output of the block is buffered and a level of buffering is interposed between the multiplication level and the accumulation process. One or more levels of buffers may be allocated within the pipelined binary tree adder this depends on the number of words to be added and on the elaboration time constraints.

#### 4.5 Analysis Channelizer

Within the analysis channelizer  $(P_2)$  the analytic signal is decomposed into J independent complex signals whose spectrum is shifted towards a low frequency range. This block is composed by a polyphase network, that is implemented through a "Variable Coefficient Polyphase Filter" (VCPF) [42], an in-place FFT block and two memory elements (Dual Port RAM Block, DPRB). The first DPRB is required to reorganize the output of the VCPF and to insert the delays needed for the management of the "extended set of polyphase components" (see [40] and [42]), while the second DPRB is used to reorganize the output of the FFT block and to restore the "natural order" in the channelizer output sequence. Figure 4.5 shows the RTL diagram of the equivalent polyphase network, in which the input and the output signals are denoted as complex signals.



Figure 4.5: RTL view of the VCPF used in the analysis channelizer.

The FFT block can be in turn implemented through a non-parallel architecture. This architecture is derived from Decimation In Frequency Radix-2 algorithm [43]. Given a number of channels equal to  $J = 2^k$ , the FFT block is implemented through  $log_2 J = k$  cascaded stages. A single stage is composed by a

FIFO memory block, a butterfly unit and a complex multiplier (depicted in Figure 4.6). Figure 4.7 shows the block diagram of the overall analysis channelizer in the equivalent non-parallel architecture.



Figure 4.6: RTL diagram of the butterfly stage.



Figure 4.7: Block diagram of the analysis channelizer in the non-parallel architecture.

Wherever memory elements are required to store the coefficients, e.g. in VCPF and FFT blocks, ROM blocks with a buffered output are adopted. Figure 4.8 shows the actual structure considered for coefficients' storage.



Figure 4.8: Buffered ROM.

## 4.6 Synthesis Channelizer

The synthesis channelizer  $(P_3)$  combines the J input signals in order to obtain the output analytic signal. This block consists of the VCPF, the in-place radix-2 IFFT block, the DPRB and the elements for the composition of flows coming from the decimation branches. Figure 4.9 shows the block diagram of the overall synthesis channelizer. The DPRB is used to re-organize the output of the IFFT block and to restore the "natural order" in the VCPB input sequence.



Figure 4.9: Equivalent implementation of the synthesis channelizer.

The implementation of the IFFT is coincident with the implementation of the FFT block used in the analysis channelizer: therefore, the IFFT block is composed of  $log_2J=k$  stages of butterfly elements and complex multipliers in the same architecture shown in Figure 4.6. The only difference is in the sign of the twiddle factors.

Also the VCPF is coincident with the implementation of the VCPF used in the analysis channelizer with only one modification in the status register: 2J complex samples have to be stored through a FIFO for each tap, while only J samples are involved in the analysis block. Figure 4.10 depicts the RTL diagram of the VCPF.

#### 4.7 Analytic to IF

The Analytic to IF block processes the output signal of the synthesis channelizer in order to obtain the discrete IF signal (see Figure 4.11 for the simplified RTL architecture of the block). For this purpose the condition of oversampling is restored on the channelizer output through an interpolator operating with conversion factor equal to 2. The analytic signal is then obtained and the discrete IF signal is then extrapolated by taking its real part. The elaboration branch is composed by the status register, the multiplication level and the block of accumulation. The output of the block is buffered and a level of buffering is interposed between the multiplication level and the accumulation process. One or more levels



Figure 4.10: RTL view of the VCPF used in the synthesis channelizer.



Figure 4.11: RTL diagram of the "Analytic to IF" process.

of buffers may be allocated within the pipelined binary tree adder that implements the accumulation process.

#### 4.8 Noise Model

In this section we report the analytical solution of the noise model for DTP seen previously introduced in [37]. Unlike the processed bandwidth that directly imposes the sampling rate of the digital section, the modulation and coding schemes impose a requirement for the signal-to-noise ratio that the link-budget must satisfy.

For this reason, having a noise model allows for the exploration of design alternatives. In addition, in order to overcome some initial assumptions, the extended model, used in [37], also allows to handle different values of power spectral density across different ESBWs and more general constraints on pass-band and stop-band in filter design.

From the DTP hardware architecture described above, the following sources of degradation are identified:

- Quantization errors caused by the analog-to-digital conversion (ADC);
- Non-ideality in filter implementation, i.e., the adoption of finite-length impulse responses due to the window and the finite number of bits for the representation of their coefficients; these effects can also be expressed in terms of linear distortion with respect to a reference (ideal) behavior;
- Use of fixed-point arithmetic in registers that operate within the entire processing chain.



Figure 4.12: Signal and noise levels in the on-board receiving chain.

The hybrid receive chain visible in the Figure 4.12 shows the "Analog Amplifier" (AA) block representing the entire RF analog receive front-end which typically contains:

- LNA (Low Noise Amplifier);
- mixer;
- AGC (Automatic Gain Control);
- AAF (Anti-Aliasing IF Filter).

According to the AWGN channel model, the first noise contribution taken at the input of the AA block is characterized by a power spectral density  $N_{01} = kt_a$ , where  $t_a$  is the antenna temperature (noise) and k is the Boltzmann constant. The next AA block is characterized by the equivalent noise temperature  $t_{AA}$ , the noise figure  $F_{AA}$ , and the overall power gain  $G_{AA}$ .

In the DTP block, the degradation sources can be characterized as the sum of an AWGN component and the flat power spectral density, such as:

$$N_{DTP} = N_{0R} + \delta^2 E_{S2}$$

where the term  $N_{0R}$  is used to denote the power spectral density related to the overall rounding and quantization noise along the whole DTP chain, and the second term is intended to account for the linear distortion noise in FIR filters implementation and is proportional to the useful signal power spectral density with a coefficient  $\delta$  accounts for ripple in the FIR filters.

In each DTP step, show graphically in Figure 4.13 two contributions,  $N_R$  and  $N_L$ , are added to the total noise introduced. The former related to rounding and quantization, while the latter to linear distortions in FIR filters.



Figure 4.13: Spectral representation of signals along the DTP chain.

For each block composing the DTP the noise introduced, and the cumulative up to it, are given in Table 4.1, where a uniform parameter configuration (PU) is considered, and where:

- $P_0$ : ADC block;
- $P_1$ : IF to Analitycs filter block;
- $P_2$ : Variable coefficents poliphases filter (VCPF) and FFT blocks;
- $P_3$ : IFFT and Variable coefficents poliphases filter (VCPF) blocks;
- $P_4$ : Analitycs to IF filter block.

| Block | Noise contribution                                                                                                                                                          | Cumulative Noise                                                                                                |
|-------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| $P_0$ | $N_{R0} = (q_0^2/12) \frac{1}{4f_0}$ $N_{L0} = 0$                                                                                                                           | $N_{R0}^T = N_{R0} N_{L0}^T = N_{L0} = 0$                                                                       |
| $P_1$ | $N_{R1} = (\frac{N_1}{2} \frac{q_m^2}{12} + \frac{q_s^2}{12}) \frac{1}{2f_0}$ $N_{L1} = \delta^2 (A_1^2 S_i) 4 f_o \frac{1}{2f_0}$                                          | $N_{R1}^{T} = (A_1^2 N_{R0}^T) + N_{R1}$ $N_{L1}^{T} = (A_1^2 N_{L0}^T) + N_{L1} = N_{L1}$                      |
| $P_2$ | $N_{R2} = J\left(\frac{2N_2}{J}\frac{q_m^2}{3}\frac{1}{A_2^2} + \frac{q_s^2}{6} + \frac{q_{fft}^2}{3}\right)\frac{J}{4f_0}$ $N_{L2} = \delta^2(A_1^2S_i)2f_o\frac{J}{4f_0}$ | $N_{R2}^T = N_{R1}^T + N_{R2}$ $N_{L2}^T = N_{L1}^T + N_{L2}$                                                   |
| $P_3$ | $N_{R3} = \left(J\frac{q_{fft}^2}{3} + \frac{N_3}{J}\frac{q_m^2}{3} + \frac{q_s^2}{6}\right)\frac{1}{2f_0}$ $N_{L3} = \frac{1}{2}J\delta^2(A_1^2S_i)2f_0\frac{1}{2f_0}$     | $N_{R3}^{T} = N_{R2}^{T} + N_{R3}$ $N_{L3}^{T} = N_{L2}^{T} + N_{L3}$                                           |
| $P_4$ | $N_{R4} = 2\left(\frac{N_4}{2}\frac{q_m^2}{12} + \frac{q_s^2}{12}\right)\frac{1}{4f_o}$ $N_{L4} = 2\delta^2(A_1^2S_i)4f_0\frac{1}{4f_0}$                                    | $N_{R4}^{T} = \frac{1}{A_{4}^{2}}(N_{R3}^{T} + N_{R4})$ $N_{L4}^{T} = \frac{1}{A_{4}^{2}}(N_{L3}^{T} + N_{L4})$ |

Table 4.1: Summary of Noise Computation in the DTP Chain for the UP approach.

#### 4.9 Hardware Complexity Model

In [1] and [P1] for the DTP a hardware complexity suitable for the estimation and comparison of the various solutions has been introduced, which stops at a level of detail not adequate to be extended for power estimation purposes. In fact for each block are calculated:

- number of Flip-Flops (FF);
- number of adders (expressed in full-adder equivalents FA);

• number of Multipliers (MULT or DSP).

We will see later some elementary blocks will change and we will alsoneed to trace the fan-out of them.

#### 4.9.1 Sub-blocks case analyzed

This section reports the test performed on an FPGA of the Xilinx family with the following settings of parameters of the Vivado power analysis: Static Probability = 0.5 and Toogle Rate = 50.0. The power analysis is performed for the DTP sub-blocks separately.

For each of these sub-blocks several cases have been analyzed with reference to different setup configurations. The number of bits used to represent both the input and output data of the block (i.e.  $n_{si}$  and  $n_{so}$  respectively) belong to the setting parameters together with the number of bits  $(n_h)$  for the representation of the filter coefficients and FFT twiddle factors. Furthermore, the number of additional bits  $(n_m)$  for the representation of the internal multiplication results is also selectable (i.e. multiplication results represented by  $n_{si} + n_m$  bits). Finally, the taps of the filters (taps) and the number of channels (Channels) extrapolated by the channelizer complete the set of configurable parameters.

The power estimation has been applied to the three main sub-blocks of the DTP chain: i) the "IF to Analytic" (IF2A), ii) the "Variable Coefficients Polyphase Filter" (VCPF), and iii) the "FFT" block. Figure 4.15 shows the results, obtained in the comparison between our high level model and the Xilinx power analyzer, with the parameter settings, of the three sub-blocks, provided in Table 4.2.

#### 4.9.2 IF to Analytic block in detail

For the IF2Ana block, the hardware complexity is recalculated by going through the Table 4.3 first and then Table 4.4 until the signals are computed.

#### 4.10 FPGA implementation: Simulation Results

The hardware complexity, increased by the signals, confirms the power estimate of our model. The comparison with the results obtained from the Vivado synthesis tool is shown in Figure 4.15. From the comparison, the model appears suitable for early helpful estimation in the conceptual design phase.

| Case | Block | $n_{si}$ | $n_{so}$ | $\mid n_h \mid$ | $n_m$ | taps | Channels |
|------|-------|----------|----------|-----------------|-------|------|----------|
| 1    | IF2A  | 13       | 11       | 15              | 3     | 100  | -        |
| 2    | IF2A  | 13       | 11       | 15              | 3     | 150  | _        |
| 3    | IF2A  | 13       | 11       | 15              | 3     | 200  | _        |
| 4    | IF2A  | 17       | 16       | 15              | 3     | 100  | _        |
| 5    | IF2A  | 17       | 16       | 15              | 3     | 150  | _        |
| 6    | IF2A  | 17       | 16       | 15              | 3     | 200  | _        |
| 7    | IF2A  | 17       | 16       | 15              | 9     | 100  | _        |
| 8    | IF2A  | 17       | 16       | 15              | 9     | 150  | _        |
| 9    | IF2A  | 17       | 16       | 15              | 9     | 200  | _        |
| 10   | VCPF  | 13       | 13       | 15              | 3     | 23   | 8        |
| 11   | VCPF  | 14       | 14       | 15              | 3     | 25   | 16       |
| 12   | VCPF  | 15       | 15       | 15              | 3     | 25   | 32       |
| 13   | VCPF  | 16       | 16       | 15              | 3     | 25   | 64       |
| 14   | VCPF  | 17       | 17       | 15              | 3     | 27   | 128      |
| 15   | FFT   | 20       | 20       | 17              | _     | -    | 8        |
| 16   | FFT   | 20       | 20       | 17              | _     | -    | 16       |
| 17   | FFT   | 20       | 20       | 17              | _     | -    | 32       |
| 18   | FFT   | 20       | 20       | 17              | _     | -    | 64       |
| 19   | FFT   | 20       | 20       | 17              | _     | -    | 128      |
| 20   | FFT   | 20       | 20       | 17              | _     | -    | 256      |
| 21   | FFT   | 20       | 20       | 17              | _     | -    | 512      |
| 22   | FFT   | 20       | 20       | 17              | _     | -    | 1024     |
| 23   | FFT   | 10       | 10       | 15              | _     | -    | 8        |
| 24   | FFT   | 10       | 10       | 15              | _     | -    | 16       |
| 25   | FFT   | 10       | 10       | 15              | _     | -    | 32       |
| 26   | FFT   | 10       | 10       | 15              | _     | -    | 64       |
| 27   | FFT   | 10       | 10       | 15              | _     | -    | 128      |
| 28   | FFT   | 10       | 10       | 15              | _     | -    | 256      |
| 29   | FFT   | 10       | 10       | 15              | _     | -    | 512      |
| 30   | FFT   | 10       | 10       | 15              | -     | -    | 1024     |

Table 4.2: Use cases

| Regis | sters:                                                                                                                          |             |                                    |                    |                                 |  |  |
|-------|---------------------------------------------------------------------------------------------------------------------------------|-------------|------------------------------------|--------------------|---------------------------------|--|--|
| row   | iterations                                                                                                                      |             | No. of Reg.s                       |                    | No. of bits                     |  |  |
| 1     |                                                                                                                                 |             | $\frac{3N}{2} + n_{BL} + 1$        |                    | $n_{si}$                        |  |  |
| 2     | for i f                                                                                                                         | rom         | $\lceil N/2^{n_{UBL}\ i} \rceil$   |                    | $n_{si} + n_m +$                |  |  |
|       | 1 to <i>r</i>                                                                                                                   | $l_{BL}$    | 1 1 / 2                            |                    | $+n_{UBL} i$                    |  |  |
| Mult  | ipliers:                                                                                                                        |             |                                    |                    |                                 |  |  |
| row   | No. of I                                                                                                                        | Mult.s      | No                                 | o. of b            | its                             |  |  |
| 3     | N                                                                                                                               |             | n                                  | $_{si}$ x $n$      | h                               |  |  |
| Adde  | ers:                                                                                                                            |             |                                    |                    |                                 |  |  |
| row   | iterati                                                                                                                         | ons         | No. of Add.s                       |                    | No. of bits                     |  |  |
| 4     | for i f                                                                                                                         | rom         | $\lceil N/2^{i-1} \rceil/2 \rceil$ | r                  | $a \cdot \perp a \cdot \perp i$ |  |  |
|       | 1 to $n_{ACC}$                                                                                                                  |             |                                    | $n_{si} + n_m + i$ |                                 |  |  |
| Satur | ration and                                                                                                                      | l Round     | ding Blocks:                       |                    |                                 |  |  |
| row   | No. of                                                                                                                          | b           | $n_i$                              | $n_H$              | $n_L$                           |  |  |
|       | SRB.s                                                                                                                           |             |                                    |                    |                                 |  |  |
| 5     | N                                                                                                                               | 1           | $n_{si} + n_h$                     | 1                  | $n_h - n_m - 1$                 |  |  |
| 6     | 1                                                                                                                               | 1           | $n_{si} + n_m +$                   | 0                  | $n_{si} + n_m +$                |  |  |
|       |                                                                                                                                 |             | $+n_{ACC}$                         |                    | $+n_{ACC}-n_{so}$               |  |  |
| 7     | 1                                                                                                                               | 1           | $n_{si}$                           | 0                  | $n_{si} - n_{so}$               |  |  |
| Cloc  | k lines:                                                                                                                        |             |                                    |                    |                                 |  |  |
| row   |                                                                                                                                 | No. of CLKs |                                    |                    |                                 |  |  |
| 8     | $2 + \frac{5N}{2} + \sum_{n=1}^{\infty} \left( \left\lceil \frac{N}{2n_{UBL} i} \right\rceil - 1 \right)$                       |             |                                    |                    |                                 |  |  |
|       | $2 + 2 + \sum_{i=1} \left( \left  2^{n_{UBL}} i \right  \right)$                                                                |             |                                    |                    |                                 |  |  |
| Signa | al Lines:                                                                                                                       |             |                                    |                    |                                 |  |  |
| row   |                                                                                                                                 |             | No. of Sig.:                       | S                  |                                 |  |  |
| 9     | $2 - n_{si} + \frac{(5 + 2n_{si})N}{2} + \sum_{i=1}^{n_{BL}} \left( \left\lceil \frac{N}{2^{n_{UBL}}} \right\rceil - 1 \right)$ |             |                                    |                    |                                 |  |  |

**Table 4.3:** Hardware Complexity for the IF2A Block



Figure 4.14: RTL diagram of the "IF to Analytic" process.

| Block                   | LUT                                                                                                                                          | FF | DSP | Signal          | $Clock_{lines}$ |
|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|----|-----|-----------------|-----------------|
| Reg(n)                  | n                                                                                                                                            | n  | _   | $3(n-1) + 2n^*$ | n-1             |
| Mult(n)                 | _                                                                                                                                            | _  | 1   | 48              | _               |
| Adderz(n)               | n                                                                                                                                            | _  | _   | 2n              | _               |
| $SRB(n_i, n_H, n_L, b)$ | $\left\lceil \frac{n_L - 2}{2} \right\rceil + \left\lceil \frac{n_i - n_L}{2} \right\rceil + \left\lceil \frac{n_H - 1}{2} \right\rceil + 1$ | _  | _   | = LUT           | _               |

**Table 4.4:** Contribution of internal blocks to the computation of primitives and signals.



Figure 4.15: Estimation of the dynamic power for the considered use cases.

#### IF to Analytic (Analytic to IF)

The parameters of the "IF to Analytic" block for a specific hardware configuration are:

$$(n_{si}, n_{so}, n_h, n_m, order)$$

where  $n_{si}$ ,  $n_{so}$ ,  $n_h$  and  $n_m$  are the number of bits of the input signal, the output signal, the filter coefficients and the additional bits after multiplication  $(n_{si} + n_m)$ , respectively, while order is the filter order.

Figure 4.16 describes the results obtained from simulations carried out for different values of *order*, with the parameters settings provided in Table 4.5 and where correction coefficients were used to highlight the consistency of the method.

|       | $n_{si}$ | $n_{so}$ | $n_h$ | $n_m$ | order           |
|-------|----------|----------|-------|-------|-----------------|
| Case1 | 13       | 11       | 15    | 3     | [200, 300, 400] |
| Case2 | 17       | 16       | 15    | 3     | [200, 300, 400] |
| Case3 | 17       | 16       | 15    | 9     | [200, 300, 400] |

Table 4.5: Use cases for "IF to Analytic" Block



Figure 4.16: Estimation for IF to Analytic (IF2ANA) or Analytic to IF (ANA2IF) blocks.

#### Variable Coefficients Polyphase Filter

The parameters of the "Variable Coefficients Polyphase Filter" block (VCPF) for a specific hardware configuration are:

$$(n_{si}, n_{so}, n_w, n_m, order, Channel)$$

where  $n_{si}$ ,  $n_{so}$ ,  $n_w$  and  $n_m$  are the number of bits of the input signal, the output signal, the filter coefficients and the additional bits after multiplication  $(n_{si}+n_m)$ , respectively, while order is the filter order and the Channel is the number of channels in the channelizer.

Figure 4.17 compile the results obtained from both simulation runs and model predictions, with the parameters' setting listed in Table 4.6 and the position of the coefficients. The goal is to highlight the consistency of the method.

| $n_{si}$ | $n_{so}$ | $ n_w $ | $n_m$ | order | Channel |
|----------|----------|---------|-------|-------|---------|
| 13       | 13       | 15      | 3     | 23    | 8       |
| 14       | 14       | 15      | 3     | 25    | 16      |
| 15       | 15       | 15      | 3     | 25    | 32      |
| 16       | 16       | 15      | 3     | 25    | 64      |
| 17       | 17       | 15      | 3     | 27    | 128     |

Table 4.6: Use cases for simulation and estimation for "VCPF" Block



Figure 4.17: Estimation of "Variable Coefficients Polyphases Filter" (VCPF) block.

#### **Fast Fourier Transform**

The parameters of the "FFT" block (FFT) for a specific hardware configuration are:

$$(n_{si}, n_{so}, n_w, Channel)$$

where  $N_{si}$ ,  $N_{so}$  and  $N_w$  are the number of bits of the input signal, the output signal and the coefficients, respectively, while Channel is the order or number of point of the FFT.

Figure 4.18 shows results as obtained from both simulations and model for various values of *Channel*, with the parameters' setting provided in Table 4.7

and Where correction coefficients were used to highlight the consistency of the method.

|       | $n_{si}$ | $n_{so}$ | $\mid n_w \mid$ | Channel                              |
|-------|----------|----------|-----------------|--------------------------------------|
| Case1 | 20       | 20       | 17              | [8, 16, 32, 64, 128, 256, 512, 1024] |
| Case2 | 10       | 10       | 15              | [8, 16, 32, 64, 128, 256, 512, 1024] |

Table 4.7: Use cases for simulation and estimation for "FFT" Block



Figure 4.18: Estimation for FFT (IFFT) blocks.

### 4.11 Conclusion

The hardware complexity, increased by the signals, confirms the power estimate of our model. The comparison with the results obtained from the Vivado synthesis tool is shown in Figure 4.15. The model appears suitable for early estimation in the conceptual design phase from the comparison. From the previous analysis is possible to highlight the critical sub-blocks in terms of power consumption. Improvement strategies can then be developed at the conceptual design stage, choosing alternative ways.

## Chapter 5

## New Multilevel Pulsed Modulation for Optical Biotelemetry: Another Case Study

#### 5.1 Introduction

The design of new architectures of neural systems is demanding to fulfill the need to acquire, code and decode cortex neural signals with the aim to design novel biomedical apparatus to monitor and drive, for example, external prosthetic devices in patients with disabilities [44–49]. Moreover, a class of these devices are diagnostic and therapeutic systems capable to send and receive signals from both inside and outside of the human body by means of implantable and wearable sensors to provide reliable, efficient and high-quality healthcare for sick persons [50–52].

The electronic devices designed for all these medical applications must provide transcutaneous telemetry paradigms supporting high data rate transmission at minimum Error Bit Ratio (BER) with the further important need to operate in regime of low voltage and low power consumption [53–60]. In this sense, optical Ultra-Wide-Band (UWB) biotelemetry systems have proved to achieve data rate up to 300 Mbps with a power consumption less than 37 pJ/bit [61]. In general, for example, respect to radio frequency transmission links, optical biotelemetry assures high level of electromagnetic compatibility and signalintegrity [53–60].

In addition to these properties, UWB optical biotelemetry reported in [61], adds an additional advantage since the technique employs sub-nanosecond laser pulses to generate the clock signal and to transmit the bitstream through a proper coding procedure. The consequence of this, is that the laser is operating only for a time much shorter than clock period with a consequent decrease of the overall

system power consumption. To date, a strong limitation is that this architecture transmits only one bit per symbol. The ability to reliable generate sub-nanosecond laser pulses by using Vertical Cavity Surface Emitting Laser (VCSEL) opens the possibility to propose different multilevel data coding approaches for which each symbol to be transmitted is composed by more than one bit.

Thus, multilevel modulation techniques can allow to decrease the main transmission clock frequency (i.e., the baud rate) and, at the same time, to achieve high data rate with the final consequence to increase the system overall energy efficiency (i.e., lower power consumption). In this regard, different are the modulation techniques proposed in the Literature that have been implemented for both Impulse Radio UWB and optical (in fiber and wireless) links. Among them, there are schemes based on Multilevel Dual Header-Pulse Interval Modulation (MDH-PIM), Digital Pulse Interval Modulation (DH-PIM), Pulse Amplitude Modulation (PAM), Pulse Width Modulation (PWM) and Pulse Position Modulation (PPM) [62–66]. However, most of these approaches suffer of the lack of an intrinsic synchronism between the transmitter and the receiver and/or have a reduced energy efficiency.

Aim of this work is to present a new solution of multilevel pulsed modulation that intrinsically allows for the synchronization among the transmitter and the receiver of UWB-inspired optical data links. On the other words, the proposed approach implements a Synchronized PPM (S-PPM) technique that also guarantees low power and high data rate operations so resulting particularly suitable for high performances high efficiency wireless optical biotelemetry for implantable and wearable systems. The overall electronic system implementing the novel multilevel pulsed data coding has been designed considering that, as a case-example, each symbol to be transmitted is composed of 3 bits. However, the proposed architecture is simply scalable to manage different symbol sizes and can be easily implemented in any kind of programmable/configurable hardware for a fast prototyping and testing as well as can be integrated on-chip in standard CMOS technologies through a full-custom microelectronic design. Preliminary experimental results are reported considering the development of the complete UWB-inspired wireless optical biotelemetry system that implements the proposed pulsed modulation technique on an FPGA board. In this regard, an internal reference clock equal to 40 MHz (i.e., a baud rate of 40 Mega symbol per second, each symbol composed by 3 bits) has been used so achieving a final equivalent transmission data rate equal to 120 Mbps with an efficiency of about 16.25 pJ/bit corresponding to 48.75 pJ/symbol.

The hardware solution used for the transmission and reception phases (i.e. modulation and demodulation operations) has been designed following the best practices identified during the research on power estimation. In order to validate the best practices used and implemented, the design was uniform from a power



**Figure 5.1:** Overview of a transcutaneous UWB-inspired optical biotelemetry.

consumption point of view.

#### **5.2** Optical Biotelemetry Overview

The overall proposed system is shown in Figure 5.1. The system composed by two internal blocks, the DIGITAL DATA CODING and the LASER DRIVER, and two external blocks, the PHOTODIODE CONDITIONING CIRCUIT and the DIGITAL DATA DECODING. Each time a SYMBOL (composed by N bits) must be transmitted (with a baud rate set by CLOCK M), the signal EN enables the DIGITAL DATA CODING block that implements the novel multilevel pulsed data coding technique generating the TRANSMITTED CODED PULSED SIGNAL. These voltage pulses are used as input for of the LASER DRIVER block that includes a Vertical Cavity Surface Emitting Laser (VCSEL) to generate LASER PULSES with a pulse-width consistent with the TRANSMITTED CODED PULSED SIGNAL. These laser pulses are transmitted through the channel represented by the skin tissue and reach the external photodiode (PD). The PHOTODIODE CONDITIONING CIRCUIT block converts the current pulses generated by the PD into voltage pulses called RECEIVED CODED PULSED SIGNAL that, in turn, are used by the DIGITAL DATA DECODING block to provide the RECOVERED SYMBOL and the RECOVERED CLOCK M.



**Figure 5.2:** Data coding process of the proposed multilevel synchronized pulse position modulation technique.

## 5.3 The proposed Multilevel Pulsed Modulation Technique

The developed multilevel pulsed coding technique implements a new Synchronized Pulse Position Modulation (S-PPM) paradigm. The related timing diagram is shown in Figure 5.2, where the generic CLOCK signal has a period T and the associated frequency corresponds to the transmitted symbol data rate. Considering that, for example, each symbol to be transmitted is composed by 3 bits (i.e., 8 different symbols/levels), the proposed S-PPM works as follows: for any transmitted symbol, a pulse corresponding with the rising edge of the CLOCK is generated and transmitted. These pulses named SYNC PULSES do not transmit data information but are only used by the DIGITAL DATA DECODING block, for the clock recovery operations reducing the complexity of the digital architectures and the Bit Error Ratio (BER). Moreover, an additional pulse is generated as function of the transmitted symbol.

As an example, transmitting the symbol "001", a pulse is generated at the time T/8, corresponding to a phase delay of 45 degree respect the edge of the CLOCK signal. On the other hand, for the transmission of the symbol "011" the pulse is generated at the time 2T/8, corresponding to a phase delay of 90 degree. These pulses represent the DATA PULSES and carry information about the transmitted symbol. DATA PULSES is not generated only if the transmitted symbol is equal to "000". The combination of SYNC PULSES and DATA PULSES provides the TRANSMITTED CODED PULSED SIGNAL.

#### 5.4 System Design and Implementation

The new proposed modulation approach requires the design and the development of novel digital architectures for its implementation. In this regard, all the digital circuits related to the functional/logic blocks performing the data coding and decoding have been implemented on a commercially-available FPGA-based development board (Xilinx KCU105 with Kintex UltraScale XCKU040-2FFVA1156E FPGA) through a hardware description (Very High Speed Integrated Circuits Hardware Description Language, VHDL) by setting the LVC-MOS15 hardware operating mode (signal voltage levels from 0 to 2 V).

#### **5.4.1** Power Analysis in Pre-Design Phase

Starting from the previosu power consumption studies, a preliminary analysis has been performed so to identify the best hardware architectures for the modulator and the demodulator. In particular, this analysis has been focused on the choice of the suitable solution having the lowest average activity switching. Since the employed Pulse Position Modulation can be considered as a Time-to-Digital conversion, the main core of the modulator/demodulator is a counter. In this regard, the following three main possible solutions have been considered, analyzed and compared:

- Classical Code Based Counter;
- Gray Code Based Counter;
- Polyphase Based Architecture (proposed solution/approach).

|            | STAGE~I                              | STAGE~II                                               | STAGE III                             |
|------------|--------------------------------------|--------------------------------------------------------|---------------------------------------|
| Classical  | $P_I^C = F_s N \log_2 N$             | $P_{II}^{C} = \frac{F_s}{2} \sum_{i=1}^{\log_2 N} 2^i$ | $P_{II}^C = \frac{F_s}{2} \log_2 N$   |
| Gray       | $P_I^G = F_s N \log_2 N$             | $P_{II}^{G} = \frac{\vec{F_s}}{2}N$                    | $P_{II}^{G} = \frac{F_s}{2} \log_2 N$ |
| Polyphases | $P_{I}^{P} = F_{s}(\frac{N}{2} - 1)$ | $P_{II}^P = \frac{\vec{F}_s}{2}N$                      | $P_{II}^P = \frac{F_s}{4}N$           |

Table 5.1: Activity switching.

Referring to Figure 5.3, Table 5.1 summarises the values of the activity switching in the three stages of the different solutions analysed. To comply with the power studies, the transitions of the various signals are considered over the entire cycle (in practice every 2 symbol transitions).

In the following, each main stage considered is described and analised more in detail:



**Figure 5.3:** Classical, Gray and Polyphases Architecture.

- **STAGE I**: In stage I, the symbol synchronisation signal is connected in parallel to the counter (FF) or phase shifter (Delay) sub-blocks. The main difference among them is that although the phase shifter blocks are larger in number, they work at the symbol frequency, whereas the counters need a clock frequency that is N times higher than the symbol frequency.
- **STAGE II**: In stage II the total transitions within the symbol period have been calculated and they are equal to half the number of levels in the Gray and Polyphase cases, while is  $\frac{1}{2} \sum_{i=1}^{\log_2 N} 2^i$  in the Classical case.
- **STAGE III**: Finally, in Stage III, the output signals will be updated every symbol period (i.e., with a frequency  $F_s$ ). Considering all the analized cases, the main difference lies in the number of signals: in the Classic and Gray cases they

are equal to  $\log_2 N$  while in the polyphase case they are equal to N/2.

Figure 5.4 shows a comparative analysis of the power estimation results (in terms of Million Transitions per seconds, [Mtr/s]) achieved considering the following cases of interest referred to the considered applications:

- 60 Mb/s from 2 to 6 bit for symbol;
- 120 Mb/s from 2 to 6 bit for symbol;
- 180 Mb/s from 2 to 6 bit for symbol.



Figure 5.4: Power estimation for Multilevel Link.

#### **5.4.2** Design and Implementation

More in detail, the digital architecture that implement the S-PPM coding technique is shown in Figure 5.5. First of all, each transmitted symbol is stored in a SYMBOL BUFFER block through the EN signal. At the same time the PLL TX block generates the four clock signals  $\phi 1$ ,  $\phi 2$ ,  $\phi 3$  and  $\phi 4$  starting from the CLOCK\_M. These clock signals are a CLOCK\_M replica except for a phase delay equal to 45, 90, 135 and 180 degrees, respectively. The Look-up Table LUT1 accepts as input the clock signals and the symbol implementing a combinatorial



Figure 5.5: Digital architecture of the implemented DIGITAL DATA CODING block.



Figure 5.6: Time diagram of the main signals involved in the SYMBOL CODING operations

function described in the time diagram of Figure 5.6. When a symbol equal to "000" is transmitted the LUT1 output is always equal to zero. Instead, if a "010" symbol is transmitted, the LUT1 output became high when  $\phi 1$ ,  $\phi 2$  have a high logic state and  $\phi 3$ ,  $\phi 4$  a low logic state. As result, it is possible to observe in the LUT1 output a pulse with a duration of T/8 and delayed of 2T/8 respect the rising edge of the CLOCK\_M. At the same way, any other symbol is able to generates a pulse with the same length but with different delay. In other words, each symbol generates a pulse with a specific delay corresponding to one of the  $N^2$  possible level, where N is the number of the bits contained in the symbol. In order to reduce the duration of the pulse generated by the LUT1 output, this signal is connected with the clock port of the Flip Flop FF2.

The FF2 output in connected to an asynchronous reset pin of the same Flip Flop. In this way, it is possible to generate the DATA PULSE signal that is a replica of the LUT1 output except for the length that is reduced from the initial value of T/8 to a smaller Flip Flop reset time, reducing the optical power requirements for the communications. In the same way, the Flip Flop FF1, generates the SYNC PULSES starting from the CLOCK\_M and independently from the symbol. Finally, the TRANSMITTED CODED PULSED SIGNAL is obtained as an



Figure 5.7: Digital architecture of the implemented DIGITAL DATA DECODING block.

OR operation between the SYNC PULSES and DATA PULSES.

The purpose of the DIGITAL DATA DECODING block, whose implemented architecture is reported in Figure 5.7, is to recover the transmitted symbol starting from the incoming RECEIVED CODED PULSED SIGNAL. The CLOCK RECOVERY block is able to replicate the CLOCK M used in the transmitter system using the SYNC PULSES of the RECEIVED CODED PULSED SIGNAL. Moreover, the same circuit regenerates also the signals  $\phi 1$ ,  $\phi 2$ ,  $\phi 3$  and  $\phi 4$ . After a proper synchronization introduced by the DELAY block, the RECEIVED CODED PULSED SIGNAL is connected to the clock pin of four different Flip Flop in which the D inputs are connected to the signals  $\phi 1$ ,  $\phi 2$ ,  $\phi 3$  and  $\phi 4$ . Considering the time diagram in Figure 5.7, each incoming pulse of the RECEIVED CODED PULSED SIGNAL enables the FFs that acquire the status of the signals  $\phi 1$ ,  $\phi 2$ ,  $\phi 3$  and  $\phi 4$ . More in detail, each SYNC PULSE of the received pulse train acquires always a logic state low of the inputs resetting the FFs. Any subsequent DATA PULSE allows to acquire a unique logic state sequence of the signals  $\phi 1$ ,  $\phi 2$ ,  $\phi 3$  and  $\phi 4$ . As an example, the first data pulse generates a sequence of "1100" in Figure 5.8. The four bits output of the FFs are stored in a DATA BUFFER block in presence of a RECOVERED CLOCK M rising edge and just before the subsequent reset. Since the stored bit sequence is unique for each delay allowed by the DATA PULSES, this sequence is used by the LUT2 that provides as output, the corresponding symbol. Finally, the symbol is saved in a buffer that provides the RECOVERED SYMBOL.



**Figure 5.8:** Time diagram of the main signals involved in the DIGITAL DATA DECODING operations.

Furthermore, as far as concern the analog circuits implementing the LASER DRIVER and the PHOTODIODE CONDITIONING CIRCUIT blocks, we employed the same solutions reported in a previous our work [61]. Finally, these circuits drive a high speed VCSEL (VCSEL-850 by Thorlabs) with an emission wavelength  $\lambda=850~nm$  and a fast silicon PD (FDS-025 by Thorlabs) with a 47~ps response time and a  $250~\mu m$  active area diameter.

## 5.5 Experimental set-up and Measurements: System Characterization, Validation and Results

In Figure 5.9 is shown a photo of the experimental setup used for the overall system characterization. The setup includes a tissue sample (i.e., porcine skin) emulating the presence of human tissue and providing the typical attenuation and scattering expected in the optical path. The VCSEL and the PD are mounted on their corresponding electronic conditioning circuits. They are positioned on the two sides of the tissue sample by assuring their perfect optical alignment and in close contact with the porcine skin having an approximate thickness of about 4mm (i.e., the working distance between the VCSEL and the PD).

The overall system has been characterized by choosing an internal reference clock (i.e., CLOCK\_M signal) equal to 40MHz (i.e., a baud rate of 40 Mega symbol per second, each symbol composed by 3 bits) providing a corresponding equivalent transmission data rate equal to 120 Mbps. Different measurements have been conducted by employing a digital oscilloscope (LeCroy Wavemaster 8600A, 6~GHz, 20~GS/s). In order to evaluate the overall performances of the implemented complete system (i.e., correctness of the operations, BER, power consumption, etc.), a 231-1 bits pseudo-random sequence (PRBS) has been gen-



**Figure 5.9:** Photograph showing the experimental setup for characterizing the develoed optical biotelemetry system.

erated by the employed FPGA board.

More in detail, in Figure 5.10 captured waveforms from the oscilloscope of the main signals of the DIGITAL DATA CODING block are reported. In particular, this is an example of an experimental timing diagram demonstrating the correctness of the pulsed data coding process performed by the system showing the input symbol to be transmitted (composed by 3 bits, from LSB to MSB) and the corresponding transmitted coded pulsed signal. Furthermore, Figure 5.11 shows the transmitted serial bitstream and the corresponding received coded pulsed signal together with the internal reference signal (generated inside the clock recovery block) and the recovered main clock signal (i.e., RECOVERED CLOCK\_M). This demonstrate the correct functionality of the CLOCK RECOVERY block, inside the DIGITAL DATA DECODING block, that is fundamental to properly proceed with the data recovery and, so, to provide the correct RECOVERED SYMBOL at the output of the receiver. Moreover, Figure 5.12 report the transmitted bitstream and the corresponding regenerated symbol (composed by 3 bits, from LSB to MSB) provided at the output of the receiver. From this result it is possible to observe the perfect matching between the received decoded data (i.e., the RE-COVERED SYMBOL) and the transmitted PRBS serial bitstream. In addition, it is possible to evaluate also a time latency of about 2 symbol periods (i.e., about 50 ns) between the transmitted and the recovered symbol/bitstreams due to the clock recovery and the data decoding processes.

Finally, through the experimental measurements, it has been also demonstrated



**Figure 5.10:** Experimentally captured waveforms of the implemented architecture showing the main signals during transmission of a PRBS operating with a baud rate of 40 MHz (from top to bottom): the input symbol to be transmitted (composed by 3 bits, from LSB to MSB) and the corresponding transmitted coded pulsed signal.

that the proposed optical biotelemetry is able to achieve a BER lower than  $10^{-10}$  with a maximum overall system power consumption of about  $1.95\ mW$ . By operating at  $120\ Mbps$ , the corresponding energy efficiency is about  $16.25\ pJ/bit$ . These results have been obtained by setting the VCSEL pulse width of about 500 ps at FWHM and the related pulsed driving current with a peak value of about  $10\ mA$  (with a null DC bias current).

With respect to previous results achieved by the same authors in [67] with the S-OOK data coding, considering similar operating conditions, an improvement of the energy efficiency of about a factor 2.4 has been achieved with a reduced complexity/criticisms of the overall electronics. In this case, in fact, the system operates with a reduced internal main reference clock (i.e., a lower baud rate equal to 40~MHz instead of 120~MHz) guaranteeing the same output data rate (i.e., 120~Mbps). In other words, by using the S-OOK data coding each symbol composed by 3 bits would require an average value of 4.5~ pulses to be transmitted. On the contrary, in the case of the present S-PPM technique, each symbol to be transmitted requires an average value of 1.875~ pulses to be generated so providing a theoretical power consumption reduction of about a factor 2.4.

It is worth noting that, for a specific and fixed baud rate, by increasing the number of the bits (i.e., higher than 3) per symbol, it is possible to further enhance the energy efficiency and, at the same time, increase the overall transmission data rate.



**Figure 5.11:** Experimentally captured waveforms of the implemented architecture showing the main signals during transmission of a PRBS operating with a baud rate of 40 MHz (from top to bottom): the transmitted bitstream, the corresponding received coded pulsed signal, the internal reference signal of the clock recovery blok and the the recovered main clock.



**Figure 5.12:** Experimentally captured waveforms of the implemented architecture showing the main signals during transmission of a PRBS operating with a baud rate of 40 MHz (from top to bottom): the transmitted bitstream and the corresponding regenerated symbol (composed by 3 bits, from LSB to MSB) provided at the output of the receiver.

#### 5.6 Conclusion

This study case report on a novel multilevel synchronized pulse position modulation technique for wireless ultra-wide-band optical biotelemetry links intended for implantable and wearable medical devices. The proposed modulation paradigm has been designed aiming to increase the data transmission efficiency and, at the same time, to reduce the overall power consumption of the biotelemetry system. The proposed modulation scheme allows both for generating the synchronization clock signal and for transmitting symbols composed by more than one bit by making use of proper multilevel pulsed data coding approaches. As a case-example, the digital electronic implementation of the novel multilevel pulsed modulation has been developed to transmit symbol composed by 3 bits. This has been achieved by designing a proper data coding process and by employing 500 ps laser pulses so to properly achieve the coded bitstream to be transmitted. Preliminary experimental results have been reported by developing an optical biotelemetry system on an FPGA board setting an internal reference clock signal equal to 40 MHz meaning that the transmission data rate was equal to 120 Mbps. The resulting energy efficiency has been found to be 16.25 pJ/bit, corresponding to 48.75 pJ/symbol. Moreover, the optical link is able to achieve a BER lower than 10–10 with a maximum overall power consumption of about 1.95 mW.

Finally, the reported architecture can manage different symbol sizes and can be easily implemented in any kind of programmable and configurable hardware for fast prototyping and testing.

The validation of the performed power consumption estimation for the designed digital architecture and its comparison with alternative solutions have provided interesting results and will be the focus of a subsequent publication [J1] and of a future works.

## Chapter 6

## **Conclusions**

In the emerging global framework for fifth/sixth-generation (5/6G) wireless technologies, transparent satellites may be considered as an appealing solution to provide backhaul connectivity to the on-ground Relay Nodes.

Digital Transparent Processor represents a board processor for satellite used to switch the channel into the frame at the physical layer. Thus, they represent an enabler for envisioned applications in the integration between terrestrial mobile networks and satellite constellations. Already proposed methodologies do not provide a single solution for the implementation of DTP but a family of solutions that differ by the used sets of parameters (input/output data size, coefficient width, and the number of channels).

The solutions are analyzed to choose the best or most appropriate solution to implement. In this phase of the design, the solutions are estimated under various profiles: Hardware complexity, Time delay, and Power consumption. The aspect of **power estimation** is what is highlighted in this thesis work. In particular, the main object of the research has been the identification of a model of power consumption that is suitable for the stage of design examined (high level or mathematical signals level), which also has a low computational complexity in order to address the design and avoid costly redesign cycles.

Moving from a modeling framework and a related design methodology for Digital Transparent Processors, we have developed and validated in recent papers. The present work proposes a significant extension of the previous modeling approach to incorporate power consumption analysis.

Based on SFG Augmented (SFG+), the methodology proposed makes possible power and area estimation with a similar result to the Register Transfer Level (RTL) estimation method. This new representation supports moving from arithmetic to the architectural description without considering the RTL representation. In this way, power estimation is anticipated in the conceptual design process.

The SFG+ mode was validated through two case studies: the DTP itself and in

an optical communications environment UPLINK. The model's validity, capable of performing power, hardware complexity, and area estimations, is evidenced in both cases. The SFG+ model estimates were compared in a projection and simulation environment offered by the VIVADO tool, proving the effectiveness of the SFG+ model. From the application point of view, the difference between the two cases of studies generalizes the use of the proposed model. Also, demonstrate the robustness and versatility of the model, just like some features linked to the deterministic character of the projected components. In the case study concerning the optical biotelemetry exploiting the multilevel pulsed position modulation, the method has been applied in a conceptual phase of real application design that will see the validation not only on the FPGA prototype but also as on-chip VLSI Full-Custom implementation.

Ongoing work is concerned with further refinements of the models and the formulation of an explicit optimization problem that is expected to yield a robust framework for the careful design of advanced transponders in the 5G/6G ecosystems.

### **Publications**

#### **In Reviewed Conferences Proceeding**

- [P1]: V. Sulli, G. Marini, F. Santucci, G. Battisti and M. Faccio, "Performance and Hardware Complexity Trade-offs for Digital Transparent Processors in 5G Satcoms," 2019 IEEE Aerospace Conference, Big Sky, MT, USA, 2019, pp. 1-9.
- [P2]: G. Marini, G. Battisti, V. Sulli, F. Santucci and M. Faccio, "Augmented Hardware Complexity for Digital Transparent Processor in Power Estimation Perspective," scheduled.
- [P3]: G. Di Patrizio Stanchieri, G. Battisti, A. De Marcellis, M. Faccio, E. Palange, T. G. Constandinou, "A New Multilevel Pulsed Modulation Technique for Low Power High Data Rate Optical Biotelemetry," IEEE Bio-CAS Conference 2021.

#### **Journals**

[J1]: G. Di Patrizio Stanchieri, G. Battisti, A. De Marcellis, M. Faccio, E. Palange, T. G. Constandinou, "A Multilevel Optical Pulsed Modulation for High Efficiency Biotelemetry," scheduled in Transaction.

## **Bibliography**

- [1] V. Sulli, G. Marini, F. Santucci, G. Battisti, and M. Faccio. Performance and hardware complexity trade-offs for digital transparent processors in 5g satcoms. In *2019 IEEE Aerospace Conference*, pages 1–9, 2019.
- [2] I. Leyva-Mayorga, B. Soret, M. Röper, D. Wübben, B. Matthiesen, A. Dekorsy, and P. Popovski. Leo small-satellite constellations for 5g and beyond-5g communications. *IEEE Access*, 8:184955–184964, 2020.
- [3] M. Mitry. Routers in space: Kepler communications' cubesats will create an internet for other satellites. *IEEE Spectrum*, 57(2):38–43, 2020.
- [4] Inigo del Portillo, Bruce G. Cameron, and Edward F. Crawley. A technical comparison of three low earth orbit satellite constellation systems to provide global broadband. *Acta Astronautica*, 159:123 135, 2019.
- [5] F. N. Najm. A survey of power estimation techniques in vlsi circuits. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 2(4):446–455, 1994.
- [6] Y. Nasser, J. Lorandel, J. Prévotet, and M. Hélard. Rtl to transistor level power modelling and estimation techniques for fpga and asic: A survey. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, pages 1–1, 2020.
- [7] Chi-Ying Tsui, M. Pedram, and A. M. Despain. Exact and approximate methods for calculating signal and transition probabilities in fsms. In *31st Design Automation Conference*, pages 18–23, 1994.
- [8] D. Marculescu, R. Marculescu, and M. Pedram. Stochastic sequential machine synthesis targeting constrained sequence generation. In *33rd Design Automation Conference Proceedings*, *1996*, pages 696–701, 1996.
- [9] J. Monteiro, S. Devadas, and B. Lin. A methodology for efficient estimation of switching activity in sequential logic circuits. In *31st Design Automation Conference*, pages 12–17, 1994.

- [10] F. N. Najm. Transition density: a new measure of activity in digital circuits. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 12(2):310–323, 1993.
- [11] C. Tsui, M. Pedram, and A. M. Despain. Efficient estimation of dynamic power consumption under a real delay model. In *Proceedings of 1993 International Conference on Computer Aided Design (ICCAD)*, pages 224–228, 1993.
- [12] R. Marculescu, D. Marculescu, and M. Pedram. Switching activity analysis considering spatioternporal correlations. In *IEEE/ACM International Conference on Computer-Aided Design*, pages 294–299, 1994.
- [13] P. H. Schneider and S. Krishnamoorthy. Effects of correlations on accuracy of power analysis-an experimental study. In *Proceedings of 1996 International Symposium on Low Power Electronics and Design*, pages 113–116, 1996.
- [14] S. Garg, S. Tata, and R. Arunachalam. Static transition probability analysis under uncertainty. In *IEEE International Conference on Computer Design: VLSI in Computers and Processors*, 2004. *ICCD* 2004. *Proceedings.*, pages 380–386, 2004.
- [15] Laurence W. Nagel. SPICE2: A Computer Program to Simulate Semiconductor Circuits. PhD thesis, EECS Department, University of California, Berkeley, May 1975.
- [16] Charlie X. Huang, Bill Zhang, An chang Deng, and Burkhard Swirski. The design and implementation of powermill. In *In Proceedings of the International Symposium on Low Power Design*, pages 105–110, 1995.
- [17] Christian Piguet. Low-power CMOS circuits: technology, logic design and CAD tools. CRC/Taylor Francis, 2006.
- [18] S. Alipour, B. Hidaji, and A. S. Pour. Circuit level, static power, and logic level power analyses. In *2010 IEEE International Conference on Electro/Information Technology*, pages 1–4, 2010.
- [19] Burch, Najm, Yang, and Trick. Mcpower: a monte carlo approach to power estimation. In 1992 IEEE/ACM International Conference on Computer-Aided Design, pages 90–97, 1992.
- [20] M. G. Xakellis and F. N. Najm. Statistical estimation of the switching activity in digital circuitsy. In *31st Design Automation Conference*, pages 728–733, 1994.

- [21] Y. H. Park and E. S. Park. Statistical power estimation of cmos logic circuits with variable errors. *Electronics Letters*, 34(11):1054–1056, 1998.
- [22] Yaseer Arafat Durrani and Teresa Riesgo. Efficient power analysis approach and its application to system-on-chip design. *Microprocessors and Microsystems*, 46:11 20, 2016.
- [23] Gaurav Verma, Chetna Dabas, Ashish Goel, Manish Kumar, and Vijay Khare. Clustering based power optimization of digital circuits for fpgas. *Journal of Information and Optimization Sciences*, 38(6):1029–1037, 2017.
- [24] Lenny Truong and Pat Hanrahan. A Golden Age of Hardware Description Languages: Applying Programming Language Techniques to Improve Design Productivity. In Benjamin S. Lerner, Rastislav Bodík, and Shriram Krishnamurthi, editors, 3rd Summit on Advances in Programming Languages (SNAPL 2019), volume 136 of Leibniz International Proceedings in Informatics (LIPIcs), pages 7:1–7:21, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
- [25] Gerardus Johannes Maria Smit, Jan Kuper, and C.P.R. Baaij. A mathematical approach towards hardware design. In P.M. Athanas, J. Becker, J. Teich, and I. Verbauwhede, editors, *Dagstuhl Seminar on Dynamically Reconfigurable Architectures*, Dagstuhl Seminar Proceedings, page 11, Germany, December 2010. Internationales Begegnungs- und Forschungszentrum für Informatik. eemcs-eprint-19169.
- [26] Christiaan Baaij and Jan Kuper. Using rewriting to synthesize functional languages to digital circuits. In Jay McCarthy, editor, *Trends in Functional Programming*, pages 17–33, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
- [27] R. Wester, D. Sarakiotis, E. Kooistra, and J. Kuper. Specification of apertif polyphase filter bank in clash. In *CPA*, 2012.
- [28] Luciano Lavagno, Grant Martin, and Louis Scheffer. *Electronic Design Automation for Integrated Circuits Handbook 2 Volume Set.* CRC Press, Inc., USA, 2006.
- [29] V. Sulli, D. Giancristofaro, F. Santucci, and M. Faccio. An analytical method for performance evaluation of digital transparent satellite processors. In *2016 IEEE Global Communications Conference (GLOBECOM)*, pages 1–7, Dec 2016.

- [30] V. Sulli, D. Giancristofaro, F. Santucci, and M. Faccio. Computing the hardware complexity of digital transparent satellite processors on the basis of performance requirements. In 2017 IEEE International Conference on Communications (ICC), pages 1–7, May 2017.
- [31] Paulo S. R. Diniz, Eduardo A. B. da Silva, and Sergio L. Netto. *Multirate systems*, page 455–502. Cambridge University Press, 2 edition, 2010.
- [32] Fredric J. Harris. *Multirate Signal Processing for Communication Systems*. Prentice Hall PTR, USA, 2004.
- [33] P. M. Krishna and T. P. S. Babu. Polyphase channelizer demystified [lecture notes]. *IEEE Signal Processing Magazine*, 33(1):144–150, 2016.
- [34] Neil H. E. Weste and Kamran Eshraghian. *Principles of CMOS VLSI Design: A Systems Perspective*. Addison-Wesley Longman Publishing Co., Inc., USA, 1985.
- [35] D. Marculescu, R. Marculescu, and M. Pedram. Information theoretic measures for power analysis [logic design]. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 15(6):599–610, 1996.
- [36] Louis P. A. Robichaud, M. Boisvert, and J. Robert. Signal flow graphs and applications. 1962.
- [37] V. Sulli, D. Giancristofaro, F. Santucci, M. Faccio, and G. Marini. Design of digital satellite processors: From communications link performance to hardware complexity. *IEEE Journal on Selected Areas in Communications*, 36(2):338–350, 2018.
- [38] P. Gabellini, N. Gatti, G. Gallinaro, D. Giancristofaro, and P. Angeletti. Proposed reference system scenarios for performance assessment of a high-throughput highly-reconfigurable bent-pipe processor for access networks. In *ESA Workshop Adv. Flexible Telecom Payloads (ESA/ESTEC)*, pages 1–7, 2008.
- [39] R. E. Crochiere and L. R. Rabiner. *Multirate Digital Signal Processing*. Prentice-Hall, 1983.
- [40] V. Sulli, F. Santucci, M. Faccio, and D. Giancristofaro. Performance of satellite digital transparent processors through equivalent noise. *IEEE Transactions on Aerospace and Electronic Systems*, 54(6):2643–2661, 2018.

- [41] Heinz G. Göckler and Helmut Eyssele. Study of on-board digital fdm-demultiplexing for mobile scpc satellite communications. *European Transactions on Telecommunications*, 3(1):7–14, 1992.
- [42] M. Iwabuchi, K. Sakaguchi, and K. Araki. Study on multi-channel receiver based on polyphase filter bank. In 2008 2nd International Conference on Signal Processing and Communication Systems, pages 1–7, 2008.
- [43] Y. Gao. Hardware implementation of a 32-point radix-2 fft architecture. volume http://www.eit.lth.se/sprapport.php?uid=856, 2015.
- [44] James J Jun, Nicholas A Steinmetz, Joshua H Siegle, Daniel J Denman, Marius Bauza, Brian Barbarits, Albert K Lee, Costas A Anastassiou, Alexandru Andrei, Çağatay Aydın, et al. Fully integrated silicon probes for high-density recording of neural activity. *Nature*, 551(7679):232–236, 2017.
- [45] Nicholas A Steinmetz, Christof Koch, Kenneth D Harris, and Matteo Carandini. Challenges and opportunities for large-scale electrophysiology with neuropixels probes. *Current opinion in neurobiology*, 50:92–100, 2018.
- [46] Guosong Hong, Xiao Yang, Tao Zhou, and Charles M Lieber. Mesh electronics: a new paradigm for tissue-like brain probes. *Current opinion in neurobiology*, 50:33–41, 2018.
- [47] Gian Nicola Angotzi, Fabio Boi, Aziliz Lecomte, Ermanno Miele, Mario Malerba, Stefano Zucca, Antonino Casile, and Luca Berdondini. Sinaps: An implantable active pixel sensor cmos-probe for simultaneous large-scale neural recordings. *Biosensors and Bioelectronics*, 126:355–364, 2019.
- [48] Michael Haas, Jens Anders, and Maurits Ortmanns. A bidirectional neural interface featuring a tunable recorder and electrode impedance estimation. In 2016 IEEE Biomedical Circuits and Systems Conference (BioCAS), pages 372–375. IEEE, 2016.
- [49] Sara S Ghoreishizadeh, Dorian Haci, Yan Liu, Nick Donaldson, and Timothy G Constandinou. Four-wire interface asic for a multi-implant link. *IEEE Transactions on Circuits and Systems I: Regular Papers*, 64(12):3056–3067, 2017.
- [50] Ha Uk Chung, Bong Hoon Kim, Jong Yoon Lee, Jungyup Lee, Zhaoqian Xie, Erin M Ibler, KunHyuck Lee, Anthony Banks, Ji Yoon Jeong, Jongwon Kim, et al. Binodal, wireless epidermal electronic systems with in-sensor analytics for neonatal intensive care. *Science*, 363(6430), 2019.

- [51] Shuang Song, Mario Konijnenburg, Roland van Wegberg, Jiawei Xu, Hyunsoo Ha, Wim Sijbers, Stefano Stanzione, Dwaipayan Biswas, Arjan Breeschoten, Peter Vis, et al. A 769 μw battery-powered single-chip soc with ble for multi-modal vital sign monitoring health patches. *IEEE transactions on biomedical circuits and systems*, 13(6):1506–1517, 2019.
- [52] Hojoong Kim, Yun-Soung Kim, Musa Mahmood, Shinjae Kwon, Nathan Zavanelli, Hee Seok Kim, You Seung Rim, Fayron Epps, and Woon-Hong Yeo. Fully integrated, stretchable, wireless skin-conformal bioelectronics for continuous stress monitoring in daily life. *Advanced Science*, 7(15):2000810, 2020.
- [53] Han Yuan and Bin He. Brain-computer interfaces using sensorimotor rhythms: current state and future perspectives. *IEEE Transactions on Biomedical Engineering*, 61(5):1425–1435, 2014.
- [54] T Liu, J Anders, and M Ortmanns. Bidirectional optical transcutaneous telemetric link for brain machine interface. *Electronics Letters*, 51(24):1969–1971, 2015.
- [55] Kerron Duncan and Ralph Etienne-Cummings. Selecting a safe power level for an indoor implanted uwb wireless biotelemetry link. In *2013 IEEE Biomedical Circuits and Systems Conference (BioCAS)*, pages 230–233. IEEE, 2013.
- [56] Alexander D Rush and Philip R Troyk. A power and data link for a wireless-implanted neural recording system. *IEEE Transactions on Biomedical Engineering*, 59(11):3255–3262, 2012.
- [57] Asimina Kiourti and Konstantina S Nikita. A review of in-body biotelemetry devices: Implantables, ingestibles, and injectables. *IEEE Transactions on Biomedical Engineering*, 64(7):1422–1430, 2017.
- [58] Wen Li, Yida Duan, and Jan Rabaey. A 200-mb/s energy efficient transcranial transmitter using inductive coupling. *IEEE transactions on biomedical circuits and systems*, 13(2):435–443, 2018.
- [59] Yevhenii Antonenko, Mychailo Buriak, Oleksii Osypenko, Dmytro Shtoda, and Nikolay Chizh. Wireless charger for implantable biotelemetry system. In 2018 9th International Conference on Ultrawideband and Ultrashort Impulse Signals (UWBUSIS), pages 260–263, 2018.
- [60] Iman Ghotbi, Mohammad Najjarzadegan, Ali Esmailiyan, Shahin Jafarabadi Ashtiani, and Omid Shoaei. A wireless pulsed-current battery charger for

- implantable biomedical stimulators. In 2016 IEEE 59th International Midwest Symposium on Circuits and Systems (MWSCAS), pages 1–4. IEEE, 2016.
- [61] Andrea De Marcellis, Guido Di Patrizio Stanchieri, Marco Faccio, Elia Palange, and Timothy G Constandinou. A 300 mbps 37 pj/bit pulsed optical biotelemetry. *IEEE Transactions on Biomedical Circuits and Systems*, 14(3):441–451, 2020.
- [62] AM Zaiton, CH Eng, and F Jasman. Pulse position modulation characterization for indoor visible light communication system. In *Journal of Physics: Conference Series*, volume 1502, page 012005. IOP Publishing, 2020.
- [63] Nobuhiko Kikuchi. Multilevel signaling technology for increasing transmission capacity in high-speed short-distance optical fiber communication. *IEICE Transactions on Electronics*, 102(4):316–323, 2019.
- [64] Diouba Sacko and AA Kéïta. Techniques of modulation: pulse amplitude modulation, pulse width modulation, pulse position modulation. *International journal of Engineering And advanced Technology*, 7(2):100–108, 2017.
- [65] Majid Zarie, Aakbar Asghari Varzaneh, Mahdi Akbari Allah Abadi, and Farhad Sadeghi Almaloo. Multilevel dual header pulse interval modulation scheme for optical wireless communications [j]. *International Journal of Electrical and Electronic Engineering & Telecommunications*, 1(9), 2020.
- [66] Geetika Mehandiratta, RS Kaler, and Gurpreet Kaur. Transmission analysis of 112 gbps dual polarization qpsk/16qam using coherent receiver with digital signal processing. In 2018 International Conference on Intelligent Circuits and Systems (ICICS), pages 87–92. IEEE, 2018.
- [67] Andrea De Marcellis, Elia Palange, Luca Nubile, Marco Faccio, Guido Di Patrizio Stanchieri, and Timothy G Constandinou. A pulsed coding technique based on optical uwb modulation for high data rate low power wireless implantable biotelemetry. *Electronics*, 5(4):69, 2016.