- Journal List
- HHS Author Manuscripts
- PMC11245300

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health.

Learn more: PMC Disclaimer | PMC Copyright Notice

J Signal Process Syst. Author manuscript; available in PMC 2024 Jul 12.

*Published in final edited form as:*

J Signal Process Syst. 2022 May; 94(5): 455–472.

Published online 2021 May 3. doi:10.1007/s11265-021-01662-2

PMCID: PMC11245300

NIHMSID: NIHMS1955742

PMID: 39006237

Xiaomin Wu, Da-Ting Lin, Rong Chen,^{*} and Shuvra S. Bhattacharyya^{*}

Author information Copyright and License information PMC Disclaimer

## Abstract

In this paper, we develop methods for efficient and accurate informationextraction from calcium-imaging-based neural signals. The particular form ofinformation extraction we investigate involves predicting behavior variableslinked to animals from which the calcium imaging signals are acquired. Morespecifically, we develop algorithms to systematically generate compact deepneural network (DNN) models for accurate and efficient calcium-imaging-basedpredictive modeling. We also develop a software tool, called NeuroGRS, to applythe proposed methods for compact DNN derivation with a high degree ofautomation. GRS stands for Greedy inter-layer order with Random Selection ofintra-layer units, which describes the central algorithm developed in this workfor deriving compact DNN structures. Through extensive experiments usingNeuroGRS and calcium imaging data, we demonstrate that our methods enable highlystreamlined information extraction from calcium images of the brain with minimalloss in accuracy compared to much more computationally expensive approaches.

## 1. Introduction

Predictive modeling based on calcium imaging data plays an important role inoptic causality discovery based on optogenetics and calcium imaging. In this paper,we are concerned with the efficient and accurate utilization of calcium images topredict behavior variables linked to animals. While the methods developed in thepaper are relevant to a wide variety of prediction scenarios, we demonstrate themconcretely to predict the motion of a target mouse. The specific prediction probleminvolves predicting among the states of remaining stationary, moving relativelyslowly (fine motion) or moving fast (*coarse motion*).

Real-time performance on resource-constrained hardware platforms is veryimportant for such prediction problems. The capability to derive accuratepredictions in real-time allows scientists to dynamically adjust experimentparameters based on observations during an experiment. Additionally, real-timeperformance is critical when prediction is used as part of a closed-loop (feedback)system, such as in a device for neuromodulation [1]. In such a system, real-time prediction is essential to ensure thetimeliness of control functions, such as neural stimulation. In practical systemsfor real-time, calcium-image-based prediction, hardware resource constraints mustoften be considered as well. Tools for calcium-imaging-based experiments are morecost-effective and more easily customized when they operate on commodity computers,such as desktops or laptops, as opposed to high end servers. Connections tocloud-based servers involve large communication latencies, which interfere with realtime performance. In scenarios where prediction is deployed in biomedical devices(as in the neuromodulation example above), size, cost, and power consumptionconstraints may severely limit the computational resources that are available.

Neural network prediction models, especially deep neural network (DNN)models, are becoming increasingly popular in many application fields because oftheir potential for high accuracy, and their capability for exploiting largedatasets when such data is available for model training. DNNs form an importantclass of machine learning methods that learn complex function approximations frominput/output examples [2]. In recent years,DNNs have demonstrated accuracy levels that are close to or even higher thanhuman-beings in fields such as computer vision [3], natural language processing [4], and speech recognition [5].However, the model sizes associated with state-of-the-art DNN models have beenincreasing at a rapid pace. Large DNN models require very high computational power,which makes them unsuitable for real-time and resource-constrained applications ofcalcium imaging. Additionally, the presence of redundant information in large DNNmodels may make it more difficult for scientists and biomedical engineers tounderstand and interpret predictive models that utilize calcium imaging data.

In this paper, we develop new methods to transform large DNN models intoeffective *compact DNN models*, which require much less resources forreal-time computation and much less memory compared to the corresponding originalmodels, while achieving similar prediction accuracy. The transformation process thatwe investigate in this paper — from large to compact DNN form — isreferred to as *pruning*. In addition to facilitating real-time,resource-constrained prediction, the ability of compact (pruned) models to identifythe most important connections between artificial neurons enhancesscientists’ understanding of brain function by more clearly showingrelationships between neuron activities and behavior variables. Here and throughoutthe rest of this paper, we use *artificial neuron* to refer to aneuron within a DNN and *neuron* (without the“artificial” qualifier) to refer to a neuron in the brain. Also in thecontext of pruning, we refer to the DNN that is input to the pruning process as the*original* network, and the resulting smaller network as the*pruned* network.

We develop a new pruning approach called GRS, which stands for Greedyinter-layer order with Random Selection of intra-layer units. GRS is a form ofpruning called *structured pruning*, where connections are removedfrom the original network according to regular patterns rather than at arbitrarypoints in the network (e.g., see [6]).Structured pruning is easier to exploit for efficient implementation compared tounstructured pruning approaches, and unlike unstructured pruning, it requires noadditional cost in terms of hardware or specialized libraries to efficiently exploit[6]. We show that our proposed GRSapproach can be used in isolation or in combination with additional networktransformation stages that are based on unstructured pruning.

We refer to our combination of GRS with unstructured pruning, as describedabove, as GRS+TQ, where “TQ” represents the specific sequence of twounstructured pruning stages that we have applied in this work to post-process thenetwork derived by GRS. Here “TQ” stands for Threshold and Quantize.The first unstructured pruning stage uses a thresholding approach to iterativelyremove low-weight connections, while the second unstructured pruning stage performsquantization to reduce the number of distinct weight values that need to bemaintained in memory.

This work is among the first to systematically generate compact DNN modelsfor accurate and efficient calcium-imaging-based predictive modeling. The pruningapproaches that we develop, GRS and GRS+TQ, are applicable across different types ofDNN models, which further strengthens their utility in designspace exploration forreal-time, resource-constrained applications of calcium image analysis. Wedemonstrate this generality by evaluating GRS and GRS+TQ on two different types ofDNN models — multilayer perceptron (MLP) models and convolutional neuralnetwork (CNN) models. Intuitively, GRS is well suited to implementation usingoff-the-shelf processors and neural network libraries, whereas GRS+TQ provides thepotential for further improvement through specialized hardware/software for handlingthe irregular computations resulting form unstructured pruning.

Building on our proposed pruning methods, we develop a software tool toapply the methods with a high degree of automation. Our software tool, called NeuroGRS, provides a novel platform for experimentation with streamlined DNNimplementations of behavior prediction based on neural activity data. Throughextensive experiments using NeuroGRS and calcium imaging data, we demonstrate thatGRS and GRS+TQ lead to highly streamlined information extraction from calcium imagesof the brain with minimal loss in accuracy compared to much more computationallyexpensive approaches.

## 2. Related Work

Li et al. apply calcium images of the mouse brain to a DNN model to predictforelimb reach results [7]. Due to theirdirect use of calcium images as input to the DNN, they use a highly complex model,ResNet18 [8], which is composed of 18 layers.In contrast, we demonstrate in this paper that by operating on neural activitysignals, we can apply much more compact and efficient DNN models while maintainingaccurate prediction of mouse motion. The activity signals are derived bypreprocessing the input calcium image stream. The datasets that we experiment within this work include the effect of such preprocessing.

Lee et al. [9] apply online learningwith an incremental linear discriminant analysis (LDA) method to predict mousemotion based on neural activity signals that are extracted from calcium images. Inthis paper, we apply DNNs as prediction models, which provide higher predictionaccuracy, and more stable performance across different datasets, and can be easilyused for online learning.

Liu et al. and Frankle/Carbin use carefully designed comparison experimentsto demonstrate that for structured pruning methods, training the pruned models fromscratch yields similar accuracy to that of the pruned models with the weights thatare inherited from the original network [10,11]. We use this insight toallow our proposed new pruning process to jointly optimize the weights and networkarchitecture (rather than constraining the network to use the weights of theoriginal network).

Li et al. and Hu et al. show that significant loss of accuracy can resultfrom pruning multiple filters or artificial neurons without retraining between thepruning operations [12,13]. Motivated by these results, our proposed approach,GRS, applies retraining after each removal of a filter or artificial neuronthroughout the pruning process. This approach increases computational cost of thepruning process, but helps to maximize the accuracy of the pruned solutions.

State of the art pruning methods mostly focus on identifying units thatprovide the least contribution based on pre-trained weights (e.g., see [14, 15,12, 16, 17, 18, 19,20]). GRS differs from these methods in thatGRS does not restrict itself to the use of pre-trained weights. Instead, asdescribed earlier in this section, GRS retrains the weights after every pruningoperation. In our experiments, we include a comparison of GRS with a representativeapproach — the approach of Li et al. [12] — that is based on using pre-trained weights. Details of thiscomparison are discussed in Section 5.3

The main novelty of our work is two-fold. First, we introduce a newstructured pruning method that involves weight retraining throughout the pruningprocess. Second, we present the first study, to our knowledge, of DNN pruning forefficient behavior prediction from calcium-imaging based neural signals, and wedemonstrate the utility of our proposed new structured pruning approach on thisimportant problem in neural signal analysis. Additionally, we develop a softwaretool, called NeuroGRS, which combines the proposed structured pruning method withtwo unstructured pruning techniques. NeuroGRS also provides automated synthesis ofC/C++ code for deployment of efficient behavior prediction implementations onresource-constrained platforms.

## 3. Methods

In this section, we present the methods developed in this work, includingthe NeuroGRS tool, which enables automated application of the proposed pruningmethods.

### 3.1. NeuroGRS

Input to NeuroGRS includes a set of $n$ alternative, overparameterized DNN models(*candidates*) ${M}_{1},{M}_{2},\dots ,{M}_{n}$. This input is provided by the neuromodulationsystem designer (user) based on his or her own previous experience with DNNmodels and off-the-shelf customized models that he or she has access to. Each${M}_{i}$ has an associated *model typemtype*$\left({M}_{i}\right)$, which must be either MLP or CNN. Extension ofNeuroGRS to support other model types is an interesting direction for futurework.

NeuroGRS provides an automated framework that allows the designer toderive an optimal model from among the candidate models. This capability isprovided while taking into account the optimized application of structuredpruning and also (optionally) further model compression through unstructuredpruning.

A dataflow representation of the computational process underlyingNeuroGRS is shown in Fig. 1 To facilitateexperimentation and adaptation to different application requirements, NeuroGRSis developed using modular interfaces and components for the different blocksillustrated in Fig. 1. Background ondataflow and details on dataflow modeling in NeuroGRS are discussed in Section 3.2 and Section 3.3.

Fig. 1

A dataflow representation of the computational process underlyingNeuroGRS

NeuroGRS evaluates each candidate ${M}_{i}$ individually by training it, applying pruning(GRS or GRS+TQ) to the trained model, and then evaluating the resulting prunedmodel $P\left({M}_{i}\right)$. The training subsystem involved in thisprocess is represented by the block labeled Training in Fig. 1. and the pruning subsystem encompasses theblocks labeled TQ, GRS, Switch and Select. The interaction among these fourblocks is described in Section 3.3 anddetails on the GRS and TQ pruning processes are presented in Section 3.4 and Section3.6 respectively.

As shown in Fig. 1, the output fromthe pruning process is a set of pruned models $P\left({M}_{1}\right),P\left({M}_{2}\right),\dots ,P\left({M}_{n}\right)$, and a set of corresponding vectors${\mathbf{v}}_{1},{\mathbf{v}}_{2},\dots ,{\mathbf{v}}_{n}$, where each $P\left({M}_{i}\right)$ is the pruned version of${M}_{i}$ and each ${\mathbf{v}}_{i}$ encapsulates selected design evaluation metricsfor the pruned model $P\left({M}_{i}\right)$. In particular, ${\mathbf{v}}_{i}$ is a 3-element vector that gives the measuredaccuracy, FLOP count, and number of remaining (unpruned) weights in$P\left({M}_{i}\right)$.

The blocks with dashed borders in Fig.1 (${D}_{T}$ and $\left.{D}_{V}\right)$ represent datasets that are used for trainingand pruning, respectively. We refer to these two datasets as the TrainingDataset and Validation Dataset, respectively. More details on these datasets andtheir usage in NeuroGRS are discussed in Section3.4, Section 3.6, and Section 4.

After all candidates have been pruned and evaluated, an optimized designis selected based on application-specific criteria along with the results${\mathbf{v}}_{1},{\mathbf{v}}_{2},\dots ,{\mathbf{v}}_{n}$. This selection process is represented by theblock in Fig. 1 labeled Design Selection.The application-specific selection criteria can be configured flexibly by theuser — for example, by defining constraints that must be satisfied for asubset of the metrics, and defining optimization priorities for the remainingmetrics. Examples of this kind of selection criteria are (a) a constraint onaccuracy together with an objective of optimizing (minimizing) a weighted sum ofFLOPs and memory cost (model size), and (b) constraints on FLOPs and memory costtogether with an objective of optimizing accuracy. When constraints coexist withoptimization objectives, the selection process optimizes the specifiedobjectives subject to the given constraints. The flexibility and modularityprovided for customizing the design selection processes is important for designof neuromodulation systems, which must satisfy stringent implementationconstraints, as described in Section 1, inaddition to their goal of providing high accuracy prediction capabilities.

To generate the vectors $\left\{{\mathbf{v}}_{i}\right\}$ at the output of the pruning process, each$P\left({M}_{i}\right)$ is executed on each instance in the validationdataset ${D}_{V}$, and the results from these executions areaggregated to derive the corresponding result vector ${\mathbf{v}}_{i}$. In the current version of NeuroGRS, theaggregation process involves simply averaging across all dataset instances.However, this aggregation process can easily be adapted to incorporate moreelaborate statistics (e.g., minimum, maximum, and average), and similarly, theDesign Selection block can easily be extended adapted to take into account moreelaborate design evaluation results.

### 3.2. Dataflow Modeling in NeuroGRS

As mentioned in Section 3.1, thesystem model illustrated in Fig. 1 is adataflow representation. Dataflow is a useful modeling format for signal andinformation processing systems because it provides well-defined interfacesbetween functional components, and exposes important forms of high-levelapplication structure that are useful for reliable and efficient implementationin hardware or software [21]. In the formof dataflow that is commonly applied for signal processing system design, anapplication is represented as a directed graph. Vertices in the graph are called*actors* and represent functional modules, such as digitalfilters or machine learning classifiers, and edges represent first-in, first-outcommunication channels that buffer data as it passes from the output of oneactor to the input of another. Each unit of data that passes through a dataflowedge is called a *token*. Tokens can have arbitrary typesassociated with them, such as integers, floating point values, images or videostreams.

A dataflow actor executes as a sequence of *firings*,where each firing can be viewed as a discrete unit or quantum of theactor’s computation. A firing can be executed when an actor hassufficient data, where this notion of sufficiency is defined precisely as partof the design of the actor. The actual time at which different actors execute isdetermined by a subsystem called a *scheduler*, which is not partof the dataflow graph model. This separation of concerns between functionalspecification (provided by the dataflow graph) and dispatching of componentexecutions (provided by the scheduler) is an important feature of dataflow-baseddesign processes. For more details on the form of dataflow that we apply in thiswork, we refer the reader to [22,21].

### 3.3. Pruning Modes

The Switch and Select actors shown in Fig.1 provide conditional execution of the TQ (Threshold and Quantize)actor based on the Boolean-valued system input EUP, which stands for“Enable Unstructured Pruning.” This conditional execution providestwo alternative modes of pruning, GRS and GRS+TQ, where the mode used can beconfigured using the EUP input.

The Switch actor has two inputs — labeled c (control) and d(data), and two outputs, labeled T (true) and F (false). The actor consumes aBoolean-valued token on its control input. The Boolean value of the tokenconsumed from the control input indicates which output (T or F) the actor shouldproduce its next output token onto. Upon reading the next token$\tau $ from the data input, the token$\tau $ is copied onto the output that is indicated bythe control input. This process is repeated for each pair of correspondingtokens that arrive at the control and data inputs. Similarly, the Select actorreads a token from its T or F input based on the value of the token arriving onits control input. The token read from T or F is copied onto the data output.More background about dataflow graphs involving Switch and Select actors can befound in [23].

Since the EUP signal is applied simultaneously as the control input forboth the Switch and Select actors in Fig.1, the subgraph involving GRS, TQ, Switch and Select applies the GRS+TQalgorithm only if the EUP signal is true-valued and otherwise, if EUP isfalse-valued, the subgraph applies only GRS.

### 3.4. Structured Pruning with GRS

GRS is an iterative algorithm that selects and removes a computationalunit from the input model on each iteration. The selection process applies agreedy strategy. Here, by a unit, we mean a single artificial neuron within adense layer or a single filter within a convolutional layer. The iterativeremoval of units is carried out until the accuracy reaches a pre-definedthreshold on the maximum acceptable accuracy loss or the size of each networklayer reaches a pre-defined threshold on the minimum number of units in thelayer.

Algorithm 1 gives a pseudocodedescription of the GRS algorithm. The algorithm takes as input a DNN model${M}_{GRS}$, training dataset ${D}_{T}$, validation dataset ${D}_{V}$, minimum layer size constraint vector*MinStruct*, and accuracy drop tolerance$\mathcal{T}$. The output is a pruned version${\mathcal{M}}_{GRS}$ of ${M}_{GRS}$ along with the accuracy *Acc*determined from evaluating the model on ${D}_{V}$; the number of FLOPs in${\mathcal{M}}_{GRS}$; and the number of parameters (model size) in${\mathcal{M}}_{GRS}$. The inputs $\mathcal{T}$ and *MinStruct* have beenomitted in Fig. 1 to avoid excessiveclutter in the diagram.

The vector *MinStruct* has positive-integer-valuedelements and is indexed by the hidden layers in ${M}_{GRS}$. Each vector element *MinStruct*$\left[i\right]$ gives the minimum number of units to retain inhidden layer $i$ of the model throughout the pruning process.The *MinStruct* input therefore gives the user a means forcontrolling the maximum amount to which each layer can be pruned. The accuracydrop tolerance $\mathcal{T}$ is a real value in $\left(\mathrm{0,1}\right]$. The pruning process in GRS is constrained sothat the accuracy is not allowed to fall below $\mathcal{T}\times $*OriValAcc*, where *OriValAcc* is the accuracy ofthe input (unpruned) model ${M}_{GRS}$ when evaluated on ${D}_{V}$. Hold-out validation is used to assess theaccuracy of the intermediate model in each iteration of the pruning process. Weuse hold-out validation instead of $\mathrm{k}$-fold cross-validation to reduce computationalcost since the validation process must be performed repeatedly, one or moretimes per iteration.

The symbol [] represents an empty list. For a given hidden layer$\lambda $ within a given model $M,\lambda $. *UnitCount* represents thenumber of units in $\lambda $. Given a nonempty list$L$ of real numbers, $\mathit{\text{argmax}}\left(L\right)$ gives the index of an element of the list thathas maximum value (ties are broken arbitrarily).

Function *validation* takes a DNN model and a dataset asarguments, evaluates the model with the dataset, and returns the averageaccuracy of the model across the dataset. The function*randomCutOneUnit* takes a DNN model and layer of the modelas its arguments, randomly selects one unit from the layer, removes the selectedunit, and returns the modified (smaller) model. Function*fineTuning* takes a DNN model and a dataset as arguments,and retrains the model using the dataset.

GRS finishes when either (a) the minimum unit counts imposed byMinStruct are reached for all hidden layers or (b) all candidate pruningoperations reduce the accuracy below $\mathcal{T}\times $*OriValAcc*. In our experiments, we use $\mathcal{T}=0.985$.

### 3.5. Baseline Models for GRS

To help demonstrate the effectiveness of GRS, we compare GRS with twobaseline methods in Section 5.3. The firstbaseline method was presented by Li et al. [12], while the second method is developed for experimentationpurposes as part of our work on NeuroGRS. More specifically, the second methodis a variation of GRS that replaces the greedy inter-layer traversal order witha random inter-layer traversal. We refer to these methods, respectively, as NWM(Natural inter-layer order and Weight Magnitude based selection of intra-layerunit to prune), and RRS (Random inter-layer order and) Random Selection ofintra-layer unit).

NWM was introduced as a pruning method that focuses on filters inconvolutional layers. In our experiments, we used an extended version of NWMthat also prunes artificial nodes in dense layers. NWM removes a preset numberof units with relatively small sum of absolute weight magnitudes from eachlayer. The removal is performed by traversing the network layer by layer,starting at the input side of the network and traversing towards the outputside. We refer to this as the *natural* inter-layer traversalorder. In the remainder of this paper, when we write “NWM”, werefer to our extended version of the method by Li et al. [12], unless otherwise stated.

We use RRS as a baseline in our experiments to help validate andquantify the utility of optimizing the inter-layer traversal order during thestructured pruning process of GRS. Liu et al. show that the pruned structure isof special importance [10]. Since removalof units in different layers results in different intermediate structures duringthe pruning process, the greedy inter-layer order selection of GRS is motivatedby the objective of deriving more favorable intermediate structures. Forexample, unsuitable intermediate structures may cause an early stop of theoverall structured pruning process, especially when the accuracy drop tolerance$\mathcal{T}$ is strict.

### 3.6. Unstructured Pruning Extensions with TQ

Pruning Stage T applies the concept of pruning weights that haverelatively low magnitudes. This concept has been presented previously in theliterature (e.g., see [19]), and ourcontribution here is to integrate the concept into the NeuroGRS framework as anoptional follow-on to the GRS Pruning Stage, and with a subsequent refinementthat provides further optimization through weight quantization (Stage Q) [20].

Pruning Stage T in NeuroGRS seeks to reduce the FLOP count and weightcount of a model by removing weights that have relatively small absolute values,and that contribute relatively little to the forward propagation computation ofthe given DNN model. We sometimes refer to this *stage* moreconcisely as *Stage T*. Like GRS, Stage T is an iterativealgorithm. At each iteration all weights whose magnitudes fall below a threshold*Thresh* are temporarily removed. If the model resulting fromthe removals has sufficient accuracy, then the temporary removals are madepermanent, and the threshold *Thresh* for the next iteration isincreased. The iteration continues until the temporary weight removals result inan accuracy level that falls below the minimum accuracy constraint. As with GRS,hold-out validation is used to assess the accuracy of the intermediate model ineach iteration of the pruning process.

Algorithm 2 gives a pseudocodedescription of Stage T. Input to Stage T consists of an input DNN model${M}_{T}$, validation dataset ${D}_{V}$, accuracy drop tolerance$\mathcal{T}$, initial setting for *Thresh*,and constant amount *ThreshStep* by which to increase*Thresh* before transitioning to a new algorithm iteration.As in Algorithm $1,\mathcal{T}$ is a real value in $\left(\mathrm{0,1}\right]$, and the pruning process in Stage T isconstrained so that the accuracy is not allowed to fall below a factor of$\mathcal{T}$ times the accuracy of the input model$\left({M}_{T}\right)$. The $\mathcal{T}$ value provided to Stage T need not be the sameas that provided to Algorithm GRS.

Function *validation* operates as specified in Section 3.4. Function*cutWeights* takes a DNN model and a threshold value asarguments. The function removes all weights from the model whose magnitudes aresmaller than the threshold. The function keeps a record of the weights that itremoves in a given call to the function while discarding any weight removalrecords associated with the previous call. Function *undoCuts*reinserts the weights that were removed in the most recent call to*cutWeights*.

Stage T calls function *undoCuts* only if it is foundfrom Function *validation* that the most recent weight removalshave caused the accuracy to become unacceptably low. The weight removal process(iteration) of Pruning Stage T terminates after the first call to*undoCuts* completes.

In our experiments with Stage T, we use $\mathcal{T}=0.995$, *Thresh*$=0.001$, and *ThreshStep$=0.001$*.

Pruning Stage Q applies the concept of quantizing weights, which canresult in smaller memory requirements and faster operations when specializedsoftware libraries or hardware is used [20]. As illustrated in Fig. 1.Pruning Stage Q is applied after Pruning Stage T whenever the EUP (EnableUnstructured Pruning) input to NeuroGRS is true-valued.

Stage Q takes as input a DNN model ${M}_{Q}$, a validation dataset ${D}_{V}$, an accuracy drop tolerance$\mathcal{T}$, and an initial maximum number of decimalplaces ${n}_{d}$ to control the rounding. Stage Q starts byrounding all weights in ${M}_{Q}$ to ${n}_{d}$ decimal places. If the resulting,weight-rounded model has sufficient accuracy (determined using${D}_{V}$), then the number of decimal places to retainis decreased by 1 (to $\left({n}_{d}-1\right)$). This process is iteratively repeated untilthe minimum accuracy constraint is crossed, at which point the model from theprevious iteration is output as the final weight-rounded model. For brevity, weomit a pseudocode sketch of Stage Q. In our experiments with Stage Q, we use${n}_{d}=4$ and $\mathcal{T}=0.990$.

Exploiting the unstructured pruning result of TQ in general requires aspecialized software library or specialized hardware that is capable ofexploiting irregular, sparse DNN computations. The model output by the TQ blockcan be applied to platforms that are equipped with such specialized software orhardware. However, experimentation using such specialized hardware/software isbeyond the scope of this paper. Instead, we assess the cost (in accuracy) andpotential benefit (in computation and memory savings) resulting from TQ bymeasuring the accuracy drop, reduction in FLOPs, and reduction in DNNparameters, respectively.

### 3.7. Synthesis of Real-time Implementations

From the optimized model produced by the Design Selection block in Fig. 1. NeuroGRS automatically generates aC/C++ implementation that implements the model. Such synthesized implementationsare useful for experimenting with or deploying the models in real-time,resource-constrained operational scenarios, where the efficiency of theinference code is critical. The automated C/C++ code synthesis feature ofNeuroGRS is developed by applying capabilities of the DSPCAD Framework, whichprovides tools for prototyping and experimenting with dataflow-based designflows for signal and information processing systems [24]. Here, DSPCAD stands (in reverse order) forcomputer-aided design (CAD) for digital signal processing (DSP) systems. Moredetails on the code synthesis process of NeuroGRS is beyond the scope of thispaper. For details on the DSPCAD Framework, which provides an importantfoundation for NeuroGRS’s code synthesis capability, we refer the readerto [24].

## 4. Datasets

NeuroGRS involves calcium imaging data that was generated in an open-fieldexperiment. The experiment was focused on studying general locomotion activitylevels of a group of mice [25]. NeuroGRStakes extracted neural signals from calcium imaging as inputs. Examples of inputneural signals (neurons 0 to 9) and behavior labels are shown in Fig. 2 Neuron signals were extracted using the methoddescribed in [25]. Each dataset correspondsto a single calcium imaging session. In this work, we used 9 datasets from the firstthree imaging sessions (e1, e2, e3) on three different mice, identified as mouse 04(m04), mouse 05 (m05) and mouse 06 (m06). The datasets associated with these threemice have calcium imaging traces involving 273, 140, and 114 neurons,respectively.

Fig. 2

Example of input neural signals and behavior labels.

Each dataset includes 3000 samples, where each sample corresponds to a giveninstant in time $t$ during the experiment, and has the form of a vector${v}_{t}$ that is indexed by the neurons extracted from thecorresponding calcium imaging trace. For a given neuron $n$, ${v}_{t}\left[n\right]$ gives the measured signal value associated with$n$ at time $t$. The label associated with a sample${v}_{t}$ indicates whether or not the mouse is engaged infine motion at time $t$. In this context, fine motion is defined to meanthat the mouse is moving at a speed $s\left(t\right)$, where $0.2\text{}\mathrm{c}\mathrm{m}/\mathrm{s}<s\left(t\right)<2\text{}\mathrm{c}\mathrm{m}/\mathrm{s})$. The label value “1” indicates thatthe mouse is engaged in fine motion, whereas “ 0” indicates that themouse does not exhibit fine motion. The sampling interval used across the 3000samples in each dataset is 100 milliseconds (10 Hz sample rate).

In our experiments, we apply NeuroGRS independently to each of the 9datasets described above. To apply NeuroGRS to a given dataset, we first partitionthe dataset into three independent portions, which results in three smallerdatasets. These smaller datasets are referred to as ${D}_{S},{D}_{V}$, and ${D}_{E}$. The dataset ${D}_{S}$ is used as a starting point for constructing thetraining dataset ${D}_{T}$, and the dataset ${D}_{V}$ is used for validation during pruning, as describedin Section 3.4 and Section 3.6 The third dataset ${D}_{E}$ (the subscript “E” stands for“evaluation”) is used for testing. The dataset${D}_{E}$ is not used in the process of pruning and designselection; it is only used to evaluate performance for the pruned models that arederived by NeuroGRS. The size ratio used when partitioning each dataset is$\left|{D}_{S}\right|:\left|{D}_{V}\right|:\left|{D}_{E}\right|=8:1:1$, where $\left|d\right|$ denotes the number of samples in dataset$d$.

The original datasets involving mouse locomotion are imbalanced in that thefine motion class (label 1) is represented less frequently compared to class 0: theminimum, maximum, and average percentage of label 1 among the 9 datasets are$13\%,45.3\%$, and $26.1\%$, respectively. Bias caused by imbalanced datasetsduring training can lead machine learning algorithms to ignore or excessivelysuppress the minority class. To minimize such bias, balanced data is extracted fromthe ${D}_{S}$ set associated with each dataset when deriving theassociated training dataset ${D}_{T}$. We evaluate and test model performance using theoriginal distribution of a given dataset; that is, we do not modify the imbalanceratios of ${D}_{V}$ and ${D}_{E}$.

A random undersampling method is used to select a proper of subset${D}_{0}\subset {D}_{S}$ of the original samples from${D}_{S}$ having label 0 such that $\left|{D}_{0}\right|=\left|{D}_{1}\right|$, where ${D}_{1}$ is the set of samples in ${D}_{S}$ having label 1. The training dataset${D}_{T}$, as represented in Fig. 1 , is then derived as ${D}_{T}={D}_{0}\cup {D}_{1}$. The datasets ${D}_{V}$ and ${D}_{E}$ are taken from the original datasets withoutapplication of undersampling.

As mentioned in Section 3.1, NeuroGRScan operate on MLP models as well as CNN models. Vectors of neuron signals areapplied directly as inputs to MLP models in NeuroGRS. On the other hand, for CNNmodels, neuron signals are first rearranged into a 2D square matrix$\Gamma $ using a spatial arrangement algorithm. Oneobjective of the rearrangement is to keep the input small to reduce computationalrequirements. Another objective is to associate spatial relationships in the neuralsignals according to neuron position information that is available in thedatasets.

In the input rearrangement algorithm for CNN models, neurons are firstsorted in increasing order according to the sum $x\left(\mu \right)+y\left(\mu \right)$, where $\left(x\right(\mu ),y(\mu )$ gives the spatial coordinates associated with agiven neuron $\mu .\Gamma $ is an $n\times n$ matrix, where $n=1$ initially, and $n$ is increased as the algorithm operates until all ofthe neurons have been inserted into $\Gamma $. Each neuron occupies a distinct matrix elementwithin $\Gamma $. At the end of the insertion process, any vacantelements in $\Gamma $ have the value 0.

Algorithm 3 sketches the procedureused to transfer neurons from the sorted list $L$ of $\left(x\right(\mu )+y(\mu \left)\right)$ values into $\Gamma $, while progressively increasing the size of$\Gamma $ as needed. The function*createMatrix* creates a $1\times 1$ matrix containing a single, zero-valued element.The function *pop* removes and returns the first element from a givenlist. The function *extendMatrix* takes as arguments the matrix$\Gamma $ along with the current matrix size$n$. The function enlarges $\Gamma $ to be an $(n+1)\times (n+1)$ matrix by adding a row of zeros as the new topmostrow, and a column of zeros as the new rightmost column. The data in the lower left$n\times n$ submatrix of the enlarged version of$\Gamma $ is unchanged. Indexing of matrix rows and columnsin Algorithm 3 starts at 0.

## 5. Experiments and Results

We experiment with four different overparameterized models as the inputs$\left\{{M}_{i}\right\}$ to NeuroGRS. These include two MLP models nn1 andnn2, and two CNN models cnn1 and cnn2. The structure of each initial model issummarized in Fig. 3 The difference between$\mathrm{n}\mathrm{n}1$ and $\mathrm{n}\mathrm{n}2$ or $\mathrm{c}\mathrm{n}\mathrm{n}1$ and cnn 2 involves the number of layers. We studieddifferent numbers of layers in this context to explore how the number of layersimpacts the pruning results derived by NeuroGRS, and to demonstrate that NeuroGRScan effectively prune models with different numbers of layers. In Section 5.2, we report on pruning experiments thatdemonstrate the effectiveness of NeuroGRS in deriving compact DNN models for calciumimage based prediction. In Section 5.3, weperform comparison experiments between Algorithm GRS, the structured pruning stageof NeuroGRS, and the two baseline methods, NWM and RRS, discussed in Section 3.5.

Fig. 3

Structure of input DNN models.

The models $\left\{{M}_{i}\right\}$ can be viewed as models that are representative ofwhat neuromodulation system designers would apply as input to a tool such asNeuroGRS. We design the models $\left\{{M}_{i}\right\}$ for the experiments conducted in this researchbecause, to the best of our knowledge, there are no publicly available, baseline DNNmodels that are suitable for real-time calcium-imaging-based behavior prediction onresource-constrained platforms. Popular, state-of-the-art DNN models in theliterature, such as ResNet or VGGNet, focus on very different kinds ofclassification problems on inputs of much larger dimensionality. In our neuralsignal processing context, these models do not match well with our objective ofderiving streamlined models from calcium imaging signals.

Thus, to investigate how different structures are handled by NeuroGRS, wedevelop four moderately-sized MLP models and CNN models as our initial models. Theseinitial models, $\left\{{M}_{i}\right\}$ were carefully designed to reach acceptableprediction accuracy on our datasets without excessive model complexity. These modelsare suitable for evaluating the effectiveness of NeuroGRS since they are designedfrom the beginning with compactness as a design objective. Further improvements incompactness provided by NeuroGRS are achieved on top of compactness properties thatare inherent through the careful design of the initial models.

In Section 5.4, we apply the inferencemodel synthesis capabilities described in Section3.7, and we experiment with the synthesized models on tworesource-constrained platforms, including a Raspberry Pi platform. Whereas theinference experiments in Section 5.2 and Section 5.3 are performed using Keras [26], the experiments in Section 5.4 are performed using models derived from thesynthesis capabilities of NeuroGRS.

### 5.1. Common Experiment Parameters

This section briefly summarizes settings that are common to theexperiments presented in Section 5.2 andSection 5.3. All of the training andretraining used in NeuroGRS uses the Adam optimizer [27]. For training, the batch size is set to 32, andthe number of epochs is 150 for each of the four input models$\mathrm{n}\mathrm{n}1,\mathrm{n}\mathrm{n}2,\mathrm{c}\mathrm{n}\mathrm{n}1$, and cnn2. The number of epochs is set to 50when retraining intermediate models during the pruning process. A decay ratio of0.95 is applied to the initial dropout ratio of 0.5 at each retraining iterationas the model is shrinking. The results presented for each data set / modelcombination are averaged across 10 independent trials.

As mentioned in Section 3, accuracydrop tolerances (ADTs) of GRS, Pruning Stage T, and Stage Q are set to${\mathcal{T}}_{GRS}=0.985,{\mathcal{T}}_{T}=0.995$, and ${\mathcal{T}}_{Q}=0.990$, respectively. These settings limit the overallaccuracy drop in NeuroGRS to ${\mathcal{T}}_{GRS}\times {\mathcal{T}}_{T}\times {\mathcal{T}}_{Q}=0.970$ of the original accuracy$AC{C}_{\text{original.}}$. We have empirically tuned${\mathcal{T}}_{GRS},{\mathcal{T}}_{T}$, and ${\mathcal{T}}_{Q}$, and found that GRS is more sensitive to theADT than Stage T and Stage Q. Additionally, we found that Stage Q is moresensitive to the ADT than Stage T. In other words, if we allocate slightly moreADT to GRS than to Stage T and Stage Q, then the result (in terms of thecompactness of the pruned model) is typically better than if we decompose theoverall ADT equally across all three stages. Decomposing an overall ADT level$Z$ equally in this context means assigning an ADTof ${Z}^{1/3}$ to each of the three stages.

Based on the trends that we observed empirically, we provided slightlymore ADT to GRS compared to Stage Q, and slightly more ADT to Stage Q comparedto Stage T. For our application, we estimated that an overall ADT of 0.97 isacceptable, meaning that the accuracy of the pruned model should have anaccuracy no less than $0.97\times AC{C}_{\text{original.}}$. For other applications, the three ADTparameters can be tuned based on specific runtime and accuracy requirements. Asdescribed above, the tuning process was performed empirically in our study— i.e., we tuned the ADTs by iterating through a series of experiments.Systematic optimization of the parameters ${\mathcal{T}}_{GRS},{\mathcal{T}}_{T}$, and ${\mathcal{T}}_{Q}$ is an interesting direction for futurework.

### 5.2. Pruning Experiments

Pruning experiments with the GRS Algorithm and GRS+TQ combination areperformed on $\mathrm{n}\mathrm{n}1,\mathrm{n}\mathrm{n}2,\mathrm{c}\mathrm{n}\mathrm{n}1$, and $\mathrm{c}\mathrm{n}\mathrm{n}2$ to obtain compact versions of these modelswhile maintaining the prediction accuracy within a specified minimum value. Asmentioned in Section 3.4, the accuracytolerances for Stages GRS, T, and Q were set to $\mathcal{T}=\mathrm{0.985,\; 0.995,\; 0.990}$, respectively. The results for nn1,nn2,cnn1 andcnn2, are summarized in Table 1, Table 2, Table 3, and Table 4,respectively.

### Table 1

Results of pruning experiments with nn1 as the input model.

Dataset | Acc_S (loss%) | FLOPs_S (% of initial) | Params_S (% of initial) | Acc_U (loss%) | FLOPs_U (% of initial) | Params_U (% of initial) |
---|---|---|---|---|---|---|

m04e1 | 0.967 (0.55%) | 6425 (34.19%) | 3234 (34.21%) | 0.966 (0.68%) | 2803 (14.91%) | 202 (2.14%) |

m04e2 | 0.949 (0.17%) | 5068 (26.97%) | 2554 (27.01%) | 0.95 (−0.01%) | 1968 (10.47%) | 393 (4.16%) |

m04e3 | 0.915 (2.23%) | 2743 (14.6%) | 1384 (14.64%) | 0.911 (2.61%) | 1211 (6.45%) | 326 (3.45%) |

m05e1 | 0.934 (2.1%) | 3392 (32.99%) | 1716 (33.01%) | 0.929 (2.62%) | 2325 (22.61%) | 235 (4.53%) |

m05e2 | 0.901 (0.4%) | 4461 (43.39%) | 2255 (43.37%) | 0.897 (0.92%) | 3152 (30.65%) | 484 (9.33%) |

m05e3 | 0.897 (−0.06%) | 5635 (54.8%) | 2849 (54.78%) | 0.891 (0.58%) | 4392 (42.72%) | 214 (4.12%) |

m06e1 | 0.937 (1.4%) | 3981 (46.19%) | 2017 (46.17%) | 0.928 (2.32%) | 3164 (36.72%) | 344 (7.9%) |

m06e2 | 0.907 (2.0%) | 3823 (44.36%) | 1938 (44.36%) | 0.904 (2.4%) | 2918 (33.86%) | 396 (9.09%) |

m06e3 | 0.938 (0.49%) | 3237 (37.56%) | 1642 (37.6%) | 0.933 (0.95%) | 2055 (23.84%) | 295 (6.77%) |

Open in a separate window

### Table 2

Results of pruning experiments with nn2 as the input model.

Dataset | Acc_S (loss%) | FLOPs_S (% of initial) | Params_S (% of initial) | Acc_U (loss%) | FLOPs_U (% of initial) | Params_U (% of initial) |
---|---|---|---|---|---|---|

m04e1 | 0.974 (0.24%) | 5894 (33.47%) | 2962 (33.51%) | 0.972 (0.44%) | 3529 (20.04%) | 28 (0.32%) |

m04e2 | 0.953 (0.21%) | 6224 (35.34%) | 3127 (35.38%) | 0.952 (0.28%) | 2796 (15.88%) | 43 (0.49%) |

m04e3 | 0.921 (2.09%) | 7049 (40.03%) | 3541 (40.06%) | 0.918 (2.41%) | 5025 (28.54%) | 55 (0.63%) |

m05e1 | 0.94 (0.83%) | 6086 (66.91%) | 3069 (66.94%) | 0.94 (0.83%) | 4128 (45.38%) | 60 (1.33%) |

m05e2 | 0.906 (0.47%) | 6370 (70.03%) | 3212 (70.06%) | 0.903 (0.76%) | 4474 (49.18%) | 270 (5.91%) |

m05e3 | 0.894 (0.33%) | 7023 (77.21%) | 3541 (77.23%) | 0.89 (0.74%) | 5399 (59.35%) | 276 (6.03%) |

m06e1 | 0.941 (0.38%) | 4881 (65.67%) | 2466 (65.71%) | 0.939 (0.56%) | 3878 (52.17%) | 277 (7.39%) |

m06e2 | 0.919 (0.54%) | 5414 (72.85%) | 2735 (72.88%) | 0.918 (0.69%) | 4136 (55.65%) | 108 (2.89%) |

m06e3 | 0.93 (0.46%) | 3813 (51.31%) | 1927 (51.36%) | 0.931 (0.36%) | 2647 (35.62%) | 66 (1.78%) |

Open in a separate window

### Table 3

Results of pruning experiments with cnn1 as the input model.

Dataset | Acc_S (loss%) | FLOPs_S (% of initial) | Params_S (% of initial) | Acc_U (loss%) | FLOPs_U (% of initial) | Params_U (% of initial) |
---|---|---|---|---|---|---|

m04e1 | 0.969 (1.09%) | 24907 (44.59%) | 12504 (44.6%) | 0.961 (1.97%) | 23304 (41.72%) | 872 (3.11%) |

m04e2 | 0.952 (−0.33%) | 12863 (23.02%) | 6463 (23.06%) | 0.945 (0.33%) | 11987 (21.46%) | 1507 (5.38%) |

m04e3 | 0.913 (2.63%) | 14041 (25.14%) | 7051 (25.15%) | 0.906 (3.34%) | 13197 (23.62%) | 2034 (7.26%) |

m05e1 | 0.881 (4.03%) | 1677 (7.6%) | 862 (7.74%) | 0.887 (3.36%) | 1541 (6.98%) | 370 (3.33%) |

m05e2 | 0.854 (0.62%) | 3029 (13.73%) | 1545 (13.88%) | 0.853 (0.75%) | 2533 (11.48%) | 505 (4.54%) |

m05e3 | 0.851 (1.58%) | 5989 (27.14%) | 3034 (27.25%) | 0.85 (1.73%) | 5469 (24.78%) | 1480 (13.3%) |

m06e1 | 0.898 (2.65%) | 2061 (9.34%) | 1057 (9.49%) | 0.885 (4.0%) | 1847 (8.37%) | 335 (3.01%) |

m06e2 | 0.867 (−4.29%) | 8318 (37.69%) | 4204 (37.75%) | 0.858 (−3.26%) | 7706 (34.91%) | 570 (5.12%) |

m06e3 | 0.916 (0.88%) | 3149 (14.27%) | 1601 (14.38%) | 0.913 (1.17%) | 2870 (13.01%) | 360 (3.24%) |

Open in a separate window

### Table 4

Results of pruning experiments with cnn2 as the input model.

Dataset | Acc_S (loss%) | FLOPs_S (% of initial) | Params_S (% of initial) | Acc_U (loss%) | FLOPs_U (% of initial) | Params_U (% of initial) |
---|---|---|---|---|---|---|

m04e1 | 0.972 (0.88%) | 12154 (22.23%) | 6107 (22.28%) | 0.97 (1.05%) | 11721 (21.44%) | 1375 (5.02%) |

m04e2 | 0.952 (0.35%) | 23644 (43.25%) | 11865 (43.28%) | 0.953 (0.22%) | 23162 (42.37%) | 8572 (31.28%) |

m04e3 | 0.93 (1.76%) | 6567 (12.01%) | 3308 (12.07%) | 0.93 (1.78%) | 5918 (10.82%) | 1310 (4.78%) |

m05e1 | 0.93 (1.24%) | 8283 (39.67%) | 4183 (39.76%) | 0.925 (1.79%) | 6965 (33.36%) | 1748 (16.63%) |

m05e2 | 0.881 (0.52%) | 6231 (29.85%) | 3154 (29.98%) | 0.882 (0.44%) | 5988 (28.68%) | 2341 (22.27%) |

m05e3 | 0.862 (1.89%) | 7743 (37.09%) | 3915 (37.21%) | 0.859 (2.27%) | 6835 (32.74%) | 1105 (10.51%) |

m06e1 | 0.917 (3.13%) | 5035 (24.12%) | 2551 (24.25%) | 0.918 (3.05%) | 4736 (22.68%) | 982 (9.34%) |

m06e2 | 0.871 (3.04%) | 7453 (35.7%) | 3767 (35.8%) | 0.866 (3.56%) | 6514 (31.2%) | 1249 (11.88%) |

m06e3 | 0.936 (0.7%) | 7938 (38.02%) | 4010 (38.12%) | 0.937 (0.55%) | 7616 (36.48%) | 3103 (29.52%) |

Open in a separate window

For each of the four input models, results are provided for the 9datasets described in Section 4. Thecolumns labeled Acc_S, FLOPs_S, and Params_S provide the accuracy, FLOP countand number of parameters after the GRS stage of NeuroGRS. Here, the“_S” suffix stands for “Structured pruning”. Theparenthetic term “(loss %)” gives the loss in accuracy compared tothe original, unpruned model. This accuracy loss is measured as$\left({\alpha}_{o}-{\alpha}_{p}\right)/{\alpha}_{o}$, where ${\alpha}_{o}$ and ${\alpha}_{p}$ represent the accuracy levels of the originaland pruned models, respectively. Similarly, the parenthetic term “(% ofinitial)” gives a measure of the FLOP count or parameter count of thepruned model relative to the original model. For example, the “% ofinitial” value of $44.6\%$ for the FLOP count of model$\mathrm{n}\mathrm{n}1$ on dataset m04e1 means that the GRS reduced theFLOP count by $55.4\%$ for this model/dataset combination.

Similarly, the columns labeled Acc_U, FLOPs_U, and Params_U provide theaccuracy, FLOP count and number of parameters after GRS+TQ; the“_U” suffix indicates that the effects of “Unstructuredpruning” are included in these results.

Measurement of FLOP counts and parameter counts are based on theTensorFlow library [28]. Each trialincludes an independent execution through the entire dataflow of NeuroGRS,starting with the training of the input model, with the exception that theDesign Selection block is disabled. Instead of applying Design Selection, wecollect and report the results for all four pairs $\left(P\left(Mi\right),{\mathbf{v}}_{\mathbf{i}}\right)$ (see Fig.1).

The results in Table 1 throughTable 4 show that NeuroGRS can pruneall of the four input models into very compact DNN models without much loss intesting accuracy. Some of the results even show that higher accuracy is achievedby the pruned models compared to the corresponding original models (e.g., seethe Acc_S result for Dataset m05e3 in Table1). Such increases in accuracy are indicated by negative “loss%” values in the tables. It has been argued that pruning can sometimesincrease accuracy in this way because pruned structures are simpler and help toreduce overfitting [19].

To summarize and compare the effectiveness of NeuroGRS on all four inputDNN models on calcium imaging data, the results from Table 1 through Table 4 are aggregated in Table5, which shows average results for each of the four input modelsacross all 9 datasets. As shown in Table5, all of the pruned models maintain most of the accuracy provided bythe corresponding original models. Additionally, the MLP models (nn1 and nn2)are seen to suit our datasets better with higher testing accuracy, lowerparameter counts, and lower FLOP counts. These trends are seen in Table 5 for both structured-only (GRS) andstructured+unstructured (GRS+TQ) pruning modes. Overall, the results demonstratethe effectiveness of NeuroGRS in consistently providing optimization capabilityacross a diverse set of input models and datasets.

### Table 5

Summary of pruning experiments over all four input models.

Model | Acc_S (loss%) | FLOPs_S (% of initial) | Params_S (% of initial) | Acc_U (loss%) | FLOPs_U (% of initial) | Params_U (% of initial) |
---|---|---|---|---|---|---|

nn1 | 0.927 (1.03%) | 4307 (37.23%) | 2177 (37.24%) | 0.923 (1.45%) | 2665 (24.69%) | 321 (5.72%) |

nn2 | 0.931 (0.62%) | 5861 (56.98%) | 2953 (57.01%) | 0.929 (0.79%) | 4001 (40.2%) | 131 (2.97%) |

cnn1 | 0.9 (0.98%) | 8448 (22.5%) | 4258 (22.59%) | 0.895 (1.49%) | 7828 (20.7%) | 893 (5.37%) |

cnn2 | 0.917 (1.5%) | 9450 (31.33%) | 4762 (31.42%) | 0.915 (1.63%) | 8828 (28.86%) | 2421 (15.69%) |

Open in a separate window

### 5.3. Comparison Among GRS, NWM, and RRS

In this section, we present an experimental comparison of GRS with theNWM and RRS methods for structured pruning, which were described as baselinemethods in Section 3.5. The comparisonexperiments are performed on all four initial models $\left\{{M}_{i}\right\}$.

Table 6 shows aggregated resultsfor each of the four input models across all 9 datasets. More detailedexperimental results for the four input models across these datasets areprovided in the appendix. The abbreviations AL, FCI, and PCI stand,respectively, for Accuracy Loss, FLOP Count Improvement, and Parameter CountImprovement, where the losses and improvements are interpreted with respect tothe original (unpruned) model. For example, the AL difference of GRS vs. NWM fora given input model ${M}_{i}$ is computed from the nine values defined by:

$$\frac{\left(\text{Acc\_O}\left(d\right)-\text{Acc\_G}\left(d\right)\right)}{\text{Acc\_O}\left(d\right)}-\frac{\left(\text{Acc\_O}\left(d\right)-\text{Acc\_N}\left(d\right)\right)}{\text{Acc\_O}\left(d\right)}=\frac{\left(\text{Acc\_N}\left(d\right)-\text{Acc\_G}\left(d\right)\right)}{\text{Acc\_O}\left(d\right)}$$

(1)

for $d\in D$, where $D=\mathrm{m}04\mathrm{e}1,\text{}\mathrm{m}04\mathrm{e}2,\dots \mathrm{m}06\mathrm{e}3$ denotes our set of nine datasets, and Acc_X$\left(d\right)$ represents the accuracy measured in ourexperiments by model $X$ on dataset $d$ for the given input model${M}_{i}$. Here, $\mathrm{X}=\mathrm{O}$ denotes the original model (without anypruning), while $\mathrm{X}=\mathrm{G},\mathrm{X}=\mathrm{N}$ and $\mathrm{X}=\mathrm{R}$ for the pruned models derived by applying GRS,NWM and RRS, respectively, to the original model. The AL difference values arereported as the maximum, minimum and average of the nine values

$$\left\{\frac{\mathrm{A}\mathrm{c}\mathrm{c}\_\mathrm{N}\left(d\right)-{\mathrm{A}\mathrm{c}\mathrm{c}}_{-}\mathrm{G}\left(d\right)}{\mathrm{A}\mathrm{c}\mathrm{c}\_\mathrm{O}\left(d\right)}\mid d\in D\right\}.$$

(2)

### Table 6

Summary of comparison experiments over all four input models.

GRS vs. NWM | GRS vs. RRS | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

AL difference | FCI difference | PCI difference | AL difference | FCI difference | PCI difference | |||||||||||||

Model | min | max | avg | min | max | avg | min | max | avg | min | max | avg | min | max | avg | min | max | avg |

nn1 | −0.10% | 2.04% | 0.78% | 8.13% | 60.59% | 24.18% | 8.33% | 60.52% | 24.22% | −0.32% | 1.93% | 0.87% | 19.96% | 67.49% | 41.45% | 19.91% | 67.43% | 41.41% |

nn2 | −0.98% | 0.50% | 0.00% | −11.54% | 15.93% | 3.03% | −11.54% | 15.91% | 3.04% | −0.50% | 0.46% | −0.12% | −5.93% | 11.25% | 2.39% | −5.92% | 11.23% | 2.40% |

cnn1 | −3.48% | 1.62% | −0.25% | 3.19% | 48.33% | 21.91% | 3.23% | 48.27% | 21.91% | −2.83% | 2.18% | 0.74% | 47.72% | 79.25% | 62.92% | 47.72% | 79.17% | 62.87% |

cnn2 | 0.00% | 3.51% | 1.14% | 19.13% | 50.99% | 34.54% | 19.08% | 50.89% | 34.46% | 0.31% | 3.51% | 1.39% | 43.30% | 63.75% | 56.87% | 43.29% | 63.63% | 56.82% |

Open in a separate window

The FCI difference and PCI difference values reported in Table 6 are calculated by taking Equation 2 and replacing each${\mathrm{A}\mathrm{c}\mathrm{c}}_{-}\mathrm{X}\left(d\right)$ term with ${\mathrm{F}\mathrm{L}\mathrm{O}\mathrm{P}}_{-}\mathrm{X}\left(d\right)$ or Params_X $\left(d\right)$, respectively, for the corresponding modelspecifier X. For example, the FCI difference statistics for the GRS vs. RRScomparison are computed from the nine values

$$\left\{\frac{{\mathrm{F}\mathrm{L}\mathrm{O}\mathrm{P}\mathrm{s}}_{-}\mathrm{R}\left(d\right)-{\mathrm{F}\mathrm{L}\mathrm{O}\mathrm{P}\mathrm{s}}_{-}\mathrm{G}\left(d\right)}{\mathrm{F}\mathrm{L}\mathrm{O}\mathrm{P}{\mathrm{s}}_{-}\mathrm{O}\left(d\right)}\mid d\in D\right\}\text{.}$$

(3)

Note that from the formulation of AL, FIC and PCI differences, asdescribed above, a positive value for the AL difference in Table 6 means that GRS performs worse compared tothe associated baseline model, while positive values for FCI and PCI indicatethat GRS performs better than the associated baseline model.

The results show that for models nn1, cnn1, and cnn2, GRS provideslarge advantages on average in reducing FLOP count and parameter count comparedto both NWM and RRS. One case (GRS vs. NWM for nn2) has no accuracy change andanother (GRS vs. NWM for cnn1) shows a slight accuracy improvement with GRS; inall other cases, the average accuracy for GRS is worse than that of NWM or GRSby a small percentage (0.74%–1.39%).

For nn2, the trend is different in that while there is still areduction in average FCI and PCI, the magnitude of the reduction is small. Thisis because nn2 contains only one hidden layer, whereas the advantage of GRSstems from its ability to flexibly apply different inter-layer orderings in thepruning process.

By looking at the two “min” columns for FIC differenceand the two “min” columns for PIC difference in Table 6, we see that GRS provides highly consistentimprovements in FLOP count and Parameter count for models nn1, cnn1, and cnn2— for each of these three models, improvements in FLOP and parametercounts are delivered by GRS on all 9 datasets.

### 5.4. Resource-Constrained Inference

To demonstrate the utility of NeuroGRS’s structured pruningcapabilities for neural signal processing on resource constrained-platforms, wepresent experiments in this section using a moderate-complexity laptop computerplatform and a low-complexity Raspberry Pi platform. Both platforms are used inisolation, without any cloud computing support. These experiments alsodemonstrate the code synthesis capabilities presented in Section 3.7. Our experiments in this section do notemploy specialized hardware/software subsystems that are needed to exploitunstructured pruning. Therefore, we focus only on applying GRS Pruning in theseexperiments and disable the TQ part of the NeuroGRS dataflow.

First, we employ a laptop computer platform, based on a single,moderately-priced device — the Intel Core i7 7700HQ CPU. The experimentdoes not apply cloud computing or high-end servers, which are often notappropriate platforms for real-time neural signal processing systems, asdiscussed in Section 1. On the targeted CPUplatform, we execute pruned model implementations that are generated usingNeuroGRS together with its code synthesis capabilities. For comparison, we alsoimplement the corresponding original model for execution on the targeted device.This implementation is generated with NeuroGRS’s code synthesiscapability using an option that bypasses the entire pruning process, and justgenerates code for the given input model.

The execution time results reported in this section only pertain to theinference code that is synthesized by NeuroGRS, not to the execution of thepruning algorithms and code synthesis within NeuroGRS. The implementations thatare evaluated in this section are single-threaded since our focus is to isolatethe benefit provided by the proposed pruning methods. Application of theproposed methods to multi-threaded implementations is an interesting directionfor further study.

We report only the runtime associated with inference computations, andomit startup overhead that is required to initialize the model. In particular,we give results on the average runtime required to perform a predictionoperation on a single input image frame once the system has been initialized toprocess an arbitrary number of frames. The resulting runtime measurements can beused to assess the runtime cost of prediction in relation to the frame rate of10Hz, which is associated with our collection of datasets (see Section 4).

The results on runtime evaluation are shown in Table 7. The results are tabulated for the originalmodel and the pruned model that results from GRS. Each value represents theaverage measured runtime to process a single input image frame, as describedabove. Each experiment is executed on all nine datasets. For each dataset, GRSis executed 10 times to derive 10 pruned models. Recall that randomization isapplied within the GRS algorithm, which means that the solution derived by GRSis not unique. To account for this variation, we average across 10 differentpruning solutions for each dataset. Runtime is measured by executing each prunedmodel on each dataset 100 times, and averaging the resulting$9\times 10\times 100=9000$ measurements to derive the corresponding valueshown in Table 7. For the originalmodel, we average across the same number of executions (9000) for consistency,even though the original model has no difference in structure among the 10independent executions for each dataset.

### Table 7

Results of inference experiments over all four input models for laptopcomputer platform

Inference runtime | |||
---|---|---|---|

Model | Original model (ms) | Pruned model (ms) | Improvement |

nn1 | 0.048 | 0.027 | 43.36% |

nn2 | 0.044 | 0.029 | 34.28% |

cnn1 | 0.608 | 0.183 | 69.81% |

cnn2 | 0.656 | 0.209 | 68.08% |

Open in a separate window

Note that the magnitudes of the improvements in Table 7 are generally different from theimprovements in FLOPs that are reported in Section 5.2 and Section 5.3.This is because the runtime includes overhead in executing the model, such ascommunication overhead between the DNN layers, which is not captured by the FLOPcount assessments in Section 5.2 and Section 5.3. Nevertheless, the results inTable 7 represent large improvementsdelivered by GRS.

Next, we employ a Raspberry Pi platform (Raspberry Pi Zero W V1.1). Theexperimental setup is identical to that used for the laptop-targetedexperiments, as described above, except that the target platform is different.For example, we again average across 9000 measurements to derive each reportedvalue. The results are shown in Table 8.As with the laptop computer platform, the results show that NeuroGRS provideslarge improvements. The improvements for the Raspberry Pi are slightly tomoderately larger than those reported for the laptop platform, depending on theinput DNN model.

### Table 8

Results of inference experiments over all four input models forRaspberry Pi platform.

Inference runtime | |||
---|---|---|---|

Model | Original model (ms) | Pruned model (ms) | Improvement |

nn1 | 1.203 | 0.572 | 52.44% |

nn2 | 1.074 | 0.666 | 37.97% |

cnn1 | 16.277 | 4.646 | 71.46% |

cnn2 | 16.139 | 5.031 | 68.83% |

Open in a separate window

With a 10Hz frame rate, the runtime of the original models on the laptopare already well within the frame interval $\left(0.1\mathrm{s}\mathrm{e}\mathrm{c}\right)$, and the absolute reductions provided bypruning are negligible in comparison to the frame interval, although they aresignificant in a relative sense. The impact of runtime reduction is moresignificant on the Raspberry $\mathrm{P}\mathrm{i}$, as shown in Table 8, especially for larger DNNs, such as cnn1 and cnn2. Forthese two DNNs, we see improvements of 11.631 ms and $11.108\text{}\mathrm{m}\mathrm{s}$, respectively. The runtime improvementpercentages (relative improvement levels) on the laptop and Raspberry Pi aresimilar.

The runtime improvement provided by NeuroGRS can be significant anduseful under any of the following three scenarios: (1) when the applicationrequires higher frame rate; (2) when the application is deployed on a devicewith lower computational power, such as the Raspberry Pi device that we haveexperimented with; and (3) when the inference model is much more complex thanthe models that we have studied in our experiments. Also, considering otherneural signal preprocessing steps that are needed for end-to-end applicationprocessing, the time reduction of the pruned model derived by NeuroGRS providesmore time for other processing steps, which facilitates the operation of complexend-to-end applications in real-time.

## 6. Conclusion

In this work, we have introduced a new structured pruning method calledGreedy inter-layer order with Random Selection of intra-layer units (GRS) andcombined it with two state-of-the-art unstructured pruning methods, which we referto collectively as TQ. Additionally, we have developed a novel software tool, calledNeuroGRS, which neural signal processing system designers can use to apply GRS orGRS+TQ with a high degree of automation. NeuroGRS includes capabilities not only forderiving compact DNN models but also for synthesizing efficient code for deployingthose models on resource-constrained platforms. We have demonstrated theeffectiveness of GRS, GRS+TQ, and NeuroGRS through extensive experiments involvingnine calcium imaging datasets acquired from animal models. Useful directions forfuture work include extending NeuroGRS to support model types other thanconvolutional neural networks and multilayer perceptrons, and to exploit graphicsprocessing unit and multicore CPU acceleration capabilities in the inference targetplatform.

## Acknowledgements

This work was supported by the NIH NINDS (R01NS110421) and the BRAINInitiative.

## Appendix: Experiment Results on Separate Datasets

The results for nn1, nn2, cnn1 and cnn2, are summarized in Table 9, Table 10, Table 11, and Table 12, respectively. In each of thesetables, the columns labeled Acc_X, FLOPs_X, and Params_X provide the accuracy,FLOP count and number of parameters for the model represented by X. Here,$\mathrm{X}=\mathrm{O}$ to denote the original model (without anypruning), while $X=G,X=N$ and $\mathrm{X}=\mathrm{R}$ for the pruned models derived by applying GRS,NWM and RRS, respectively, to the original model.

### Table 9

Results of comparison experiments with nn1 as the inputmodel.

Dataset | ACC_O | FLOPs_O | Params_O | ACC_G | FLOPs_G | Params_G | ACC_N | FLOPs_N | Params_N | ACC_R | FLOPs_R | Params_R |
---|---|---|---|---|---|---|---|---|---|---|---|---|

m04e1 | 0.976 | 9457 | 18795 | 0.967 | 6425 | 3234 | 0.973 | 10740 | 5406 | 0.976 | 15345 | 7718 |

m04e2 | 0.951 | 9457 | 18795 | 0.949 | 5068 | 2554 | 0.956 | 16455 | 8277 | 0.955 | 17752 | 8931 |

m04e3 | 0.945 | 9457 | 18795 | 0.915 | 2743 | 1384 | 0.928 | 5805 | 2930 | 0.930 | 11540 | 5804 |

m05e1 | 0.947 | 5201 | 10283 | 0.934 | 3392 | 1716 | 0.944 | 6944 | 3513 | 0.944 | 8434 | 4264 |

m05e2 | 0.910 | 5201 | 10283 | 0.901 | 4461 | 2255 | 0.905 | 6240 | 3159 | 0.908 | 7876 | 3982 |

m05e3 | 0.903 | 5201 | 10283 | 0.897 | 5635 | 2849 | 0.905 | 7235 | 3659 | 0.906 | 7930 | 4009 |

m06e1 | 0.955 | 4369 | 8619 | 0.937 | 3981 | 2017 | 0.936 | 4682 | 2381 | 0.939 | 5701 | 2887 |

m06e2 | 0.931 | 4369 | 8619 | 0.907 | 3823 | 1938 | 0.926 | 5773 | 2930 | 0.925 | 7995 | 4051 |

m06e3 | 0.944 | 4369 | 8619 | 0.938 | 3237 | 1642 | 0.938 | 4929 | 2498 | 0.935 | 6546 | 3318 |

Open in a separate window

### Table 10

Results of comparison experiments with nn2 as the inputmodel.

Dataset | ACC_O | FLOPs_O | Params_O | ACC_G | FLOPs_G | Params_G | ACC_N | FLOPs_N | Params_N | ACC_R | FLOPs_R | Params_R |
---|---|---|---|---|---|---|---|---|---|---|---|---|

m04e1 | 0.979 | 17609 | 8841 | 0.974 | 5894 | 2962 | 0.976 | 8369 | 4204 | 0.971 | 6169 | 3100 |

m04e2 | 0.956 | 17609 | 8841 | 0.953 | 6224 | 3127 | 0.953 | 7109 | 3584 | 0.957 | 6598 | 3327 |

m04e3 | 0.942 | 17609 | 8841 | 0.921 | 7049 | 3541 | 0.923 | 6004 | 3017 | 0.916 | 7764 | 3901 |

m05e1 | 0.948 | 9097 | 4585 | 0.94 | 6086 | 3069 | 0.943 | 7535 | 3799 | 0.944 | 7109 | 3584 |

m05e2 | 0.906 | 9097 | 4585 | 0.906 | 6370 | 3212 | 0.901 | 5320 | 2683 | 0.903 | 5831 | 2941 |

m05e3 | 0.884 | 9097 | 4585 | 0.894 | 7023 | 3541 | 0.885 | 7109 | 3584 | 0.893 | 7166 | 3613 |

m06e1 | 0.950 | 7433 | 3753 | 0.941 | 4881 | 2466 | 0.940 | 4904 | 2478 | 0.938 | 4904 | 2478 |

m06e2 | 0.920 | 7433 | 3753 | 0.919 | 5414 | 2735 | 0.923 | 5020 | 2536 | 0.921 | 5159 | 2606 |

m06e3 | 0.934 | 7433 | 3753 | 0.93 | 3813 | 1927 | 0.935 | 4835 | 2443 | 0.925 | 4556 | 2302 |

Open in a separate window

### Table 11

Results of comparison experiments with cnn1 as the inputmodel.

Dataset | ACC_O | FLOPs_O | Params_O | ACC_G | FLOPs_G | Params_G | ACC_N | FLOPs_N | Params_N | ACC_R | FLOPs_R | Params_R |
---|---|---|---|---|---|---|---|---|---|---|---|---|

m04e1 | 0.979 | 55863 | 28033 | 0.969 | 24907 | 12504 | 0.970 | 44280 | 22216 | 0.975 | 51567 | 25881 |

m04e2 | 0.951 | 55863 | 28033 | 0.952 | 12863 | 6463 | 0.948 | 31398 | 15762 | 0.949 | 50041 | 25112 |

m04e3 | 0.938 | 55863 | 28033 | 0.913 | 14041 | 7051 | 0.910 | 35429 | 17770 | 0.933 | 43015 | 21586 |

m05e1 | 0.919 | 22071 | 11137 | 0.881 | 1677 | 862 | 0.891 | 3131 | 1601 | 0.901 | 13265 | 6712 |

m05e2 | 0.863 | 22071 | 11137 | 0.854 | 3029 | 1545 | 0.858 | 5545 | 2817 | 0.869 | 19643 | 9914 |

m05e3 | 0.868 | 22071 | 11137 | 0.851 | 5989 | 3034 | 0.848 | 8888 | 4502 | 0.863 | 21385 | 10792 |

m06e1 | 0.920 | 22071 | 11137 | 0.898 | 2061 | 1057 | 0.866 | 3918 | 1994 | 0.900 | 17561 | 8867 |

m06e2 | 0.836 | 22071 | 11137 | 0.867 | 8318 | 4204 | 0.859 | 9023 | 4564 | 0.843 | 20048 | 10122 |

m06e3 | 0.928 | 22071 | 11137 | 0.916 | 3149 | 1601 | 0.931 | 13815 | 6976 | 0.930 | 20641 | 10418 |

Open in a separate window

### Table 12

Results of comparison experiments with cnn2 as the inputmodel.

Dataset | ACC_O | FLOPs_O | Params_O | ACC_G | FLOPs_G | Params_G | ACC_N | FLOPs_N | Params_N | ACC_R | FLOPs_R | Params_R |
---|---|---|---|---|---|---|---|---|---|---|---|---|

m04e1 | 0.981 | 27417 | 54671 | 0.972 | 12154 | 6107 | 0.973 | 23613 | 11846 | 0.975 | 46777 | 23465 |

m04e2 | 0.958 | 27417 | 54671 | 0.952 | 23644 | 11865 | 0.952 | 43549 | 21836 | 0.956 | 47316 | 23735 |

m04e3 | 0.951 | 27417 | 54671 | 0.93 | 6567 | 3308 | 0.940 | 29147 | 14615 | 0.938 | 40842 | 20494 |

m05e1 | 0.941 | 10521 | 20879 | 0.93 | 8283 | 4183 | 0.949 | 13251 | 6682 | 0.937 | 18103 | 9127 |

m05e2 | 0.881 | 10521 | 20879 | 0.881 | 6231 | 3154 | 0.888 | 14642 | 7375 | 0.894 | 19541 | 9849 |

m05e3 | 0.885 | 10521 | 20879 | 0.862 | 7743 | 3915 | 0.870 | 16381 | 8250 | 0.881 | 19635 | 9896 |

m06e1 | 0.945 | 10521 | 20879 | 0.917 | 5035 | 2551 | 0.926 | 9029 | 4558 | 0.937 | 18275 | 9213 |

m06e2 | 0.911 | 10521 | 20879 | 0.871 | 7453 | 3767 | 0.903 | 18100 | 9121 | 0.903 | 19933 | 10046 |

m06e3 | 0.944 | 10521 | 20879 | 0.936 | 7938 | 4010 | 0.945 | 15590 | 7857 | 0.945 | 18716 | 9436 |

Open in a separate window

## Contributor Information

Xiaomin Wu, University of Maryland College park.

Da-Ting Lin, National Institute on Drug Abuse.

Rong Chen, University of Maryland School of Medicine.

Shuvra S. Bhattacharyya, University of Maryland, College Park.

## References

1. Andrews RJ: Neuromodulation: Advances in the nextdecade. Annals of the New York Academy of Sciences pp. 212–220(2010) [PubMed] [Google Scholar]

2. Hornik K, Stinchcombe M, White H: Multilayer feedforward networks are universalapproximators. Neural networks2(5),359–366(1989) [Google Scholar]

3. Krizhevsky A, Sutskever I, Hinton GE: ImageNet classification with deep convolutional neuralnetworks. In: Advances in Neural Information Processing Systems, pp.1106–1114(2012) [Google Scholar]

4. Collobert R, Weston J: A unified architecture for natural language processing:Deep neural networks with multitask learning. In:Proceedings of the 25th international conference on Machinelearning, pp. 160–167.ACM; (2008) [Google Scholar]

5. Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al.: Deep speech: Scaling up end-to-end speechrecognition. arXivpreprint arXiv:1412.5567 (2014) [Google Scholar]

6. Anwar S, Hwang K, Sung W: Structured pruning of deep convolutional neuralnetworks. ACM Journal on Emerging Technologies in Computing Systems13(3), 1–18(2017) [Google Scholar]

7. Li C, Chan DC, Yang X, Ke Y, Yung WH: Prediction of forelimb reach results from motor cortexactivities based on calcium imaging and deep learning.Frontiers in cellular neuroscience13, 88 (2019) [PMC free article] [PubMed] [Google Scholar]

8. He K, Zhang X, Ren S, Sun J: Deep residual learning for imagerecognition. In: Proceedings of the IEEEconference on computer vision and pattern recognition, pp.770–778(2016) [Google Scholar]

9. Lee Y, Madayambath SC, Liu Y, Lin DT, Chen R, Bhattacharyya SS: Online learning in neural decoding using incrementallinear discriminant analysis. In: 2017 IEEEInternational Conference on Cyborg and Bionic Systems (CBS), pp.173–177.IEEE; (2017) [Google Scholar]

10. Liu Z, Sun M, Zhou T, Huang G, Darrell T: Rethinking the value of network pruning.arXivpreprint arXiv:1810.05270(2018) [Google Scholar]

11. Frankle J, Carbin M: The lottery ticket hypothesis: Finding sparse, trainableneural networks. arXivpreprint arXiv:1803.03635(2018) [Google Scholar]

12. Li H, Kadav A, Durdanovic I, Samet H, Graf HP: Pruning filters for efficient ConvNets.arXivpreprint arXiv:1608.08710(2016) [Google Scholar]

13. Hu H, Peng R, Tai YW, Tang CK: Network trimming: A data-driven neuron pruning approachtowards efficient deep architectures. arXivpreprint arXiv:1607.03250(2016) [Google Scholar]

14. Luo JH, Wu J, Lin W: Thinet: A filter level pruning method for deep neuralnetwork compression. In: Proceedings of the IEEEinternational conference on computer vision, pp.5058–5066(2017) [Google Scholar]

15. Molchanov P, Tyree S, Karras T, Aila T, Kautz J: Pruning convolutional neural networks for resourceefficient inference. arXivpreprint arXiv:1611.06440(2016) [Google Scholar]

16. He Y, Zhang X, Sun J: Channel pruning for accelerating very deep neuralnetworks. In: Proceedings of the IEEEInternational Conference on Computer Vision, pp.1389–1397(2017) [Google Scholar]

17. Suau X, Zappella L, Palakkode V, Apostoloff N: Principal filter analysis for guided networkcompression. arXivpreprint arXiv:1807.10585(2018) [Google Scholar]

18. Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C: Learning efficient convolutional networks throughnetwork slimming. In: Proceedings of the IEEEInternational Conference on Computer Vision, pp.2736–2744(2017) [Google Scholar]

19. Han S, Pool J, Tran J, Dally W: Learning both weights and connections for efficientneural network. In: Advances in neural information processing systems, pp.1135–1143(2015) [Google Scholar]

20. Han S, Mao H, Dally WJ: Deep compression: Compressing deep neural networks withpruning, trained quantization and Huffman coding.arXivpreprint arXiv:1510.00149(2015) [Google Scholar]

21. Bhattacharyya SS, Deprettere E, Leupers R, Takala J (eds.): Handbook of Signal Processing Systems,third edn. Springer; (2019) [Google Scholar]

22. Lee EA, Parks TM: Dataflow process networks.Proceedings of the IEEE83(5),773–799(1995) [Google Scholar]

23. Buck JT, Lee EA: Scheduling dynamic dataflow graphs using the token flowmodel. In: Proceedings of the InternationalConference on Acoustics, Speech, and Signal Processing(1993) [Google Scholar]

24. Lin S, Liu Y, Lee K, Li L, Plishker W, Bhattacharyya SS: The DSPCAD framework for modeling and synthesis ofsignal processing systems. In: Ha S, Teich J (eds.) Handbook of Hardware/Software Codesign, pp.1–35.Springer; (2017) [Google Scholar]

25. Barbera G, Liang B, Zhang L, Gerfen CR, Culurciello E, Chen R, Li Y, Lin DT: Spatially compact neural clusters in the dorsal striatumencode locomotion relevant information.Neuron92(1),202–213(2016) [PMC free article] [PubMed] [Google Scholar]

26. Keras (2020).https://keras.io/

27. Kingma DP, Ba J: Adam: A method for stochasticoptimization (2014).ArXiv:1412.6980 [cs.LG] [Google Scholar]

28. Abadi M, et al.: TensorFlow: Large-scale machine learning on heterogeneous distributed systems (2016).ArXiv:1603.04467v2 [cs.DC]