# 2 2 Deep Belief Networks and Deep Boltzmann Machines

## A Definitive Guide To Build Training Data For Computer Vision

Tech giants like Google, Microsoft, Amazon, and Facebook have declared theirproduct strategies with the AI first approach. The AI effect has influencedthe product roadmaps of all enterprise companies which now have prominent AIbased applications getting launched each quarter to automate their businessprocesses. Computer Vision, specifically, is being vastly explored and appliedacross industries from traditional banking to cutting edge self-driving cars.Amazing isn’t it!But, how does one start to implement Computer Vision or CV in short? The majorsteps are as follows: 1. Collect lots of data 2. Label it 3. Get GPUs — Training ML models require huge computational resources 4. Choose an algorithm -> Train your model -> Test it -> Teach the model what it doesn’t know yet 5. Repeat the above point till you get acceptable qualityEach of these 5 steps has their own list of technical and operationalchallenges. In this article, I will help you out with 2 (Labeling of trainingdata) to get you started.I have written about the ways you can start gathering training data. Thisdepends on the use case you plan to work on.

## 2.1. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) were inspired by the visual system’sstructure, and in particular by the models of it proposed in [18]. The firstcomputational models based on these local connectivities between neurons andon hierarchically organized transformations of the image are found inNeocognitron [19], which describes that when neurons with the same parametersare applied on patches of the previous layer at different locations, a form oftranslational invariance is acquired. Yann LeCun and his collaborators laterdesigned Convolutional Neural Networks employing the error gradient andattaining very good results in a variety of pattern recognition tasks [20–22].A CNN comprises three main types of neural layers, namely, (i) convolutionallayers, (ii) pooling layers, and (iii) fully connected layers. Each type oflayer plays a different role. Figure 1 shows a CNN architecture for an objectdetection in image task. Every layer of a CNN transforms the input volume toan output volume of neuron activation, eventually leading to the final fullyconnected layers, resulting in a mapping of the input data to a 1D featurevector. CNNs have been extremely successful in computer vision applications,such as face recognition, object detection, powering vision in robotics, andself-driving cars.(i) Convolutional Layers. In the convolutional layers, a CNN utilizes variouskernels to convolve the whole image as well as the intermediate feature maps,generating various feature maps. Because of the advantages of the convolutionoperation, several works (e.g., [23, 24]) have proposed it as a substitute forfully connected layers with a view to attaining faster learning times.(ii) Pooling Layers. Pooling layers are in charge of reducing the spatialdimensions (width height) of the input volume for the next convolutionallayer. The pooling layer does not affect the depth dimension of the volume.The operation performed by this layer is also called subsampling ordownsampling, as the reduction of size leads to a simultaneous loss ofinformation. However, such a loss is beneficial for the network because thedecrease in size leads to less computational overhead for the upcoming layersof the network, and also it works against overfitting. Average pooling and maxpooling are the most commonly used strategies. In [25] a detailed theoreticalanalysis of max pooling and average pooling performances is given, whereas in[26] it was shown that max pooling can lead to faster convergence, selectsuperior invariant features, and improve generalization. Also there are anumber of other variations of the pooling layer in the literature, eachinspired by different motivations and serving distinct needs, for example,stochastic pooling [27], spatial pyramid pooling [28, 29], and def-pooling[30].(iii) Fully Connected Layers. Following several convolutional and poolinglayers, the high-level reasoning in the neural network is performed via fullyconnected layers. Neurons in a fully connected layer have full connections toall activation in the previous layer, as their name implies. Their activationcan hence be computed with a matrix multiplication followed by a bias offset.Fully connected layers eventually convert the 2D feature maps into a 1Dfeature vector. The derived vector either could be fed forward into a certainnumber of categories for classification [31] or could be considered as afeature vector for further processing [32].The architecture of CNNs employs three concrete ideas: (a) local receptivefields, (b) tied weights, and (c) spatial subsampling. Based on localreceptive field, each unit in a convolutional layer receives inputs from a setof neighboring units belonging to the previous layer. This way neurons arecapable of extracting elementary visual features such as edges or corners.These features are then combined by the subsequent convolutional layers inorder to detect higher order features. Furthermore, the idea that elementaryfeature detectors, which are useful on a part of an image, are likely to beuseful across the entire image is implemented by the concept of tied weights.The concept of tied weights constraints a set of units to have identicalweights. Concretely, the units of a convolutional layer are organized inplanes. All units of a plane share the same set of weights. Thus, each planeis responsible for constructing a specific feature. The outputs of planes arecalled feature maps. Each convolutional layer consists of several planes, sothat multiple feature maps can be constructed at each location.During the construction of a feature map, the entire image is scanned by aunit whose states are stored at corresponding locations in the feature map.This construction is equivalent to a convolution operation, followed by anadditive bias term and sigmoid function:where stands for the depth of theconvolutional layer, is the weight matrix, and is the bias term. For fullyconnected neural networks, the weight matrix is full, that is, connects everyinput to every unit with different weights. For CNNs, the weight matrix isvery sparse due to the concept of tied weights. Thus, has the form ofwhereare matrices having the same dimensions with the units’ receptive fields.Employing a sparse weight matrix reduces the number of network’s tunableparameters and thus increases its generalization ability. Multiplying withlayer inputs is like convolving the input with , which can be seen as atrainable filter. If the input to convolutional layer is of dimension andthe receptive field of units at a specific plane of convolutional layer is ofdimension , then the constructed feature map will be a matrix of dimensions .Specifically, the element of feature map at (, ) location will bewithwhere thebias term is scalar. Using (4) and (3) sequentially for all () positions ofinput, the feature map for the corresponding plane is constructed.One of the difficulties that may arise with training of CNNs has to do withthe large number of parameters that have to be learned, which may lead to theproblem of overfitting. To this end, techniques such as stochastic pooling,dropout, and data augmentation have been proposed. Furthermore, CNNs are oftensubjected to pretraining, that is, to a process that initializes the networkwith pretrained parameters instead of randomly set ones. Pretraining canaccelerate the learning process and also enhance the generalization capabilityof the network.Overall, CNNs were shown to significantly outperform traditional machinelearning approaches in a wide range of computer vision and pattern recognitiontasks [33], examples of which will be presented in Section 3. Theirexceptional performance combined with the relative easiness in training arethe main reasons that explain the great surge in their popularity over thelast few years.

## 2.2. Deep Belief Networks and Deep Boltzmann Machines

Deep Belief Networks and Deep Boltzmann Machines are deep learning models thatbelong in the “Boltzmann family,” in the sense that they utilize theRestricted Boltzmann Machine (RBM) as learning module. The RestrictedBoltzmann Machine (RBM) is a generative stochastic neural network. DBNs haveundirected connections at the top two layers which form an RBM and directedconnections to the lower layers. DBMs have undirected connections between alllayers of the network. A graphic depiction of DBNs and DBMs can be found inFigure 2. In the following subsections, we will describe the basiccharacteristics of DBNs and DBMs, after presenting their basic building block,the RBM.

## 2.2.2. Deep Belief Networks

Deep Belief Networks (DBNs) are probabilistic generative models which providea joint probability distribution over observable data and labels. They areformed by stacking RBMs and training them in a greedy manner, as was proposedin [39]. A DBN initially employs an efficient layer-by-layer greedy learningstrategy to initialize the deep network, and, in the sequel, fine-tunes allweights jointly with the desired outputs. DBNs are graphical models whichlearn to extract a deep hierarchical representation of the training data. Theymodel the joint distribution between observed vector and the hidden layersas follows:where , is a conditional distribution for the visible units atlevel conditioned on the hidden units of the RBM at level , and is thevisible-hidden joint distribution in the top-level RBM.The principle of greedy layer-wise unsupervised training can be applied toDBNs with RBMs as the building blocks for each layer [33, 39]. A briefdescription of the process follows:(1)Train the first layer as an RBM thatmodels the raw input as its visible layer.(2)Use that first layer to obtain arepresentation of the input that will be used as data for the second layer.Two common solutions exist. This representation can be chosen as being themean activation or samples of .(3)Train the second layer as an RBM, takingthe transformed data (samples or mean activation) as training examples (forthe visible layer of that RBM).(4)Iterate steps ( and ) for the desired numberof layers, each time propagating upward either samples or mean values.(5)Fine-tune all the parameters of this deep architecture with respect to a proxy forthe DBN log- likelihood, or with respect to a supervised training criterion(after adding extra learning machinery to convert the learned representationinto supervised predictions, e.g., a linear classifier).There are two main advantages in the above-described greedy learning processof the DBNs [40]. First, it tackles the challenge of appropriate selection ofparameters, which in some cases can lead to poor local optima, therebyensuring that the network is appropriately initialized. Second, there is norequirement for labelled data since the process is unsupervised. Nevertheless,DBNs are also plagued by a number of shortcomings, such as the computationalcost associated with training a DBN and the fact that the steps towardsfurther optimization of the network based on maximum likelihood trainingapproximation are unclear [41]. Furthermore, a significant disadvantage ofDBNs is that they do not account for the two-dimensional structure of an inputimage, which may significantly affect their performance and applicability incomputer vision and multimedia analysis problems. However, a later variationof the DBN, the Convolutional Deep Belief Network (CDBN) ([42, 43]), uses thespatial information of neighboring pixels by introducing convolutional RBMs,thus producing a translation invariant generative model that successfullyscales when it comes to high dimensional images, as is evidenced in [44].

## 2.2.3. Deep Boltzmann Machines

Deep Boltzmann Machines (DBMs) [45] are another type of deep model using RBMas their building block. The difference in architecture of DBNs is that, inthe latter, the top two layers form an undirected graphical model and thelower layers form a directed generative model, whereas in the DBM all theconnections are undirected. DBMs have multiple layers of hidden units, whereunits in odd-numbered layers are conditionally independent of even-numberedlayers, and vice versa. As a result, inference in the DBM is generallyintractable. Nonetheless, an appropriate selection of interactions betweenvisible and hidden units can lead to more tractable versions of the model.During network training, a DBM jointly trains all layers of a specificunsupervised model, and instead of maximizing the likelihood directly, the DBMuses a stochastic maximum likelihood (SML) [46] based algorithm to maximizethe lower bound on the likelihood. Such a process would seem vulnerable tofalling in poor local minima [45], leaving several units effectively dead.Instead, a greedy layer-wise training strategy was proposed [47], whichessentially consists in pretraining the layers of the DBM, similarly to DBN,namely, by stacking RBMs and training each layer to independently model theoutput of the previous layer, followed by a final joint fine-tuning.Regarding the advantages of DBMs, they can capture many layers of complexrepresentations of input data and they are appropriate for unsupervisedlearning since they can be trained on unlabeled data, but they can also befine-tuned for a particular task in a supervised fashion. One of theattributes that sets DBMs apart from other deep models is that the approximateinference process of DBMs includes, apart from the usual bottom-up process, atop-down feedback, thus incorporating uncertainty about inputs in a moreeffective manner. Furthermore, in DBMs, by following the approximate gradientof a variational lower bound on the likelihood objective, one can jointlyoptimize the parameters of all layers, which is very beneficial especially incases of learning models from heterogeneous data originating from differentmodalities [48].As far as the drawbacks of DBMs are concerned, one of the most important onesis, as mentioned above, the high computational cost of inference, which isalmost prohibitive when it comes to joint optimization in sizeable datasets.Several methods have been proposed to improve the effectiveness of DBMs. Theseinclude accelerating inference by using separate models to initialize thevalues of the hidden units in all layers [47, 49], or other improvements atthe pretraining stage [50, 51] or at the training stage [52, 53].

## Server application

The server application is responsible for processing the content of thequeried images and identifying the location from the queried images. Theserver application utilizes a trained deep learning model to recognize thelocation from the image. The server application returns the locationinformation to the mobile application in a JSON format. The server applicationutilizes an open source message broker namely, Active MQ [55] to communicatewith the Android application.

## Deep learning model

A deep learning model is configured for indoor scene recognition task.Tensorflow [56], an open-source machine learning library was utilized to buildthe deep learning model. Figure 3 illustrates the architecture of thedeveloped deep learning model. The model is built using convolutional layers,pooling layers, and fully connected layers at the end.Fig. 3Architecture of the developed deep learning modelFor an input RGB image i, convolutional layer calculates output of the neuronswhich are associated to each local regions of the input. Convolutional layercan be applied to raw input data as well as output of another Convolutionallayer. During convolution operation, the filter/kernel will slide over theeach raw pixel of the RGB image or over the feature map generated from theprevious layer. This operation compute the dot product between weights andregions of the input.Let (M_i^{l-1}) be the feature map from previous layer, (w_k^{l}) isthe weight matrix in current layer then convolutional operation will resultsnew feature map (M_k^{l}).$$begin{aligned}&Y= displaystyle sum limits _{iin N_{K}}M_i^{l-1}*w_k^{l}+b_k^{l} end{aligned}$$(1)$$begin{aligned}&M_k^{l}= f(Y) end{aligned}$$(2)Here, Y is the output of convolutional operation. (N_{K}) represents thenumber of kernel in current layer and (b_k^{l}) is bais value. Bais is anadditional parameter used in CNN to adjust the output from the convolutionallayer. Bais help the model to fit best for input data.An activation function f(Y) is applied to the resulting output from theconvolutional operation to generate the feature map (M_k^{l}). Activationfunction a.k.a transfer function is utilized to decide the output by mappingthe resulting values of convolutional operation to a specific interval such asbetween [0,1] or [−1,1] etc. Here we utilized Rectifier Linear Unit (ReLU) asan activation function. ReLU is the commonly used activation function in CNNand faster compared to other functions.For an input x, ReLU function f(x) is,$$begin{aligned} f(x) = {left{ begin{array}{ll} 0, &{} text {if }x < 0 x, &{} text {if }x ge 0 end{array}right. } end{aligned}$$(3)Once the convolutional operation is completed, pooling operation is applied onthe resulting feature map to reduce the spatial size of feature maps byperforming down sampling. Average pooling and max pooling are the two commonfunctions utilized for pooling operation.For a feature map of volume W1 (times) H1 (times) D1, poolingoperation produce a feature map of reduced volume W2 (times) H2(times) D2 where:$$begin{aligned} W2=(W1-F)/S+1, quad H2=(H1-F)/S+1, quad D2=N_{K}end{aligned}$$Here, S is stride and F is spatial extent. We used max pooling function inpooling operation where MAX operation is applied in a local region resulting amax value among that region.The feature map in the form of n dimensional matrix are flattened in tovectors before feeding to fully connected layer. The fully connected layercombines the feature vector to build a model. Moreover, softmax function isused to normalize the output of fully connected layer that result the outputsrepresentation based on probability distribution.For an input image x, softmax function applied in the output layer computesthe probability that x belongs to a class (c_{k}) by,$$begin{aligned} p(y=c_{k}|x;P)= frac{e^{P_{c_{k}}^T x}}{displaystyle sumnolimits _{c_{i}=1}^n e^{P_{c_{i}}^T x}} end{aligned}$$(4)where n is the number of classes and P is the parameter of the model.Our model consist of 7 convolutional layers where each has a max pooling layerattached to it. The first convolutional layer has 128 filters with size(3times 3) and the last layer has 512 filters with size (3times 3).The other convolutional layers use 256 filters with size (3times 3). Maxpooling layer is responsible for reducing the dimensions of the featuresobtained in its preceding convolutional layer. Dimensionality reduction aidsthe convolutional neural network model to achieve translation invariance,reduce computation and lower the number of parameters. In the end, thearchitecture contains a fully connected layer with 4096 nodes followed by anoutput layer with softmax activation. The deep learning model was trainedusing more than 5000 images to identify 42 indoor location. The model takes anRGB image of size (224times 224) pixels as input and classifies the imageinto one of the 42 class labels learned during the training phase.

## Image dataset

Our indoor image dataset [57] contains more than 5000 images classified intodifferent directories. Each directory represents one indoor location or class.Moreover, each directory contains a JSON file which contains the locationinformation required by the Android application to locate the user. The imagesin the dataset are captured from the ground floor of building B09 of QatarUniversity. In order to consider various orientation of users, we capturedpictures from different angles for the same location. The images are capturedusing Samsung Galaxy smartphone, LG smartphone and Lenovo smartphone. Weconsidered the diversity of mobile phones to reserve the different sorts ofpictures which are taken from varied cameras. Each images are in RGB formatand reshaped into a size of (224 times 224) pixels. This dataset can beutilized for indoor scene recognition applications also. The sample images ofthe dataset are displayed in Fig. 4.Fig. 4Sample images from dataset

## Navigation module

The navigation module is responsible for providing navigation instructions tothe user. We utilized the indoor map and CAD drawings of indoor areas tocreate the navigation module. The navigation module contains the routinginformation between each point of interest to another point of interest. Thenavigation information inside the navigation module is stored in a JSON arrayformat. One JSON array includes the navigation instructions for one specificroute. We created the navigation instructions manually for each route.Thecommon commands used in navigation instructions are turn right, turn left,walk straight etc. Moreover, instructions provide information such as distancebetween current location of the user (computed by the system) and criticallocations such as junctions in indoor areas or doors or lift. The distancebetween each location associated with the captured visual scenes was measuredmanually and represented in terms of steps. For converting the distancemeasured in meters to step, we considered walking patterns of normal people.

## QR code decoder

We analyzed two open source barcode reader library, ‘Zxing’ [59] and‘Zbar’[60] for QR code encoding and decoding operations. We found that ‘Zxing’library is more effective compared to ‘Zbar’ in low light and challengingconditions. We implemented the ‘Zxing’ library and the ‘Zbar’ library to readand extract the information from QR codes.

## QR dataset

The QR dataset contains more than 25 directories where each directory isassociated with an indoor location. The ame of the directories are unique(same as unique id embedded in the QR codes). The directories are enclosedwith a JSON file which contains information about the location. The QR codesare created using pyzbar library (Zbar library for python language). Eachindoor location is mapped with a unique QR code. The four digits unique idbeginning from ‘1000’ was manually assigned for each QR code. The QR code wasprinted in normal A4 sheet papers and pasted in indoor areas. In eachlocation, we provided more than 25 copies of the QR code to provide reliablenavigation service. Figure 7 shows the sample instance of attached QR codes onthe floor.Fig. 7QR codes attached on the floor

## BLE positioning module

The positioning module combined two popular positioning techniques to estimatethe location of the user in real-time. BLE fingerprinting [11] andmultilateration [61] techniques were utilized to achieve the localization ofthe user in the indoor map. When the system sense only two nearby beacons,then fingerprinting technique is utilized, where the observed fingerprint iscompared with the pre-stored fingerprints in the database. If the system isable to detect more than two nearby beacons, the multilateration technique isused to compute the position of the user.