**Symbolic Frameworks**

Symbolic computation frameworks (as in CNTK, MXNET, TensorFlow, Theano) are specified as a symbolic graph of vector operations, such as matrix add/multiply or convolution. A layer is just a composition of those operations. The fine granularity of the building blocks (operations) allows users to invent new complex layer types without implementing them in a low-level language (as in Caffe).

I’ve used different symbolic computation frameworks in my work. However, I found each of them has their pros and cons in their design and current implementation, and none of them can perfectly satisfy all needs. For my problem needs , I decided to work with Theano.

Here we compare the following symbolic computation frameworks:

- Software: Theano
- Creator: Université de Montréal
- Software license: BSD license
- Open source: Yes
- Platform: Cross-platform
- Written in: Python
- Interface: Python
- CUDA support: Yes
- Automatic differentiation: Yes
- Has pre-trained models: Through Lasagne’s model zoo
- Recurrent Nets: Yes
- Convolutional Nets: Yes
- RBM/DBNs: Yes

- Software: TensorFlow
- Creator: Google Brain Team
- Software license: Apache 2.0
- Open source: Yes
- Platform: Linux, Mac OS X,
- Windows support on roadmap
- Written in: C++, Python
- Interface: Python, C/C++
- CUDA support: Yes
- Automatic differentiation: Yes
- Has pre-trained models: No
- Recurrent Nets: Yes
- Convolutional Nets: Yes
- RBM/DBNs: Yes

- Software: MXNET
- Creator: Distributed (Deep) Machine Learning Community
- Software license: Apache 2.0
- Open source: Yes
- Platform: Ubuntu, OS X, Windows, AWS, Android, iOS, JavaScript
- Written in: C++, Python, Julia, Matlab, R, Scala
- Interface: C++, Python, Julia, Matlab, JavaScript, R, Scala
- CUDA support: Yes
- Automatic differentiation: Yes
- Has pre-trained models: Yes
- Recurrent Nets: Yes
- Convolutional Nets: Yes
- RBM/DBNs: Yes

**Non-symbolic frameworks**

**PROS**:

- Non-symbolic (imperative) neural network frameworks like
**torch**,**caffe**etc. tend to have very similar design in their computation part. - In terms of expressiveness, imperative frameworks with a good design can also expose graph-like interface (e.g.
**torch/nngraph**).

**CONS**:

- The main drawbacks of imperative frameworks actually lie in manual optimization. For example, in-place operation has to be manually implemented.
- Most imperative frameworks are not designed well enough to have comparable expressiveness as symbolic frameworks.

**Symbolic frameworks**

**PROS**:

- Symbolic frameworks can possibly infer optimization automatically from the dependency graph.
- A symbolic framework can exploit much more memory reuse opportunities, as is well done in
**MXNET**. - Symbolic frameworks can automatically compute an optimal schedule. This is explained in
**TensorFlow whitepaper**.

**CONS:**

- Available open source symbolic frameworks currently are still not good enough to beat imperative frameworks in performance.

**Adding New Operations**

Theano / MXNET | TensorFlow |
---|---|

Can add Operation in Python with inline C support. | Forward in C++, symbolic gradient in Python. |

**Code Re-usability**

Training deep networks are time-consuming. So, Caffe has released some pre-trained model/weights (model zoo) which could be used as initial weights while transfer learning or fine tuning deep networks on domain specific or custom images.

**Theano**Lasagne is a high-level framework built on top of Theano. It’s very easy to use Caffe pre-tained model weights in Lasagne.**TensorFlow**No support for pre-trained model.******MXNET**MXNET has a*caffe_converter tool*which allows to convert pre-trained caffe model weights to fit MXNET.

**Low-level Tensor Operators**

A reasonably efficient implementation of low-level operators can serve as ingredients in writing new models, saving the effort to write new Operations.

Theano | TensorFlow | MXNET |
---|---|---|

A lot of basic Operations | Fairly good | Very few |

**Control Flow Operator**

Control flow operators make the symbolic engine more expressive and generic.

Theano | TensorFlow | MXNET |
---|---|---|

Supported | Experimental | Not Supported |

**High**-**level** **Support**

**Theano**Pure symbolic computation framework. High-level frameworks can be built to fit desired means of use. Successful examples include*Keras*,*Lasagne*,*blocks*.**TensorFlow**Has good design considerations for neural network training, and at the same time avoid being totally a neural network framework, which is a wonderful job. The**graph collection**,**queues**,**image augmenters**etc. can be useful building blocks for a higher-level wrapper.**MXNET**Apart from the symbolic part, MXNET also comes with all necessary*components*for image classification, going all the way through data loading to building a model that has a method to start training.

**Performance**

**Benchmarking Using Single-GPU**

I benchmark LeNet model on MNIST Dataset using a Single-GPU (NVIDIA Quadro K1200 GPU).

Theano | TensorFlow | MXNET |
---|---|---|

Great | Not so good | Excellent |

**Memory**

GPU memory is limited and may usually be a problem for large models.

Theano | TensorFlow | MXNET |
---|---|---|

Great | Not so good | Excellent |

**Single**-**GPU** **Speed**

**Theano** takes a long time to compile a graph, especially with complex models. **TensorFlow** is a bit slower.

Theano / MXNET | TensorFlow |
---|---|

comparable to CuDNNv4 | about 0.5x slower |

**Parallel**/**Distributed** **Support**

Theano | TensorFlow | MXNET |
---|---|---|

experimental multi-GPU | multi-GPU | distributed |

**Conclusion**

Theano (with higher-level Lasagne & Keras) is a great choice for deep learning models. It’s very easy to implement new networks & modify existing networks using Lasagne/Keras. I prefer python, and thus prefer using Lasagne/Keras due to their very mature python interface. However, they do not support R. I have tried using transfer learning and fine tuning in Lasagne/Keras, and it’s very easy to modify an existing network and customize it with domain-specific custom data.

Comparisons of different frameworks show that **MXNET is the best choice (better performance/memory).** Moreover, it has a great R support. In fact, it is the only framework that supports all functions in R. In MXNET, transfer learning and fine tuning networks are possible, but not as easy (as compared to Lasagne/Keras). This makes modifying existing trained networks more difficult, and thus a bit difficult to use domain-specific custom data.

MXNet is a modern interpretation and rewrite of a number of ideas being talked about in the deep learning infrastructure. It’s designed from the ground up to work well with multiple GPUs and multiple computers.

When doing multi-device work in other frameworks, the end user frequently has to think about when to do computation and how data is synchronized. In MXNet, every operation is lazy. It only computes values when the resources are available to compute them.

In practice, you get incredible device utilization automatically. A new batch can be copying to a GPU, while that same GPU can be running a forward pass as a CPU does a complex parameter update on the CPU — all at the same time!

In addition:

- It is built on a dataflow graph (like Tensorflow, Theano, Torch, Caffe).
- It manages its own memory internally (like Theano) and is able to reuse memory locations.
- It has a backend written in C++ and cuda, which is exported via a C interface. This allows simple language bindings. It currently supports Python, R, Julia, Go and Javascript (in varying degrees).
- It allows use of Torch natively from Python.
- It can be deployed on mobile.

For a high level overview of MXNet, check out their Arxiv Paper.

There are two modes of computation in MXNet: imperative and symbolic. The imperative API exposes an interface similarly to Numpy. The symbolic API lets users define computation graphs, similar to Theano and Tensorflow.

The basic building block for the imperative API is an `NDArray`

. Much like Numpy, this object holds a tensor (or multi-dimensional array). Unlike Numpy, this object also stores a pointer to where the memory is held (CPU or GPU).

For example, to construct a tensor of zeros on CPU and GPU:

```
import mxnet as mx
cpu_tensor = mx.nd.zeros((10,), ctx=mx.cpu())
gpu_tensor = mx.nd.zeros((10,), ctx=mx.gpu(0))
```

In MXNet, you need to specify where arrays are held. This is done by passing in an appropriate context to GPU or CPU.

Much like in Numpy, basic math operations can be done on these.

```
ctx = mx.cpu() # which context to put memory on.
a = mx.nd.ones((10, ), ctx=ctx)
b = mx.nd.ones((10, ), ctx=ctx)
c = (a + b) / 10.
d = b + 1
```

Unlike Numpy, everything done on these arrays is done lazily. Each one of these operations will return instantly with a NDArray that represents some future computation. MXNet’s true power is revealed in how it does these computations. An operation gets scheduled to be computed, and runs when its dependencies are met. For example, to compute the value of `d`

, you need to know the value of `b`

, so that must be computed first. In this example, you don’t need to know the value of `c`

to compute `d`

, meaning that the execution engine is free to compute `c`

and `d`

in whatever order it wants, or even concurrently in an efficient manner.

To actually read the values in the NDArray, you can call:

```
numpy_d = d.asnumpy()
```

This function blocks until the value of `d`

has been computed, then converts it to a Numpy array.

While the imperative API is extremely powerful by itself, it is often very rigid and hard to prototype with. Everything must be known about the computation ahead of time, and must be written out by the user. The symbolic API tries to remedy this. Instead of working with defined arrays, you work with symbols that can be “compiled” or interpreted to a executable set of operations.

Take a similar example to the one above:

```
import mxnet as mx
a = mx.sym.Variable("A") # represent a placeholder. These can be inputs, weights, or anything else.
b = mx.sym.Variable("B")
c = (a + b) / 10
d = c + 1
```

With these symbols defined, we can inspect the graph’s inputs and outputs.

```
d.list_arguments()
# ['A', 'B']
d.list_outputs()
# ['_plusscalar0_output'] This is the default name from adding to scalar.
```

The graph allows shape inference.

```
# define input shapes
inp_shapes = {'A':(10,), 'B':(10,)}
arg_shapes, out_shapes, aux_shapes = d.infer_shape(**inp_shapes)
arg_shapes # the shapes of all the inputs to the graph. Order matches d.list_arguments()
# [(10, ), (10, )]
out_shapes # the shapes of all outputs. Order matches d.list_outputs()
# [(10, )]
aux_shapes # the shapes of auxiliary variables. These are variables that are not trainable such as batch normalization population statistics. For now, they are save to ignore.
# []
```

Unlike other frameworks, MXNet’s graphs are completely stateless. They just represent some function that has arguments and outputs. In MXNet there is no difference between “weights”, or parameters of a model and its inputs (data fed in). They are both arguments to the graph. For example, a graph representing a logistic regression would have three arguments: the input data, some weights, and biases.

To actually perform computation in the graph, we need to create what MXNet calls an `Executor`

. This can be done by `bind`

-ing an output symbol with a specific set of input variables.

```
input_arguments = {}
input_arguments['A'] = mx.nd.ones((10, ), ctx=mx.cpu())
input_arguments['B'] = mx.nd.ones((10, ), ctx=mx.cpu())
executor = d.bind(ctx=mx.cpu(),
args=input_arguments, # this can be a list or a dictionary mapping names of inputs to NDArray
grad_req='null') # don't request gradients
```

This `Executor`

allocates all the necessary memory and temporary variables to perform the computation. Once bound, an `Executor`

always works with the same memory locations for inputs and outputs. To actually compute results for given values, you need to modify the contents of the input variables, call `forward`

, and then read out the outputs.

```
import numpy as np
# The executor
executor.arg_dict
# {'A': NDArray, 'B': NDArray}
executor.arg_dict['A'][:] = np.random.rand(10,) # Note the [:]. This sets the contents of the array instead of setting the array to a new value instead of overwriting the variable.
executor.arg_dict['B'][:] = np.random.rand(10,)
executor.forward()
executor.outputs
# [NDArray]
output_value = executor.outputs[0].asnumpy()
```

Like in the imperative API, calling forward is also lazy and will return before computation is finished.

Now, the great thing about doing computation with a dataflow graph is that you can automatically differentiate with respect to inputs. This can be done automatically with the `backward`

function. The `Executor`

needs a place to put the output gradients though for this to work, as shown below.

```
# allocate space for inputs
input_arguments = {}
input_arguments['A'] = mx.nd.ones((10, ), ctx=mx.cpu())
input_arguments['B'] = mx.nd.ones((10, ), ctx=mx.cpu())
# allocate space for gradients
grad_arguments = {}
grad_arguments['A'] = mx.nd.ones((10, ), ctx=mx.cpu())
grad_arguments['B'] = mx.nd.ones((10, ), ctx=mx.cpu())
executor = d.bind(ctx=mx.cpu(),
args=input_arguments, # this can be a list or a dictionary mapping names of inputs to NDArray
args_grad=grad_arguments, # this can be a list or a dictionary mapping names of inputs to NDArray
grad_req='write') # instead of null, tell the executor to write gradients. This replaces the contents of grad_arguments with the gradients computed.
executor.arg_dict['A'][:] = np.random.rand(10,)
executor.arg_dict['B'][:] = np.random.rand(10,)
executor.forward()
# in this particular example, the output symbol is not a scalar or loss symbol.
# Thus taking its gradient is not possible.
# What is commonly done instead is to feed in the gradient from a future computation.
# this is essentially how backpropagation works.
out_grad = mx.nd.ones((10,), ctx=mx.cpu())
executor.backward([out_grad]) # because the graph only has one output, only one output grad is needed.
executor.grad_arrays
# [NDarray, NDArray]
```

One of MXNet’s design goals is to be flexible. As a result, it often exposes APIs at multiple different levels. The above call to `bind`

needs all input and gradient arguments defined before hand. A lot of this information could be calculated using shape inference. A simpler API, `simple_bind`

, also exists for the same task. This allocates all the necessary space needed. All it needs is a few input shapes to fully specify the graph.

```
input_shapes = {'A': (10,), 'B': (10, )}
executor = d.simple_bind(ctx=mx.cpu(),
grad_req='write', # instead of null, tell the executor to write gradients
**input_shapes)
executor.arg_dict['A'][:] = np.random.rand(10,)
executor.arg_dict['B'][:] = np.random.rand(10,)
executor.forward()
out_grad = mx.nd.ones((10,), ctx=mx.cpu())
executor.backward([out_grad])
```

**The real power of MXNet is that you can combine the two styles.** For example, you can use symbolic operations to construct a neural network. Bind an executor to compute for a given `batch_size`

, then use the imperative API to do gradient updates. All of these operations will be performed lazily when the needed data is available.

Now for a more complicated example: a small, fully connected neural network with one hidden layer for MNIST.

Visualization of the network. Inputs to the graph are in gray, all other squares are operations performed on the inputs. In this example we are using a batch size of 128.

```
import mxnet as mx
import numpy as np
# First, the symbol needs to be defined
data = mx.sym.Variable("data") # input features, mxnet commonly calls this 'data'
label = mx.sym.Variable("softmax_label")
# One can either manually specify all the inputs to ops (data, weight and bias)
w1 = mx.sym.Variable("weight1")
b1 = mx.sym.Variable("bias1")
l1 = mx.sym.FullyConnected(data=data, num_hidden=128, name="layer1", weight=w1, bias=b1)
a1 = mx.sym.Activation(data=l1, act_type="relu", name="act1")
# Or let MXNet automatically create the needed arguments to ops
l2 = mx.sym.FullyConnected(data=a1, num_hidden=10, name="layer2")
# Create some loss symbol
cost_classification = mx.sym.SoftmaxOutput(data=l2, label=label)
# Bind an executor of a given batch size to do forward pass and get gradients
batch_size = 128
input_shapes = {"data": (batch_size, 28*28), "softmax_label": (batch_size, )}
executor = cost_classification.simple_bind(ctx=mx.gpu(0),
grad_req='write',
**input_shapes)
# The above executor computes gradients. When evaluating test data we don't need this.
# We want this executor to share weights with the above one, so we will use bind
# (instead of simple_bind) and use the other executor's arguments.
executor_test = cost_classification.bind(ctx=mx.gpu(0),
grad_req='null',
args=executor.arg_arrays)
# initialize the weights
for r in executor.arg_arrays:
r[:] = np.random.randn(*r.shape)*0.02
# Using skdata to get mnist data. This is for portability. Can sub in any data loading you like.
from skdata.mnist.views import OfficialVectorClassification
data = OfficialVectorClassification()
trIdx = data.sel_idxs[:]
teIdx = data.val_idxs[:]
for epoch in range(10):
print "Starting epoch", epoch
np.random.shuffle(trIdx)
for x in range(0, len(trIdx), batch_size):
# extract a batch from mnist
batchX = data.all_vectors[trIdx[x:x+batch_size]]
batchY = data.all_labels[trIdx[x:x+batch_size]]
# our executor was bound to 128 size. Throw out non matching batches.
if batchX.shape[0] != batch_size:
continue
# Store batch in executor 'data'
executor.arg_dict['data'][:] = batchX / 255.
# Store label's in 'softmax_label'
executor.arg_dict['softmax_label'][:] = batchY
executor.forward()
executor.backward()
# do weight updates in imperative
for pname, W, G in zip(cost_classification.list_arguments(), executor.arg_arrays, executor.grad_arrays):
# Don't update inputs
# MXNet makes no distinction between weights and data.
if pname in ['data', 'softmax_label']:
continue
# what ever fancy update to modify the parameters
W[:] = W - G * .001
# Evaluation at each epoch
num_correct = 0
num_total = 0
for x in range(0, len(teIdx), batch_size):
batchX = data.all_vectors[teIdx[x:x+batch_size]]
batchY = data.all_labels[teIdx[x:x+batch_size]]
if batchX.shape[0] != batch_size:
continue
# use the test executor as we don't care about gradients
executor_test.arg_dict['data'][:] = batchX / 255.
executor_test.forward()
num_correct += sum(batchY == np.argmax(executor_test.outputs[0].asnumpy(), axis=1))
num_total += len(batchY)
print "Accuracy thus far", num_correct / float(num_total)
```

MXNet does contain a whole host of higher level APIs to dramatically reduce complexity of more straight forward models like the one above. Here is an example of the same model using the `FeedForward`

class. A lot more is hidden, but the code is a lot smaller and simpler.

```
import mxnet as mx
import numpy as np
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__) # get a logger to accuracies are printed
data = mx.sym.Variable("data") # input features, when using FeedForward this must be called data
label = mx.sym.Variable("softmax_label") # use this name aswell when using FeedForward
# When using Forward its best to have mxnet create its own variables.
# The names of them are used for initializations.
l1 = mx.sym.FullyConnected(data=data, num_hidden=128, name="layer1")
a1 = mx.sym.Activation(data=l1, act_type="relu", name="act1")
l2 = mx.sym.FullyConnected(data=a1, num_hidden=10, name="layer2")
cost_classification = mx.sym.SoftmaxOutput(data=l2, label=label)
from skdata.mnist.views import OfficialVectorClassification
data = OfficialVectorClassification()
trIdx = data.sel_idxs[:]
teIdx = data.val_idxs[:]
model = mx.model.FeedForward(symbol=cost_classification,
num_epoch=10,
ctx=mx.gpu(0),
learning_rate=0.001)
model.fit(X=data.all_vectors[trIdx],
y=data.all_labels[trIdx],
eval_data=(data.all_vectors[teIdx], data.all_labels[teIdx]),
eval_metric="acc",
logger=logger)
```

I hope this tutorial has been a good introduction to MXNet at its core. Personally, I find this low level of work essential when trying out the latest research work. In general, with almost all of these frameworks, there is a trade off between how expressive you must be (how much code you have to write), and flexibility to implement new and unthought of techniques.

I’ve only scratched the surface of what is inside MXNet — for more information check out the docs. They include both documentation on how to use MXNet as well as design notes from the developers.