Edge AI Computing with Coral Edge TPU — unboxing review and performance benchmark by axonX hackathon

At Rubix as part of one of the innovation teams, axonX has been involved in several IoT/streaming data projects over the last couple of years. More recently, this has led us to explore technologies and trends around Reinforcement Learning, Online/Real-time Learning, and Edge AI Computing.
During our monthly knowledge sharing and “hackathon” days, we have discussed and elaborated on several use cases and corresponding (machine/deep learning) algorithms related to these AI technologies and trends. In that spirit, we bought a couple of $79 (now $59) Coral USB Accelerator devices to tinker with, to see if they can be applied in some of our use cases. Each device contains a single Google Edge TPU (Tensor Processor Unit).

So, what sets a TPU apart from, say, GPU’s or even “traditional” CPU’s?

CPU’s are general-purpose workhorses; they boast ever-expanding instruction sets to support a wide variety of applications: arithmetic, video compression, encryption — anything goes. GPU’s take a narrower approach; while they support a narrower instruction set, they have many times more cores, so GPU’s are well-suited for simple work that is easily parallelizable, like 3D rendering and mining bitcoins.

TPU’s have been around since 2016; they are purpose-built ASICs with a further-reduced instruction set solely optimized for tensor calculations — the kind that Google’s TensorFlow framework uses extensively. The first three generations of TPU’s that Google launched were designed for data centers (a.k.a. “Cloud TPU’s”), typically coming in packages of up to 128 cores, providing cost-efficient model-training capabilities.

In a “traditional” machine learning use case, it’s feasible to transport production data to a data center for inferencing, that is: applying a model to input data — for example, feeding a picture through an image classifier. At axonX, we have experience running models in production on TensorFlow Serving (or Seldon Serving) with the machine learning pipelines of the Google Kubeflow framework. This is a valid MLOps approach in many situations, and we’d love to tell you all about it if you’re interested! However, that’s not what this article is about.

In some cases, the network cannot be relied upon for inferencing, for instance, due to limitations on bandwidth, latency, and reliability. For instance, a self-driving car can’t hit the brakes every time it drives out of range of a cell tower to get a decision from a centralized model, so we need local power to get these decisions and/or predictions. A high-speed camera inspecting items falling through an industrial hopper has milliseconds to decide whether there are defects, again no use case for centralized or Cloud-based solutions.

Enter the Coral Edge TPU.
https://www.youtube.com/embed/ydzJPeeMiMI

The single-core device announced in 2018 and distributed in several small form factors through Google’s subsidiary Coral since 2019, focuses on running power-efficient inferencing.

The products complement each other: training models on huge sets of data is a data-intensive and power-hungry task, best left to Cloud TPU’s in Google’s data centers. The Coral Edge TPU fills the gap to the edge. It allows you to take the models to the data and applying them there, instead of the other way around. Data can be processed, filtered, and aggregated before it leaves the device, in ways that previously required callouts to centralized services, forming a bottleneck in data pipelines. This makes sense; it’s a lot quicker to send the text string “apple, apple, pear, banana” across a wire than a video stream where fruit makes an occasional appearance. Better yet, the Edge TPU’s small form factor and low power use make it suitable for consumer products, remote sensors, and other IoT projects, without needing any kind of network connectivity at all.

So, how does the Edge TPU work its magic?

By doing one thing only and doing it… not so well. But well enough, and efficiently. In addition to the reduced instruction set, it operates on narrow 8-bit registers (like the first-generation Cloud TPU), which translates to lower-resolution edge weights in neural network connections. To put this in perspective: a full-width 64-bit float can represent any value of 18446744073709551616 for some weight, whereas 8 bits can only represent 256 values. But, as it turns out, that’s enough!

Therefore, to run on an Edge TPU, models that have been trained on, say, 64-bit weights, first need to be converted to the required TensorFlow Lite model format (“quantized”). Google claims that this does not significantly impact model accuracy, although training models from scratch on 8-bit signed integers should yield the best results.

After ensuring that a model only uses 8-bit weights, it needs to be compiled to the edge TPU compatible model using the Edge TPU compiler. This is due to another cost-cutting optimization: The Edge TPU has only 8MB of RAM on-board, which might not fit larger models.

Enter the compiler given a model, it first reserves a small amount of space for the model executable, then tries to fit as much of the model parameters into the space left. If everything fits, that means the TPU only needs to cache the model once, after which it can run inferences on that model in quick succession. If, on the other hand, not all parameter data can be kept on-device, the compiler fits as many layers as it can, so the runtime only needs to pull in the remaining layers on each inference.

image source: https://coral.ai/static/docs/images/compiler/cache-overflow.svg

Another feature of the compiler is its ability to “co-compile” multiple smaller models, combining them to fit together in memory, or as much of them as space allows. Since loading a non-co-compiled model involves flushing the previously loaded one from the cache, this strategy can speed up scenarios where inferences on different models are run in an interleaved fashion.
The Coral Edge TPU can also be used to conduct so-called “model fine-tuning” or transfer learning, where you would retrain a pre-trained model by using a targeted and relatively smaller set of training data, to make sure such fine-tuning training can be conducted in a much shorter time than its earlier initial training iterations and batches. This is limited to the last layer of a model, however.

So, let’s put all that theory to the test!

Some figures, we run a simple image classification, we compare an old Lenovo Yoga notebook with running the model on the Coral TPU.

According to the online specifications of the Coral TPU, its performance should be in principle able to support real-time video/images object recognitions with a high speed of up to 400 FPS for e.g. via the MobileNet V2 object recognition deep learning algorithm of Google.

For demonstration purposes, we deployed such a model (MobileNet SSD V2, written in Python) on a compact Lenovo Yoga notebook running Ubuntu 18.04, and we obtained following video detection speed performances of around 46 FPS.
As a comparison with a CPU-based video object recognition implementation during an earlier team hackathon session, we obtained for such CPU-based video object recognition implementation on an HP PC with 6 core CPUs a detection speed of around 0.7 FPS, which is significantly slower than the results mentioned above for the Coral edge TPU scenario.

On top of this, we compared the performance of the GPU and TPU available on Google Colab by running the same notebook in the two backends. After adjusting the code to ensure that it takes advantage of the TPUs, we were able to obtain similar performance when training the model using a batch size of 64. When using larger batch sizes, e.g. 2048, the TPU performed much better than the GPU runtime. The following table shows the median time per step during training for different batch sizes.

Summarizing: the Coral TPU is useful to run models and do predictions on the edge. In general, if you train on a TPU, with sufficiently large batch size and many epochs you can get more than three times the performance over using a GPU. There’s still room for improvement in terms of compatibility between TensorFlow 2 and TPUs — you need TensorFlow 1 to quantize existing models to the required TFLite model. However, we can see that the future is bright both for training quickly and efficiently and doing machine learning on the edge!