Part 1 : Why Graphics cards are important for Deep Learning?

7 min readSep 2, 2021

Introduction

We know Graphics cards are how powerful. Most of you would have heard exciting stuff happening using deep learning. You would have also heard that Deep Learning requires a lot of hardware. I have seen people training a simple deep learning model for days on their laptops (typically without GPUs) which leads to an impression that Deep Learning requires big systems to run execute.

Nvidia the most famous company Manufacturing Graphics cards and AI solutions

Why we need graphics card for Deep Learning?
How it’s works in Deep Learning?

Why we need graphics card for Deep Learning?

Training a model in deep learning requires a large dataset, hence the large computational operations in terms of memory. To compute the data efficiently, a GPU is an optimum choice. The larger the computations, the more the advantage of a GPU over a CPU. Optimizing tasks are far easier in CPU. Generally speaking, GPUs are fast because they have high-bandwidth memories and hardware that performs floating-point arithmetic at significantly higher rates than conventional CPUs . GPUs’ main task is to perform the calculations needed to render 3D computer graphics.

High bandwidth

But then in 2007 NVIDIA created CUDA. CUDA is a parallel computing platform that provides an API for developers, allowing them to build tools that can make use of GPUs for general-purpose processing.

Processing large blocks of data is basically what Machine Learning does, so GPUs come in handy for ML tasks. TensorFlow and Pytorch are examples of libraries that already make use of GPUs. Now with the RAPIDS suite of libraries we can also manipulate dataframes and run machine learning algorithms on GPUs as well.[About Nvidia CUDA Programming we will see in Next part]

To significantly reduce training time, you can use deep learning GPUs, which enable you to perform AI computing operations in parallel. When assessing GPUs, you need to consider the ability to interconnect multiple GPUs, the supporting software available, licensing, data parallelism, GPU memory use and performance. In couple of years GPU Manufactures increase the GPU specs like architecture, die size(nm), frequency and flops, etc…

How it’s works in Deep Learning?

Graphics processing units or GPUs are specialized hardware for the manipulation of images and calculation of local image properties. The mathematical basis of neural networks and image manipulation are similar, embarrassingly parallel tasks involving matrices, leading GPUs to become increasingly used for machine learning tasks. As of 2016, GPUs are popular for AI work, and they continue to evolve in a direction to facilitate deep learning, both for training and inference in devices such as self-driving cars. GPU developers such as Nvidia NVLink are developing additional connective capability for the kind of dataflow workloads AI benefits from. As GPUs have been increasingly applied to AI acceleration, GPU manufacturers have incorporated neural network-specific hardware to further accelerate these tasks. Tensor cores are intended to speed up the training of neural networks.

GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. They have a large number of cores, which allows for the better computation of multiple parallel processes.

Training model without graphics card it will takes too much of time and so, They designed the architecture for High speed computation. They introduce Tensor core. Nvidia Tensor core all microprocessors are designed to carry out arithmetic and logical operations. One arithmetic operation that holds high importance is matrix multiplication. Multiplying two 4×4 matrices involves 64 multiplications and 48 additions. Convolution and Multiplication are the areas where the new cores shine.

CUDA cores have been present on every single GPU developed by Nvidia in the past decade while Tensor Cores have recently been introduced.

Tensor cores can compute a lot faster than the CUDA cores. CUDA cores perform one operation per clock cycle, whereas tensor cores can perform multiple operations per clock cycle.

CUDA is too slow rather than Tensor core

Everything comes with a cost, and here, the cost is accuracy. Accuracy takes a hit to boost the computation speed. On the other hand, CUDA cores produce very accurate results.

For machine learning models, CUDA cores are not as effective as Tensor cores in terms of both cost and computation speed. Hence, these are the preferred choice for training machine learning models.

If you are a developer trying to learn in-depth about this technology, check out these posts by Nvidia’s official blog for developers. Nvidia’s blog for developers includes dozens of posts on this topic.

What is Tensor ?

To understand exactly what tensor cores do and what they can be used for, we first need to cover exactly what tensors are. Microprocessors, regardless what form they come in, all perform math operations (add, multiply, etc) on numbers.

Sometimes these numbers need to be grouped together, because they have some meaning to each other. For example, when a chip is processing data for rendering graphics, it may be dealing with single integer values (such as +2 or +115) for a scaling factor, or a group of floating point numbers (+0.1, -0.5, +0.6) for the coordinations of a point in 3D space. In the case of the latter, the position of the location requires all three pieces of data.

A tensor is a mathematical object that describes the relationship between other mathematical objects that are all linked together.

A tensor is a mathematical object that describes the relationship between other mathematical objects that are all linked together. They are commonly shown as an array of numbers, where the dimension of the array can be viewed as shown below.

The simplest type of tensor you can get would have zero dimensions, and consist of a single value — another name for this is a scalar quantity. As we start to increase the number of dimensions, we can come across other common math structures:

1 dimension = vector
2 dimensions = matrix

Strictly speaking, a scalar is a 0 x 0 tensor, a vector is 1 x 0, and a matrix is 1 x 1, but for the sake of simplicity and how it relates to tensor cores in a graphics processor, we’ll just deal with tensors in the form of matrices.

Tensors And Tensorflow

A tensor is a mathematical object represented by an array of components that are functions of the coordinates of a space. Google created its own machine learning framework that uses tensors because tensors allow for highly scalable neural networks.

Google surprised industry analysts when it open sourced its Tensorflow machine learning software library, but this may have been a stroke of genius because Tensorflow quickly became one of the most popular machine learning frameworks used by developers. Google was also using Tensorflow internally, and it benefits Google if more developers know how to use Tensorflow because it increases the potential talent pool for the company to recruit from. Meanwhile, chip companies seem to be optimizing their products, either for Tensorflow directly, or for tensor calculations (as Nvidia is doing with the V100). In other words, chip companies are battling each other to improve Google’s open sourced machine learning framework — a situation that can only benefit Google.

Finally, Google also built its own specialized Tensor Processing Unit, and if the company decides to offer cloud services powered by the TPU, there will be a wide market of developers that could stand to benefit from it (and purchase access to it).

Nvidia Tensor Cores

The Tensor Cores in the Volta-based Tesla V100 are essentially mixed-precision FP16/FP32 cores, which Nvidia has optimized for deep learning applications.

The new mixed-precision cores can deliver up to 120 Tensor TFLOPS for both training and inference applications. According to Nvidia, V100’s Tensor Cores can provide 12x the performance of FP32 operations on the previous P100 accelerator, as well as 6x the performance of P100’s FP16 operations. The Tesla V100 comes with 640 Tensor Cores (eight for each SM).

In the image below, Nvidia is showing how for a matrix-matrix multiplication, commonly used in the training of neural networks, the V100 can be more than 9x faster compared to the Pascal-based P100 GPU.

Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations

The company said that this result is due to the custom crafting of the Tensor Cores and their data paths to maximize their floating point performance with a minimal increase in power consumption.

Nvidia’s thought about Tensor core

Unprecedented Acceleration for HPC and AI

Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy. The latest generation expands these speedups to a full range of workloads. From 10X speedups in AI training with Tensor Float 32 (TF32), a revolutionary new precision, to 2.5X boosts for high-performance computing with floating point 64 (FP64), NVIDIA Tensor Cores deliver new capabilities to all workloads. Example Video:

Reference links

Volta (microarchitecture) — Wikipedia