Audio classification project

07 Jan 2018

This is the fourth article for the website

I am working on a project on Audio classification using deep learning. In this blog post, I will try to explain deep learning and audio classification in this post.

Firstly, let’s consider Audio classification. It is a process of classifying audio data into distinct groups. Classification helps in dividing huge chunks of data into useful categories. Human beings have inborn or trained capabilities to distinguish between different colours and sounds from the environment. By the age of 2 years, a normal human infant can distinguish between different colors and by the age of 4, he can distinguish between different sounds.

This leads to an interesting question!! When can we build machines that can think for themselves? This human inquisitiveness gave rise to this upcoming field of Artificial Intelligence The goal of this field is to finally create machines which can display equal amounts of intelligence as human beings. The field may be in vogue right now, but it predates back to about 1956. In the beginning, machines succeeded in accomplishing most of the routine tasks that humans could do. Tasks which required high amounts of precision, repetition and high amounts of computation became very easy for the computers. However, many tasks which are intuitive to human beings, like identifying images became extremely difficult for computers.

To solve these intuitive problems, we employ the use of Deep Learning. It is inspired from the working of a nervous systems of human beings. Human brains form neurons whenever a new task is being learnt or strengthen the neural connections when a task gets repeated many times.


Tasks like describing a colour or an image in mathematical terms is very difficult. This problem is taken care off by Deep Learning. It makes the computers learn by experience and understand the world in a hierarchial manner. For example, let us consider the simple task of buying potatotes from a market.


If we follow the traditonal coding, it will become almost impossible to describe the humble potato to the computer. It has many shapes and varieties and it will be difficult to describe the changing of curvature,loop on the upper part,elliptical shape,etc to the computer in lines of code. Enter the savior, Neural Networks.

The neural networks will go through many images of potatoes( millions!!) and try to develop its own intuition about what are potatoes. This intuition is formed in the form of graphs of related concepts. In the end, it can understand what is a potato and describe it perfectly. It will form a shape like this.

Neural network after feeding one image of potato

Neural network after feeding millions of images of potato

We can see that the neural network has become so complex after going through millions of potato images(Experience) and there are many more interconnections(hierarchy) between the neurons now. This enormous growth is also shown by the human brain as it passes throught the stages of infancy to adulthood.

Deep learning has already shown great promise in solving problems of image, text and speech classification. A lot of major companies have already started using deep learning in their products.

Inspite of this, a lot of progress is yet to be done in speech classification.