Big Data

Prerequisites

Basics of Linux File systems.
Basic understanding of System Design.

What to expect from this course

This course covers the basics of Big Data and how it has evolved to become what it is today. We will take a look at a few realistic scenarios where Big Data would be a perfect fit. An interesting assignment on designing a Big Data system is followed by understanding the architecture of Hadoop and the tooling around it.

What is not covered under this course

Writing programs to draw analytics from data.

Course Contents

Overview of Big Data
Usage of Big Data Techniques
Evolution of Hadoop
Architecture of Hadoop
1. HDFS
2. Yarn
MapReduce Framework
Other Tooling Around Hadoop
1. Hive
2. Pig
3. Spark
4. Presto
Data Serialization and Storage

Overview of Big Data

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques, and frameworks.
Big Data could consist of
1. Structured data
2. Unstructured data
3. Semi-structured data
Characteristics of Big Data:
1. Volume
2. Variety
3. Velocity
4. Variability
Examples of Big Data generation include stock exchanges, social media sites, jet engines, etc.

Usage of Big Data Techniques

Take the example of the traffic lights problem.
1. There are more than 300,000 traffic lights in the US as of 2018.
2. Let us assume that we placed a device on each of them to collect metrics and send it to a central metrics collection system.
3. If each of the IoT devices sends 10 events per minute, we have 300000 x 10 x 60 x 24 = 432 x 10 ^ 7 events per day.
4. How would you go about processing that and telling me how many of the signals were “green” at 10:45 am on a particular day?
Consider the next example on Unified Payments Interface (UPI) transactions:
1. We had about 1.15 billion UPI transactions in the month of October 2019 in India.
2. If we try to extrapolate this data to about a year and try to find out some common payments that were happening through a particular UPI ID, how do you suggest we go about that?