An introduction to Big Data
History of Big Data

Before 2000 most of the IT applications used by business was to help them in their daily operations.

Typically this applications used to store our details and transaction Example Retail Shop or Bank .

This was fine for a small organization like a retail shop or Bank but not for a large organization that had different business functions. There were individual application for each of this functions and this created a silo of data and could not be integrated and as there were different applications for each Business functions like Sales, Finance, Inventory. These Applications were created by different vendors and in different technology or platform. This was overcome by ERP like SAP which used to integrate all the different functions and stored all the data in a central database.

Things were fine till the arrival of internet, which saw the business starting their own website and people used to visit these sites to gather information before calling them or visiting them.

This saw the overall increase of data as these we not transaction data created by individuals but organizations sharing their information. Example a newspaper sharing all the data it had.

All this information available in internet but we did not how to access this as people had to know the website to access it. There are about 700 million web sites and well over a trillion web pages. This was solved by the arrival of search Engine Company like Yahoo.


Origination of Big Data

Imagine the problem of a search engine like yahoo, it had to search among huge data set and arrive at the results quickly. To add on the problems was that data was exploding in volume, variety and velocity so storing and retrieval of all this huge amount of data coming in different format was a problem.


Just to get an idea of data being generated daily:

Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.
Twitter generates 7TB of data Daily
Today, Facebook ingests 500 terabytes of new data every day.
IBM claims 90% of today’s stored data was generated in just the last two years.


The amount of data that gets created every day is at mind boggling proportions and will only continue to increase. The problem that we have when dealing with so much data is that it becomes almost impossible to separate the important facts from the non-important facts.

So we need computer programs to help us to filter the data. But a single program, working with terabytes of data, requires a lot of time to process that much data. And by the time the processing has been completed, the answer may no longer be relevant. For example knowing the traffic patterns of the last three days does not help me to determine how to cross a street at this instance.

The other thing that makes working with all of this data so difficult is the fact that most of the data is unstructured.

Some source of unstructured data is social networks, web logs, RFID information, video and audio archives, sensor data, military surveillance, astronomy, genomics and internet search indexing

Computer programs work well with structured data. If there is a data field that only has postal code values, then it is easy to search on that field and get all of the stores within a particular geographical area. But what if you are looking for some key words in recorded conversations? The data is there but accessing it becomes significantly harder.

Over a history that spans more than 30 years, SQL database servers have traditionally held gigabytes of information — and reaching that milestone took a long time. In the past 15 years, data warehouses and enterprise analytics expanded these volumes to terabytes So storing and retrial of data was not such a problem but arrival of internet based  companies like Facebook, Google, twitter changed the entire scenario and it created such a huge data set that existing Relational database could not handle it.


So the problem could be summed as  1) Huge Data- It was growing daily 2) Unstructured Data example Facebook 3) Data coming very fast example twitter

The existing Hardware and Software could no handle it as

The scalability was an issue as data would increase  daily
The processing capacity or retrieval of data was slow
Huge cost associated with using the existing RDBMS vendors.

The Solution for Big Data Problem: Hadoop


For those companies that are able to process this amount of data, there is a tremendous opportunity for those organizations to make timely decisions and achieve business goals However, at the same time, organizations are struggling to gain deeper insights from this data. Business leaders continue to make decisions without access to the trusted information they need. CEOs understand that they need to do a better job in capturing the data and understanding resulting information.

So the solution to problem caused by big data should address the issues mentioned in the previous chapter

A new database that should handle huge volume and should be scalable: To counter the issue of storage, the Hadoop Distributed File System (HDFS) came into existence
Retrieval of  such huge data quickly : This was taken cared by distributed storing of Data and   massive parallel processing done using MapReduce
Hardware should not be expensive: This was taken care by using lots of inexpensive commodity servers.

History of Hadoop


Hadoop is a well-adopted, standards-based, open-source software framework built on the foundation of Google’s MapReduce and Google File System papers. It’s meant to leverage the power of massive parallel processing to take advantage of Big Data, generally by using lots of inexpensive commodity servers.

2004—Initial versions of what is now Hadoop Distributed File system and Map-Reduce implemented by Doug Cutting and Mike Cafarella.
December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.
January 2006—Doug Cutting joins Yahoo!.
February 2006—Apache Hadoop project officially started to support the stand alone development of MapReduce and HDFS
February 2006—Adoption of Hadoop by Yahoo! Grid team.
April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
May 2006—Yahoo! set up a Hadoop research cluster—300 node

What is Mapreduce and why it is needed?


MapReduce is a software framework that breaks big problems into small, manageable tasks and then distributes them to multiple servers. Actually, “multiple servers” is an understatement;

Hundreds of computers may contain the data needing to be processed. These servers are called nodes, and they work together in parallel to arrive at a result


MapReduce is a programming model for data processing. The model is simple, yet not

too simple to express useful programs in. Hadoop can run MapReduce programs written

in various languages like in Java, Ruby, Python, and C++. Most importantly, MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with

enough machines at their disposal.

MapReduce can work with raw data that’s stored in disk files, in relational databases, or both. The data may be structured or unstructured, and is commonly made up of text, binary, or multi-line records. Weblog records, e-commerce click trails and complex documents are just three examples of the kind of data that MapReduce routinely consumes.


What is HDFS and why it is needed?


The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.

HDFS is a highly fault tolerant, distributed, reliable, scalable file system for data storage. HDFS stores multiple copies of data on different nodes; a file is split up into blocks (Default 64 MB) and stored across multiple machines. Hadoop cluster typically has a single namenode and number of datanodes to form the HDFS cluster.

How Hadoop will help you in career growth?

Ideally to enter into Big Data Sector you should have some relevant IT experience of at least 1 year.

As a fresher you can learn the skills for which fresher’s are hired as fresher’s are mostly hired in most of the companies based on their logical and programming skills.


Even if you are a Fresher you can start learning Hadoop as it would be helpful for you to switch your career once you have some experience in your backend.


Huge demand for skilled professionals

According to a Forbes report of 2015, about 90% of global organizations are investing in big data analytics and about one third of organizational call it “very significant.” Hence, it can be implied that Big Data Hadoop will not only remain merely a technology but a magical wand in the hands of the companies trying to mark their presence in the market.


A McKinsey Global Institute study states that the US will face a shortage of about 190,000 data scientists and 1.5 million managers and analysts who can understand and make decisions using Big Data by 2018.


Big bucks as per the statistics:

 Hadoop Developer Salary in United States -$102,000


Hadoop practitioners are among the highest paid IT professionals today with salaries ranging till $85K (source: indeed job portal), and the market demand for them is growing rapidly.


The average salary for Hadoop & Java Developer in TCS, India is ₹677k – ₹738k, according to Glassdoor.