DISTRIBUTED STORAGE CLUSTER AND HADOOP

 



DISTRIBUTED STORAGE CLUSTER &                             HADOOP


CONTENT -

  1. Introduction 

  2. What are the Issues faced?

  3. How data is increasing?

  4. What is Bigdata?

  5. Challenges

  6. Types of Bigdata

  7. What is the Solution of Big Data?

  8. Distributed Storage

  9. HDFS


INTRODUCTION -

Earlier everything was going on fine when there was no Internet but after the Internet, Technical Industries like Google, Facebook, etc. Started facing the issue. Users are increasing day by day and so there data also. There are approximately 4.57 billion Internet users in the world and in 1 year almost 346 million new users have come.


WHAT ARE THE ISSUES FACED?

Any entry made by the user and that is stored in the Database is Data and that data can be used by industries for commercial purposes but one issue came here that is day by day data increased exponentially and now the questions came up -

  • Where to store data?

  • If stored how to process data?

  • How to retrieve data faster?

  • How to stored and retrieve data at Real-time?

  • How to find raw data for the industry?

  • How to manage that untapped data?


HOW DATA IS INCREASING?

  1. SOCIAL MEDIA - Social Media is a place where people connect with each other by online mode and share their emotions and journey by images, audios, videos, etc.

Social Media is one of the important factors of Big Data. Instagram, Facebook, Whatsapp, takes alot of data like personal details, pictures, likes or reactions, etc.

 

  • FACEBOOK - Facebook is a social media platform that has almost 2.7 billion active users until the second quarter of 2020. Facebook generates 4 petabytes of data per day. People can chat and upload images, videos, etc. on Facebook. 


  1. GOOGLE- Google is a Search Engine that has 4 billion users and it processes 3.5 billion searches per day and if we break down this it processes 40,000 searches per second on an average. Google processes approximately 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters.


  1. INTERNET OF THINGS(IoT) - IoT connects with a device and makes it smarter. Nowadays we have a smart A.C., smart room, etc. Due to IoT we humongous amount of data is generated. It is assumed that till 2025 41.6 billion of data will be generated by IoT devices.

There are many more things due to data is Increasing.


WHAT IS BIGDATA?


Big data is a problem. Big Data is a tsunami of data that is increasing exponentially day by day.

Examples of big data are - Science, Astronomy, Sensor Networks, Medical records, Social Data, etc.

 Problems with big data:

  1. Huge Volumes

  2. Data in different types and Format

  3. Impacting the Business



CHALLENGES-

  1. STORING THE DATA - The data is coming in huge volume and where to store it is a big issue. To store a huge amount of data in a traditional system is not possible. 

To buy one expensive hardware with a huge volume storing capability is not a good idea because it will raise another issue.

We have one file of 500 MB but we have only 200 MB of storage left now what to do?


  1. VARIOUS FORMATS OF DATA- Earlier, we used to store data in Relational Database but currently, 80% of the data is Unstructured Data. Also now there are different types of data:

  1. STRUCTURED DATA

  2. UNSTRUCTURED DATA

  3. SEMI-STRUCTURED DATA

it‘s hard to handle this data in a traditional manner.


3. PROCESSING DATA FASTER-  let’s take one example, we have one harddisk of 100MB and we stored data there but now more data is coming so we increased its size from 100MB to 500MB but now more data is coming and we are increasing it’s size again and again. Now all data is stored but did you thought about How we will be going to retrieve this data or process this data?


Though the CPU speed, RAM Memory, Disk Capacity have improved alot, the thing not improved is the speed. From the last 7 - 10 years, the read/write speed of a disk is 80 MB/Sec.

So these are the problems faced by industries when the data converted to Big Data.



TYPES OF BIG DATA-



  1. STRUCTURED DATA- The Relational Database is known as Structured data which is in the form of Row and Column.

Example- Stock Information, Credit Card details, Medical records of the hospital, Bank Records etc.

Facebook especially make their own query language based on SQL which handles Big Data Known as Hive Query Language.


  1. UNSTRUCTURED DATA- Unstructured data which are images, audios, videos, etc. Almost 80% of the data is unstructured data. It is generated more by Social Media.


  1. SEMI-STRUCTURED DATA- JSON, XML, CSV File, Tab Delimited files,log files etc are semi-structured data.

Log files are the files that store the data when we login till logout to any application. Like on Facebook when we log in, what activity is done by us, when we logout .everything is stored in log file.


3 V’s OF BIG DATA-



  1. VOLUME - The main Characteristics of Big Data is said as a Huge Volume collected from different resources. The sectors are flooded with lots of data. Travel, education, entertainment, health, banking, shopping - each and every sector can benefit immensely from the Big data solution. Data is collected from diverse sources which include business transactions, social media, sensors, surfing history, etc.

These are the sizes-

  1. VELOCITY-  It means at what speed the data is coming.In today’s world the company which executes things faster that will be in a good position so speed matters alot.

A Streaming Application like Amazon Web Services Kinesis is an example of an application that handles the velocity of data.


  1. VARIETY - In this new era of technology, the data is in different formats and types. Rather than Relational data images, videos are increased.

The different types of data are- 

  1. STRUCTURED DATA

  2. UNSTRUCTURED DATA

  3. SEMI-STRUCTURED DATA


DISTRIBUTED STORAGE - 

Distributed Storage means when the file can’t be stored in one P.C. and we split the file and store it in different P.C.

Let’s understand with an example - we have a file of 100 MB and we have storage of 50 MB and we can’t store it like this. So we can do one thing rather than storing it by vertical scaling we can store it in a horizontal scaling manner.


VERTICAL SCALING (SCALE-UP)- We can add more storage to the same hard disk. It stores the data but at the time of Retrieval or processing the data it increases the read/write or input/output time.


HORIZONTAL SCALING (SCALE OUT) - Add more P.C. rather than adding storage. The Advantage of horizontal Scaling is it stores the data but also retrieves and processes it at a faster rate which is good for Industries.





SOLUTION TO BIG DATA - 


The Solution to Big Data was Given by DOUGH CUTTING which is HADOOP. Hadoop’s name is given because his son’s elephant toy name was Hadoop.

HADOOP IS A FRAMEWORK WRITTEN IN JAVA LANGUAGE.

Hadoop stores and processes data in a distributive manner and in a parallel way.

TWO MAIN COMPONENTS OF HADOOP ARE-

  1. HDFS(Hadoop distributed file system) -> for distributed system

  2. MapReduce -> for processing and parallel working







HADOOP ARCHITECTURE-


MASTER/SLAVE ARCHITECTURE-


NameNode is the Master and SlaveNode is the DataNode.

NameNode is expensive hardware and stores metadata.

DataNode is Commodity Hardware and Stores the Files with the replication factor and input split.


Thank you for reading !!

Hope you like it!!

For more blogs stay connected…


Comments

Post a Comment

Popular posts from this blog

HOW GOOGLE USES MACHINE LEARNING ??

AWS CLOUDFRONT SETUP

Terraform Replica Set Code