Managing Huge Data with Distributed Storage and Hadoop Cluster

9 min readSep 17, 2020

Stepping into the world of Big Data and Distributed Computing

Recently, I started my journey as the ARTH Learner in the program “ARTH -2020” under the guidance of ‘The World Record Holder Mr. Vimal Daga Sir’. As it is just the starting phase of my journey as the ARTH learner, here I’m sharing some learnings.

How much data is created on the Internet each day ?

Ever thought about how much data is created every day? So how much data is produced every day in this techie world? Before getting into the depth of it, here is a overview.

Big Data Growth Statistics

The stats of data growth in 2020 tell us that big data is rapidly growing and the amount of data generated each day is increasing at an unprecedented rate. The majority of the world’s data has emerge about in only last few years. If you’ve wondered how much data the average person uses per month, you can start by looking at how much data is created every day in 2020 by the average person. This currently stands at 2.5 quintillion bytes per person, per day (there are 18 zeroes in a quintillion. Just for your information).

In the last two years alone, around 90% of the world’s data has been created. By the end of 2020, around 44 zettabytes will make up the entire digital universe. Furthermore, it is estimated that around 463 exabytes of data will be generated each day by humans as of 2025. These impressive technology growth stats for big data show no sign of slowing down. Thus, Big Data is the future of the world.

**Tabular Representation of various Memory Sizes**

Here’s How Much Data Big Companies Manage

It makes us to wonder, just how much data is the tech giants companies like Google and Facebook are generating? What does that hugedata look like, and is it comparable to any such? Here are the stats from four of the biggest tech companies out there right now.

Google: 40,000 Google Web Searches Per Second

Google revealed some huge, big stats on big data recently. Around more than 3.7 billion humans now have regular access to the internet. And, that results in approx 40,000 web searches per second on Google. This can only be understand by our habit : No one says “Let me check on the internet.” It’s always “Let me Google that.”.

Furthermore, over half of all those web searches are through the mobile devices. It is likely the web search totals will continue to grow as more and more people start using their mobile devices across the world.

Facebook: 4 petabytes of data every day.

In the year 2012, this stats was something around 500 terabyte, as revealed by Facebook. But, this huge amount of change is due to the amount of time users spend on Facebook, as reflected in the data growth statistics throughout 2020. In fact, people spend more time on Facebook than any other social network.

Ever thought that how many photos are uploaded to Facebook every day? It is a huge number. Around 350 million photos are uploaded to Facebook each day.

Another stat Facebook revealed in the year 2012 was that over 100 petabytes of data are stored in a single Hadoop disk cluster

Twitter: 12 Terabytes Per Day

One can’t think that 140-character messages comprise large stores of data, but it turns out that the Twitter community generates more than 12 terabytes of data per day.

That equals 84 terabytes per week and 4368 terabytes — or 4.3 petabytes — per year. That’s a huge data certainly for short, character-limited messages like those shared on the network. This is because there are around 330 million users active Twitter boasts 330 million monthly active users. Of these, more than 40 percent, or more specifically, 145 million, use the service on a daily basis.

Amazon :

Amazon is dominating the marketplace — Amazon processes $373 MILLION in sales every day in 2017, compared to about 120 million amazon sales in 2014
Each month more than around 206 million people around the world get on their devices and visit Amazon.com.
Amazon S3 — on top of everything else the company handles and offers a comprehensive cloud storage solution that naturally facilitates the transfer and storage of massive data troves. Because of this, it’s difficult to truly pinpoint just how much data Amazon is generating in total.

What Is Big Data actually means?

It is a combination of structured, semi-structured as well as unstructured data collected by different organizations that can be mined for useful information for predictive modeling and other advanced analytics applications.

In present tech-savvy world, Big data is often characterized by the 3Vs: the large volume of data, the wide variety of data types and the velocity at which the data is generated, collected and it’s processing is done.

Importance of Big Data

Many of the tech giant companies use the big data accumulated in their systems to improve operations, provide better customer service, develop personalized marketing campaigns based on specific customer preferences and, ultimately, increase profitability.

Companies which utilizes big data hold a potential over those that don’t since they’re able to make faster and more informed decisions, provided they use the data many time more effectively. Using customer data as an example, the different branches of analytics that can be done as — Comparative analysis : of customer-engagement, Social media listening : to know what their customer are saying about their products, Marketing analysis : to develop new ideas to make the promotion of their new products, and many more.

How big data is stored and processed

The need to handle big data have some unique demands in the computer infrastructure. The computing power required to access and process the huge volumes and varieties of data quickly can inundate a single server or server cluster. Companies must apply some adequate processing capacity to big data tasks in order to achieve the required velocity.

This can potentially required some hundreds or thousands of servers that can distribute the processing work and operate in a combined way, often based on some technologies like Hadoop and Apache Spark.

Distributed Storage

There will be multiple machines and multiple processors. It holds a set of nodes called ‘Cluster’ (which basically means ‘Collection’). This system can scale linearly and if you double the number of nodes in your system, you’ll be getting double the storage. It also increases the speed of the machine to twice.

In present scenario, a wide set of systems and application, especially in high performance computing, depends on distributed environments to process and analyses huge amounts of data. As the amount of data is increasing at an unprecedented rate, and the main aim is to develop an efficient, reliable and scalable storage system. One of such system is Distributed File Systems (DFSs). It used to develop a hierarchical and unified view of multiple file servers share over the network.

Role of Apache Hadoop

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to even thousands of machines, each offering local computation and storage.

Hadoop Cluster

It is a cluster which is designed as such to perform Big-data computation efficiently and also to store and manage huge amounts of data. It is a collection of commodity hardware interconnected with each other and working together as a single unit.

Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called ‘HADOOP’. The entire Hadoop algorithm is coded in Java. It works in an environment that has distributed storage and computation capability across multiple clusters.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.

MapReduce Algorithm

MapReduce is a programming model which consists of writing map and reduce functions. Map accepts ‘key/value pairs’ and produces a sequence of key/value pairs. Then, the data is shuffled to group keys together. After that, we reduce the accepted values with the same key and produce a new key/value pair.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. The differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. In fact, deployments of more than thousand of nodes in HDFS exist.

In HDFS, files are divided into blocks, and file access follows single-writer and multi-reader semantics. To meet the fault-tolerance requirement, multiple replicas of a block are stored on different DataNodes. The number of replicas is called the replication factor. This framework also includes the following two modules which are as−

Hadoop YARN − This is a framework which is used for job scheduling and cluster resource management.
Hadoop Common − These are Java libraries and the utilities required by other Hadoop modules.

NameNode

It is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes).

DataNode

These are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file

How Is Facebook Deploying Big Data?

As the enormous data generated by Facebook each day as it has around 2.38 Billion number of active users, It is one the biggest user of Hadoop Cluster. And thus, It is known as the Robust user of Hadoop cluster.

Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. Hadoop provides a common infrastructure for Facebook with efficiency and reliability.

Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase, which has a kind of layered architecture that supports abundance of messages in a single day.

Facebook stores much of the data on its massive Hadoop cluster, which has grown exponentially in recent years. Today the cluster holds a staggering 30 petabytes of data.

Conclusion

The availability of Big Data, low-cost commodity hardware, and new information management and analytic software have produced a unique moment in the history of data analysis. Hadoop is one of the framework that makes us able to work on large datasets. Distributed Storage and thus computing will have a vast scope and need in upcoming future.

“The future belongs to those who prepare for it. So get ready for the enthralling journey of Distributed Computing.”

#bigdata #hadoop #bigdatamanagement #arthbylw #vimaldaga #righteducation #educationredefine #rightmentor #worldrecordholder #ARTH #linuxworld #makingindiafutureready #righeudcation