Big data comes as a helping aidin crisis during Nepalese Earthquake! Oh yes… that’s true.
You will be taken abackto know that big data playedan important role in the Nepalese earthquake crisis- by extending data-driven help and support in connecting people who are outside country looking for their crisis- stricken relatives, family and friends thereby assisting and getting aids for them.
Big Data refers to tremendously growingcapability of collecting voluminous data from a variety of sources and analyzing it to create useful insights throughthe computer algorithms. In today’s age where huge amount of structured and unstructured data is being generated by a wide range of internet enabled devices and machines, the major challenge is to store, process and manage data effectively and facilitating huge data retrieval seamlessly. The data analysis is the most sought practice in today’s world of highly competitive businesses. With voluminous data, organizations burn the candles at both ends to create useful business insights.
Hadoop– the most popular platform comes to rescues in such scenarios where big data needs to be dealt with storage, processing, retrieval and management of petabytes and exabytes of data.
Hadoop is an Apache project and an open-source framework that has been engineered to enable scalable, economical, reliable and efficient storage and processing capabilities for big data in distributed fashion. Hadoop provides scaling capabilities from single servers to large number of machines wherein every machine possess computation and storage capacities. On the application layer, the Hadoop library detects the failures and extends the high-availability of data despite relying on the hardware being used.
Ability of Hadoop administration to analyses the data stored in different machines and locations cost effectively and readily makes it more desirable. The concept of MapReduce that divides a query into small pieces and processing all these pieces in parallel makes Hadoop more powerful.
Hadoop ecosystem is comprised of HDFS (Hadoop Distributed File System), Hadoop MapReduce, Hadoop Common and Hadoop Yarn. HDFS and Hadoop MapReduce are considered to be the main components of the Hadoop core. In this blog, we will discuss these two main components.
HDFS or Hadoop Distributed File System and Hadoop MapReduce has been developed in such as way that they can be co-deployed in a single cluster thereby facilitating the processing to flow to the data readily.
Hadoop Distributed File System (HDFS)has been designed to run on commodity machines using the distributed file system design. HDFS is not only fault tolerant but also can store voluminous data across different machines. The machines storing the big data have not only low-cost hardware, the data files on these machines are stored redundantly to avoid the data loss in case of any failure. A command interface is provided for HDFS interaction. The cluster status can also be monitored to keep track of the data. HDFS supports the streaming access to the file system data so as to provide the quick data access and allows authentication and data permissions.
Hadoop MapReduce is data processing technique that divides the query data into several data chunks to process them parallely. The technique involves two tasks namely map and reduce. While map divides the data set into small data chunks of key/ value pairs, reduce intakes the input from map and processes these data chunks and combines these data chunks into smaller chunks of data. As the name implies, reduce is performed after map operation.
In our upcoming blogs, more will be discussed on HDFS and other Hadoop essentials.