Hey , Welcome you to my article on What technologies does Facebook , youtube use to store its Big Data?
Let us Consider facebook , Instagram , etc they are storing lots of lots of data per data i.e according to 2012 report facebook stores 500TB per day & remains their for life time until we delete it and it increses day by day.....
Then where facebook stores its Bigdata ?? Is facebook store its data in one harddisk of 1000 Bronto byte & after it full it uses another ??? Facebook can store in this way also but No....Facebook not uses this way to store the data why?? it cost high but facebook can purchase it somewhere there is no problem of cost ? But There are another lots of reasons behind it that facebook cant store data in this way ...
One of the main reason is due I/O Operation . What is the meaning of it ?? I will Explain in a minute ... We Know speed of RAM is more than Hard disk in terms of processing .. If we want to store data of 10MB in HD it takes 1 to 2 seconds but at a same time if we want to store a data of 4GB then it takes 1 to 2 minutes ...i.e here in HD accordingly data size it takes time to store data ...same in this way if we want Open some program having large size it takes time...i.e storing it is an output operation and opening something is an input operation ... It means as size of data increases it takes a lot of time ..
If facebook store nearly 500TB data per day … then think how many time it takes to store or retrieve the data.. nearly it take 1 to 2 year .. Then Solution is it uses Data Distributing Technology it distribute the data through among no. of servers here the problem of storage also solve & Speed/Velocity of data storing & retrieving also solve. Then How distribute the load through servers here role come up of Hadoop Technology...
Actually bigdata is not a technology , it is a problem which come up with storing large no of data into storage having capacity less than the actual data.
How Big Data is Classified?
Big Data is classified into 3 different categories.
Structured Data refers to the data that has a proper structure associated with it. For example, the data that is present within the databases, the CSV files, and the excel spreadsheets can be referred to as Structured Data.
Semi-Structured Data refers to the data that does not have a proper structure associated with it. For example, the data that is present within the emails, the log files, and the word documents can be referred to as Semi-Structured Data.
Un-Structured Data refers to the data that does not have any structure associated with it at all. For example, the image files, the audio files, and the video files can be referred to as Un-Structured Data.
This is how Big Data is classified into different categories.
Characteristics of Big Data
Big Data is categorized by 3 important characteristics.
Volume refers to the amount of data that is getting generated.
Velocity refers to the speed at which the data is getting generated.
And Variety refers to the different types of data that is getting generated.
These are the 3 important characteristics of Big Data.
Traditional Approach of Storing and Processing Big Data
In a traditional approach, usually the data that is being generated out of the organizations, such as the banks or stock markets, or the hospitals is given as an input to an ETL (Extract, Transform and Load) System.
An ETL System, would extract this data, transform this data, (that is, it would convert this data into proper format) and finally load this data into the database.
Once this process is completed, the end users would be able to perform various operations, such as generate reports and perform analytics by querying this data.
But as this data grows, it becomes a challenging task to manage and process this data using this traditional approach.
This is one of the reasons for not using the traditional approach for storing and processing the Big Data.
Now, let’s try to understand, some of the major drawbacks associated with using the traditional approach for storing and processing the Big Data.
The first drawback is, it’s an expensive system and requires a lot of investment for implementing or upgrading the system, therefore small and mid-sized companies wouldn’t be able to afford it.
The second drawback is scalability. As the data grows expanding this system would be a challenging task.
And the last drawback is, it is time-consuming. It takes a lot of time to process and extract, valuable information from this data, as it is designed and built based on legacy computing systems.
Hope this makes clear, why the traditional approach or the legacy computing systems are not used to store and process the Big Data.
Then Solution is it uses Data Distrubuting Technology it distribute the data throgh among no. of servers here the problem of storage also solve & Speed/Velocity of data storing & retreiving also solve. Then How distribute the load throgh servers here role come up of Hadoop Technology...
To understand how or what are the process for the data distribution in Hadoop can be done , I will come up with the procedure how it works:
Hadoop is a distributed file system which follows Master Slave Architecture for Data distribution. In this architecture there is a cluster which consists of one single Name node(Master node) and Data nodes (slave nodes).
Hope , This article helpful for you ..