Table of Contents
Introduction to Azure HDInsight
Big data management is too complex and too large for organizations to handle for analysis in traditional relational database management systems. Though they can handle large volumes of data, it is not just about size but more about the complexity of data so when we talk about data we talk about three Vs – Volume, Variety and Velocity. APACHE Hadoop technology lets organizations handle big data in an efficient manner which can be used for multiple purposes like batch processing, real time processing, predictive analytics and so on.
Today we look more in detail about Azure HDInsight which is an APACHE Hadoop running on Azure cloud, its features such as scalability, resiliency and use cases etc.
The three Vs of Big data
Before we get into all about Azure HDInsight, we need to understand the three Vs of Big data –
- Volume could be a large amount of data you need to process such as web server logs which you need to analyse to understand click streams as people browse the net.
- Variety is vast and related to all kinds of data structured or unstructured , rows and columns of organized data or some free text , images, videos , audio clips etc.
- Velocity is the speed by which data is arriving , you could have a constant stream of data such as data coming from sensors and IoT devices which you need to analyse in real time and visualize the process.
These are all big data issues and can’t be handled by traditional relational databases in an efficient manner.
Big Data Handling Issues
Big data handling has major issues when it comes to Batch processing, real time processing and predictive analytics. Let’s understand more about them in detail.
- Batch processing, we are getting huge volume and variety of formats in real time, we are grabbing it, filtering it to get rid of unwanted data and cleanse it so we have eventually valid data in right format and structure to analyse and visualize as tables in a database or something like that which is meaningful.
- Real time processing we are capturing data in real time filtering and removing what we are not interested in for example if you find out how many tweets you got in the last one hour and how many people visited your website in the past 30 minutes or so.
- Predictive analytics is where you apply statistical models on data that you have captured and based on historical data you try to predict future outcomes. Maybe you are applying a regression technique or clustering data into small entities. You are doing things like basket analysis to understand user preferences and then trying to make recommendations accordingly.
Apache Hadoop & Azure HDInsight
Now we look at APACHE Hadoop which is the basis for Azure HDInsight. Hadoop is an open-source distributed data processing cluster technology and Microsoft HDInsight is actually an APACHE Hadoop running on Azure, it is an instance of Hortonworks HDP Hadoop distribution running on virtual machines on Azure cloud. It can spin up a cluster by just going to the Azure portal and saying that a Hadoop cluster is needed and it will manage provisioning and running of all virtual machines and installation of Hadoop on those machines. The virtual machines will be configured as cluster running Hortonworks implementation of Hadoop.
As we move storage into the cloud , it would move either into Azure storage account or into Azure Data Lake which is a new service for managing large volumes of data. Hadoop cluster understands this and advantage is that shared storage still behaves the same way for Hadoop so there is no change to the way Hadoop processes data but since shared storage is on cloud other applications can write data directly into there or read data directly from there, so actual storage is separated from processing in the cluster and which enables us to manage the lifetime of data and lifetime of cluster separately.
Another feature of Azure HDInsight is when we use Hadoop here the metadata stored in an instance of Azure SQL database. There is no requirement of explicit provision or management of database which would otherwise be required when Hadoop on local server.
Features of Azure HDInsight
- Mainly used to create , manage and analyse big data analytics
- Can be quickly deployed with the help of virtual machines using Azure portal
- Implement any number of nodes in cluster
- Pay – as – you – go service
- Reuse of cluster when requirement scales down
- Cost effective to collect and store structure and unstructured data
- Undiscovered data disclosure is easier in unstructured data
- Flat network storage system technology offers high speed connection between cluster nodes and blob storage system
- Master slave pattern of Insight to allow central node or master node to operate and control centrally the cluster and secondary nodes are integrated with Azure deployments