What is Hive? Architecture & Modes

In this tutorial, you will learn-

What is Hive?

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). Hive makes job easy for performing operations like

Important characteristics of Hive

For setting up MySQL as database and to store Meta-data information check Tutorial "Installation and Configuration of HIVE and MYSQL"

Some of the key points about Hive:

Hive Vs Relational Databases:-

By using Hive, we can perform some peculiar functionality that is not achieved in Relational Databases. For a huge amount of data that is in peta-bytes, querying it and getting results in seconds is important. And Hive does this quite efficiently, it processes the queries fast and produce results in second's time.

Let see now what makes Hive so fast.

Some key differences between Hive and relational databases are the following;

Relational databases are of "Schema on READ and Schema on Write". First creating a table then inserting data into the particular table. On relational database tables, functions like Insertions, Updates, and Modifications can be performed.

Hive is "Schema on READ only". So, functions like the update, modifications, etc. don't work with this. Because the Hive query in a typical cluster runs on multiple Data Nodes. So it is not possible to update and modify data across multiple nodes.( Hive versions below 0.13)

Also, Hive supports "READ Many WRITE Once" pattern. Which means that after inserting table we can update the table in the latest Hive versions.

NOTE: However the new version of Hive comes with updated features. Hive versions ( Hive 0.14) comes up with Update and Delete options as new features

Hive Architecture

Introduction to Hive

The above screenshot explains the Apache Hive architecture in detail

Hive Consists of Mainly 3 core parts

  1. Hive Clients
  2. Hive Services
  3. Hive Storage and Computing

Hive Clients:

Hive provides different drivers for communication with a different type of applications. For Thrift based applications, it will provide Thrift client for communication.

For Java related applications, it provides JDBC Drivers. Other than any type of applications provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server in the Hive services.

Hive Services:

Client interactions with Hive can be performed through Hive Services. If the client wants to perform any query related operations in Hive, it has to communicate through Hive Services.

CLI is the command line interface acts as Hive service for DDL (Data definition Language) operations. All drivers communicate with Hive server and to the main driver in Hive services as shown in above architecture diagram.

Driver present in the Hive services represents the main driver, and it communicates all type of JDBC, ODBC, and other client specific applications. Driver will process those requests from different applications to meta store and field systems for further processing.

Hive Storage and Computing:

Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage and performs the following actions

Job exectution flow:

Introduction to Hive

From the above screenshot we can understand the Job execution flow in Hive with Hadoop

The data flow in Hive behaves in the following pattern;

  1. Executing Query from the UI( User Interface)
  2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution) process and its related metadata information gathering
  3. The compiler creates the plan for a job to be executed. Compiler communicating with Meta store for getting metadata request
  4. Meta store sends metadata information back to compiler
  5. Compiler communicating with Driver with the proposed plan to execute the query
  6. Driver Sending execution plans to Execution engine
  7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS operations.
  1. Fetching results from driver
  2. Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will send results back to driver and to UI ( front end)

Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.

Different modes of Hive

Hive can operate in two modes depending on the size of data nodes in Hadoop.

These modes are,

When to use Local mode:

When to use Map reduce mode:

In Hive, we can set this property to mention which mode Hive can work? By default, it works on Map Reduce mode and for local mode you can have the following setting.

Hive to work in local mode set

SET mapred.job.tracker=local;

From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.

What is Hive Server2 (HS2)?

HiveServer2 (HS2) is a server interface that performs following functions:

From the latest version it's having some advanced features Based on Thrift RPC like;

Summary:

Hive is an ETL and data warehouse tool on top of Hadoop ecosystem and used for processing structured and semi structured data.

For user specific logic to meet client requirements.

 

YOU MIGHT LIKE: