Tableau
Tableau Architecture & Server Components
Tableau Server is designed in a way to connect many data tiers. It can connect clients from...
Here are frequently asked data engineer interview questions for freshers as well as experienced candidates to get the right job.
Data engineering is a term used in big data. It focuses on the application of data collection and research. The data generated from various sources are just raw data. Data engineering helps to convert this raw data into useful information.
Data modeling is the method of documenting complex software design as a diagram so that anyone can easily understand. It is a conceptual representation of data objects that are associated between various data objects and the rules.
There are mainly two types of schemas in data modeling: 1) Star schema and 2) Snowflake schema.
Following is a difference between structured and unstructured data:
| Parameter | Structured Data | Unstructured Data |
| Storage | DBMS | Unmanaged file structures |
| Standard | ADO.net, ODBC, and SQL | STMP, XML, CSV, and SMS |
| Integration Tool | ELT (Extract, Transform, Load) | Manual data entry or batch processing that includes codes |
| scaling | Schema scaling is difficult | Scaling is very easy. |
Following are the components of Hadoop application:
It is the centerpiece of HDFS. It stores data of HDFS and tracks various files across the clusters. Here, the actual data is not stored. The data is stored in DataNodes.
It is a utility which allows for the creation of the map and Reduces jobs and submits them to a specific cluster.
HDFS stands for Hadoop Distributed File System.
Blocks are the smallest unit of a data file. Hadoop automatically splits huge files into small pieces.
Block Scanner verifies the list of blocks that are presented on a DataNode.
Following are the steps that occur when Block Scanner find a corrupted data block:
1) First of all, when Block Scanner find a corrupted data block, DataNode report to NameNode
2) NameNode start the process of creating a new replica using a replica of the corrupted block.
3) Replication count of the correct replicas tries to match with the replication factor. If the match found corrupted data block will not be deleted.
There are two messages which NameNode gets from DataNode. They are 1) Block report and 2) Heartbeat.
There are five XML configuration files in Hadoop:
Four V's of big data are:
Important features of Hadoop are:
The abbreviation of COSHH is Classification and Optimization based Schedule for Heterogeneous Hadoop systems.
Star Schema or Star Join Schema is the simplest type of Data Warehouse schema. It is known as star schema because its structure is like a star. In the Star schema, the center of the star may have one fact table and multiple associated dimension table. This schema is used for querying large data sets.
Follow the following steps in order to deploy a big data solution.
1) Integrate data using data sources like RDBMS, SAP, MySQL, Salesforce
2) Store data extracted data in either NoSQL database or HDFS.
3) Deploy big data solution using processing frameworks like Pig, Spark, and MapReduce.
File System Check or FSCK is command used by HDFS. FSCK command is used to check inconsistencies and problem in file.
A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is so-called as snowflake because its diagram looks like a Snowflake. The dimension tables are normalized, that splits data into additional tables.
| Star | SnowFlake Schema |
| Dimensions hierarchies are stored in dimensional table. | Each hierarchy is stored into separate tables. |
| Chances of data redundancy are high | Chances of data redundancy are low. |
| It has a very simple DB design | It has a complex DB design |
| Provide a faster way for cube processing | Cube processing is slow due to the complex join. |
Hadoop works with scalable distributed file systems like S3, HFTP FS, FS, and HDFS. Hadoop Distributed File System is made on the Google File System. This file system is designed in a way that it can easily run on a large cluster of the computer system.
Data engineers have many responsibilities. They manage the source system of data. Data engineers simplify complex data structure and prevent the reduplication of data. Many times they also provide ELT and data transformation.
The full form of YARN is Yet Another Resource Negotiator.
Modes in Hadoop are 1) Standalone mode 2) Pseudo distributed mode 3) Fully distributed mode.
Perform the following steps to achieve security in Hadoop:
1) The first step is to secure the authentication channel of the client to the server. Provide time-stamped to the client.
2) In the second step, the client uses the received time-stamped to request TGS for a service ticket.
3) In the last step, the client use service ticket for self-authentication to a specific server.
In Hadoop, NameNode and DataNode communicate with each other. Heartbeat is the signal sent by DataNode to NameNode on a regular basis to show its presence.
| NAS | DAS |
| Storage capacity is 109 to 1012 in byte. | Storage capacity is 109 in byte. |
| Management cost per GB is moderate. | Management cost per GB is high. |
| Transmit data using Ethernet or TCP/IP. | Transmit data using IDE/ SCSI |
Here are a few fields or languages used by data engineer:
It is a large amount of structured and unstructured data, that cannot be easily processed by traditional data storage methods. Data engineers are using Hadoop to manage big data.
It is a Hadoop Job scheduling algorithm. In this FIFO scheduling, a reporter selects jobs from a work queue, the oldest job first.
Default port numbers on which task tracker, NameNode, and job tracker run in Hadoop are as follows:
In order to disable Block Scanner on HDFS Data Node, set dfs.datanode.scan.period.hours to 0.
The distance is equal to the sum of the distance to the closest nodes. The method getDistance() is used to calculate the distance between two nodes.
Commodity hardware is easy to obtain and affordable. It is a system that is compatible with Windows, MS-DOS, or Linux.
Replication factor is a total number of replicas of a file in the system.
Namenode stores the metadata for the HDFS like block information, and namespace information.
In Haddop cluster, Namenode uses the Datanode to improve the network traffic while reading or writing any file that is closer to the nearby rack to Read or Write request. Namenode maintains the rack id of each DataNode to achieve rack information. This concept is called as Rack Awareness in Hadoop.
Following are the functions of Secondary NameNode:
NameNode is the single point of failure in Hadoop so the user can not submit a new job cannot execute. If the NameNode is down, then the job may fail, due to this user needs to wait for NameNode to restart before running any job.
There are three basic phases of a reducer in Hadoop:
1. Shuffle: Here, Reducer copies the output from Mapper.
2. Sort: In sort, Hadoop sorts the input to Reducer using the same key.
3. Reduce: In this phase, output values associated with a key are reduced to consolidate the data into the final output.
Hadoop framework uses Context object with the Mapper class in order to interact with the remaining system. Context object gets the system configuration details and job in its constructor.
We use Context object in order to pass the information in setup(), cleanup() and map() methods. This object makes vital information available during the map operations.
It is an optional step between Map and Reduce. Combiner takes the output from Map function, creates key value pairs, and submit to Hadoop Reducer. Combiner's task is to summarize the final result from Map into summary records with an identical key.
Default replication factor in available in HDFS is three. Default replication factor indicates that there will be three replicas of each data.
In a Big Data system, the size of data is huge, and that is why it does not make sense to move data across the network. Now, Hadoop tries to move computation closer to data. This way, the data remains local to the stored location.
In HDFS, the balancer is an administrative used by admin staff to rebalance data across DataNodes and moves blocks from overutilized to underutilized nodes.
It is a read-only mode of NameNode in a cluster. Initially, NameNode is in Safemode. It prevents writing to file-system in Safemode. At this time, it collects data and statistics from all the DataNodes.
Hadoop has a useful utility feature so-called Distributed Cache which improves the performance of jobs by caching the files utilized by applications. An application can specify a file for the cache using JobConf configuration.
Hadoop framework makes replica of these files to the nodes one which a task has to be executed. This is done before the execution of task starts. Distributed Cache supports the distribution of read only files as well as zips, and jars files.
It stores schema as well as the Hive table location.
Hive table defines, mappings, and metadata that are stored in Metastore. This can be stored in RDBMS supported by JPOX.
SerDe is a short name for Serializer or Deserializer. In Hive, SerDe allows to read data from table to and write to a specific field in any format you want.
There are the following components in the Hive data model:
Hive provides an interface to manage data stored in Hadoop eco-system. Hive is used for mapping and working with HBase tables. Hive queries are converted into MapReduce jobs in order to hide the complexity associated with creating and running MapReduce jobs.
Hive supports the following complex data types:
In Hive, .hiverc is the initialization file. This file is initially loaded when we start Command Line Interface (CLI) for Hive. We can set the initial values of parameters in .hiverc file.
Yes, we can create more than one table schemas for a data file. Hive saves schema in Hive Metastore. Based on this schema, we can retrieve dissimilar results from same Data.
There are many SerDe implementations available in Hive. You can also write your own custom SerDe implementation. Following are some famous SerDe implementations:
Following is a list of table generating functions:
A Skewed table is a table that contains column values more often. In Hive, when we specify a table as SKEWED during creation, skewed values are written into separate files, and remaining values go to another file.
Objects created by create statement in MySQL are as follows:
In order to see database structure in MySQL, you can use
DESCRIBE command. Syntax of this command is DESCRIBE Table name;.
Use regex operator to search for a String in MySQL column. Here, we can also define various types of regular expression and search for using regex.
Following are the ways how data analytics and big data can increase company revenue:
Tableau Server is designed in a way to connect many data tiers. It can connect clients from...
What is OLAP? Online Analytical Processing (OLAP) is a category of software that allows users to...
What is Data Lake? A Data Lake is a storage repository that can store large amount of structured,...
Data modeling is a method of creating a data model for the data to be stored in a database. It...
Data can be organized and simplified by using various techniques in Tableau. We will use the...
What is Database? A database is a collection of related data which represents some elements of the...