Blog
20 Best FREE Recording Software (2021)
Audio Recording Software are programs designed to record any sound. These applications allow...
HDFS is a distributed file system for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop comes bundled with HDFS (Hadoop Distributed File Systems).
When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across a number of separate machines. A file system that manages storage specific operations across a network of machines is called a distributed file system. HDFS is one such software.
In this tutorial, we will learn,
HDFS cluster primarily consists of a NameNode that manages the file system Metadata and a DataNodes that stores the actual data.
Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into block-sized chunks, which are stored as independent units. Default block-size is 64 MB.
HDFS operates on a concept of data replication wherein multiple replicas of data blocks are created and are distributed on nodes throughout a cluster to enable high availability of data in the event of node failure.
Do you know? A file in HDFS, which is smaller than a single block, does not occupy a block's full storage.
Data read request is served by HDFS, NameNode, and DataNode. Let's call the reader as a 'client'. Below diagram depicts file read operation in Hadoop.
In this section, we will understand how data is written into HDFS through files.
In this section, we try to understand Java interface used for accessing Hadoop's file system.
In order to interact with Hadoop's filesystem programmatically, Hadoop provides multiple JAVA classes. Package named org.apache.hadoop.fs contains classes useful in manipulation of a file in Hadoop's filesystem. These operations include, open, read, write, and close. Actually, file API for Hadoop is generic and can be extended to interact with other filesystems other than HDFS.
Reading a file from HDFS, programmatically
Object java.net.URL is used for reading contents of a file. To begin with, we need to make Java recognize Hadoop's hdfs URL scheme. This is done by calling setURLStreamHandlerFactory method on URL object and an instance of FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only once per JVM, hence it is enclosed in a static block.
An example code is-
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}This code opens and reads contents of a file. Path of this file on HDFS is passed to the program as a command line argument.
This is one of the simplest ways to interact with HDFS. Command-line interface has support for filesystem operations like read the file, create directories, moving files, deleting data, and listing directories.
We can run '$HADOOP_HOME/bin/hdfs dfs -help' to get detailed help on every command. Here, 'dfs' is a shell command of HDFS which supports multiple subcommands.
Some of the widely used commands are listed below along with some details of each one.
1. Copy a file from the local filesystem to HDFS
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /
This command copies file temp.txt from the local filesystem to HDFS.
2. We can list files present in a directory using -ls
$HADOOP_HOME/bin/hdfs dfs -ls /
We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.
3. Command to copy a file to the local filesystem from HDFS
$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt
We can see temp.txt copied to a local filesystem.
4. Command to create a new directory
$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory
Check whether a directory is created or not. Now, you should know how to do it ;-)
Audio Recording Software are programs designed to record any sound. These applications allow...
Jenkins is an open source Continuous Integration platform and is a cruial tool in DevOps...
Video grabbers are tools to store videos in numerous formats, including MP3 and MP4. These...
Ansible is a DevOps tool which automates software provisioning, configuration management, and...
{loadposition top-ads-automation-testing-tools} Remote administration tools help IT professionals to debug...
Before we learn Puppet, let's understand: What is Configuration Management? Configuration...