Things you should know about HDFS

Data Analysis Skills 07

Published in

Data Engineering Insight

4 min readDec 27, 2021

This post is part of the Hadoop Series.

Starting from this post, we’ll cover basic things you should know in order to work with Hadoop clusters. This post covers the storage of Hadoop clusters: HDFS.

Hadoop Distribution

Your company might be using one of the popular Hadoop distributions Hortonworks (before they merged with Cloudera), Cloudera or MapR. (Or if you are using Hadoop in the Cloud, could be AWS EMR or Google Dataproc. They work a little differently and I can cover them in another post. Comment and let me know)

Hadoop Server

A Hadoop cluster might have below nodes (servers):

Name nodes, or master nodes. These nodes run all the services, keep track of files, jobs. A user will normally not be allowed to login to name nodes.
Data nodes or worker nodes. These are the servers that are doing heavy lifting on data. A user will normally not be allowed to login to data nodes.
Client nodes or edge nodes. These are the working environment for users to interact with the cluster.

In all my Hadoop posts, Hadoop server refers to one of the above nodes, usually the client nodes. All commands should work the same on any of these Linux servers. All commands in these Hadoop related posts should be ran on Hadoop Server unless it’s mentioned otherwise. I am assuming you know how to use ssh client (PUTTY, cmdr for Windows or terminal for mac) to login to a Linux server.

HDFS In a Nutshell

HDFS is the storage layer of Hadoop. Below are some basics you should know about HDFS:

Each file stored in the Hadoop cluster is divided into blocks once it’s over a default size configured by Hadoop admins.
Each block is stored on three different data nodes.
Each Hive table is stored insider a folder on HDFS. The folder can have many files, each of which will be stored as described above.
When scheduling a Hive or MapReduce job, YARN (the resource manager of hadoop) will try to execute tasks on the nodes that store blocks of the table’s files. This bring data close to compute and increase the efficiency of computation.

HDFS Path

An HDFS path look like hdfs://server:8020/path/to/hdfs/file

Where hdfs is the protocol, server is HDFS name node server name. 8020 is the port number for hdfs. /path/to/hdfs/file is the path for the file. For example, the Hive table default.nyse_stocks is located at hdfs://server:8020/nyse

When using Hadoop tools like Hive or spark, hdfs://server:8020 can usually be omitted.

HDFS home directory

HDFS home directory is required for every Hadoop user to run any jobs in Hive, Spark, MapReduce. A user’s HDFS home folder might look like: hdfs://server:8020/user/john. Like above, server and 8020 are often omitted.

When a new user login to a Hadoop server, his/her HDFS home directory might not exist. This is a common issue for a new user that sometimes is confusing. To resolve this, the admin of the cluster would normally setup automation to create a new user’s HDFS home folder.

Use HDFS to store large data files

Since most Linux server has limited disk space, HDFs is more suitable for large data storage before a Hive table is created.

For example, a data file named data.csv exists in user john’s home folder /home/john on a server that is connected to a Hadoop Cluster. John can then run below command to upload the file to his HDFS home directory:
hadoop fs -put data.csv
The result of this command is a copy of data.csv being created at hdfs://user/john/data.csv

Pay attention to the local file path /home/john/data.csv and HDFS path hdfs://user/john/data.csv

Basic HDFS commands

To access HDFS, you’ll need to connect to a server through ssh client like PUTTY, cmdr or mac terminal. All commands below should be run after you already connect to a Hadoop server.

All hdfs commands start with hadoop fs. Below are some basic commands.

Listing files
hadoop fs -ls <path>
If path is ignored, the user’s home directory will be listed.
Creating new directory:
hadoop fs -mkdir <dir name>
Copying file from local to HDFS
hadoop fs -put <local file path> <HDFS file path>
Copying files from HDFS to local.
hadoop fs -get <HDFS file path>
Delete files
hadoop fs -rm <HDFS file path>
When a file is deleted, it’ll be moved to the .trash folder.
You can use the -skipTrash option to remove the file permanently
hadoop fs -rm -skipTrash <HDFS file path>
Delete an HDFS folder hadoop fs -rm -r <HDFS folder name>

YARN doesn’t like too many files

Although Hadoop is designed for big data, too many files could impact the performance of the cluster. YARN tracks every single file and block of file on HDFS. When there are millions and millions of files, especially when huge number of small files (less than a block size), the cluster will struggle to keep track of all the files and start to underperform. When saving data to HDFS as Hive table, pay attention to the number of files the table has. There are Hive properties that can optimized the final table’s file size. But will be a topic for another in depth hive post.

Examples

Upload data to your HDFS home

To upload data file or folder on a Hadoop server to HDFS under a new folder named data, use below commands on the hadoop servernode

hadoop fs -mkdir data 
hadoop fs -put <file name> data/

Download data from HDFS to Hadoop server

There will be times that you want to take data to hadoop server ,
hadoop fs -get <hdfs path> <hadoop server path>

Work with data in HDFS

You don’t need to download the data if you just want to check the data. For text file,

hadoop fs -cat <hdfs file path>
hadoop fs -head <hdfs file path>
hadoop fs -tail <hdfs file path>

What’s next

Now that you have understanding of HDFS, the next steps is to actually use it from Hive and Spark. Continue to the following post: