Shell Commands 04 — Text Processing

Data Analysis Skills 05

BW L.
Data Engineering Insight

--

This is the fourth article of the shell commands series. Below is the list of published articles.

Thanks for the patience if you have read through the previous 3 shell command posts. Starting from this post, I will use data processing as example when explaining shell commands.

Most Linux systems come with some powerful text processing commands that has been around since early days of Linux. This post only covers some basic commands for text processing. Keep in mind that each of these commands are very powerful and I only scratch the surface in this post. Your best friend: “Google <command name> <purpose>”.

Downloading data using wget

Syntax: wget <URL>

The wget command downloads the content at the URL to current working folder. For example, US baby names is a popular dataset that has names of new babies. To start to work on it, we first need to download it.

$ mkdir -p data/baby-names
$ cd [esc].
$ wget http://www.ssa.gov/oact/babynames/state/namesbystate.zip
URL transformed to HTTPS due to an HSTS policy
--2021-07-22 22:50:08-- https://www.ssa.gov/oact/babynames/state/namesbystate.zip
Resolving www.ssa.gov (www.ssa.gov)... 2001:1930:d07::aaaa, 137.200.39.19
Connecting to www.ssa.gov (www.ssa.gov)|2001:1930:d07::aaaa|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22511679 (21M) [application/zip]
Saving to: ‘namesbystate.zip’
namesbystate.zip 100%[==============================================================================================================>] 21.47M 1.67MB/s in 16s2021-07-22 22:50:29 (1.31 MB/s) - ‘namesbystate.zip’ saved [22511679/22511679]

The first command create a new folder call data/baby-names . In fact, two folders are created with the -p option. First the data folder, then a subfolder baby-names under it.

The second command is a shortcut that I use everyday. By pressing the “esc” key and then “.” key, the shell will automatically replace with the last word of previous command. In this case data/baby-names . So this command is effectively the command cd data/baby-names which enters the folder we just created.

Now the data is in a zip file and the next command will be unzipping the file. But before we get there, some useful options.

  • -q quietly download without printing out the progress
  • -O file_name Rename the file as another name. When using wget to download the same files more than once, wget will append numbers like .1after the original file name. Using -O option can override the same file and will not create the .1 file.

Unzipping .zip file using unzip command

Now let’s unzip the dataset.

unzip namesbystate.zip

The results are 51 TXT files and one pdf file. This is the format of most dataset. The TXT files are the data we need. The pdf file is a README file for what’s in the data. Remember we are using a Mac? Open the pdf file by

open StateReadMe.pdf

The file will be open by preview. If you are on a server and no UI is available, you can download the zip file to you laptop or I’ll just put the most important sentence in this file below:

Each record in a file has the format: 2-digit state code, sex (M = male or F = female), 4-digit year of birth (starting with 1910), the 2–15 character name, and the number of occurrences of the name.

Now we want to know how many names in New York state have “John” in them. We’ll be using below steps piping together:

  • Filter all line that has “John” using grep command
  • Count the number of lines in the grep command using wc command
$ cat NY.TXT | grep John | wc -l
681

Filtering text using grep

The grep command is frequently used in piping when we need to find specific text in a file or the output of a command. We have seen an example in the previous post.

Syntax: grep [options] pattern [file]

The easies way to search anything is to grep the text directly. For example, in the baby names dataset above, we want to find all lines in the New York State file that contain John in it.

$ cat NY.TXT | grep John 
NY,F,1910,John,7
NY,F,1911,John,8
...omitting the rest of lines ...

The output of this command has 681 lines as we found out from the previous command.

We can use head or tail command to limit the output to the first or last few lines. See the relative sections for details.

Another useful example is to pipe with ps command.

ps -ef | grep $USER

The ps -ef command print out all processes at the time the command is executed as a snapshot. The grep command then filter out the current user’s username $USER . The result is the list of processes run by current users. Very useful to find out what’s running.

The Head and Tail

Syntax head [options] [file name]

head NY.txt

The head command prints out first 10 lines of the file NY.txt. If combine with piping, it’s print out first 10 lines from stdin. We can change the number of lines by using the -n option.

head -n 100 NY.txt

This command is equivalent

cat NY.txt | head -n 100

Syntax: tail [options] [file name]

Similar to head, the tail command print the last 10 lines from the file or stdin.

A very useful option for tail command is the -f option, which will cause the command not stop after printing out the last line of the file. Rather, it’ll wait for new data to be appended to the file. This is extremely useful when trying to debug a program. We’ll tail -f the log file and monitor what’s printing out when interacting with the program.

Counting using wc

Syntax: wc [options] [file name]

The wc command can count words, lines, characters. Check the manual by using man wc command. Frequently used options:

  • -l counts the number of lines. As used in the command before the grep section above.
  • -w count words

The wc command is also very useful in shell scripting to run certain commands based on condition. For example, below commands check the number of java process running by the current users and kill the process if there is exactly one process. Don’t worry if you don’t understand all the syntax below. We’ll get to them in future posts.

n=$(ps -ef | grep $USER | grep java | wc -l)
if [ $n == 1 ]; then
pid=$(ps -ef | grep $APP_NAME | grep java | awk '{print $2}')
echo “[ $(date) ] Killing PID $pid …” >> server.log
kill -9 $pid
fi

awk

The awk command scans and processes files. It’s a very complicated and we are only touch the basics of this command here. Each line of the input file or stdin will be processed and results printed out.

Syntax: aws [options] '{action}'

Each line of text is treated as one or more fields, separated by space by default. The fields are represented as $1, $2,... and $0 represents the entire line. So the below command will print out the first field of the file file.txt.

cat file.txt | awk '{print $1}'

But in the baby name data we downloaded, the fields are separated by “,”. So to print out a specific field, we’ll use the -F option.

cat NY.TXT | awk -F "," '{print $4}'

This will tell awk to use “,” as separated and print out the 4th fields which is the name.

sed

There are times when awk cannot easily help with the task we need, there’s another command sed that might be useful. Due to the length of this post already, I’ll leave this to you to research.sed stands for “stream editor”. I often use it to get rid of some lines I don’t want.

--

--