A Programmer's Journal

Korn Shell Wildcards

by on Dec.17, 2011, under Uncategorized

Most of us working on unix would know about the shell wildcards (* and ?). These can be very useful while working with files.
For example, *.db would match any file ending with .db in the given directory:

# ls -lrt *.db
-rw-r--r--   1 staff   staff      14336 Dec  9 18:58 messages.54672.db
-rw-r--r--   1 staff   staff      23552 Dec 11 14:07 messages.76698.db
-rw-r--r--   1 staff   staff    4536320 Dec 11 22:30 messages.57286.db
-rw-r--r--   1 staff   staff     283648 Dec 11 22:52 messages.80986.db

One of the less-known feature of the wildcards is the ability to OR them:

# ls -lrt *.@(57286|80986).db
-rw-r--r--    1 staff   staff      4536320 Dec 11 22:30 messages.57286.db
-rw-r--r--    1 staff   staff       283648 Dec 11 22:52 messages.80986.db

Further, you can negate them as well:

# ls -lrt !(*.@(57286|80986)).db
-rw-r--r--   1 staff   staff      14336 Dec  9 18:58 messages.54672.db
-rw-r--r--   1 staff   staff      23552 Dec 11 14:07 messages.76698.db

Here is a quick reference of all KSH wildcards.

I find this feature really useful while working with multiple files and wish to process them, except a few.
For example, removing all the log files in a directory except the one for the active process (assuming PID is part of the file name).

Hope you find use of this in your day-to-day work!

~A Programmer

Leave a Comment :, , , more...

Golden Rule for Super Fast Shell Scripts!

by on Jun.15, 2010, under Uncategorized

Shell scripts are very powerful in what you can make them do and they can be written in a very short amount of time, especially when it comes to processing text files. Most often these scripts are written as a stop-gap solution till a more permanent, efficient C program can replace them. However, once the script is installed in production, people realize its working just fine and there is no need to spend more effort to write a C-program from scratch. Over time the script starts getting used more and more (a phenomena one of my good friend labels “if you build it, they will come”. A reference from the movie “Field of dreams”… more on this in some other post)  and it soon becomes a performance bottle-neck in the system.

This is partly due to the fact that there are so many ways of accomplishing a single task in shell scripts that its difficult for most developers to figure out which one is the most efficient way. In this post I’ll cover a single “Golden rule” that I have discovered, which helps me write really efficient Korn Shell scripts. Here it is:

Never launch a child process in a processing loop!

A processing loop, as referred here, is a loop which iterates over every record in the input file.

The Golden Rule can be expressed in mathematical notation as:

performance = A/(# of child processes launched per input record)

Considering ‘A’ would be a system-dependent constant.

Note:Even though I have used Korn shell for all the examples here, the basic principle should hold true for any shell.

What is a Child Process?

The crux of the golden rule is to make sure that we do not launch any child process multiple times within a script execution. This is because there is a lot of overhead involved in creating a child-process. Here are some tips on what causes a child process to be created:

  • Any utility that’s not a shell built-in, like cut, sed, grep etc.
  • Every time you use a pipeline, it causes child processes to be created.

It will be easier to demo the Golden rule than talk about it. So, lets dive into some samples.

 

Examples

Lets take a very simple example, where you have a CSV input file with first-name, last-name and an email address. The job of the process is to parse the input file, verify each email for an ‘@’ and a ‘.’ in the email id, and split out any invalid records to an error file.

I’ll show the same script written in 3 different styles in reducing number of child-processes per record and increasing degree of efficiency.

Sample 1

#!/bin/ksh

> valid
> invalid

cat $1 | while read line
do
  ## Three pipelines used here to parse each record
  fname=$(echo $line | cut -d, -f1)
  lname=$(echo $line | cut -d, -f2)
  email=$(echo $line | cut -d, -f3)
  ## Another pipeline to check validity of the email
  if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
    echo "$fname,$lname,$email" >> valid
  else
    echo "$fname,$lname,$email" >> invalid
  fi
done

Sample 2

#!/bin/ksh

> valid
> invalid

cat $1 | while IFS=, read fname lname email
do
  ## Eliminated the need for three pipelines by using IFS with read
  if echo "$email" | egrep ".+@.+\..+" >/dev/null; then
    echo "$fname,$lname,$email" >> valid
  else
    echo "$fname,$lname,$email" >> invalid
  fi
done

Sample 3

#!/bin/ksh

cat $1 | while IFS=, read fname lname email
do
  ## Eliminated need for using egrep by using KSH's built-in regular expression capbility
  if [[ $email = (.)+@(.)+\.(.)+ ]]; then
    echo "$fname,$lname,$email"
  else
    echo "$fname,$lname,$email" >&2
  fi
done > valid 2> invalid

Performance Matrix

Input Records Sample 1 Sample 2 Sample 3
1,000 Real 0m9.380s 0m2.533s 0m0.085s
User 0m2.744s 0m0.781s 0m0.045s
Sys 0m7.960s 0m2.037s 0m0.039s
10,000 Real 1m25.238s 0m22.515s 0m0.970s
User 0m24.663s 0m7.001s 0m0.379s
Sys 1m11.544s 0m17.786s 0m0.299s
100,000 Real 14m42.842s 4m6.492s 0m6.527s
User 4m8.237s 1m15.282s 0m3.667s
Sys 12m12.653s 3m13.174s 0m2.862s
1,000,000 Real 145m58.457s 41m17.773s 1m11.483s
User 41m15.294s 12m28.498s 0m39.565s
Sys 121m8.872s 32m21.106s 0m30.701s

Following is the raw ‘time’ command’s output, just in case I made a mistake inputting values in the table format.

1,000 Input Records

Sample1

real	0m9.380s
user	0m2.744s
sys	0m7.960s

Sample2

real	0m2.533s
user	0m0.781s
sys	0m2.037s

Sample3

real	0m0.085s
user	0m0.045s
sys	0m0.039s

10,000 Input Records

Sample1

real	1m25.238s
user	0m24.663s
sys	1m11.544s

Sample2

real	0m22.515s
user	0m7.001s
sys	0m17.786s

Sample3

real	0m0.970s
user	0m0.379s
sys	0m0.299s

100,000 Input Records

Sample1

real	14m42.842s
user	4m8.237s
sys	12m12.653s

Sample2

real	4m6.492s
user	1m15.282s
sys	3m13.174s

Sample3

real	0m6.527s
user	0m3.667s
sys	0m2.862s

1,000,000 Input Records

Sample1

real	145m58.457s
user	41m15.294s
sys	121m8.872s

Sample2

real	41m17.773s
user	12m28.498s
sys	32m21.106s

Sample3

real	1m11.483s
user	0m39.565s
sys	0m30.701s

Conclusion

As you can see from the performance matrix, as the number of child processes in a shell script reduce, the performance begins to improve drastically.

There are situations, however, when it seems impossible to avoid running child-processes for every record (especially when third party utilities are involved, like having to update the database for every record). Rest assured, there is a way around that (read KSH co-processes)! I’ll cover that in another post sometime.

Hope this post helps you write faster running scripts!

~A Programmer

4 Comments :, , , , , more...


Hello world!

by on Jun.13, 2010, under Uncategorized

I was thinking about what should be the first post here and WordPress (The software running this site) helped me out. It created a default post with this title and I realized what better way to start a programmer’s journal with a “Hello world!” post! :D

So here is a quick introduction about me/this journal.

I got my first taste of programming with GWBasic on a DOS-based PC when I was 9 years old. Since then I have been fascinated with computers. Then in my first year of college I was introduced to the amazing world of C programming and have been programming since then. I have been working professionally for the last 7 years in the software/IT field with the main focus on Unix/C/KSH scripting.

Over the years, I have searched the internet for many different types of problems and have gotten a lot of help from people sharing tutorials/tips/recipes/source-code etc. However, till now I have only used that information, but never gave back. This journal is an effort to do exactly that… give back.

I plan to share with everyone different kinds of problems that I face in day-to-day programming and how I solved them. The posts would range from a theoretical approach to a mid-level design to a code-implementation.

Hopefully you will find some of them of use!

~A Programmer

1 Comment more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Blogroll

A few highly recommended websites...

Archives

All entries, chronologically...