Justin Spencer: "The UNIX Programming Environment" by Brian Kernighan & Rob Pike

The UNIX programming environment is unusually rich and productive.
Many UNIX programs do quite trivial tasks in isolation, but, combined with other programs, become general and useful tools.
To use the UNIX system and its components well, you must understand not only how to use the programs, but also how they fit into the environment.
The file system is central to the operation and use of the system, so you must understand it to use the system well.
The UNIX system is full duplex: the characters you type on the keyboard are sent to the system, which sends them back to the terminal to be printed on the screen.
RETURN may be typed by pressing the RETURN key or, equivalently, holding down the CONTROL key and typing an ‘m’.
The next command to try is who, which tells you everyone who is currently logged in.
“tty” stands for “teletype,” an archaic synonym for “terminal.”
If you type the line kill character, by default an at-sign @, it causes the whole line to be discarded, just as if you’d never typed it, and starts you over on a new line.
The sharp character # erases the last character typed; each # erases one more character, back to the beginning of the line (but not beyond).
Another common choice is ctl-u for line kill.
If you precede either # or @ by backslash \, it loses its special meaning.
The backslash, sometimes called the escape character, is used extensively to indicate that the following character is in some way special.
Most shells interpret # as introducing a comment, and ignore all text from the # to the end of the line.
If you type while the system is printing, your input characters will appear intermixed with the output characters, but they will be stored away and interpreted in the correct order.
The proper way to log out is to type ctl-d instead of a command; this tells the shell that there is no more input.
To read your mail, type $ mail
The ctl-d signals the end of the letter by telling the mail command that there is no more input.
The ls command lists the names (not contents) of files.
Options follow the command name on the command line, and are usually made up of an initial minus sign ‘-’ and a single letter meant to suggest the meaning.
cat prints the contents of all the files named by its arguments.
Renaming a file is done by “moving” is from one name to another.
Beware that if you move a file to another one that already exists, the target file is replaced.
To make a copy of a file (that is, to have two versions of something), use the cp command.
The rm command removes all the files you name.
don’t forget that case distinctions matter.
The first command counts the lines, words and characters in one or more files; it is named wc after its word-counting function.
The definition of a “word” is very simple: any string of characters that doesn’t contain a blank, tab or newline.
The second command is called grep; it searches files for lines that match a pattern.
grep will also look for lines that don’t match the pattern, when the option -v is used.
The third command is sort, which sorts its input into alphabetical order line by line.
Another file-examining command is tail, which prints the last 10 lines of a file.
cmp finds the first place where two files differ.
The other file comparison command is diff, which reports on all lines that are changed, added or deleted.
Generally speaking, cmp is used when you want to be sure that two files really have the same contents.
diff is used when the files are expected to be somewhat different, and you want to know exactly which lines differ. diff works only on files of text.
Files in different directories can have the same name without any conflict.
Our basic tool is the command pwd (“print working directory”), which prints the name of the directory you are currently in.
ls lists the contents of the current directory; given the name of a directory, it lists the contents of that directory.
“Pathname” has an intuitive meaning: it represents the full name of the path from the root through the tree of directories to a particular file. It is a universal rule in the UNIX system that wherever you can use an ordinary filename, you can use a pathname.
Each file and directory has read-write-execute permissions for the owner, a group, and everyone else, which can be used to control access.
The command mkdir makes a new directory.
“..” refers to the parent of whatever directory you are currently in, the directory one level closer to the root. “.” is a synonym for the current directory.
rmdir will only remove an empty directory.
The shell takes the * to mean “any string of characters.”
There is a program called echo that is especially valuable for experimenting with the meaning of the shorthand characters. As you might guess, echo does nothing more than echo its arguments.
The * is not limited to the last position in a filename--*’s can be anywhere and can occur several times.
The pattern [...] matches any of the characters inside the brackets. A range of consecutive letter or digits can be abbreviated.
The ? pattern matches any single character.
Note that the patterns match only existing filenames. In particular, you cannot make up new filenames by using patterns.
If you should ever have to turn off the special meaning of *, ?, etc., enclose the entire argument in single quotes. You can also precede a special character with a backslash.
Thy symbol > means “put the output in the following file, rather than on the terminal.” The file will be created if it doesn’t already exist, or the previous contents overwritten if it does.
The symbol >> operates much as > does, except that it means “add to the end of.”
In a similar way, the symbol < means to take the input for a program from the following file, instead of from the terminal.
This is an essential property of most commands: if no filenames are specified, the standard input is processed.
A pipe is a way to connect the output of one program to the input of another program without any temporary file; a pipeline is a connection of two or more programs through pipes.
The vertical bar character | tells the shell to set up a pipeline.
Any program that read from the terminal can read from a pipe instead; any program that writes on the terminal can write to a pipe.
The programs in a pipeline actually run at the same time, not one after another.
You can run two programs with one command line by separating the commands with a semicolon; the shell recognizes the semicolon and breaks the line into two commands.
The ampersand & at the end of a command line says to the shell “start this command running, then take further commands from the terminal immediately,” that is, don’t wait for it to complete.
An instance of a running program is called a process.
If you forget the process-id, you can use the command ps to tell you about everything you have running. If you are desperate, kill 0 will kill all your processes except your login shell.
Processes have the same sort of hierarchical structure that files do: each process has a parent, and may well have children.
If you say $ nohup command & the command will continue to run if you log out.
If there is a file named .profile in your login directory, the shell will execute the commands in it when you log in, before printing the first prompt.
The prompt string, which we have been showing as $, is actually stored in a shell variable called PS1, and you can set it to anything you like.
Probably the most useful shell variable is the one that controls where the shells looks for commands (i.e. PATH).
You can obtain the value of any shell variable by prefixing its name with a $.
Personal variables are conventionally spelling in lower case to distinguish them from the used by the shell itself, like PATH.
It’s necessary to tell the shell that you intend to use the variables in other programs; this is done with the command export.
Everything in the UNIX system is a file.
A file is a sequence of bytes. (A byte is a small chunk of information, typically 8 bits long. For our purposes, a byte is equivalent to a character.) No structure is imposed on a file by the system, and no meaning is attached to its contents--the meaning of the bytes depends solely on the programs that interpret the file.
By convention borrowed from C, the character representation of a newline is \n, but this is only a convention used by programs to make it easy to read--the value stored in the file is the single byte 012.
Programs retrieve the data in a file by a system call (a subroutine in the kernel) called read. Each time read is called, it returns the next part of a file--the next line of text typed on the terminal, for example. read also says how many bytes of the file were returned, so end of file is assumed when a read says “zero bytes are being returned.”
The format of a file is determined by the programs that use it.
file reads the first few hundred bytes of a file and looks for clues to the file type.
A runnable program is marked by a binary “magic number” at its beginning.
All text consists of lines terminated, by newline characters, and most programs understand this simple format.
In UNIX systems there is just one kind of file, and all that is required to access a file is its name.
If you design a file format, you should think carefully before choosing a non-textual representation.
The current directory is an attribute of a process, not a person or a program.
If a process creates a child process, the child inherits the current directory of its parent.
Whenever you embark on a new project, or whenever you have a set of related files, say a set of recipes you could create a new directory with mkdir and put the files there.
The command du (disc usage) was written to tell how much disc space is consumed by the files in a directory, including all its subdirectories.
Despite their fundamental properties inside the kernel, directories sit in the file system as ordinary files.
The directory / is called the root of the file system. Every file in the system is in the root directory or one of its subdirectories, and the root is its own parent directory.
Every file has a set of permissions associated with it, which determine who can do what with the file.
We must warn you: there is a special used on every UNIX system, called the superuser, who can read or modify any file on the system. The special login name root caries superuser privileges; it is used by system administrators when they do system maintenance. There is also a command called su that grants super-user status if you know the root password.
In real life, most security breaches are due to passwords that are given away or easily guessed.
The file /etc/passwd is the password file; it contains all the login information about each user.
When you give your password to login, it encrypts it and compares the result against the encrypted password /etc/passwd. If they agree, it lets you login. The mechanism works because the encryption algorithm has the property that it’s easy to go from the clear form to the encrypted form, but very hard to go backwards.
The -l option of ls prints the permissions information.
But the ‘s’ instead of an ‘x’ in the execute field for the file owner states that, when the command is run, it’s to be given the permissions corresponding to the file owner, in this case root.
The setuid bit is a simple but elegant idea that solves a number of security problems.
Note that a program is just a file with execute permissions.
The -d option of ls asks it to tell you about the directory itself, rather than it contents, and the leading d in the output signifies that ‘.’ is indeed a directory.
The chmod (change mode) command changes permission on files.
The octal modes are specified by adding together a 4 for read, 2 for write, and 1 for execute permission.
If a directory is writable, however, people can remove files in it regardless of the permission on the files themselves.
The modification date reflects changes to the file’s contents, not its modes. The permissions and dates are not stored in the file itself, but in a system structure called an index node, or inode.
All the directory hierarchy does is provide convenient names for files. The system’s internal name for a file is its i-number: the number of the inode holding the files information.
It is the i-number that is stored in the first two bytes of a directory, before the name.
The first two bytes in each directory entry are the only connection between the name of a file and its contents. A filename in a directory is therefore called a link, because it links a name in the directory hierarchy to the inode, and hence to the data.
The rm command does not actually remove inodes; it removes directory entries or links. Only when the last link to a file disappears does the system remove the inod, and hence the file itself.
The ln command makes a link to an existing file.
The purpose of a link is to give two names to the same file, often so it can appear in two different directories.
Two links to a file point to the same inode, and hence have the same i-number.
The integer printed between the permissions and the owner is the number of links to the file. Because each link just points to the inode, each link is equally important-there is no difference between the first link and subsequent ones.
cp makes copies of files.
It’s often a good idea to change the permission on a backup copy so it’s harder to remove it accidently.
mv moves or renames files, simply by rearranging the links.
The more familiar you are with the layout of the file system, the more effectively you will be able to use it.
/tmp is clean up automatically when the system starts.
/dev contains device files.
The inode of a regular file contains a list of disc blocks that store the file’s contents. For a device file, the inode instead contains the internal name for the device, which consists of its type--character (c) or block (b)--and a pair of numbers, called the major and minor device numbers. The major number encodes the type of device, while the minor number distinguishes different instances of the device.
The program /etc/mount reports the correspondence between device files and directories.
The root file system has to be present for the system to execute. /bin, /dev and /etc are always kept on the root system, because when the system starts only files in the foot system are accessible, and some files such as /bin/sh are needed to run at all.
Because the subsystems may be mounted and dismounted, it is illegal to make a link to a file in another subsystem.
The df (disc free space) command reports the available space on the mounted file subsystems.
The tty command tells you which terminal you are using.
The device /dev/tty is a synonym for your login terminal, whatever terminal you are actually using.
Redirecting news to the file /dev/null causes its output to be thrown away.
Data written to /dev/null is discarded without comment, while programs that read from /dev/null get end-of-file immediately, because reads from /dev/null always return zero bytes.
The simplest command is a single word, usually naming a file for execution.
A command usually ends with a newline, but a semicolon ; is also a command terminator.
The precedence of | is higher than that of ‘;’ as the shell parses your command line.
Parentheses can be used to group commands.
Data flowing through a pipe can be tapped and placed in a file (but not another pipe) with the tee command, which is not part of the shell, but is nonetheless handy for manipulating pipes.
Another command terminator is the ampersand &. It’s exactly like the semicolon or newline, except that it tells the shell not to wait for the command to complete.
The command sleep waits the specified number of seconds before exiting.
The filename-matching characters only match filenames beginning with a period if the period is explicitly supplied in the pattern.
Characters like * that have special properties are known as metacharacters.
The easiest and best way to protect special characters from being interpreted is to enclose them in single quote characters.
Quotes of one kind protect quotes of the other kind.
Quoted strings can contain newlines.
The string ‘> ‘ is a secondary prompt printed by the shell when it expects you to type more input to complete a command. The secondary prompt string is stored in the shell variable PS2, and can be modified to taste.
A backslash at the end of a line causes the line to be continued; this is the way to present a very long line to the shell.
The metacharacter # is almost universally used for shell comments; if a shell word begins with #, the rest of the line is ignored.
The echo has a single options, -n, to suppress the last newline.
If a file is executable and if it contains text, then the shell assumes it to be a file of shell commands. Such a file is called a shell file.
When the shell executes a file of command, each occurrence of a $1 is replaced by the first argument, each $2 is replaced by the second argument, and so on through $9.
The shell provides a shorthand $* that means “all the arguments.”
The argument $0 is the name of the program being executed.
The value of a variable is extracted by preceding the name by a dollar sign.
You can create new variables by assigning them values; traditionally, variables with special meaning are spelled in upper case, so ordinary names are in lower case.
The shell built-in set displays the values of all your defined variables.
The value of a variable is associated with the shell that creates it, and is not automatically passed to the shell’s children.
The shell provides a command ‘.’ (dot) that executes the commands in a file in the current shell, rather than in a subshell.
Always export variables you want set in all your shells and subshells.
The standard error was invented so that error messages would always appear on the terminal.
Every program has three default files established when it starts, numbered by small integers called file descriptors. The standard input, 0, and the standard output, 1, which we are already familiar with, are often redirected from and into files and pipes. The last, numbered 2, is the standard error, output, and normally finds its way to your terminal.
The construction 2>filename (no spaces are allowed between the 2 and the >) directs the standard error output into the file.
The notation 2>&1 tells the shell to put the standard error on the same stream as the standard output.
The << signals the here document construction; the word that follows is used to delimit the input, which is taken to be everything up to an occurrence of that word on a line by itself.
The shell is actually a programming language.
If you’ve done something twice, you’re likely to do it again.
“grep pattern filenames ...” searches the named files to the standard input and prints each line that contains an instance of the pattern.
Regular expressions are specified by giving special meaning to certain characters, just like the *, etc., used by the shell.
The metacharacters ^ and $ “anchor” the pattern to the beginning (^) or end ($) of the line.
Regular expression metacharacters overlap with shell metacharacters, so it’s always a good idea to enclose grep patterns in single quotes.
grep supports character classes much like those in the shell, so [a-z] matches and lower case letter. But there are differences; if a grep character class begins with a circumflex ^, the pattern matches any character except those in the class.
A period ‘.’ is equivalent to the shell’s ?: it matches any character.
The closure operator * applies to the previous character or metacharacters (including a character class) in the expression, and collectively they match any number of successive matches of the character or metacharacter.
No grep regular expression matches a newline; the expressions are applied to each line individually.
fgrep searches for many literal strings simultaneously, while egrep interprets true regular expressions--the same as grep, but with an “or” operator and parenthesis to group expressions.
Parentheses can be used to group so (xy)* matches any of the empty string, xy, xyxy, and so on.
The vertical bar | is an “or” operator.
There are two other closure operators in egrep, + and ?. The pattern x+ matches one or more x’s, and x? matches zero or one x, but no more.
fgrep interprets no metacharacters, but can look efficiently for thousand of words in parallel.
sort sorts its input by line in ASCII order.
The command uniq is the inspiration for the -u flag of sort: it discards all but one of each group of adjacent duplicate lines.
The comm command is a file comparison program.
The tr command transliterates the characters in its input.
In practice; dd is often used to deal with raw, unformatted data, whatever the source; it encapsulates a set of facilities for dealing with binary data.
The basic idea of sed is simple: “sed ‘list of ed commands’ filenames...” reads lines one at a time from the input files; it applies the commands from the list, in order, to each line and writes its edited form on the standard output.
sed does not alter the contents of its input files. It writes on the standard output, so the original files are not changed.
Although automatic printing is usually convenient, sometimes it gets in the way. It can be turned off by the -n option; in that case, only lines explicitly printed with a p command appear in the output.
Like sed, awk does not alter its input files.
awk splits each input line automatically into fields, that is, strings of non-blank characters separated by blanks or tabs.
awk calls the fields $1, $2, ..., $NF, where NF is a variable whose value is set to the number of fields.
awk normally assumes that white space (any number of blanks and tabs) separates fields, but the separator can be changed to any single character.
The field $0 is the entire input line, unchanged. In a print statement items separated by commas are printed by the output field separator, which is by default a blank.
awk uses the same comment convention as the shell does: a # marks the beginning of a comment.
awk provides two special patterns, BEGIN and END. BEGIN actions are performed before the first input line has been read.
END actions are done after the last line of input has been processed.
awk’s real strength lies in its ability to do calculations on the input data as well.
Variables are initialized to zero by default so you usually don’t have to worry about initialization.
tail takes advantage of a file system operation called seeing, to advance to the end of a file without reading the intervening data.
A major theme in shell programming is therefore making programs robust so they can handle improper input and give helpful information when things go wrong.
The shell variable $# holds the number of arguments that a shell file was called with.
Every command returns an exit status--a value returned to the shell to indicate what happened. The exit status is a small integer; by convention, 0 means “true” (the command ran successfully) and non-zero means “false” (the command ran unsuccessfully).
There are three loops: for, while, and until. The for is by far the most commonly used.
The conditional command that controls a while or until can be any command.
To distinguish our files from those belonging to other processes, the shell variable $$ (the process id of the shell command), is incorporated into the filenames; this is a common convention.
“:” is a shell builtin command that does nothing but evaluate its arguments and return true.
${var} is equivalent to $var, and can be used to avoid problems with variables inside strings containing letters or numbers.
If the variable is undefined, and the name is followed by a question mark, then the string after the ? is printed and the shell exits (unless it’s interactive).
The shell protects programs run with & from interrupts but not from hangups.
The shell built-in command trap sets up a sequence of commands to be executed when a signal occurs. “trap sequence-of-commands list of signal numbers” The sequence-of-commands is a single argument, so it must almost always be quoted. The signal numbers are small integers that identify the signal.
The command sequence that forms the first argument to trap is like a subroutine call that occurs immediately when the signal happens.
The signal 9 is one that can’t be caught or ignored: it always kills.
The shell built-in command shift moves the entire argument list one position to the left: $2 becomes $1, $3 becomes $2, etc. “$@” provides all the arguments (after the shift), like $*, but uninterpreted.
The kill command only terminated processes specified by a process-id.
The shell variable IFS (internal field separator) is a string of characters that separate words in argument lists such as backquotes and for statements.
echo -n suppresses the final newline.
The string “$@” is treated specially by the shell, and converted into exactly the arguments to the shell file.
If $@ is not quoted, it is identical to $*; the behavior is special only when it is enclosed in double quotes.
touch changes the last-modified time of its argument file to the present time, without actually modifying the file.
Always maintaining backup copies makes it safer to try out ideas: if something doesn’t work out, it’s painless to revert to the original program.
The simplest input and output routines are called getchar and putchar. Each call to getchar gets the next character from the standard input, which may be a file or pie or the terminal (the default)--the program doesn't know which. Similarly, putchar(c) puts the character c on the standard output, which is also by default the terminal.
The return value from main is the program’s exit status.
When a C program is executed, the command-line arguments are made available to the function main as a count argc and an array argv of pointers to characters strings that contain the arguments.
The function strcmp compares two strings, returning zero if they are the same.
Before it can be read or written a file must be opened by the standard library function fopen.
When a program is started, three files are open already, and file pointers are provided for them. These files are the standard input, the standard output, and the standard error output; the corresponding file pointers are called stdin, stdout, and stderr.
Since there is a limit (about 20) on the number of files that a program may have open simultaneously, it’s best to free files when they are no longer needed.
Output written on stderr appears on the user’s terminal even if the standard output is redirected.
The routine efopen encapsulates a very common operation: try to open a file; if it’s not possible, print an error message and exit.
The function atoi converts a character string to an integer.
The main principle is that a program should only do one basic job--if it does too many things, it gets bigger, slower, harder to maintain, and harder to use.
There’s no good solution to writing bug-free code except to take care to produce a clean, simple design, to implement it carefully, and to keep it clean as you modify it.
“Memory fault” means that your program tried to reference an area of memory that is was not allowed to. It usually mean that a pointer points somewhere wild.
“Core dumped” means that the kernel said the state of your executing program in a file called core in the current directory.
lint examines C programs for potential errors, portability problems, and dubious constructions.
It’s always worth running lint after a long stretch of editing, making sure that you understand each warning that it gives.
The function popen is analogous to fopen, except that the first argument is a command instead of a filename.
The function mktemp creates a file whose name is guaranteed to be different from any existing file.
The routine getenv(“var”) searches the environment for the shell variable var and returns its value as a string of characters, or NULL if the variable is not defined.
The lowest level of I/O is a direct entry into the operating system.
All input and output is done by reading or writing files, because all peripheral devices, even your terminal, are files in the file system.
Whenever I/O is to be done on the file, the file descriptor is used instead of the name to identify the file.
All input and output is done by two system calls, read and write, which are accessed from C by functions of the same name.
A return value of zero implies end of file, and -1 indicates an error of some sort.
The function sleep causes the program to be suspended for the specified number of seconds; it is described in sleep.
open is rather like fopen, except that instead of returning a file pointer, it returns a file descriptor, which is an int.
It is an error to try to open a file that does not exist. The system call creat is provided to create new files, or to rewrite old ones.
The system call close breaks the connection between a filename and a file descriptor, freeing the file descriptor for use with some other file.
Sometimes it is nice to know what specific error occurred; for this purpose all system calls, when appropriate, leave and error number in an external integer called errno.
There is also an array of character strings sys_errlist indexed by errno that translates the numbers into a meaningful string.
The system call lseek provides a way to move around in a file without actually reading or writing.
A directory is a file containing a list of filenames and an indication of where they are located. The “location” is actually an index into another table called the inode table. The inode for a file is where all information about the file except its name it kept.
Part of the inode is described by a structure called stat, defined in .
stat takes a filename and returns inode information for that file (or -1 if there is an error).
The function getlogin returns your login name, or NULL if it can’t.
system takes one argument, a command line exactly as typed at the terminal (except for the new line at the end) and executes it in a sub-shell.
The execlp call overlays the existing program with the new one, runs that, then exits.
A variant of execlp called execvp is useful when you don’t know in advance how many arguments there are going to be.
The splitting is done by a system call named fork: proc_id = fork(); which splits the program into two copies, both of which continue to run.
The fork makes two copies of the program.
Often, the parent waits for the child to terminate before continuing itself. This is done with the system call wait.
The system call signal alters the default action. It has two arguments. The first is a number that specifies the signal. The second is either the address of a function, or a code which request that the signal be ignored or be given the default action.
#include
signal(SIGINT, SIG_IGN);
causes interrupts to be ignored, while
signal(SIGINT, SIG_DFL);
restores the default action of process termination.
The file declares the type jmp_buf as an object in which the stack position can be saved.
Signals are sent to all your processes.
Ever since Backus-Naur Form was developed for Algol, languages have been described by formal grammars.
yacc is a parser generator, that is, a program for converting a grammatical specification of a language into a parser that will parse statements in the language.
A lexical chunk is traditionally called a token.
yacc process the grammar and the semantic actions into a parsing function, named yyparse, and writes it out as a file of C code.
The entry to the lexical analyzer must be named yylex, since that is the function that yyparse calls each time it wants another token. (All names used by yacc start with y.)
make is most useful when the program being created is large enough to be spread over several source files.
The header file $contains type declarations for the standard mathematical functions. contains names for the errors that can be incurred.$
make can deduce what recompilation is needed after changes are made to any of the files involved.
The program lex creates lexical analyzers in a manner analogous to the way that yacc creates parsers: you write a specification of the lexical rules of your language, using regular expressions and fragments of C to be executed when a matching string is found.
The simple computer is a stack machine: when an operand is encountered, it is pushed onto a stack (more precisely, code is generated to push it onto a stack); most operators operate on items on the top of the stack.
Stack machine usually result in simple interpreters.
Performance evaluation functions: The first is computing Ackermann’s function ack(3,3). This is a good test of the function-call mechanism; it requires 2432 calls, some nested quite deeply.
The second test is computing the Fibonacci numbers with values less than 100 a total of one hundred times; this involves mostly arithmetic with an occasional function call.
It is possible to instrument a C program to determine how much time each function uses. The program must be recompiled with profiling turned on, by adding the option -p to each C compilation and load.
The main documentation for a command is usually the manual page--a one-page description in the UNIX Programmer’s Manual.
The UNIX philosophy:

First, let the machine to the work. Use programs like grep and wc and awk to mechanize tasks that you might do by hand on other systems.
Second, let other people do the work. Use programs that already exist as building block in your programs, with the shell and the programmable filters to glue them together.
Third, do the job in stages. Build the simplest thing that will be useful, and let your experience with that determine what (if anything) is worth doing next.
Fourth, build tools. Write programs that mesh with the existing environment, enhancing it rather than merely adding to it.
ed edits one file at a time.
Regular expressions involving * choose the leftmost match and make it as long as possible.

Justin Spencer

Pages

20170213

"The UNIX Programming Environment" by Brian Kernighan & Rob Pike

No comments:

Post a Comment