Pages

20170212

"Expert C Programming" by Peter Van Der Linden

  • C programming is a craft that takes years to perfect.
  • Some programmers have developed the habit of writing the literal first, like this: if (3==i). Then, if an equal sign is accidentally left out, the compiler will complain about an “attempted assignment to literal.”
  • Note that ctime has nothing to do with the language C, it just means “convert time.”
  • The function ctime() converts its arguments into local time, which will vary from Coordinated Universal Time (also known as Greenwich Mean Time), depending on where you are on the globe.
  • We should really use the gmtime() function to obtain the largest UTC time value. This function doesn’t return a printable string, so we call asctime() to get this.
  • New Zealand, because of its easternmost position with respect to time zones, holds the unhappy distinction of being the first country to encounter bugs triggered by particular dates.
  • Performance is almost everything in a compiler.
  • Compiler performance has two aspects: runtime performance (how fast the code runs) and compile time performance (how long it takes to generate code).
  • Compiler-writers start with zero because we’re used to thinking in terms of offsets.
  • It simplifies things to treat arrays as pointers.
  • You get better code if the compiler does the work of allocating registers for individual uses of a variable, rather than reserving them for its entire lifetime at declaration.
  • The functionality left out of the C compiler had to show up somewhere; in C’s case it appears at runtime, either in application code or in the runtime library.
  • Macro use is best confined to naming literal constants, and providing shorthand for a few well-chosen constructs. Define the macro name all in capitals so that, in use, it’s instantly clear it’s not a function call. Shun any use of the C preprocessor that modifies the underlying language so that it’s no longer C.
  • No formatter should ever change anything except the white space in a program.
  • C is one of the most successful programming languages of the last two decades, perhaps the most successful.
  • All successful programming languages are eventually standardized. However, the problem with standards manuals is that they only make sense if you already know what they mean.
  • Anyone learning or using C should be working with ANSI C, not K&R C.
  • Program portability is valuable, so you should always put the necessary casts, return values, and so on in your real-world code.
  • Just because it’s written down in an international standard doesn’t mean that it’s complete, consistent, or even correct.
  • Prototypes make it easy for a compiler to check function use with definition.
  • Prototypes are an extension of function declarations so that not just the name and return type are known, but also all the parameter types, allowing the compiler to check for consistency between parameter use and declaration.
  • Don’t omit the parameter names. Although the compiler doesn’t check these, they often convey extra semantic information to the programmer.
  • The keyword const doesn’t turn a variable into a constant! A symbol with the const qualifier merely means that the symbol cannot be used for assignment. This makes the value read-only through that symbol; it does not prevent the value from being modified through some other means internal (or even external) to the program.
  • Avoid unnecessary complexity by minimizing your use of unsigned types. Specifically, don’t use an unsigned type to represent a quantity just because it will never be negative.
  • Only use unsigned types for bit fields or binary masks.
  • gcc is a robust, aggressive optimizing compiler, available for many hardware platforms and sometimes better than the manufacturer’s compiler.
  • Any time you encounter the string malloc(strlen(str)); it is almost always sure to be an error, when malloc(strlen(str)+1); was meant. This is because almost all the other string-handling routines include the room needed for the trailing nul terminator, so people get used to not making the special provision for it that strlen needs.
  • Reviewing areas for improvement is one of the factors that gradually improves the science of software engineering and the art of programming language design. That’s why C++ is so disappointing: it does nothing to address some of the most fundamental problems in C, and its most important addition (classes) builds on the deficient C type model.
  • Memorize this little rhyme to recall the correct terminology for pointer and ASCII zero:
    • The one “l” NUL ends an ASCII string,
    • The two “l” NULL points to no thing.
  • The ASCII character with the bit pattern of zero is termed a “NUL.” The special pointer value that means the pointer points nowhere is “NULL.” The two terms are not interchangeable in meaning.
  • Runtime error checking is almost unknown in C--checking for dereferencing an invalid pointer is about the only case, and even that limited case can’t be fully done under MS-DOS.
  • All virtual memory architectures will fault a process that dereferences a pointer outside its address space as soon as this happens.
  • Runtime checking goes against the C philosophy that the programmer knows what he or she is doing and is always right.
  • A conformant C compiler must permit at least 257 case labels for a switch statement (ANSI C Standard, section 5.2.4.1). This is to allow a switch on an 8-bit character (256 possible values, plus EOF).
  • It is always the case in C that where you have some statements opening a block you can always add some declarations in between, like this:
    {
        declarations
        statements
    }
  • You might use this if allocating memory was expensive, and hence avoided if possible.
  • Another use is to declare some variable whose use is really localized to this block.
  • C++ takes this a step further still, and allows arbitrary intermingling of statements and declarations, and even embedding declarations in the middle of “for” statements.
  • The fact that all the cases are optional, and any form of statement, including labelled statements, is permitted, means that some errors can’t be detected even by lint.
  • Perhaps the biggest defect in the switch statement is that cases don’t break automatically after the actions for a case label. Once a case statement is executed, the flow of control continues down, executing all the following cases until a break statement is reached.
  • This is known as “fall through” and was intended to allow common end processing to be done, after some case-specific preparation had occurred. In practice it’s a severe misfeature, as almost all case actions end with a break;.
  • Default fall through is wrong 97% of the time.
  • We concluded that default fall through on switches is a design defect in C. The overwhelming majority of the time you don’t want to do it and have to write extra code to defeat it.
  • One new feature introduced with ANSI C is the convention that adjacent string literals are concatenated into one string. This replaces the old way of constructing multi-line messages using escaped newlines, and starting each continuation string in column one.
  • The automatic concatenation means that a missing comma in an initialization list of string literals no longer causes a diagnostic message. A missing comma now results in a silent marriage of adjacent strings.
  • An automated program can output a comma or no comma by having a statically declared character initialized to space and then set to comma.
  • Whenever you define a C function, its name is globally visible by default. You can prefix the function name with the redundant extern keyword or leave it off, and the effect is the same. The function is visible to anything that links with that object file. If you want to restrict access to the function, you are obliged to specify the static keyword.
  • With the benefit of practical experience, default global visibility has been conclusively and repeatedly demonstrated to be a mistake. Software objects should have the most limited scope by default. Programmers should explicitly take action when they intent to give something global scope.
  • One problem is that C is so terse. Just adding, changing, or omitting a single character often gives you a program that is still valid but does something entirely different. Worse than that, many symbols are “overloaded”--given different meanings when used in different contexts. Even some keywords are overloaded with several meanings, which is the main reason that C scope rules are not intuitively clear to programmers.
  • When sizeof’s operand is a type it has to be enclosed in parentheses, which makes people wrongly believe it is a function call, but for a variable this is not required.
  • The more work you make one symbol do, the harder it is for the compiler to detect anomalies in our use of it.
  • You should always put parentheses around an expression that mixes Booleans, arithmetic, or bit-twiddling with anything else.
  • Some authorities recommend that there are only two precedence levels to remember in C: multiplication and division come before addition and subtraction. Everything else should be in parentheses. We think that’s excellent advice.
  • Every operator has a level of precedence and a “left” or “right” associativity assigned to it. The precedence indicates how “tightly” the operands in an unbracketed expression bind.
  • All assignment-operators have right associativity.
  • The only use of associativity is to disambiguate an expression of two or more equal-precedence operators.
  • If you need to take associativity into account to figure out the value of an expression, it’s usually better to rewrite the expression into two expression, or to use parentheses.
  • The nominal task of gets() is to read in a string from a stream. The caller tells it where to put the incoming characters. But gets() does not check the buffer space; in fact, it can’t check the buffer space.
  • A knowledgeable malefactor can amend the return address in the procedure activation records on the stack by stashing the right binary patterns in the argument string This will divert the flow of execution not back to where it came from, but to a special instruction sequence (also carefully deposited on the stack) that calls execv() to replace the running image with a shell.
  • The fgets() routine sets a limit on the number of characters read, so it won’t exceed the size of the buffer.
  • Many people are surprised to learn that ANSI C mandates the argc, argv convention of passing arguments to a C program, but it does.
  • The backslash character can be used to “escape” several characters, including a newline. An escaped newline is treated as one logical line and this can be used to continue long strings. A problem arises if you inadvertently slip a space or two in between the backslash and the carriage return, as \ whitespace newline is different than \newline.
  • The ANSI standard specifies a convention that has come to be known as the maximal munch strategy. Maximal munch says that if there's more than one possibility for the next token, the compiler will prefer to bite off the one involving the longest sequence of characters.
  • As in BCPL, C++ comments are introduced by // and go to the end of a line.
  • The C++ language allows the C notation for comments, too.
  • Automatic variables go away once the flow of control leaves the scope in which they are declared. That means that even if you return a pointer to such a variable, as here, there’s no telling what it points to once the function is exited.
  • In C, automatic variables are allocated on the stack.
  • Memory management works best if you can writes the “free” at the same time as you write the “malloc.”
  • It takes discipline to ensure that code is kept lint clean, and it would save much trouble if the lint warnings were automatically generated by the compiler.
  • Back in the early days of C on UNIX, an explicit decision was made to extract full semantic checking for the compiler. This error checking was instead done by a stand-alone program known as “lint.” By omitting comprehensive error-checking, the compiler could be made smaller, faster, and simpler.
  • Lint is your software conscience. It tells you when you are doing bad things. Always use lint. Listen to your conscience.
  • The value is not just in removing the existing bugs, but in preventing new bugs from contaminating the source base.
  • The economics of software is such that the earlier in the development cycle a bug is found, the cheaper it is to fix. So it is a good investment to have lint (or preferably the compiler itself) do the extra work to find problems rather than the debugger; but better a debugger find the problems than an internal test group. The worst option of all is to leave the problems to be found by customers.
  • Even if you could make your programming language 100% reliable, you would still be prey to catastrophic bugs in the algorithm.
  • C’s declaration syntax is trivial for a compiler (or compiler-writer) to process, but hard for the average programmer.
  • Ambiguity is a very undesirable property of a programming language grammar, as it significantly complicates the job of a compiler-writer.
  • An important building block is a declarator--the heart of any declaration; roughly, a declarator is the identifier and any pointers, function brackets, or array indications that go along with it.
  • A declaration gives the basic underlying type of the variable and any initial value.
  • Structs are just a bunch of data items grouped together. Other languages call this a “record.”
  • We can follow a struct definition by some variable names, declaring variables of this struct type.
  • The only other point to watch that we can write an optional “structure tag” after the keyword “struct.”
  • Structs can also have bitfields, unnamed fields, and word-aligned fields. These are obtained by following the field declaration with a colon and a number representing the field length in bits.
  • Our preference is not to mix a struct declaration with definitions of variables.
  • We should be much more concerned with how easy the code is to read, not to write. We write code once, but it is read many times during subsequent program maintenance. It’s just a little simpler to read a line that only does one thing. For this reason, variable declarations should be separate from the type declaration.
  • Parameters are passed in registers (for speed) where possible.
  • By putting an array inside a struct like this:
        struct s_tag { int a[100]; };
    you can now treat the array as a first-class type. You can copy the entire array with an assignment statement, pass it to a function by value, and make it the return type of a function.
  • Unions are known as the variant part of variant records in many other languages.
  • Instead of each member being stored after the end of the previous one, all the members have an offset of zero. The storage for the individual members is thus overlaid: only one member at a time can be stored there.
  • Unions usually occur as part of a larger struct that also has implicit or explicit information about which type of data is actually present. There’s an obvious type insecurity here of storing data as one type and retrieving it as another.
  • Unions are typically used to save space, by not storing all possibilities for certain data items that cannot occur together.
  • Unions can also be used, not for one interpretation of two different pieces of data, but to get two different interpretations of the same data.
  • Enums (enumerated types) are simply a way of associating a series of names with a series of integer values. In a weakly typed language like C, they provide very little that can’t be done with a #define, so they were omitted from most early implementations of K&R C.
  • The integer values start at zero by default. If you assign a value in the list, the next value is one greater, and so on. There is one advantage to enums: unlike #defined names which are typically discarded during compilation, enum names usually persist through to the debugger, and can be used while debugging your code.
  • Just because is says constant, it doesn’t necessarily mean constant.
  • Typedefs are a funny kind of declaration: they introduce a new name for a type rather than reserving space for a variable.
  • Don’t put several declarators together in one typedef. And never, ever, bury the typedef in the middle of a declaration.
  • Typedef creates aliases for data types rather than new data types.
  • You can extend a macro typename with other type specifiers, but not a typedef’d typename.
  • A typedef’d name provides the type for every declarator in a declaration.
  • Everything within a namespace must be unique, but an identical name can be applied to things in different namespaces.
  • This was not true for very old compilers, and is one reason people prefixed field names with a unique initial in the BSD 4.2 kernel code.
  • If you use the same identifier for the type and the tag in a typedef, it has the effect of making the keyword “struct” optional, which provides completely the wrong mental model for what is going on.
  • Don’t bother with typedefs for structs. All they do is save you from writing the word “struct”, which is a clue that you probably shouldn’t be hiding anyway.
  • Always use a tag in a structure definition, even if it’s not needed. It will be later.
  • A pretty good principle in computer science, when you have two different things, is to use two different names to refer to them. It reduces the opportunities for confusion (always a good policy in software). If you’re struck for a name for a structure tag, just give it a name that ends in “_tag.”
  • One of the problems with the strcmp() routine to compare two strings is that it returns zero if the strings are identical. This leads to convoluted code when the comparison is part of a conditional statement.
  • Recall that objects in C must have exactly one definition, and they may have multiple external declarations.
  • A definition is the special kind of declaration that creates an object; a declaration indicates a name that allows you to refer to an object created here or elsewhere.
  • The declaration of an external object tells the compiler the type and name of the object, and that memory allocation is done somewhere else.
  • The symbol appearing on the left of an assignment is sometimes called an l-value (for “left-hand-side” or “locator” value), while a symbol on the right of an assignment is sometimes called an r-value (for “right-hand-side”).
  • The problem of the external declaration of a pointer not matching the definition of an array is simple to fix--change the declaration so it does match the definition.
  • Both arrays and pointers can be initialized with a literal string in their definition.
  • A string literal created by a pointer initialization is defined as read-only in ANSI C.
  • In contrast to a pointer, an array initialized by a literal string is writable. The individual characters can later be changed.
  • Professional C programmers have to be proficient with the use of malloc() and pointers to anonymous memory().
  • The compiler creates an output file containing relocatable objects. These objects are the data and machine instructions corresponding to the source programs.
  • Most compilers are not one giant program. They usually consist of up to half-a-dozen smaller programs, invoked by a control program called a “compiler driver.” Some pieces that can be conveniently split out into individual programs are: the preprocess, the syntactic and semantic checker, the code generator, the assembler, the optimizer, the linker, and, of course, a driver program to inoe all these pieces and pass the right options to each. An optimizer can be added after almost any of these phases.
  • An object file isn’t directly executable; it needs to be fed into a linker first. The linker identifies the main routine as the initial entry point (place to start executing), binds symbolic references to memory addresses, unites all the object files, and joins them with the libraries to produce an executable.
  • Dynamic linking allows a system to provide a big collection of libraries with many useful services, but the program will look for these at runtime rather than having to library binaries bound in as part of the executable.
  • If a copy of the libraries is physically part of the executable, then we say the executable has been statically linked; if the executable merely contains filenames that enable the loader to find the program’s library references at runtime, then we say it has been dynamically linked.
  • A major purpose of dynamic linking is to decouple programs from the particular library version they use.
  • Because this is an interface between application programs and the services provided by library binary executables, we call it an Application Binary Interface or ABI.
  • A dynamically linked executable is smaller than its statically linked counterpart.
  • All executables dynamically linked to a particular library share a single copy of the library at runtime.
  • Dynamic linking is “just-in-time” linking. It does mean that programs need to be able to dine their libraries at runtime. The linker accomplished this by putting library filenames or pathnames into the executable; and this in turn, means that libraries cannot be moved completely arbitrarily.
  • Anyone can create a static or dynamic library. You simple compile some code, without a main routine, and process the resulting .o files with the correct utility--”ar” for static libraries, or “ld” for dynamic libraries.
  • Dynamic linking is now the default on computers running System V release 4 UNIX, and it should always be used.
  • The best policy is to avoid problems by making sure that all applications are dynamically linked.
  • Static libraries are known as archives and they are created and updated by the ar--for archive--utility.
  • Convention dictates that static libraries have a “.a” extension on their filename.
  • A dynamically linked library is created by the link editor, ld. The conventional file extension for a dynamic library is “.so” meaning “shared object”--every program linked against this library shares the same one copy, in contrast to static linking, in which everyone is (wastefully) given their own copy of the contents of the library. In its simplest form, a dynamic library can be created by using the -G option to cc.
  • Position-independent code means that the generated code makes sure that every global data access is done through an extra indirection. This makes it easy to relocate the data simply by changing one value in the table of global offsets.
  • A rule of thumb is to always use PICode for libraries. Position-independent code is especially useful for shared libraries because each process that uses a shared library will generally map it at a different virtual address (though sharing one physical copy).
  • A related turn is “pure code.” A pure executable is one that contains only code (no static or initialized data). It is “pure” in the sense that it doesn’t have to be modified to be executed by any specific process. It references its data off the stack of from another (impure) segment. A pure code segment can be shared. If you are generating PIcode (indicating sharing) you usually want it to be pure, too.
  • Here are the essential UNIX linking facts of life:
    • Dynamic libraries are called libsomething.so, and static libraries are called libsomething.a.
      • By convention, all dynamic libraries have a filename of the form libname.so (version numbers may be appened to the name).
    • You tell the compiler to link with, for example, libthread.so by giving the option -lthread.
      • The command line argument to the C compiler doesn’t mention the entire pathname to the library file.
      • The compiler is told to link against a library with the command line option -lname where the library is called libname.so
    • The compiler expects to find the libraries in certain directories.
      • The compiler option -Lpathname is used to tell the linker a list of other directories in which to search for libraries that have been specified with the -l options.
      • Using environment variables is not officially frowned on, for reasons of security, performance, and build/execute independance. Use the -Lpathname -Rpathname options at link-time instead.
    • Identify your libraries by looking at the header files you have used.
      • The man-pages show the exact argument types each routine expects, and should mention the library it’s in.
    • Symbols from static libraries are extracted in a more restricted way than symbols from dynamic libraries.
      • With dynamic libraries, all the library symbols go into the virtual address space of the output file, and all the symbols are available to all the other files in the link. In contrast, static linking only looks through the archive for the undefined symbols presently known to the loader at the time the archive is processed.
  • Each header file that you include potentially represents a library against which you must link. This top carries over into C++, too. A big problem of name inconsistency shows up here. Header files usually do not have a name that looks anything like the name of the corresponding library.
  • Another inconsistency is that a single library may contain routines that satisfy the prototypes declared in multiple header files.
  • If you’re in doubt, use the nm utility to list the routines that a library contains.
  • The order of the statically linked libraries on the compiler command line is significant.
  • Another problem occurs if you mention the static libraries before your own code. There won’t be any undefined symbols yet, so nothing will be extracted.
  • Always put the -l library options at the rightmost end of your compilation command line.
  • Interpositioning (some people call it “interposing”) is the practice of supplanting a library function by a user-written function of the same name. It enables a library function to be replaced in a particular program, usually for debugging or performance reasons.
  • Not only are all the calls that you make to the library routine replaced by calls to your version, but all calls from system routines now reference your routine instead.
  • Over the years we have seen no convincing examples where interpositioning was essential but the effect could not be obtained in a different (perhaps less convenient) manner.
  • Don’t make any symbols in your programs global, unless they are meant to be part of your interface!
  • If an identifier is reserved, it means that the user is not allowed to redefine it. However, this is not a constraint, so it does not require an error message when it sees it happen. It just causes unportable, undefined behavior.
  • Use the ldd command to list the dynamic dependencies of an executable. This command will tell you the libraries that a dynamically linked program needs.
  • Turing proposed that a human interrogator converse (via teletype, to avoid sight and sound clues) with another person and with a computer. If the human interrogator was unable to correctly identify which was which after a period of five minutes, then the computer would be said to have exhibited artificial intelligence. This scenario has come to be called the Turing Test.
  • The above program’s inability to directly answer a straightforward question is a dead giveaway to a computer scientist, and highlights the central weakness in the Turing test: simply exchanging semi-appropriate phrases doesn’t indicate though-we have to look at the content of what is communicated.
  • The Turing test has repeatedly been shown to be inadequate.
  • It’s common programming technique to label or tag important data with a unique number identifying what it is. The labelling number is often termed a “magic” number; it confers the mysterious power of being able to identify a collection of random bits.
  • Under SVr4 executables are marked by the first byte of a file containing hex 7F followed by the letters “ELF” at bytes 2,3 and 4 of the file.
  • A segment on UNIX is a section of related stuff in a binary.
  • A segment in the Intel x86 memory model is the result of a design in which (for compatibility reasons) the address space is not uniform, but is divided into 64-Kbyte ranges known as segments.
  • When you run size on an executable, it tells you the size of three segments known as text, data, and bss in the file.
  • Another way to examine the contents of an executable file is to use the n or dump utilities.
  • The BSS segment gets its name from abbreviating “Block Started by Symbol.”
  • Since the BSS segment only holds variables that don’t have any value yet, it doesn’t actually need to store the image of these variables. The size that BSS will require at runtime is recorded in the object file, but BSS (unlike the data segment) doesn’t take up any actual space in the object file.
  • The text segment contains the program instructions.
  • The data segment contains the initialized global and static variables, complete with their assigned values. The size of the BSS segment is then obtained from the executable, and the loader obtains a block of this size, putting it right after the data segment.
  • The data segment is typically the largest segment in any process.
  • The stack segment contains a single data structure, the stack. A classic computer science object, the stack is a dynamic area of memory that implements a last-in-first-out queue.
  • A push operation makes the stack grow larger, and a pop removes a value from it.
  • The runtime maintains a pointer, often in a register and usually called sp, that indicates the current top of the stack.
  • The stack provide the storage area for local variables declared inside functions.
  • The stack stores the “housekeeping” information involved when a function call is made. This housekeeping information is known as a stack frame, or more generally, a procedure activation record.
  • The stack also works as a scratch-pad area--every time the program needs some temporary storage, perhaps to evaluate a length arithmetic expression, it can push partial results onto the stack, popping them when needed.
  • A stack would not be needed for recursive calls. If not for these, a fixed amount of space for local variables, parameters, and return addresses would be known at compile time, and could be allocated in the BSS.
  • Although we talk about the top of the stack, the stack grows downwards on most processors, towards memory addresses with lower values.
  • The procedure activation record is a data structure that supports an invocation of a procedure, and also records everything needed to get back to where you came from before the call.
  • The runtime maintains a pointer, often in a register and usually called fp, which indicates the active stack frame. This will be the stack frame nearest to or at the top of the stack.
  • Pointers that have lost their validity in this way (by referencing something that is no longer live) are known as “dangling pointers”--they don’t reference anything useful, just kind of dangle in space. If you need to return a pointer to something defined in a function, then define the thing as static. This will ensure that space for the variable is allocated in the data segment instead of on the stack.
  • The storage class specifier auto is never needed.
  • Setjump saves a copy of the program counter and the current pointer to the top of the stack. Then longjump restores these values, effectively transferring control and resetting the state back to where you were when you did the save.
  • A goto can’t jump out of the current function in C (that’s why this is a “longjump”--you can jump a long way away, even to a function in a different file).
  • You can only longjump back to somewhere you have already been, where you did a setjump, and that still has a live activation record.
  • A setjump/longjump is most useful for error recovery.
  • The header file needs to be included in any source file that uses setjump or longjump.
  • In DOS, the stack size ha to be specified as part of building the executable, and it cannot be grown at runtime.
  • If you’re working on the OS kernel, most of the runtime tools are not available to you, because the kernel does not run as a user process.
  • A kernel “panics”, or comes to an abrupt halt, when it detects a situation that “cannot” arise.
  • A word to the wise: it’s possible to embed assembler code in C source.
  • A knowledge of memory architecture helps a programmer to understand some of the C conventions and restrictions.
  • The modern Pentium is a direct descendant of Intel’s 8086 from 15 years before, and it contains architectural irregularities to provide a backwards compatibility with it.
  • The 80x86 is a difficult and frustrating architecture for which to write compilers and application programs.
  • A guide to memory prefix use:
    • kilo = 2^10 = one thousand bytes
    • mega = 2^20 = one million bytes
    • giga = 2^30 = one billion bytes
    • tera = 2^40 = one trillion bytes
  • Note that all disk manufacturers use decimal rather than binary notation for disk capacity.
  • A 64-bit address space is really large.
  • When you have access to huge amounts of data, the latency for moving it around will start to dominate software performance.
  • It is very inconvenient for a program to be restricted by the amount of main memory installed on a machine. So early on in computing, the concept of virtual memory was developed to remove this restriction. The basic idea is to use cheap but slow disk space to extend your fast but expensive main memory.
  • Virtual memory is organized into “pages.” A page is the unit that the OS moves around and protects, typically a few Kbytes in size.
  • When a memory image travels between disk and physical memory we say it is being paged in (if going to memory) or paged out (going to disk).
  • A process can only operate on pages that are in memory. When a process makes a reference to a page that isn’t in memory, the MMu generates a page fault. The kernel responds to the event and decides whether the reference was valid or invalid. If invalid, the kernel signals “segmentation violation” to the process. If valid, the kernel retrieves the pave from the disk.
  • Write-through cache--This always initiates a write to main memory at the same time it writes to the cache.
  • Write-back cache--In the first instance, this writes only to cache. The data is transferred to main memory when the cache line is about to be written again an a save hasn’t taken place yet.
  • In UNIX, disk inodes are cached in memory. This is why the filesystem can be corrupted by powering the machine off without first flushing the cache to disk with the “sync” command.
  • Just as the stack grows dynamically on demand, so the data segment contains an object that can do this, namely, the heap.
  • The heap area is for dynamically allocated storage, that is, storage obtained through malloc (memory allocation) and accessed through a pointer. Everything in the heap is anonymous--you cannot access it directly by name, only indirectly through a pointer.
  • There are two common types of heap problems:
    • freeing or overwriting something that is still in use (this is a “memory corruption”)
    • not freeing something that is no longer in use (this is a “memory leak”)
  • If the programmer does not free each malloced block when it is no longer needed, the process will acquire more and more memory without releasing the portions no longer in use.
  • Whenever you write malloc, write a corresponding free statement.
  • If you don’t know where to put the “free” that corresponds to your “malloc”, then you’ve probably created a memory leak.
  • The alloca() routine allocates memory on the stack; when you leave the function in which you called it, the memory is automatically freed.
  • The main user-visible symptom of a memory leak is that the guilty process slows down.
  • Looking for a memory leak is a two-step process. First you use the swap command to see how much swap space is available. Type the command three or four times over the space of a minute or two, to see if available swap specs keeps keeping smaller.
  • Of all the network investigative tools, the absolute tops is snoop.
  • The absolute best feature of snoop, though, is the -a option. This causes snoop to output a click on the workstation loudspeaker for each packet. You can listen to your network ether traffic.
  • The second step is to identify the suspected process, and see if it is guilty of a memory leak.
  • Again, repeat the command several times; any program that dynamically allocates memory can be observed growing in size. If a process appears to be constantly growing and never leveling off, then suspect a memory leak.
  • Most memory leaks aren’t quite as blatant as overwriting the only pointer to the block before it can be freed, so they are harder to identify and debug.
  • A signal is an event notification or a software-generated interrupt, much used in UNIX systems programming and hardly ever used in applications programming.
  • The “core dump” part of the message is just a throwback to the days when all memory was made of ferrite rings, or “cores”.
  • Alignment means that data items can only be stored at an address that is a multiple of their size.
  • The compilers automatically allocate and pad data (in memory) to achieve alignment.
  • Segmentation faults are generated by an exception in the memory management unit (the hardware responsible for supporting virtual memory). The usual cause is dereferencing (looking at the contents of the address contained in) a pointer with an uninitialized or illegal value.
  • But never nest one conditional operator inside another, as it quickly becomes too hard to see what goes with what.
  • Common immediate causes of segmentation fault:
    • dereferencing a pointer that doesn’t contain a valid value
    • dereferencing a null pointer (often because the null pointer was returned from a system routine, and used without checking)
    • accessing something without the correct permission--for example, attempting to store a value into a read-only text segment would cause this error
    • running out of stack or heap space (virtual memory is huge but not infinite)
  • The common programming errors that (eventually) lead to something that gives a segmentation fault, in order of occurrence, are:
    • Bad pointer value errors: using a pointer before giving it a value, or passing a bad pointer to a library routine.
    • Overwriting errors: writing past either end of an array, writing past either end of a malloc’d block, or overwriting some of the heap management structures.
    • Freeing errors: freeing the same block twice, freeing something that you didn’t malloc, freeing some memory that is still in use, or freeing an invalid pointer.
  • You can amend your free statements to clear a pointer after freeing what is points to: free(p); p = NULL; This ensures that if you do use a pointer after you have freed it, at least the program core dumps once.
  • The correct way to free an element while traversing down a linked list is to use a temporary variable to store the address of the next element. Then you can safely free the current element at any time, and not have to worry about referencing it again to get the address of the next.
  • If your program needs more memory than the operating system can give it, it will be terminated with a “segmentation fault”.
  • Signal handlers are not supposed to call library functions (except under restricted circumstances stated in the standard).
  • Prototypes are intended to reduce a common (and hard-to-find) class of errors, namely a mismatch between formal and actual parameter types.
  • Never mix the old and new styles of function declaration and definition.
  • If a program sets the terminal into a funny mode, it will stay in a funny mode.
  • Raw I/O achieves a blocking read--if no character is available, the process waits there until one comes in. If you need a nonblocking read, you can use the ioctl() (I/O control) system call.
  • If a library or system call encounters problems, it will set errno to indicate the cause of the problem.
  • Curses (think “cursor”) is a library of screen management calls, with implementations on all popular platforms.
  • Interrupt-driven programs are much more complex and difficult to get working, but the paradigm enables a process to make productive use of time otherwise spent waiting for input. The use of threads diminishes the need to use interrupt-driven I/O techniques.
  • Did you know that most debuggers allow you to make function calls from the debugger command line? When you debug the code and you’re stopped at a breakpoint you can easily check the integrity of your data structures by manually issuing a call to your print routine.
  • Debugging code into existence means writing a fast slapdash first attempt, and then getting it working by successive refinements over a period of weeks by changing parts that don’t work.
  • Coding for debuggability means breaking the system down into parts, and getting the program structure working first. Only when you have got the basic program working should you code the complicated refinements, the performance tweaks, and the algorithm optimizations.
  • Hashing is a way to speed up access to an element in a table of data.
  • Hashing is a tried and tested table lookup optimization, and it’s used everywhere in systems software: in databases, operating systems, and compilers.
  • Sometimes taking the time to break a programming problem into smaller parts is the fastest way of solving it.
  • All array names that are function parameters are always converted to pointers by the compiler.
  • To a compiler, an array is an address, and a pointer is the address of an address.
  • An array reference a[i] is always rewritten to *(a+i) by the compiler at compile-time.
  • An array name in an expression becomes a pointer, and there you are: pointers and arrays are interchangeable in expressions because they all boil down to pointers in the end, and both can be subscripted. Just as with addition, the subscript operator is commutative (it doesn’t care which way round the arguments come, 5 + 3 equals 3 + 5). This is why, given a declaration like int a[10];, both the following are correct: “a[6]” and “6[a]”.
  • The compiler automatically scales a subscript to the size of the object pointed at.
  • Word has gotten out that it is “more efficient” to program array algorithms using pointers instead of arrays. The popular belief is usually wrong. With modern production-quality optimizing compiler, there is not necessarily a difference in the code generated for one-dimensional arrays and that for references through a pointer.
  • Pointers are not necessarily any faster than subscripts when processing one-dimensional arrays.
  • Parameter is a variable defined in a function definition or a function prototype declaration.
  • Argument is a value used in a particular call to a function.
  • The standard stipulates that a declaration of a parameters as “array of type” shall be adjusted to “pointer to type”.
  • The reason arrays are passed to functions as pointers is efficiency.
  • The array/pointer equivalence for parameters was done for efficiency. All non-array data arguments in C are passed “by value” (a copy of the argument is made and passed to the called function; the function cannot change the value of the actual variable used as an argument, just the value of the copy it has).
  • It simplifies the compiler if the convention is adopted that all arrays are passed as a pointer to the start, and everything else is passed by making a copy of it. Similarly the return value of a function can never be an array or function, only a pointer to an array or function.
  • Data can be explicitly passed as call-by-reference by using the “address of” operator. This causes the address of the argument to be sent, rather than a copy of the argument.
  • Our preference is always to define the parameter as a pointer, since that is what the compiler rewrites it to.
  • Note that there is one thing that you can do with a pointer that you cannot do with an array name: change its value. Array names are not modifiable l-values; their value cannot be altered.
  • An entire array is passed to a function by giving it a pointer to the zeroth element, but you can give it a pointer to any element, and pass in just the last part of the array.
  • By passing in an address that is one before the start of the array (a[-1]), you can effectively give the array an index range of 1 to N, instead of 0 to N-1.
  • Definitions should match declarations. If you defined it as an array, your extern declaration should be an array. And likewise for a pointer.
  • Arrays in C are one-dimensional.
  • Whenever you see “array” in C, think “vector,” that is, a one-dimensional array of something, possibly another array.
  • With multidimensional arrays in C the rightmost subscript varies the fastest, a convention known as “row major addressing.”
  • In the simplest case, one-dimensional arrays can be given an initial value by enclosing the list of values in the usual curly braces. As a convenience, if you leave out the size of the array, it creates it with the exact size to hold that number of initial values.
  • You can only assign to an entire array during its declaration. There’s no good reason for this restriction.
  • Multidimensional arrays can be initialized with nested braces.
  • If the array dimension is bigger than the number of initial values provided, the remaining elements are set to zero. If they are pointers they are set to NULL. If they are floating points elements, they are set to 0.0.
  • By declaring a pointer-to-character we leave open the possibility that other characters can be stored adjacent to the first, implicitly forming a string.
  • If you declare an array of pointers to strings, and allocate memory for these strings as needed, you will be much more frugal with machine resources.
  • Wherever possible with strings, don’t make a fresh copy of the entire string.
  • The reason you see char ** argv is that argv is an array of pointers (i.e. char *argv[]). This decays into a pointer to the element, namely a pointer to a pointer.
  • One-dimensional arrays of any type can be used as function arguments in C. The parameter is rewritten as a pointer to its first element, so you may need a convention for indicating the total size of the array. There are two basic methods:
    • Send an additional parameter that gives the number of elements (this is what argv does)
    • Give the last element in the array a special value to indicate the end of the data (this is what the nul character at the end of a string does). The special value must be one that can’t otherwise occur in the data.
  • Arrays of two or more dimension cannot be used as parameters in the general case in C. You cannot pass a general multidimensional array to a function. You can pass specific arrays of known prespecified size, but it cannot be done in the general case.
  • In summary, therefore, if your multidimensional arrays are all fixed at the exactly same size, they can be passed to a function without trouble. The more useful general case of arrays of arbitrary size as function parameters breaks down like this:
    • One dimension--works OK, but you need to include count or end-marker with “out-of-bound” value.
    • Two dimensions--can’t be done directly, but you can rewrite the matrix to a one-dimensional Iliffe vector, and use the same subscripting notation.
    • Three or more dimension--doesn’t work for any type. You must break it down into a series of lesser-dimension arrays.
  • Strictly speaking, an array cannot be directly returned by a function. You can, however, have a function returning a pointer to anything you like, including an array.
  • Make sure you don’t return a pointer to an auto variable.
  • Programmers use dynamic arrays when we don’t know the size of the data in advance.
  • It can be instructive to use the strings utility to look inside a binary to see the error messages a program can generate.
  • Two lines is acceptable for a rare message.
  • Keep messages informative, non-inflammatory, and try to avoid unprofessional language such as profanity, colloquialisms, humor, or even exaggerations.
  • Here’s how a dynamic array can be effected in C. The basic idea is to use the malloc() (memory allocation) library call to obtain a pointer to large chunk of memory. Then reference it like an array, using the fact that a subscripted array decays to a pointer plus offset.
  • Dynamic arrays are also useful for avoiding predefined limits.
  • All significant software has bugs of one kind or another.
  • A compiler should never abort, no matter what input you give it.
  • Fixing reported problems is one of our highest priorities, but we can only fix problems that we know about.
  • Report to customer support all the product defects that you find. We can only fix the bugs that we know about and can reproduce.
  • Zen and the art of software maintenance suggests that you should spend a little time investigating any bugs you find; there may be an easy workaround.
  • There is a library call, realloc(), that relocates an existing block of memory to a different (usually large) size, preserving its contents while doing this.
  • In practice, dont’ assign the return value from realloc() directly to the character pointer; if reallocation fails, it will nullify the pointer, and you’ll lose the reference to the existing table!
  • The system may slow down at unpredictable points when a large table suddenly needs to grow. The growth multiple is a vital parameter.
  • You may have to protect table access with locks to prevent one thread reading it while another thread is reallocating it.
  • An alternative approach for a dynamically growing data structure is a linked list, at the cost of no random access. A linked list can only be accessed serially (unless you start caching, address of frequently accessed elements), whereas arrays give you random access. This can make a huge difference to performance.
  • Program proofs are not practical because most programmers find them too inaccessible.
  • For most practical purposes C++ is a superset of ANSI C.
  • Object-oriented programming (naturally) involves the use of objects as the central theme. There are lots of ways to define a software object; most of them agree that a key element is grouping together data with the code that processes it, and having some fancy ways of treating it as a unit. Many programming languages refer to this type of thing as a “class.”
  • Abstraction: The process of refining away the unimportant details of an object, so that only the essential characteristics that describe it remain.
  • Class: A user-defined type, just as int is a built-in type. Anything in a class is known as a member of the class. Member functions of a class (the operations) are also known as methods.
  • Object: A specific variable of a class type, just as j may be a specific variable of type int. An object is also known as an instance of a class.
  • Encapsulation: Grouping together the types, data, and functions that make up a class.
  • Inheritance: This is the big one--allowing one class to receive the data structures and functions described in a simpler base class.
  • Now C++ is a rather large language. As a concrete example, the size of a C compiler front-end might be around 40,000 lines; a C++ compiler front-end might be twice as big, or more.
  • Abstraction is the notion of looking at a group of “somethings” and realizing that they have common themes. You can then ignore the unimportant differences and just record the key data items that characterize the thing.
  • C supports abstraction through allowing the user to define new types (struct, enum) that are almost as convenient as the predefined types (int, char, etc.), and to use them in a similar way. We say “almost as convenient” because C does not allow the predefined operators (*, <<, [], +, etc.) to be redefined for user-defined types. C++ removes this barrier.
  • When you bundle together an abstract data type with its operations, it is termed “encapsulation.”
  • A class encapsulates (bundles together) code with its related data.
  • A class is the software realization of encapsulation. A class is a type, just like char, int, double, and struct rec * are types, and so you must declare variables of the class to do anything useful.
  • It’s a helpful convention to start class names with a capital letter.
  • A class is just a user-defined type with all the operations on it.
  • It’s prefered to not make data public.
  • The availability is a keyword that says who can access the declaration that follow.
  • Private declarations are visible outside the class (the name is known), but they are not accessible.
  • Public: The declarations that come after are visible outside the class and can be set, called, and manipulated as desired.
  • Protected: The declarations that come after are visible to functions inside this class, and to functions in classes derived from this class.
  • Private: The declarations that come after can only be used by member functions of this class.
  • Friend: This says that the function is not a member of the class, but can access private and protected data just as though it were.
  • C++ comments start with // and go to the line end.
  • C++ source files usually have an extension of .cpp or .cc or .C.
  • The :: is called “the scope resolution operator.” The identifier that precedes it is a class to look in. If there’s no preceding identifier, ti means global scope.
  • Invoking a member function on a class object is equivalent to the “sending a message to that object” terminology that other object-oriented languages use.
  • Every method has a this pointer parameters implicitly passed to it, allowing an object to refer to itself inside a method.
  • Most classes have at least one constructor. This is a function that is implicitly called whenever an object of that class is created. It sets initial values in the object. There is also a corresponding cleanup function, called a “destructor”, which is invoked when an object is destroyed (goes out of scope or is returned to the heap). Destructors are less common than constructors, and you write the code to handle any special termination needs, garbage collection, and so on.
  • Constructors and destructors are needed because no one outside the class is able to access the private data.
  • Constructor functions always have the same name as the class.
  • Constructors are needed because classes typically contain structs with many fields, requiring sophisticated initialization. An object’s constructor function will be called automatically whenever an object of that class is created, and should never be invoked explicitly by the programmer.
  • Single inheritance occurs when a class specializes, or refines, the data structures and methods of a single base class.
  • Inheritance: Deriving one class from another such that all of the other’s characteristics are automatically available.
  • Inheritance usually provides increasing specialization as you go from a simple base class to a more specific derived class.
  • An important part of OOP is figuring out the hierarchies of the abstract data types in your application. The major novelty that C++ provides, which cannot easily be accomplished by discipline use of C, is inheritance.
  • Inheritance takes place between two classes (and not between individual functions).
  • Inheritance say the derived class is a variation of the base class and there are many detailed semantics governing how they can access each other.
  • Multiple inheritance allows two classes to be combined into one, so that objects of the resulting class would behave as objects of either class.
  • Overloading simply means reusing an existing name, but using it to operate on different types.
  • Since C++ allows the creation of new types, the programmer is given the ability to overload names and operators for those new types too.
  • Overloading (by definition) is always resolved at compile time.
  • In order to conserve programmer sanity, you should only overload an operator for similar operations; don’t overload * so that is now does division.
  • Polymorphism refers to the ability to have one name for a function or an operator, and use it for several different derived class types. Each object will carry out a different variant of the operation in a manner appropriate to itself.
  • C++ demands that you warn it when you start supplanting base class methods with derived class ones. You warn it by adding the keyword virtual to the base class method that you might be replacing.
  • The word virtual is a bit of a misnomer in this context. Elsewhere through computer science, “virtual” means letting the user see something that is not really there, and supporting the illusion by some means. Here, it means not letting the user see something that is really there (the base class method).
  • Polymorphism and interposing both allow multiple functions to have the same one identifier.
  • There is a property of programming languages known as orthogonality. This refers to the degree to which different features follow the same underlying principles.
  • The major purpose of a programming language is to provide a framework for expressing problem solutions in terms which a computer can process. The better a language is a representing this discourse, the more successful it will be.
  • A language will be successful if its constructs are useful “building blocks” for solving problems in a given domain.
  • The market for a single language is always less than for a general machine.
  • C++ has some point improvements, but it retains many of the flaws of C, and piles up another big layer of complexity on top.
  • A typecast can be written in the more normal-looking format of float(i) as well as the strange-looking C style of (float) i.
  • C++ allows a constant integer to define the size of an array.
  • Declarations can be intermingled with statements, dropping the C requirement that all declarations precede all statements in a block.
  • The best way to get started with C++ is to start programming in its ANSI C subset.
  • Name-mangling is a horrible kludge which will likely live on in C++ for a long time.
  • Name-mangling is a hack for doing type checking between different files. But it implies that all your C++ rules must be compiled with the same compiler as the name-mangling scheme may differ among compilers.
  • Companies that don’t listen to their customers always go out of business. Companies that try to push the state of the art often succeed.
  • We all learn far more from our mistakes. than from our successes.
  • Programmers are sometimes more qualified than managers to judge which of the candidates has the best technical skills as an “individual contributor.”
  • Of course, part of what you look for when you interview people is not just what they respond with, but how they respond.
  • The ‘-Xc’ tells the compiler to reject any non-ANSI C constructs. It’s a good idea to always use this option when writing new code, because it will help you attain maximum program portability.
  • Something short can be simpler to read than something long. However, extreme brevity is also difficult to read.
  • Don’t try to do too many things in one statement.
  • Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
  • A system call gets into the kernel by issuing a “trap” or interrupt.
  • Library routines are usually slower than in-line code because of the subroutine call overhead, but system calls are much slower still because of the context switch to the kernel.
  • For raw performance, minimize the number of system calls wherever possible, but remember, many routines in the C library do their work by making system calls.
  • All the UNIX routines that manipulate files use either a file pointer or a file descriptor to identify the file they are operating on.
  • All the system calls that manipulate files take (or return) a “file descriptor” as an argument.
  • The file descriptor is a small integer (usually between 0 and 255) used in the Sun implementation to index into the per-process table-of-open-files.
  • So a file descriptor is the offset in the per-process table-of-open-files.
  • A FILE pointer holds the address of a file structure that is used to represent the open I/O stream.
  • Trick questions are fun to pose to your friends, but they don’t really belong in an interview.
  • Now, the first thing you learn in complexity theory is that the notation O(N) means that as N (usually the number of things being process) increases, the time taken increases at most linearly. Similarly, O(N^2) means that as N increases, the processing time increases much, much faster, by up to N-squared in fact. The second thing that you learn in complexity theory is that everything in a binary tree has O(log(n)).
  • Humanity’s higher purpose is to strive, to seek, to create.

No comments:

Post a Comment