Garbage collection in programs

Submitted by peter on Thu, 11/30/2017 - 06:31

Programs and programming languages waste memory. GC, Garbage Collection, recycles memory for reuse, reducing overall memory wastage. Here is the good, the bad, and manual collection.

Why do we have garbage memory? Programming code allocates memory for every little bit of data, one chunk at a time. When a program finishes using a chunk, the chunk remains unused until it is recycled. Recycling is a problem. Some programming languages recycle automatically, some manually, and all have limitations based on their design. PHP is an example of a hybrid automatic GC you can manually control for special applications.

Think of something using variables x, y, and z. X is allocated some memory then y is allocated some memory. X is no longer needed so x is deallocated, ready for reuse. The code now needs space for z. If z is the same size or smaller than x, z can reuse the space allocated to x. When z is bigger, z has to get new memory.

In a small program used for a short time, you allocate new memory for every variable and everything is cleaned up when the program finishes. The problems start with long running programs. A transaction server might run non stop for a few years.

A new problem is the computer that hibernates instead of having a daily shutdown and startup cycle. You leave your email program running and simply shut the lid on your notebook computer. The next day you open up your notebook and the email program continues using the same memory. You now need GC in your email program.

Three things take up time in memory usage. Allocating space for a new variable takes time and takes extra time if the memory allocation attempts to find an existing space to reallocate. Deallocating memory takes less time until it kicks off code that attempts to recycle many small spaces as one larger space, commonly called garbage collection. Recycling memory opens up the possibility of something secret remaining in memory, which introduces extra cycles to wipe memory clean during deallocation or GC.

Aggregation

We can reduce allocation time wastage by aggregating memory requests. Aggregation is an ancient approach used on mainframes back before Bill Gates made the Apple computer usable as a PC1.

Aggregation bunches variables together based on their use. For example, all the variables used by a function are placed in one big memory allocation instead of many small allocations. This is several times more efficient for allocation, deallocation, and a security wipe. You can do this in any programming language where there is a code unit like a function or an object.

Some code units need a variable amount of memory and that is often handled by allocating a common amount then reverting to conventional allocation when more is needed. This is still more efficient than what most code does as most code reallocates a variable every time the variable expands.

Deallocation uses very little time and aggregation makes only a small difference. Aggregation makes a big difference when the GC kicks in, either as part of deallocation or as a separate task. To deallocation, an aggregate is just one variable, not many small variables, and either is easy. To GC, the difference is huge.

Aggregation creates a conflict of interest when you swipe memory for security. Think about a user login function. The variables contain the user name and password entered for the login. The password has to be wiped for security. The user name is public and does not matter. At the variable level, the a wipe of the password is a fast operation. At the aggregate level, the same wipe has to cover all the data, the password, the user name, and what may be a heap of other variables such as the date, time, IP address. An aggregate wipe could be hundreds of times longer than the bit needed for security.

Reuse

Reuse of variables is fast when it works. Think about a simple function to check for illegal characters in a user name during registration. The function could be locked into memory once, could allocate the variables once, then serve many user registration processes. The function would be a long running service retaining the same memory. If your computer has eight processors, there could be one function in each processor. Allocation occurs eight times at startup then never uses any more memory allocation/deallocation/GC until you restart your computer next year.

Realtime systems and other well crafted software have some reuse. Reuse works well when data has the same length or a fixed limit. Twitter bleats have a fixed length, making a Twitter server easier to design for reuse. Email messages have a variable length address, a variable length title and a variable length body, making reuse difficult for email.

C is one of many languages with an option to allocate static variables. C lets you write code for multiple processors. Reuse is practical in C after you overcome all the complexities of programming in C.

Recycle

Recycling of allocated memory can be random where the allocation process searches through deallocated memory for something the same size or it can be planned reuse. Reuse is faster but cannot be used everywhere.

Random recycling is a slow expensive process when the deallocated memory list is long. GC has to occur occasionally to reduce the length of the deallocated list.

One of the ancient approaches is to split memory usage into input, temporary, and output. Instead of a function allocating space for input, the function uses a reference to the input. A reference can also be used for the output to remove the memory allocation in whatever receives the output.

In the dinosaur days of mainframes, when people were developing most of what you use now, a function could be called with space for the input and output. The function would read from the supplied input and write to the supplied output area, reducing the need for memory allocation in the function.

Reentrant

Reentrant code has only one copy of the code and it uses external memory, through automatic allocation or supplied by the calling code. There are real advantages to using memory supplied by the calling code.

A function might request two input values, an output variable space, and use a thousand bytes of workspace. The calling code supplies an aggregate area with the inputs, the output, and the thousand spare bytes. The function never allocates memory which means any number of processes can use the code any time and things like interrupts are irrelevant to the called function.

Suppose your code calls dozens of functions and each one requires from one thousand to five thousand bytes. You could allocation five thousand bytes at the start and reuse the same space for every function. This can be a massive reduction in memory allocation overheads.

Use less memory

PHP 7 replaces PHP 5 as the standard for PHP development. PHP 7 can be many times faster than PHP 5 due to a reduction in the memory allocated by the code. Less memory. Less allocation overheads. Less deallocation overheads. Less GC. Given the way PHP is used, everything in PHP 7 is better.

If PHP was used for something like an email client or a word processor, we would not see the same speed increases in PHP 7. PHP starts clean, shiny, and new for every page request. The PHP 7 improvements are the best approach for page serving. For an application that sits there all day, reuse and recycling is more important.

Garbage Collection language examples

Languages have GC running automatically, GC running manually, or nothing and require extra code added for GC. Here are some examples.

PHP

PHP adds a reference count to each variable. When the reference count is back to zero, the variable is no longer used and can be deallocated. GC is automatic and can be switched off for short running processes. The automatic collection has a default setting to operate when 10,000 unused variables are ready for collection and that number can be reconfigured. When there is some natural break in your process, perhaps a wait for a network event, you can manually run GC.

C

C has manual memory management and optional automatic memory management through add-on code. You can make variables static to lock them in for the life of the program execution. Variables can be automatically allocated on a "stack" then released for reuse when a code block is exited. You can also perform dynamic memory allocation with malloc then release the memory with free but the memory will be lost until you add code to recycle the memory.

C is as close as you can get to the power and control of Assembler without writing Assembler. The manual approach with C is painful until you have to write a realtime program that may run for years. For something that runs 24/7, you need that level of complete control.

Go

Go is a modern update of C style languages designed for large applications but not reatime applications. Go does automatic GC with no controls. Go runs GC frequently to keep each GC pause short. You cannot defer GC until a natural break, making some interactive applications pause at the wrong time. Early versions of Go had noticeable GC pauses and recent versions hide the GC to the point where you might not notice it unless your were specifically looking for it.

Conclusion

You need garbage collection in programs. Good design and code will reduce the amount of garbage. No garbage collection is the best approach for applications with a short execution time and that is becoming rare. Automatic GC works for most applications but not realtime. Manual GC is needed for realtime and complete control of heavily interactive applications.

1 The first Apple computer was similar to the many bare computer boards already in use worldwide except the Apple computer board lacked usable software, relegating the Apple to a toy instead of a PC. Bill Gates/Microsoft converted their BASIC compiler to run on the weird processor used in the Apple board. Suddenly the Apple device was useful.