What is Mark and Sweep?

In computer programming, mark and sweep is an algorithm which is used in garbage collection. Generally in C++ and Java, we create objects and some of them may remain unused but the objects which ever were created occupy some heap memory and if we need to create new objects, there may not be enough memory that is, we may run out of space and in such cases there is a necessity to clear or remove those unused objects but doing it manually by the programmer takes a lot of time and effort. So we have mark and sweep algorithm do collect the garbage values.

Description of Mark and Sweep

If there exists a garbage collection algorithm, then two basic operations that it should perform are as follows. Firstly, it should find all the unreachable and unused objects. Secondly, the algorithm should be able to recover all the memory used by those unreachable objects. So both of the above mentioned operations are performed by mark and sweep algorithm. Mark and sweep algorithm has two phases to perform garbage collection. They are 1. Mark phase and 2. Sweep phase. We are going to discuss both the phases in detail as follows:

Mark Phase

For every object created, there is a bit assigned to it called mark bit and mark bit for an object just created will be set to zero(0). In the first phase i.e., mark phase, the programmer sets the mark bit for all the objects that are reachable to 1 (true). So, as already mentioned that in mark phase, all the unreachable objects are found and removed, and to do so, we need a reference to all the objects present and then we can get to know if any particular object is in use or not. We need to perform a graph traversal and we may use a depth first search strategy for this. Look at the following figure, which shows both reachable objects using arrows and unreachable objects which are not connected to the root.

Every object is considered as a node in this process of traversal and all the nodes which can be reached are visited and the traversal continues till it visits all the reachable nodes or objects.

We can consider the below algorithm for the operation that is done in mark phase.

}// we have assumed that there is only one root variable in this case but if there exists more than one root, we need to call Mark() for all the available root variables.

Sweep Phase

In second phase, i.e., the sweep phase, the name itself says what it is about to do. In this phase, all the unused or unreachable objects are swept away. All the memory that has been occupied by these unreachable objects is cleared and is ready to be allocated to some other newly created objects. All the marked bits that are set to false or zero are now cleared after the traversal from the heap and the algorithm in sweep phase can be written as follows:

//Algorithm in Sweep phase//

void Sweep()

{

Object *pHeap = pHeapStart;

while(pHeap < pHeapEnd)

{

if (!Marked(pHeap))

Free(pHeap); // put it to the free object list

Else

UnMarkBit(pHeap);

pHeap = GetNext(pHeap);

}

}

Example to explain Mark and Sweep

So, let us look at the following example in which the working of mark and sweep algorithm is described. Let us assume seven objects A, B, C, D, E, F, G and some of them are connected to each other and some are not. Root R is the base object from which the traversal begins. Look at the following connection of objects.

In the above figure of connection of objects, to find out what objects are reachable, the traversal of objects begins from root R and then to A as there is a connection. Here connection represents that there is a reference between those two objects. So, the traversal of objects takes place until it reaches all the nodes which are connected. But, we have no connection between C and E, F or between any other object that is connected to the root and E, F. That means the traversal of E, F does not happen as they are not connected. So, these two objects are considered as not reachable even if they are connected themselves (this is a drawback in reference counting technique which is explained below). So, all these (E, F) objects are swept away from the memory as per the algorithm and memory occupied by these two objects is freed.

Advantages and Disadvantages of Mark and Sweep Algorithm

Advantages:

Mark and sweep algorithm can handle the case of cyclic references. Unlike some algorithms of garbage collection, mark and sweep algorithm does not end up in an infinite loop even if cycles occur in traversal.

This algorithm does not have additional performance overheads during the execution of this algorithm.

Disadvantages:

While we run the garbage collecting algorithm, the main program execution stops running.

After the execution of the algorithm, all the unreachable objects are removed from the memory and this results in leaving the reachable objects in the memory with spaces in between the reachable objects i.e., they appear to be fragmented, Take a look at the below figure.

All the white spaces in between refer that there is empty space and is dividing the reachable objects which are represented by grey blocks apart.

Mark and Sweep (vs) Stop and Copy Algorithm

We are going to take a look at two different garbage collecting algorithms and the difference in their functionality. As we have already discussed, mark and sweep follows two phases in order to perform garbage collection.

Stop and copy algorithm is another technique for collecting garbage values and this techniques is also used to remove the fragmented memory which is shown in the above figure. The main working principle of stop and copy algorithm is that it splits the whole memory in use by objects into two halves. One half is known as the current half (where all the locations take place) and the second half remains idle (in which nothing happens).Unlike in mark and sweep algorithm, this technique does not contain a free list to maintain or search while removing fragmentation. In stop and copy algorithm, memory is allocated to new objects from the first half and when the memory in the first half is fully occupied, garbage collection is performed on the first half of memory.

Diagram showing stop and copy memory fragmentation removal and read the above paragraph and try to compare each sentence with this figure and you will get a clear idea of how stop and copy technique works.

As of the mark and sweep, the collector traversed the entire memory heap from the root node (object). After each node is traversed, the traversed node is transferred (can be said copied) to the second half of the heap which is idle. When the copy is finished, i.e., all the reachable nodes are now copied into the idle heap (second heap), the designations of the two halves are switched. The first half which is also known as current heap is changed as idle heap and the second heap (idle heap) is now changed to current heap. So, all the reachable nodes have been copied but what about the unreachable nodes or objects? These unreachable objects are left in the idle heap at this point that means these are not copied in to the second half along with reachable nodes.

Advantages of stop and copy:

This technique successfully eliminates fragmentation of memory.

Memory allocation is done in constant time.

Drawbacks of stop and copy:

The main drawback of this technique is that you are going to waste half of the memory by dividing it as a part of technique.

Mark and Sweep vs Reference Counting

As we are already aware of mark and sweep let us take a look at reference counting and how efficient it is compared to mark and sweep. When objects are created, in mark and sweep a traversal is performed to know all the reachable nodes and in reference counting keeps a track on how many times an object has been referred and based on that count it decides which object is inaccessible or unreachable. If number of references to an object are zero, that means that object is not reachable by any other object and is hence unreachable.

Formal definition of reference counting is that it is a technique which stores the number of references or pointers or even handles to a resource such as disk space or a block of memory or any other resource. A reference counting algorithm may also refer to a garbage collection algorithms which use such of reference counts. So, if an object is unreachable and is destroyed, then the reference count of all the objects that had referred to the destroyed object will be decreased by one. So, destroying one object eventually leads to a large number of objects (those objects which refer to the destroyed object) getting freed.

Advantages of Reference Counting:

In reference counting, objects are retrieved as soon as they can no longer are referenced by other objects and are retrieved in an incremental fashion. Every object has its life time clearly defined in reference counting.

This technique is implemented because it is one of the simplest in memory management.

Disadvantages of Reference Counting:

A notable disadvantage in reference counting is that it can’t detect objects that have a cycle of reference. For example, if A and B are two objects that have reference to each other but they are not referred or not in contact with the root node and other objects that are in contact with the rot node. So, this technique is never going to give the reference count to objects A and B as zero as they are being referred by each other.

Explanation of Mark and Sweep in C language and C#

In C programming, memory is allocated to objects by the function malloc(). So, to begin with writing a Garbage Collector, we first need to write a memory allocator i.e., malloc(). Sometimes, memory is allocated dynamically and this dynamically allocated memory resides in heap (a section of memory).

So, Mark and Sweep in C language works as follows:

We need to know two terms related to this. Used list and free list. Used list is that whose memory is being currently used and free list is the total available space/memory which is idle. Firstly, we need to attempt to access any memory location that we can. Secondly, all the variables are stored at the memory location that you were already able to access.

Solving Vehicle Routing problem using Mark and Sweep

Vehicle Routing Problem states that “find the optimal set of routes that best suit for a group of vehicles to deliver its passengers to the destiny”. It is somewhat similar to the traveling salesman problem. So, the main objective of the vehicle routing problem is to minimize the total cost in reaching the destiny. As look at the following figure in which three optimal routes are found from the depot and to solve a vehicle routing problem, we have a lot of approaches like ‘branch and bound’, ‘branch and cut’ etc.

In vehicle Routing Problem, best nodes out of all the possible nodes should be chosen for a single optimal route and the other nodes are not considered to take part in forming the optimal route.

Applications and implementation of Mark and Sweep

Specifically, Mark and Sweep algorithm is used in programming languages in order to remove the excess space that is used by objects that are not functional. It means to free up the space occupied by objects that are no longer in use or being referred. It is used in many software applications in order to free the available memory space period and improve the performance of the software application.

Mark and Sweep technique is implemented in two phases. They are mark phase and sweep phase. Mark phase detects all the unreachable objects by traversing the entire tree of objects and then separates all the idle objects. Sweep phase helps in deleting all these collected unreachable objects from the memory and releases the memory which can be used to allocate for some other purpose (may be allocated to newly created objects).

Explanation of Concurrent Mark Sweep (CMS)

Concurrent Mark Sweep (CMS) collector is a garbage collector that is built or designed especially for those applications that seek to have less pauses in garbage collection. A concurrent mark sweep collector is said to be generational and performs both minor and major garbage collections. CMS uses a separate garbage collector threads to trace all reachable objects at the same time along with the execution of application threads and tries to reduce the pause time. During each and every cycle of garbage collection, the concurrent mark sweep collector pauses the execution all the application threads for a certain period at the beginning and in the middle of the collection. There is another term that we need to be aware of that is, concurrent mode failure which is a message shown when you use a low pause collector in garbage collection. It means when the new generation is being filled too fast and as a result it is overflowing to tenured generation of objects but in the meanwhile if the CMS could not clear out the tenured generation in the background. This is called Concurrent mode failure.

So, among both the pauses, the pause that occurs in the middle of the collection tends to be a longer one and during these pauses a multiple number of threads are used to do the collection process during this pause time. We have to remember that the CMS (concurrent mark sweep) collector at most of the times completes the sweeping of unreachable objects without stopping the execution of the application threads. If you look at the following figure, you can observe the difference of pausing in serial parallel and concurrent garbage collection. The blue arrows represent the execution of application and the arrows in light red represent the pauses.

If too much time is spent in garbage collection, then an error “ Out of Memory” is thrown by the CMS collector, that is, if more than 98 percent of time is spent in only garbage collection and 2 percent in recovery of memory, then this error is shown. This feature is especially used in preventing the applications running for overtime and these application even may not make any progress due to small size of memory. Anyhow, if one wants to remove this feature, they can simply add the option “UseGCOverheadLimit“ to the command line and the showing of error “outofmemoryerror” will be disabled.