Patching IO2BO in LLVM

Apr 18, 2023

Prologue

This is a toy project I did that I found very interesting. The project is based on an “old” paper called IntPatch in 2010. Code base is here. Code is messy and needs refactoring.

IO2BO

What is IO2BO

Info

IO2BO (Integer Overflow to Buffer Overflow) vulnerability is defined as “An integer overflow occurs when a program performs a calculation to determine how much memory to allocate, which causes less memory to be allocated than expected, leading to a buffer overflow.”

Here is a simple example from the paper. We know the maximum 32-bit signed integer is 0x7FFFFFFF. Suppose in line 467 we read in an int32 of 0x80000001. In line 469 we multiply it with 4 which gives us 4 and use it to allocate the memory. If we look further we found in line 483 we think we have more memory than we actually have allocated, which means a buffer overflow problem was caused by the integer overflow.

Why is it important

According to the paper, IO2BO is widely used by attackers: bypassing SSH, heap corruption attack IO2BO: nearly 1/2 of integer overflow vulnerabilities, 1/3 of heap overflow vulnerabilities. Yet many programmers are not yet aware of the danger of IO2BO. Even though some may be aware of it, the problem is tedious and error-prone to fix.

Integer overflow is generally validated in two ways by developers:

The first way is postcondition testing. We check if the result of the operation matches our expectation. This method is efficient, but it is useless when aggressive optimization discard checks like (a*b)/b != a.

// postcondition testing
if (b != 0 && (a*b)/b != a){
  MSG("overflow occurs");
}

The second way is precondition testing. We check if both variable is within the valid range before we calculate. This method can circumvent compiler optimization. The problem is the overhead it induces is non-trivial.

// precondition testing
if ( a > 0 && b > 0 && a > INT_MAX/b)
  MSG("overflow occurs");
else if ( a > 0 && b < 0 && b < INT_MIN/a )
  MSG("overflow occurs");
else if ( a < 0 && b > 0 && a < INT_MIN/b )
  MSG("overflow occurs");
else if ( a < 0 && b < 0 && b < INT_MAX/a )
  MSG("overflow occurs");
else
  c = a*b

The motivation of IntPatch is to eliminate IO2BO at compile time. The programmer don’t need to worry about them anymore.

Method

Type System

The idea is simple: the input brings uncertainty; the arithmetic operation propogates the uncertainty and may cause overflow; the memory operation consumes the uncertainty and leads to buffer overflow. We need to first locate the vulnerable arithmetic operations and then patch them. We can do this by introducing a type system. There are four types:

Untainted and non-overflow (00)
Tainted and non-overflow (10)
Untainted and may-overflow (01)
Tainted and may-overflow (11)

Tainted: the variable may be screwed up due to an overflow somewhere. May-overflow: the variable itself may overflow.

The arrow in the graph indicates the order of the type system. Our goal of the type system is to locate the vulnerabilities, which is defined by “a type 11 variable being consumed by a store instruction.” We assume the input variable is of type 11 and arithmetic operations propogates its taintedness and may-overflowness.

Type Inference

At the very beginning we only know the type of the input is 11. We use type inference rules to infer the type of other variables in LLVM IR.

Assignment: this is straightforward. The variable assigned has the same type as the variable used.
Branch: branch instruction is always harmless.
Compare: the result of a comparison can be tainted (since the operands may overflow), but the result is never overflow since it is a boolean.
Arithmetic: the arithmetic operations are always possible for overflow. The result can also be tainted if at least one of the operands is tainted.
Load: here comes the interesting part. When we load data from memory to a variable, we don’t know the type of the data, since we only keep track of the type of virtual registers so far. Therefore, we introduce a new type map, tp_v, for memory. tp_v also take in a register, but the output of the map is the type of the memory chunk pointed by that register. It is possible that many registers points to the same memory chunk, so when we load a memory chunk, we conservatively assume the worst case of all the possible types. This means the type of the memory chunk is the conjunction of all the tp_v of those registers who may point to this memory chunk. The situation that many registers point to the same memory chunk is called “pointer aliasing.”
Store: when storing, we update the tp_v of the memory chunk with the type of the value to store.
Phi-node: it is just the conjunction of all the possible types of this phi-node, represented by all the type of all the possible incoming values.

Patching

After we detect the vulnerable variabe, we add check statements to it and redirect to the error handling function in runtime.

Implementation

The implementation is straightforward. We built this extension upon llvm::ModulePass. The algorithm is similar to the reaching definitions analysis: we use a big while loop for fix-point algorithm and iterate the type inference rules until it hits the stable state. For pointer aliasing, we used the built-in llvm::AliasSetTracker.

After we found all the vulnerable variables, we create patches for them. Take add for example: we create three ICmpSGT instructions to compare the both operands and the result with 0. If the two operands share the same sign but the result doesn’t, overflow occurs and we use a CondBr to redirect the control flow to the error handling block we created. In the block we just terminates the program with the magic number 42.

An interesting part is how to initialize the input variable. In the original paper, they created a configuration file for each of the application and initialized manually. Instread we look for the scanf function call in the IR file and initialize the corresponding parameter. It seems to work pretty well and we don’t need to bother creating a configuration file for each program.

Results

We tested this program on some contrived programs.

#include <stdio.h>
#include <stdlib.h> 
int main(){
    int input;
    FILE* pFile;
    pFile = fopen("input1", "r");
    fscanf(pFile, "%d", &input);
    printf("%d", input);
    int plus10=input+10;
    printf("%d", plus10);
    fclose(pFile);
    return 0;
}

The input1 file contains the integer 2147483647. Before patched, the llvm IR of the program is

define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  %3 = alloca ptr, align 8
  %4 = alloca i32, align 4
  store i32 0, ptr %1, align 4
  %5 = call noalias ptr @fopen(ptr noundef @.str, ptr noundef @.str.1)
  store ptr %5, ptr %3, align 8
  %6 = load ptr, ptr %3, align 8
  %7 = call i32 (ptr, ptr, ...) @__isoc99_fscanf(ptr noundef %6, ptr noundef @.str.2, ptr noundef %2)
  %8 = load i32, ptr %2, align 4
  %9 = call i32 (ptr, ...) @printf(ptr noundef @.str.2, i32 noundef %8)
  %10 = load i32, ptr %2, align 4
  %11 = add i32 %10, 10
  store i32 %11, ptr %4, align 4
  %12 = load i32, ptr %4, align 4
  %13 = call i32 (ptr, ...) @printf(ptr noundef @.str.2, i32 noundef %12)
  %14 = load ptr, ptr %3, align 8
  %15 = call i32 @fclose(ptr noundef %14)
  ret i32 0
}

and the output is 2147483647-2147483639 with return value 0. We run the llvm again with the patching pass. It gives some interesting output. In the type analysis process, we spotted the alias pointers and add them to the alias set. We find type 11 variable used in a sink, which indicates vulnerabilities.

At the end of all iterations, we found the tainted and may-overflow variables. On closer look, they actually make sense.

%3 is tainted because it is used as the paramter of scanf
%9 is tainted because it loads %3
%11 for the same reason
%12 is tainted because it is the result of arithmetic operation on %3
store is tainted because it stores a tainted variable to %5.
%13 is tainted because it uses %5, which is just stored with tainted data.

After patchin the IR looks like this. Note that we add a bunch of icmp instructions, split the original basic block and insert a br to redirect to the new tempBB if overflow occurs. tempBB terminates the program and prevents the adversary from exploiting the vulnerablility.

; Function Attrs: inlinehint noinline nounwind optnone uwtable
define dso_local i32 @main() #0 !prof !35 {
  %pgocount = load i64, ptr @__profc_main, align 8
  %1 = add i64 %pgocount, 1
  store i64 %1, ptr @__profc_main, align 8
  %2 = alloca i32, align 4
  %3 = alloca i32, align 4
  %4 = alloca ptr, align 8
  %5 = alloca i32, align 4
  store i32 0, ptr %2, align 4
  %6 = call noalias ptr @fopen(ptr noundef @.str, ptr noundef @.str.1)
  store ptr %6, ptr %4, align 8
  %7 = load ptr, ptr %4, align 8
  %8 = call i32 (ptr, ptr, ...) @__isoc99_fscanf(ptr noundef %7, ptr noundef @.str.2, ptr noundef %3)
  %9 = load i32, ptr %3, align 4
  %10 = call i32 (ptr, ...) @printf(ptr noundef @.str.2, i32 noundef %9)
  %11 = load i32, ptr %3, align 4
  %12 = add i32 %11, 10
  %13 = icmp sgt i32 %11, 0
  %14 = icmp sgt i32 %12, 0
  %15 = icmp eq i1 %13, true
  %16 = icmp ne i1 %13, %14
  %17 = and i1 %15, %16
  br i1 %17, label %tempBB, label %18

18:                                               ; preds = %0
  store i32 %12, ptr %5, align 4
  %19 = load i32, ptr %5, align 4
  %20 = call i32 (ptr, ...) @printf(ptr noundef @.str.2, i32 noundef %19)
  %21 = load ptr, ptr %4, align 8
  %22 = call i32 @fclose(ptr noundef %21)
  ret i32 0

tempBB:                                           ; preds = %0
  ret i32 42
}

The output of the patched program is just 2147483647 with a return value of 42.

Epilogue

Compiler is fun.