High Performance FillChar Method Comparison for 32 bit (every four bytes)



In C/C++, if you want to clear an array by filling a byte-value, you can use the following:

1
void * __cdecl memset(void *_Dst, int Val, size_t _Size)
void * __cdecl memset(void *_Dst, int Val, size_t _Size)

This allows you to set every single byte given a memory location and the size to fill. However, if you want to fill the continuous memory locations every four byte (given a 32-bit value), you may have to write a short function like the following C/C++ code.

1
2
3
4
5
6
inline
void FillInteger1(int *arr, int sz, int value) {
    for (int i = 0; i < sz; i ++) {
        arr[i] = value;
    }
}
inline
void FillInteger1(int *arr, int sz, int value) {
    for (int i = 0; i < sz; i ++) {
        arr[i] = value;
    }
}

Let’s write other implementations using Win32 Inline Assembly before a recommendation is drawn.

The second version, we can use the assembly instruction rep stosd which is essentially the same as rep stos dword ptr [edi] that will store the content of register eax to where edi points to, for ecx times. The edi can be incremented or decremented depending on the direction flag. So the following clears the direction flag using cld before the instruction rep stosd

1
2
3
4
5
6
7
8
9
10
11
12
inline
void FillInteger2(int *arr, int sz, int value) {
    __asm  {
        push edi // points to the destination
        mov edi, arr
        mov eax, value
        mov ecx, sz
        cld // clear the direction flag
        rep stosd
        pop edi
   }
}
inline
void FillInteger2(int *arr, int sz, int value) {
    __asm  {
        push edi // points to the destination
        mov edi, arr
        mov eax, value
        mov ecx, sz
        cld // clear the direction flag
        rep stosd
        pop edi
   }
}

Please be noted that the x86 register edi must not be altered (they must at least be restored on the exit of function), therefore, a push and pop is used to save the state.

Assembly language gives you many alternatives because of so many different instructions (at least on x86), which lead to the same purpose. The following assembly code does the same thing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
inline
void FillInteger3(int *arr, int sz, int value) {
    __asm {
        mov eax, arr
        mov edx, sz
        shl edx, 2  // multiply by four
        mov ecx, value
        add eax, edx
        neg edx
        jns D
    L:
        mov [eax + edx], ecx // copy
        add edx, 4  // increment by four
        js L
    D:
    }
}
inline
void FillInteger3(int *arr, int sz, int value) {
	__asm {
		mov eax, arr
		mov edx, sz
		shl edx, 2  // multiply by four
		mov ecx, value
		add eax, edx
		neg edx
		jns D
    L:
		mov [eax + edx], ecx // copy
		add edx, 4  // increment by four
		js L
    D:
	}
}

It multiplies the counter by four (so makes the number of bytes), which is essentially like the C/C++ code (but slower compared to assembly):

1
2
3
4
5
6
7
8
9
inline
void FillInteger3c(int *arr, int sz, int value) {
    sz *= sizeof(int);
    char *p = (char*)arr + sz;
    sz = -sz;
    for (; sz < 0; sz += 4) {
        *((int*)(p + sz)) = value;  // convert to int* then assign value
    }  
}
inline
void FillInteger3c(int *arr, int sz, int value) {
	sz *= sizeof(int);
	char *p = (char*)arr + sz;
	sz = -sz;
	for (; sz < 0; sz += 4) {
		*((int*)(p + sz)) = value;  // convert to int* then assign value
	}  
}

The fourth implementation relies on the x86 assembly instruction loop, which seems a bit slow (more CPU cycles). The loop everytime decrements the register ECX and check for zero, if it is zero, the loop ended otherwise, jump to a specific memory location.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
inline
void FillInteger4(int *arr, int sz, int value) {
    __asm {
        push ebx
        mov ebx, value
        mov eax, arr
        mov ecx, sz
        xor edx, edx
    L:
        mov [eax + edx], ebx
        add edx, 4
        loop L
        pop ebx
    }
}
inline
void FillInteger4(int *arr, int sz, int value) {
	__asm {
		push ebx
		mov ebx, value
		mov eax, arr
		mov ecx, sz
		xor edx, edx
	L:
		mov [eax + edx], ebx
		add edx, 4
		loop L
		pop ebx
	}
}

Performance Comparison – DEBUG – Visual Studion 2010

Let’s compare the performances of the above five approaches… The timing procedure looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <time.h>
    const int SZ = 5000;
    const int SZ2 = 2000000;
    int *arr = new int[SZ];
 
    clock_t start;   
    clock_t finish;  
    double interval;
 
    start = clock();
    for (int i = 0; i <= SZ2; i ++) {
        FillInteger1(arr, SZ, i);
    }
 
    finish = clock() - start;
    interval = finish / (double)CLOCKS_PER_SEC;
    printf("FillInteger1 = %f seconds elapsed\n", interval);
 
    // the rest is similar for other approaches...
#include <time.h>
	const int SZ = 5000;
	const int SZ2 = 2000000;
	int *arr = new int[SZ];

	clock_t start;   
	clock_t finish;  
	double interval;

	start = clock();
	for (int i = 0; i <= SZ2; i ++) {
		FillInteger1(arr, SZ, i);
	}

	finish = clock() - start;
	interval = finish / (double)CLOCKS_PER_SEC;
	printf("FillInteger1 = %f seconds elapsed\n", interval);

	// the rest is similar for other approaches...

Under Visual Studio 2010, C++, DEBUG (Compiler Option), we have the following:

debug High Performance FillChar Method Comparison for 32 bit (every four bytes) array assembly C/C++ implementation integer

The naive first solution and the C/C++ version of the third version are slowest among all. The Fourth assembly looks inefficient. The second version (based on rep stosd) and the third assembly version performs quite well under DEBUG.

It is noted that all functions are inlined so no function calls are required, which is faster.

Let’s look at the code generated for first version.

debug-vc-filldword High Performance FillChar Method Comparison for 32 bit (every four bytes) array assembly C/C++ implementation integer

It can be observed that the compiler may just translate each C/C++ statement without a decent optimisation. For, the assembly versions, the compiler just put the code as they are without any modifications. The compiler the code you give is optimised, e.g., the following is the translation by Visual Studio for the second assembly version.

debug-vc-filldword2 High Performance FillChar Method Comparison for 32 bit (every four bytes) array assembly C/C++ implementation integer

Performance Comparison – RELEASE – Visual Studion 2010

Let’s take a comparison when the compiler is set to RELEASE mode.

release High Performance FillChar Method Comparison for 32 bit (every four bytes) array assembly C/C++ implementation integer

Did you predict this result? The naive C/C++ version works quite well, which indicates that the modern C/C++ compiler is good enough in most cases.. The performance of the assembly versions do not vary a lot because they were just simply ‘translated’ by the compiler without much changes on the generated code.

The C/C++ version is pretty optimised by the compiler:

release-vc-filldword High Performance FillChar Method Comparison for 32 bit (every four bytes) array assembly C/C++ implementation integer

and the assembly versions do not change:

release-vc-filldword2 High Performance FillChar Method Comparison for 32 bit (every four bytes) array assembly C/C++ implementation integer

Conclusion

A few implementations are presented on how to fill an array every 4 bytes. The performance comparisons are given. We can see that under DEBUG mode, the compiler does not optimise the code but under RELEASE, they are doing quite well.

Do you have other faster/slower implementations? Share it please so let’s push this to its limit, something like FastCode challenge.

–EOF (Coding For Speed) —

GD Star Rating
loading...
1130 words Last Post: Counting the number of Leading Zeros for a 32-bit Integer (Signed or Unsigned)
Next Post: Fast Power of Two Equal or Larger than a 32-bit Integer

The Permanent URL is: High Performance FillChar Method Comparison for 32 bit (every four bytes) (AMP Version)

Leave a Reply