Monday, 19 August 2013

Program Execution

OBJECTIVE:
 * Understanding program Execution
 * Causes for performance bottlenecks in sequential and parallel programs.
 * Improving the performance
Reference Books:
 * Hi Performance Computing - O'Reilly series
PROGRAM EXECUTION:
 Lets us see the execution process of a C program.
  C Program
     |
     |
  Compiler
     |   -> assembly lang.
     |
  assembler
     |   -> Object file *
     |
Lib modules  --> Linker
Other Objects    |   -> Object file
     |
  Loader
     |
     |
  Executable file 
 * Object file created from after the assembling may have some 
functions or variables, which are used in this file and defined elsewhere.
GENERIC FORMAT OF OBJECT FILE:
 Typical format of an object file will be as follows. 
 |-----------------------| 
 | Header Info |
 |-----------------------|
 | Machine code |
 |-----------------------|
 |   Initialized Data |
 |-----------------------|
 |     Symbol Table |
 |-----------------------|
 |    Relocation Info. |
 |-----------------------|
 for example consider the following C code.
#inclide<stdio.h>
#include<math.h>
float arr[100];
int size=100;
void main()
{
    int i;
    float sum;
    for(i=0, sum=0.0; i<size; i++)
    {
 arr[i]=sqrt(arr[i]);
 sum+=arr[i];
    }
    printf("sum is %f\n",sum);
}
compile the above program using gcc -c to get the object output.
(we can see the assembly output using -S option)
At the time of compilation, functions `sqrt' and `printf' are not defined 
in the program. These are called undefined symbols.
similarly there are some defined symbols like arr, sum, main., these can 
be used in any other functions in a different module (in a separate .c file).
so there should be some information stored in the object file about the these.

-> here there is a difference between the global variables(arr,size) and the 
local variables(i,sum).
 * Global variables are stored in the data segment whereas the local 
variables are stored in the stack area .
 * Global variables can be accessed by other functions (in this or other 
modules)  whereas local variables are local the function where they are defined.
[Static variables (defined with the "static" keyword in C are also stored in 
data segment. ]

Even in the global variables there are some initialized variables and some 
uninitialized variables.

    /-- Initialized Variables
   Data Segment --| 
    \-- Uninitialized variables

the object file generated from the above C code will be some thing like this.
*(this may vary depends on the processor architecture)

 Offset Contains Comments

H    {  0  94  No. of bytes of machine code
e  I {  4   4  "   "   "    "  Initialized data
a  n {  8 400  "   "   "    "  Uninitialized data
d  f { 12  60  "   "   "    "  Symbol Table
e  o { 16  ??  "   "   "    "  Relocation table
r    {
 
M    {  20 ??  main: first executable statement in main
a c  { ... ...  ...
c o  { 66 ??  call sqrt
h d  { ... ...  ...
i e  { 102 ??  call printf
n    {
e    {

********************************************************

Continuing from the Above lecture, remaining 3 sections (namely initialized 
data, symbol table, relocation information) of one of the possible generic 
formats of .obj file looks like the following.
(Ref: the C program discussed in 1st lecture)
------------------------------------------------------------------------------
Offset Contents Symbol table Comments
------------------------------------------------------------------------------
(Initialized Data)
114 100  (N. A.)  value "size"; this is the ONLY entry 
      for the above C program, as array arr
     is not initialized, even though it is 
     also global 
(Symbol Table)
118 ??  size  name of symbol "size" and its location
     (within initialized data sec.)
     is to be stored 
130 ??  arr  name of symbol "arr" and the starting 
            address of arr
142 ??  main()  name of symbol "main" and its location
     (offset in code segment) for OS to 
     start with main
154 ??  sqrt()  name of symbol "sqrt", no addr. info. 
     function used here but defined in math.h
166 ??  printf() name of symbol "printf", no addr. info.
     function used here, defined in stdio.h 
(Relocation Information)
178 ??    stores the info about the offsets at 
     which external vars are called

Initialized Global variables are allocated in initialized data section and 
uninitialized global vars are indirectly referred to by recording their sizes 
in the header section.
We know that the sequence for getting executable code is as follows  
   source code
      |
   compiler and assembler
      |
   object file
      |
 library------> linker
      |
   executable   
Generic format of .out file (on disk) is as follows
  ______________________________
  | HEADER       |
  |____________________________|
  | MACHINE CODE SEGMENT | 
  |____________________________|
  | INITIALIZED DATA    |
  |____________________________|

Process may be described as a program under under execution.
While under execution the generic format of (logical) address 
space of the program is as follows

  |----------------------------------|
  | MACHINE CODE      | read-only
  |       | data
  |----------------------------------|
  | INITIALIZED DATA SECTION   | 
  |       | read
  |----------------------------------| write 
  | UNINITIALIZED DATA SECTION | data
  |       |
  |----------------------------------|  
  | HEAP (for dyn mem alloc)   | grows downwards
  |----------------------------------|
  | STACK (local vars and fn.  | grows upwards
  |  call/parameter passing    | 
  |----------------------------------|

Dynamically allocated memory (created using malloc or new statements) 
is from Heap.
Different object modules are combined in the linking stages which 
primarily involves 2 steps - relocation & linking.

The first thing to do is relocation, which is combining two different object modules together by creating a new load module and writing the machine codes of object modules into it one after the other. Only the first module put into load module has its offsets unchanged but all other machine code sections of remaining modules have their offsets modified according to the amount of code input previously. This is called as relocation. As a consequence of relocation, the location info. (address) of the symbols would have changed. Next thing to be done is modification of the addresses of the external variables in the locations where they are called/referred from. When modules are separate, these entries were initialized to zero as compiler had no information about the whereabouts of these external vars. The external references may be functions also. Linking is a kind of patch-up work to be done after relocation. Note that the generic format obtained after linking and loading is the same as that discussed at the start. Now let us look at the use of stack for storing of local vars during function calls. | | | | | | | | |-------------------------------|<--- sp | local vars of func2 | |-------------------------------| | prev. FP contents |<--- fp |-------------------------------| | ret address in func1 | |-------------------------------| | parameters to func2 | |-------------------------------|<--- sp | local vars of func1 | |-------------------------------| | prev. FP contents |<--- fp |-------------------------------| | ret addr in main | |-------------------------------| | parameters to func1 | |-------------------------------| | local vars of main | |-------------------------------|<--- fp As shown above, the first call is from main to func1. The Local vars in main are in the stack, and the the parameters to be passed to func1 are pushed onto them. The return address in main is pushed above that. The first instrn. in the function func1 typically pushes the previous fp (frame pointer) value on the stack, and changes the fp to point to this same (pushed) location. Local variables for function func1 are allocated on top of the stack and the stack pointer is moved (up). Local variables within the function are accesses using the fp and an appropriate offset. When func1 calls func2, parameters, and return address are pushed in stack as shown in the figure. When returning from function func2, the last instrn. typically restores the sp(stack pointer) to a location below the current fp, and the fp to the contents of the memory location pointed by it (this corresponds to the fp value of the caller (func1)). On returning from func2, in func1, the sp has to incremented (brought down in the figure) to account the popping of pushed parameters. Thus the sp(stack pointer) and fp(frame pointer) respectively do the functions of popping the stack and guarding sp from going out of context due to excessive popping.

Assembly Language

  -----------------------------
  -----------------------------


A look at the SPARC assembly code of a sample C program:
-------------------------------------------------------


Consider the following C program:

#include<stdio.h>
#include<math.h>

double a[100];

main()
{
  int i;
  double sum;

  for(i=0, sum=0.0; i<100; i++)
  {
       a[i] = sqrt(a[i]);
       sum += a[i];
  } 
  printf("sum = %4.2f\n", sum);
}


The optimized assembly code of this program on a SPARC machine (obtained
using the CC compiler with optimization flag -O) is given below. 
A SPARC instruction has the format
         add g1, g2, g3
where g1 and g2 are source registers and g3 is the destination 
register.
The code can be understood by noting the following features about 
SPARC architecture and instruction set. 

Register Window:
----------------
SPARC is a RISC machine. An important feature of SPARC is register windowing.
When a program is running, it has access to 32 32-bit processor registers which 
include eight global registers plus 24 registers that belong to the current
register window. The first 8 registers in the window are called the `IN' registers (i0-i7). 
When a function is called, these registers may contain arguments that
can be used. The next 8 are local registers which are scratch registers that can
be used for anything while the function executes. The last 8 registers are the 
`OUT' registers (o0-o7)which the function uses to pass arguments to functions 
that it calls.

                                                                           
                                                                           
         R0  +-------+                 R0  +-------+
             |       |                     |       |
             | global|                     | global|
             |  reg. |                     | reg.  |
             |       |                     |       |
         R7  +-------+                 R7  +-------+
                                                                           
             +-------+                     +-------+
             |       |                     |       |
             |       |                     |       |
             |       |                     |       |
             |       |                     +-------+ <--- CWP
             |       |                 R8  |       |
             |       |                     |  out  |
             |       |                     |  reg. |
             |       |                     |       |
             |       |                R15  +-------+
             |       |                R16  |       |
             |       |                     | local |
             |       |                     |  reg. |
             |       |                     |       |
          R8 +-------+ <--- CWP       R23  +-------+
             |   o0  |                R24  |  i0   |
             |   ::  |                     |  ::   |
             |   o5  |                     |  i5   |
             |   sp  |                     |  fp   |
             |  temp |                R31  |  lr   |    (link reg)
        R15  +-------+                     +-------+
        R16  |       |                     |       |
             | local |                     |       |
             |  reg. |                     |       |
             |       |                     |       |
        R23  +-------+                     +-------+
        R24  |       |                     |       |
             |   in  |                     |       |
             |  reg. |                     |       |
             |       |                     |       |
        R31  +-------+                     +-------+
             |       |                     |       |

When one function calls another, the callee can choose to execute a SAVE 
instruction. This instruction decrements current workspace pointer (CWP), 
a pointer to current register window, shifting the window downward. 
The caller's OUT registers then become the callee's IN registers, and 
the callee gets a new set of local and OUT
registers for its own use. Only the pointer changes because the registers 
and return address do not need to be stored on a stack.
Similarly the RESTORE instruction (executed upon a return from function)
restores the register pointer to the caller's register window. 

A few special registers in the SPARC register window are:
 o6 is sp  in caller / i6 is fp  in callee;
 07 in caller/i7 in callee is return addr. register 
                           (a.k.a. link register) 

SPARC's delayed branches:
------------------------

SPARC handles branching in a very interesting way. Delayed branch means that 
the instruction following a branch instruction is executed (irrespective of 
whether the branch is taken or not)  while the processor prepares to transfer 
control to the destination.  Delayed branching  is implemented in order to 
avoid "pipeline stalls" (wasted cycles) due to branching. (See discussion
on Control Hazards in Pipelining to come in a future class/lecture notes)

-------------------------------------------------------------------------------

The program is organized into the 'text' and 'data' sections corresponding to
the code and global data of the program. The data section in turn consists of 
the initialized global data and the bss section for uninitialized global data.

The sethi instruction:
---------------------
This instruction moves the most significant 22 bits of a constant into a 
specified register, the low order 10 bits of the register are cleared (set 
to 0). Since the length of an instruction in SPARC is 32-bits, a 32-bit
address cannot be directly moved with a single instruction. The approach used in
SPARC is to move a 32-bit address using two instructions, one which moves the
first 22 bits using the sethi instruction to a register, and then adding the 
least significant 10 bits to that register.  

The ld instruction: 
------------------
This instruction loads a word from memory to a specified register. The first 
operand specifies a memory address and the second operand (destination) 
specifies the register. The ldd(load double) instruction moves 
two consecutive words into the register pair denoted by the second operand. 

The store instruction:
---------------------
This instruction stores a word into a memory location. The first operand is a 
register containing the word, and the second operand specifies a memory location.
The std instruction stores a double word. 

-----------------------------------------------------------------------------

.section ".text", #alloc, #execinstr
.file  "t1.c"

.section ".data1", #alloc, #write
.align 4

.L147:
.ascii "Sum = %4.2f\n\000"              ;; format string used in printf

.section ".bss", #alloc, #write
.common a,800,8                         ;; alloc. for array a (100 x 8)

.section ".text", #alloc, #execinstr

/* 00000       0 */   .align 8
!
!   CONSTANT POOL
!
  .L_const_seg_900000101:
/* 000000 0 */   .word 0, 0
/* 0x0008 0 */   .align 4

!
!  SUBROUTINE main
!
! OFFSET SOURCE LINE LABEL INSTRUCTION

                   .global main
                 main:

/* 000000      */        save %sp ,-112, %sp  ;; alloc. in stack for function 
           ;;  frame -- for storing local 
           ;;  vars. functions.

! FILE t1.c

! 1 !#include<stdio.h>
! 2 !#include<math.h>
! 4 !double a[100];
! 6 !main()
! 7 !{
! 8 !    int i;
! 9 !    double sum;
!       11 !    for(i=0, sum=0.0; i < 100; i++)

/* 0x0004   */             sethi          %hi (.L_const_seg_90000101), %g2
/* 0x0008   */             or             %g0, 0, %i1
/* 0x000c   */             ldd            [%g2+%lo (.L_const_seg_900000101)], %f30
/* 0x0010   */             sethi          %hi(a), %g2


!  12 ! {
! 13 !     a[i] = sqrt(a[i]);


/* 0x0014   13  */         ld             [%g2+%lo(a)], %o0   !volatile
/* 0x0018   11  */         add            %g2, %lo(a), %i0

 .L(00000110:
/* 0x001c   13  */     ld             [%i0+4], %o1  !volatile
!     !         sum += a[i];

/* 0x0020   14  */         add            %i1, 1, %i1
/* 0x0024   13  */         call           sqrt           ;; params = %o0 %o1 
                                                         ;; Result = %f0


/* 0x0028       */         std            %f30, [%sp + 96]      
/* 0x002c       */    st             %f0, [%i0]     ;; volatile
/* 0x0030   14  */         cmp            %i1, 100
/* 0x0034   13  */         st             %f1, [%i0 + 4] ;; volatile
/* 0x0038   14  */         ldd            [%sp + 96], %f30
/* 0x003c       */         ld             [%i0], %f6     ;; volatile
/* 0x0040       */         ld             [%i0+4], %f7   ;; volatile
/* 0x0044       */         add            %i0, 8, %i0
/* 0x0048       */         faddd          %f30, %f6, %f30
/* 0x004c       */         bl, a          .L900000110
/* 0x0050       */         ld             [%i0], %o0     ;; volatile

        .L77000005:

! 15       ! }
!       16       ! printf("Sum: %4.2f\n", sum);

/* 0x0054   16  */         st             %f30, [%sp + 108]
/* 0x0058       */         sethi          %hi(.L147), %g2
/* 0x005c       */         st             %f31, [%sp + 104]
/* 0x0060       */         add            %g2, %lo(.L147), %i0
/* 0x0064       */         ld             [%sp + 108], %i1
/* 0x0068       */         ld             [%sp + 104], %i2
/* 0x006c       */         call           printf       ;; params = %i0 %i1 %i2 
                                                       ;; Result = ! (tail call)

/* 0x0070       */         restore        %g0, %g0, %g0
/* 0x0074    0  */         .type          main, 2
/* 0x0074       */         .size          main, (.-main)
/* 0x0074    0  */         .global        _fsr_init_value
/* 0x0074       */         _fsr_init_value =  0
 

Introduction to Automata Theory, Languages, and Computation

The Anatomy of an HLA Program