Saturday 10 August 2013

Anatomy of a Program in Memory

Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.
Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:
Kernel/User Memory Split
This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:
Process Switch Effects on Virtual Memory
Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:
Flexible Process Address Space Layout In Linux
When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.
The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents – a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.
It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it’s appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.
Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.
Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().
Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.
If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs’ chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:
Fragmented Heap
Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.
The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program’s binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!
The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo – a 4-byte memory address – live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here’s a diagram showing these segments and our example variables:
ELF Binary Image Mapped Into Memory
You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what ‘area’ really means. Also, sometimes people say “data segment” meaning all of data + bss + heap.
You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the “flexible” layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that’s not the case, Linux reverts back to the “classic” layout shown below:
Classic Process Address Space Layout In Linux
That’s it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we’ll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.

Mutex vs. Semaphore, what is the difference?



The Toilet Example  (c) Copyright 2005, Niclas Winquist ;)

Mutex:

Is a key to a toilet. One person can have the key - occupy the toilet - at the time. When finished, the person gives (frees) the key to the next person in the queue.

Officially: "Mutexes are typically used to serialise access to a section of  re-entrant code that cannot be executed concurrently by more than one thread. A mutex object only allows one thread into a controlled section, forcing other threads which attempt to gain access to that section to wait until the first thread has exited from that section."
Ref: Symbian Developer Library

(A mutex is really a semaphore with value 1.)
Semaphore:
Is the number of free identical toilet keys. Example, say we have four toilets with identical locks and keys. The semaphore count - the count of keys - is set to 4 at beginning (all four toilets are free), then the count value is decremented as people are coming in. If all toilets are full, ie. there are no free keys left, the semaphore count is 0. Now, when eq. one person leaves the toilet, semaphore is increased to 1 (one free key), and given to the next person in the queue.

Officially: "A semaphore restricts the number of simultaneous users of a shared resource up to a maximum number. Threads can request access to the resource (decrementing the semaphore), and can signal that they have finished using the resource (incrementing the semaphore)."
Ref: Symbian Developer Library

Priority Queues, Insertion Sort, Selection Sort


  • A Priority Queue ranks its elements by key with a total order relation.

  • Keys:

    • Every element has its own key.
    • Keys are not necessarily unique.
    • A key is often a numerical value but the concept is general, i.e., not limited to numerics.

  • Total Order Relation

    • Denoted by £ .
    • Reflexive:k£k.
    • Antisymetric: if k1£k2 and k2£k1 then k1 = k2.
    • Transitive: if k1£k2 and k2£k3 then k1£k3.

  • A Priority Queue supports these fundamental methods:

    • insertItem(Object k, Object e)
    • Object removeMinElement()
 Lafore Priority Queue Applet

Sorting with a Priority Queue

  • A Priority Queue P can be used for sorting by inserting a Sequence S of n elements and calling removeMinElement until P is empty.

Algorithm PriorityQueueSort(S,P):
Input: A sequence S storing n keys, on which a total order relation is defined, and a Priority Queue P that compares keys with the same relation. Output: The Sequence S sorted by the total order relation.     while !S.isEmpty() do         e ¬S.removeFirst()
        P.insertItem(e,e)
    while !P.isEmpty() do
        e ¬ P.removeMinElement()
        S.insertLast(e)


The Priority Queue ADT

  • A Priority Queue ADT must support the following methods:

    • size():
        • Return the number of elements in P.Input: None;      Output: integer

    • isEmpty():
        • Test whether P is empty.Input: None; Output: boolean
    Two other methods of Interface Container.
     
    • insertItem(k,e):
        • Insert a new element e with key k into P.Input: Objects k,e; Output: none

    • minElement():
        • Return but don’t remove an element of P with smallest key; an error occurs if P is empty.Input: none: Output: Object e





The Priority Queue ADT (cont.)

    • minKey():
        • Return the smallest key in P; an error occurs if P is empty.Input: none; Output: Object k

    • removeMinElement():
        • Remove from P and return an element with the smallest key; an error condition occurs if P is empty.Input: none; Output: Object e
 Interface SimplePriorityQueue

Class Item
Item.java

public class Item {
    private Object key, elem;
    public Item (Object k, Object e) {
        key = k;
        elem = e;
    }
    public Object key() { return key; }
    public Object element() { return elem; }
    public void setKey(Object k) { key = k; }
    public void setElement(Object e) { elem = e; }
}

Comparators

  • The most general and reusable form of a priority queue makes use of comparator objects. We don’t want to have to implement a different priority queue for every key type and every way of comparing them.

  • Comparator objects are external to the keys that are to be compared and compare two objects. If we required keys to compare themselves, ambiguities could result. For example, numerical vs. lexicographical comparison.

  • When the priority queue needs to compare two keys, it uses the comparator it is given to do the comparison.
  • Thus any priority queue can be general enough to store any object.

  • The Comparator ADT includes the following methods, all of which return boolean values:
    • isLessThan(Object a, Object b)
    • isLessThanOrEqualTo(Object a, Object b)
    • isEqualTo(Object a, Object b)
    • isGreaterThan(Object a, Object b)
    • isGreaterThanOrEqualTo(Object a, Object b)
    • isComparable(Object a)
 Interface Comparator


Implementation with an Unsorted Sequence

  • Let’s try to implement a priority queue with an unsorted sequence S.

  • The element-items of S are a composition of two objects, the key k and the element e.

  • We can implement insertItem() by using insertFirst() of the Sequence ADT. This would take O(1) time.


  • However, because we always insertFirst() regardless of the key value, out sequence is unordered.



Implementation with an Unsorted Sequence (cont.)

  • Thus, for methods such as minElement(), minKey(), and removeMinElement(), we need to look at all elements of S. The worst-case time complexity for these methods is Q(n).


Implementation with a Sorted Sequence

  • Another implementation uses a sequence S that is sorted by keys such that the first element of S has the smallest key.

  • We can implement minElement(), minKey(), and removeMinElement() by accessing the first element of S. Thus, these methods are O(1), assuming our sequence has O(1) front-removal.


  • However, these advantages come at a price. To implement insertItem(), we might now have to scan through the entire sequence. Thus inserItem() is O(n).



Cost of Operations
Sequence-Based Priority Queue



Method
Unsorted Sequence
Sorted
Sequence
size, isEmpty
O(1)
O(1)
insertItem
O(1)
O(n)
minElement, minKey, removeMinElement
Q(n)
O(1)

 Class SequenceSimplePriorityQueue
Class SequenceSimplePriorityQueue

  • A sorted-sequence implementation of the SimplePriorityQueue ADT.

  •    
     
     
    public class SequenceSimplePriorityQueue  implements SimplePriorityQueue {
    //# Implementation of a priority queue using a sorted sequence
        protected Sequence seq = new NodeSequence();
        protected Comparator comp;
        // auxiliary methods
        protected Object extractKey (Position pos) {
            return ((Item)pos.element()).key();
        }
        protected Object extractElem (Position pos) {
            return ((Item)pos.element()).element();
        }
        protected Object extractElem (Item kep) {
            return kep.element();
        }




Class SequenceSimplePriorityQueue (cont.)

    // methods of the SimplePriorityQueue ADT
    public SequenceSimplePriorityQueue (Comparator c) {
        this.comp = c;
    }     public int size () {return seq.size(); }
    public boolean isEmpty () { return seq.isEmpty(); }
    public void insertItem (Object k, Object e) throws InvalidKeyException {
        if (!comp.isComparable(k))
            throw new InvalidKeyException("The key is not valid");
        else
            if (seq.isEmpty())
                seq.insertFirst(new Item(k,e));
            else
                if (comp.isGreaterThan(k,extractKey(seq.last())))
                    seq.insertAfter(seq.last(),new Item(k,e));
                else {
                    Position curr = seq.first();
                    while (comp.isGreaterThan(k,extractKey(curr)))
                        curr = seq.after(curr);
                    seq.insertBefore(curr,new Item(k,e));
                }
    }


Class SequenceSimplePriorityQueue (cont.)

    public Object minElement () throws EmptyContainerException {
        if (seq.isEmpty())
            throw new EmptyContainerException("The priority queue is empty");
        else
            return extractElem(seq.first());
    }
    // methods minKey and removeMinElement are not
    // shown.
}


Selection Sort

  • Selection Sort is a variation of PriorityQueueSort that uses an unsorted sequence to implement the priority queue P.
  • Phase 1, the insertion of an item into P takes O(1) time.
  • Phase 2, removing an item from P takes time proportional to the number of elements in P.

Sequence S
Priority Queue P
Input
{7,4,8,2,5,3,9}
{}
Phase 1:
(a) (b) … (g)
{4,8,2,5,3,9}
{8,2,5,3,9}

{}
{7}
{7,4}

{7,4,8,2,5,3,9}
Phase 2:
(a) (b) (c) (d) (e) (f) (g)
{2}
{2,3}
{2,3,4}
{2,3,4,5}
{2,3,4,5,7}
{2,3,4,5,7,8}
{2,3,4,5,7,8,9}
{7,4,6,5,3,9}
{7,4,8,5,9}
{7,8,5,9}
{7,8,9}
{8,9}
{9}
{}



Selection Sort (cont.)

  • Clearly, a bottleneck exists in phase 2. The first call to removeMinElement takes Q(n), the second Q(n-1), etc., until the last removal takes Q(1) time.

  • The total time needed for phase 2 is then

.

  • Recall that .

  • The total time complexity of phase 2 is then Q(n2 ).

  • Thus the total time complexity of the algorithm is

.
Insertion Sort

  • Insertion sort is the sort that results when we perform a PriorityQueueSort implementing the priority queue with a sorted sequence.

  • The cost of phase 2 is improved to O(n).

  • However, phase 1 now becomes the bottleneck for the running time. The first insertItem takes O(1), the second O(2), until the last insertItem takes O(n).

  • The run time of phase 1 is

.

  • The overall run time of the algorithm is

O(n2 ) + O(n) = O(n2 ).
Insertion Sort (cont.)



Sequence S
Priority Queue P
Input
{7,4,8,2,5,3,9}
{}
Phase 1:
(a) (b) (c) (d) (e) (f) (g)
{4,8,2,5,3,9}
{8,2,5,3,9}
{2,5,3,9}
{5,3,9}
{3,9}
{9}
{}
{7} {4,7} {4,7,8} {2,4,7,8} {2,4,5,7,8} {2,3,4,5,7,8} {2,3,4,5,7,8,9}
Phase 2:
(a) (b) … (g)
{2}
{2,3}

{2,3,4,5,7,8,9}
{3,4,5,7,8,9}
{4,5,7,8,9}

{}


Comparison of Selection Sort and Insertion Sort



Phase 1 Phase 2 Total
Selection Sort O(n) + Q(n2 ) = Q(n2 )
Insertion Sort O(n2) + O(n) = O(n2 )
  • Selection sort and insertion sort both take O(n2 ).

  • Selection sort will always take W(n2 ) time, no matter what the input sequence.

  • The cost of insertion sort varies depending on the input sequence.

    • If we remove from the rear of an already sorted input sequence S or remove from the front of a reverse-ordered sequennce, then the (best-case) cost of phase 1 of insertion sort is O(n). Thus, nearly sorted sequences will perform better for insertion sort.

Huffman Coding & Dictonaries

Motivation: Compression and data coding

Data compression is a highly important topic in computer science and information theory. Many situations require the use of coding schemes for data compression, including:
  • Storage of a large file on disk.
  • Storage of a large array or other dataset in memory.
  • Transmission of a large piece of information over a bandwidth-limited channel such as a phone line.
Many coding schemes are in common use in today's computer systems:
  • JPEG images
  • MPEG movies
  • data compression algorithms in a modem
Data compression is an example of the space-time tradeoff. In order to save storage costs, we expend additional CPU time in running the compression algorithm.
In this lecture we limit our focus to Huffman coding, a variable-length prefix encoding algorithm for compression of character streams. Although more sophisticated algorithms are now available and more widely used, Huffman's algorithm has many desirable qualities and is an interesting application of the use of binary trees.

Bits and bytes

Digital computers store data in binary or base-2 format. A binary digit may have one of two values:
Value Alternate representations
0 off, false, 0 volts (TTL)
1 on, true, 5 volts (TTL)
A bit is a binary digit and is the atomic unit in a digital computer.
Most modern computers do not store 1-bit numbers, however. For example, a machine with a 32-bit operating system stores addresses as 32-bit words.
A byte is an 8-bit number and is typically the smallest word size on a computer. An 8-bit binary number is a base-2 representation of 8 digits. For example,

Longer words (such as 16-bit, 32-bit, and 64-bit words) are constructed from 8-bit bytes. Other commonly used numbering systems used in computer science are octal (base 8) and hexadecimal (base 16).

Unicode and ASCII

Java stores characters as 16-bit unsigned integers according to the Unicode character-set standard. The Unicode set stores 216=65,536 characters and is an international standard that includes Asian character sets.
All of the letters in the English language, as well as numbers, punctuation, and control characters are represented in the lowest 7 bits of the Unicode set, with the upper 9 bits set to zero. This subset of the Unicode set is known as the ASCII standard and includes 27=128 characters. All of the characters encountered in this course will be part of the ASCII standard (see next page).
The following fragment of Java code
char[] charVals = {'m','n','o'};
for(int i = 0; i < charVals.length; i++)
{
    System.out.println("Character '" +charVals[i]
        + "' is stored as " + +(int)charVals[i]
        + " in base-10.");
}
yields the output
Character 'm' is stored as 109 in base-10.
Character 'n' is stored as 110 in base-10.
Character 'o' is stored as 111 in base-10.

ASCII Table

+---------------------------------------------------------------+
|  0 NUL|  1 SOH|  2 STX|  3 ETX|  4 EOT|  5 ENQ|  6 ACK|  7 BEL|
|  8 BS |  9 HT | 10 NL | 11 VT | 12 NP | 13 CR | 14 SO | 15 SI |
| 16 DLE| 17 DC1| 18 DC2| 19 DC3| 20 DC4| 21 NAK| 22 SYN| 23 ETB|
| 24 CAN| 25 EM | 26 SUB| 27 ESC| 28 FS | 29 GS | 30 RS | 31 US |
| 32 SP | 33  ! | 34  " | 35  # | 36  $ | 37  % | 38  & | 39  ' |
| 40  ( | 41  ) | 42  * | 43  + | 44  , | 45  - | 46  . | 47  / |
| 48  0 | 49  1 | 50  2 | 51  3 | 52  4 | 53  5 | 54  6 | 55  7 |
| 56  8 | 57  9 | 58  : | 59  ; | 60  < | 61  = | 62    | 63  ? |
| 64  @ | 65  A | 66  B | 67  C | 68  D | 69  E | 70  F | 71  G |
| 72  H | 73  I | 74  J | 75  K | 76  L | 77  M | 78  N | 79  O |
| 80  P | 81  Q | 82  R | 83  S | 84  T | 85  U | 86  V | 87  W |
| 88  X | 89  Y | 90  Z | 91  [ | 92  \ | 93  ] | 94  ^ | 95  _ |
| 96  ` | 97  a | 98  b | 99  c |100  d |101  e |102  f |103  g |
|104  h |105  i |106  j |107  k |108  l |109  m |110  n |111  o |
|112  p |113  q |114  r |115  s |116  t |117  u |118  v |119  w |
|120  x |121  y |122  z |123  { |124  | |125  } |126  ~ |127 DEL|
+---------------------------------------------------------------+



Variable-length encoding

Unicode and ASCII are examples of fixed-length encoding schemes, since all characters require the same amount of storage (16 bits and 8 bits, respectively).
In contrast, Huffman coding is a variable-length encoding scheme. The number of bits required to store a coded character varies according to the relative frequency or weight of that character. Although little if any space is saved on infrequent characters, frequently used characters require very little space (say one, two, or three bits).
The encoding process is based on a Huffman coding tree, which is built from the observed frequencies of characters in the document. The document is scanned, and the number of occurrences of each character in the document is recorded. Next, a binary tree is built in which the external nodes store the characters and their frequency in the document. When the binary tree is built from the document being coded, Huffman's algorithm optimizes the length of the encoded document.
Pre-scanning a document and generating a customized Huffman coding tree may be impractical, however. Often, typical frequencies of characters in a given context are used instead of the specific frequencies in a particular document.

Building a Huffman coding tree

Suppose that the following character frequencies are observed in a string that we wish to encode:  
character C D E F K L U Z
frequency 32 42 120 24 7 42 37 2
The first step is to construct a priority queue and insert the character-frequency (element-key) pairs into it. Assuming that we are using a sorted sequence-based priority queue, the above table could be schematically represented by the illustration below.
Step 1:

 

Building a Huffman coding tree (2)

In step 2, the two items with the lowest keys are removed from the priority queue. A new binary tree is created with the lowest-key item as the left external node and the second-lowest-key item as the right external node. The new tree is then inserted back into the priority queue
Step 2:


This process continues until only one node (tree) is left in the priority queue.
Step 3:

Building a Huffman coding tree (3)

Step 4:

Step 5:
Building a Huffman coding tree (4)

Final tree (after n=8 steps)
Building a Huffman coding tree (5)

Algorithm Huffman(X):
    Input: String X of length n     Output: Coding tree for X
    Compute the frequency f(c) of each character c in X.
    Initialize a priority queue Q.
    for each character c in X do
        Create a single-node tree T storing c.
        insert T into Q with key f(c).
    whileQ.size() > 1 do
        f1 ¬ Q.minKey()
        T1 ¬ Q.removeMinElement()
        f2 ¬ Q.minKey()
        T2 ¬ Q.removeMinElement()
        Create a new tree T with left subtree T1 and right subtree T2.
        Insert T into Q with key f1 + f2.
    return tree Q.removeMinElement()

Decoding

A Huffman code is a prefix code, meaning that no code is the prefix of another. To decode a bit stream from the leftmost bit, start at the root node of the tree and move to the left child if the bit is "0" or the right child if the bit is "1". When an external node is reached, the character it stores is sent to the decoded string. The next bit is then decoded from the root of the tree.
Decode 1011001110111101

Encoding

Create a lookup table storing the binary code corresponding to the path to each letter. This can be done via a preorder traversal. If the Huffman tree is used to encode ASCII text, a 128-element array can be used for this purpose. For example, the following code fragment can be used to store the string representing the set of bit-wise operations required to encode the character 'C':
String[] encoder = new String[128];
encoder['C'] = "1110";
 

Character
frequency
code
# bits
C
32 1110
4
D
42 110
3
E
120 0
1
F
24 11111
5
K
7 111101
6
L
42 101
3
U
37 100
3
Z
2 111100
6
Encode the string DEED:

Analysis

Define fi = frequency of character ci, i=1,…,n.
li = path length = length of code for ci, i.e., number of bits.
Weighted path length .  
 
Can be shown that a Huffman code is the prefix code that minimizes the weighted path length.
Average expected cost per character = 
Example from table on previous page:
expected cost per character = 
» 2.57.

A fixed-length encoding on 8 characters would require 3 bits per character.


Dictionaries


  • The Dictionary ADT
  • Implementation with Sorted and Unsorted Sequences
  • Binary Search


Adapted from the Goodrich and Tamassia lecture on Searching

The Dictionary ADT

  • A dictionary is an abstract model of a database.

  • Like a priority queue, a dictionary supports key-element pairs.

  • The main operation supported by a dictionary is searching by key.

  • Methods inherited from the SimpleContainer ADT
    • size()
    • isEmpty()

  • Query methods:
    • Object findElement(Object k)
    • Enumeration findAllElements(Object k)

  • Update methods:
    • void insertItem(Object k, Object e)
    • Object remove(Object k)
    • Enumeration removeAll(Object k)

  • Special sentinel object:
    • NO_SUCH_KEY, returned by an unsuccessful search.
 Interface SimpleDictionary

Example Operations on a Dictionary



Operation Output Unordered Dictionary
insertItem(5,A) {(5,A)}
insertItem(7,B) {(5,A),(7,B)}
insertItem(2,C) {(5,A),(7,B),(2,C)}
insertItem(8,D) {(5,A),(7,B),(2,C),(8,D)}
findElement(7) B {(5,A),(7,B),(2,C),(8,D)}
findElement(4)
NO_SUCH_KEY
{(5,A),(7,B),(2,C),(8,D)}
insertElement(2,E) {(5,A),(7,B),(2,C),(8,D),(2,E)}
findAllElements(2) C,E {(5,A),(7,B),(2,C),(8,D),(2,E)}
removeAll(2) C,E {(5,A),(7,B),(8,D)}

Implementing a Dictionary with a Sequence

  • Unordered sequence:


    • Searching takes O(n) time.
    • Inserting takes O(1) time with a doubly linked-list, O(n) time with an array.

  • Ordered sequence


    • Searching takes O(log n) time with an array, O(n) time with a linked list.
    • Inserting takes O(n) time.

  • In the ordered sequence implementation, we can search faster if the sequence is array-based.




Cost of Sequence Dictionary Operations



Method
Unsorted Sequence 
Sorted Sequence
Array Doubly Linked List Array Doubly Linked List
size, isEmpty
O(1)
O(1)
O(1)
O(1)
findElement
O(n)
O(n)
O(log n)
O(n)
findAllElements
Q(n)
Q(n)
O(log n + s)
O(n)
insertItem
O(1)
O(1)
O(n)
O(n)
remove
O(n)
O(n)
O(n)
O(n)
removeAll
Q(n)
Q(n)
O(n)
O(n)
-- space --
O(N)
O(n)
O(N)
O(n)


Binary Search

  • Narrow down the search range in stages.

  • Instead of scanning the sequence we do bisections.

  • All candidate items have key(low) £k£ key(high)

  • Example: findElement(22)


Pseudo-Code for Binary Search

Algorithm BinarySearch(S, k, low, high)     if low > high then         returnNO_SUCH_KEY     else
        mid ¬ë low + highû / 2
        ifk = key(mid) then             return k         else ifk < key(mid)then             return BinarySearch(S, k, low, mid-1)         else
            return BinarySearch(S, k, mid+1, high)

  • The Dictionary ADT method findElement(k) on a sorted array can be implemented by calling BinarySearch(S, k, 0, S.size()-1).



Analysis of Binary Search

  • A constant number of primitive operations is involved at earch recursive level. Thus the running time is proportional the number of recursive calls.
  • The size of the candidate list is cut by at least half at each step (i.e., at each recursive level).

comparison
size of candidate list
0
n
1
£ n/2
2
£ n/4
i
£ n/2i
log2n
£ 1

The recursive calls stop when the size of the candidate list £ 1. The number of recursive calls is m such that

-->  Cost of binary search = O(log n).