003 - Hashing Table & Binary Tree

Thursday, May 7, 2020

AVL & B-Tree

AVL Tree & B-Tree

AVL Tree

Definition

AVL tree is a data structure that, as the name suggests, is created by two Soviet inventors by the names of Georgy Adelson-Velsky (AV) and Evgenii Landis (L). They published this in their 1962 paper titled "An Algorithm for the Organization of Information".

In computer science, the AVL tree is the first ever tree that is self-balancing, because when the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, a recursive function is done to rebalance this property.

It takes O(log n) time complexity for search, insertion, and deletion, where n is the number of nodes in the tree. Sometimes, insertion and deletion require the tree to be rebalanced by one or more tree rotations.

Here is an example of an AVL tree with balance factors (green):

Source: wikipedia.org

Balance Factor

In a binary tree, the balance factor of a node n is defined to be the height difference of its two child sub-trees in the following formula:

BalanceFactor(n) := height(RightSubtree(n)) - height(LeftSubtree(n))

A binary tree is considered an AVL tree when the invariant holds for every node n in the tree.

BalanceFactor(n) ∈ {-1, 0, 1}

A node with BalanceFactor(n) < 0 is left-heavy, while BalanceFactor(n) > 0 is right-heavy, and BalanceFactor(n) = 0 is balanced.

Operations

Operations of an AVL tree has the same actions as the operations carried out on an unbalanced BST (binary search tree). But the modifications have to observe and restore the height balance of sub-trees.

Searching

In AVL, searching for a key is carried out the same way as in BST (binary search tree).

The search in AVL require the employment of a comparison function that establishes a total order (or preorder) on the set of keys. This has the time complexity of O(log n).

Insert

When inserting a node into a tree, we use the same method as in BST. If the tree is still empty, then the node becomes the root of the tree. If the tree is already occupied, then we go down recursively from the root searching for the compatible location to insert the new node. This traversal is guided by the comparison function. In this case, the node always replaces a NULL reference (left or right) of an external node in the tree, either becoming a left-child or right-child.

After this insertion if a tree becomes unbalanced, only ancestors of the newly inserted node are unbalanced. This is because only those nodes have their sub-trees altered. So it is necessary to check each of the node's ancestors for consistency with the invariants of AVL trees: this is called "retracing". This is achieved by considering the balance factor of each node. If a tree becomes unbalanced due to the insertion of a value, then the tree will apply some rotations in order for it to be rebalanced.

Delete
In deletion, the properties of an AVL tree have to stay the same so that it will not become any other kind of trees. To delete a root node from the tree, we have to replace the root with the rightmost child of the left sub-tree. After the replace, we consider the balance factor and then we have to rebalance the tree if it is not balanced.

B-Tree

Definition

B-Tree is another self-balancing tree data structure and it is not the abbreviation for Binary Search Tree, so do not confuse them together. It is invented by Rudolf Bayer and Edward M. McCreight while working at the Boeing Research Labs when they were trying to optimize managing index pages for large random access files.

Like other self-balancing trees, B-Trees have the time complexity of O(log n) when doing search, insertion, and deletion. Unlike any other self-balancing tree data structures though, B-Tree is the most suitable for storage systems that read and write large blocks of data, such as discs, and is commonly used in databases and file systems. What makes B-Tree unique to other trees is that one node can contain more than two data or degrees and can have more than two children. Other than that, B-Tree grows and shrinks from the root, which separates it from Binary Search Tree (which grows and shrinks from downward).

Properties
The properties of a B-Tree of order n is as follows:
1. Every node has at most n children.
2. Every non-leaf node (except root) has at least [n / 2] child nodes.
3. The root has at least two children if it is not a leaf.
4. A non-leaf node with k children contains [k - 1] keys.
5. All leaves are at the same level and carry no information.

Operations
Search
The search algorithm is similar to the one in BST. We start from the root and recursively traverse down (similar to BST's inorder traversal). For every visited non-leaf node, we recur down until it finds the key of the node. If we do not find the key from any leaf node, then we return NULL.

Insert
Insertions can be very slow in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record requires shifting all the record down by one level. One solution in insertions is to leave some spaces. Instead of densely packing all the records in a block, the block can have some free space to allow for subsequent insertions. Those spaces would be marked as if they were "deleted" records.

These are the simplified steps for insertion algorithm according to geeksforgeeks.org:
1) Initialize x as root.
2) While x is not leaf, do the following:

Find the child of x that is going to be traversed next. Let the child be y.
If y is not full, change x to point to y.
If y is full, split it and change x to point to one of the two parts of y. If k is smaller than mid key in y, then set x as the first part of y. Else second part of y. When we split y, we move a key from y to its parent x.

3) The loop in step 2 stops when x is leaf. x must have space for 1 extra key as we have been splitting all nodes in advance. So, simply insert k to x.

Delete
In deletion, the index of the B-Tree can stay the same and the record can just be marked as "deleted". The database will remain sorted.

Exercise:

* insert: 5,6,7,0,4,3,8
* delete: 3,4,8

AVL Tree

Insert 5

Insert 6

Insert 7

Insert 0

Insert 4

Insert 3

Insert 8

Delete 3

Delete 4

Delete 8

B-Tree

With max degree of 3:

Insert 5

Insert 6

Insert 7

Insert 0

Insert 4

Insert 3

Insert 8

Delete 3

Delete 4

Delete 8

Sunday, April 5, 2020

Review I

NIM : 2301862393

Name: Joerio C.

Data Structures Introduction

POINTER

Pointer is a data type whose value refers to another value stored elsewhere in a computer memory using its address.

This computer memory is what we more know as Random Access Memory or RAM.

And every compiler or intrepreter has different ways to access its memories.

Why does C uses operator "&" to scan or read a data?

- to store the memory in an anddress

- to determine which position the data is stored in

There are a lot of things that use pointers, especially in the complex data structures field. Some of the many applications are:

1. Trees

2. Linked List

3. Stack

4. Queue

5. Graphs

Pointers are also used for the dynamic memory allocation of a variable.

An example of pointer variables in C:

int x;

int *px; //pointer variable of x

px = &x;

to assign a value of x we can write:

x = 10; or *px = 10;

Another example:

Output: 20

The value of x is changed to 20 in "*px = 20;", because px was assigned to be the address of x in "px = &x ' which makes pointer px (*px) equal to x -- they are interconnected now.

To understand more about this, here is a visualization that explains about the correlation between a variable, its memory, and its address (the concept of pointer):

Code:

int x;

int *ptr;

ptr = &x;

Explanation:

memory: (blank)
address: 2345

ptr:

memory: 2345
address: 5000

Notice that the address to variable x is also the memory to pointer variable ptr and ptr has a different address than x, which is why we can make a pointer to that pointer variable that is also known as pointer to pointer.

Pointer to Pointer:

int x = 10;

int *px; //pointer to x

int **ppx; //pointer to pointer of x

px = &x;

ppx = &px;

Now, if the address of x (or &x) is 1002, then the value to pointer to x (px) is 1002.

And if the address of the pointer to x (px) is 2004, then the value to pointer to pointer to x (ppx) is 2004. In other words, ppx stores the address of px, while px stores the address of x.

ARRAY

Array is a collection of homogenous data items stored in a contiguous memory locations. For example, if we declare an array with the data type of 'int', then the whole data that we store has to be of type 'int' or else it would not be stored or an error will occur.

To declare an array variable, we can use:

int arr[10];

This is an array of data type integer, which means it can only store integers only. Values like decimals will not be stored as decimals as it will automatically be rounded by our compiler to from the data type of integer.

Also, this array variable arr can store up to 10 different data, because when we first declared it on the code above to have a size of 10 in '[10]'. But, we can store as many data as we need so long as our computer's memory is capable of doing so.

A visualation of array of integers:

Linked List

Linked List is a data structure that consists of a sequence of data records such that each record there is a field that contains a reference to the next record.

Linked list is used in many algorithms for solving real-time problems, when the number of elements to be stored is unpredictable and/or during sequential access of elements.

There are 2 types of linked list:

1. Single linked list

2. Double linked list

Single linked list has a pointer most people call *next to connect one node to the next node, so does double linked list. The only difference is that double linked list also has a pointer that connects a node to the previous node behind it which most people call *prev.

Linked lists have what we call the pointer head and pointer tail. The head is basically the first data in the linked list, while tail is the last.

Linked lists have the same function as arrays other than the *next and *prev pointers. In addition to that, there are some more major differences that distinct an array to a linked list.

Array:

is a linear collection of data elements
stores value in consecutive memory locations
can be random in accessing data

Linked List:

is a linear collection of nodes
stores value in random memory locations
accesses data only in sequential manner

To use linked list in the C language, we need the malloc.h or stdlib.h library mainly for the memory allocation function, which is malloc() itself. The purpose of this malloc function is to allocate memories dynamically (in run-time). To deallocate the memories used, we can use the function free(). We deallocate memories so that our computer can use that memory for other programs if needed.

Example:

Now if we want to free the head, we can use the function free():

Stack

Stack operations:

push: add an item x to the top of the stack (enstack)
pop: remove an item from the top of the stack (destack)
top/peek: reveal/return the top of the stack

Infix, Prefix, Postfix

Stack is applied for mathematical equations. When we input a normal equation (called an infix), the computer reads it and converts the equation into a prefix or postfix for easier computation due to having no brackets to show precedence or priority of the operators.

Infix: 4 * 5

Prefix: * 4 5

Postfix: 4 5 * ( scans from left to right, has the complexity of O(N) )

An example of how infix is converted into postfix using stack:

A * B + C becomes A B * C +

Hashing

Hashing is some kind of data structure used to store many data using mapping and the function called Hash function. The Hash function is used to store a value in the map provided that the user gives the key to that value.

Hash function's format: Hash(key) → index
Supposed we want to make a phonebook using the Hashing. The data that we store are phone numbers, these are the values that the Hashtable will later commit into memory. Say that we want to store Bob's phone number into the phonebook's index number 1, then we would write the Hash function as such: " Hash(Bob) → 1 ". In this case, Bob is the key that can only be used to access his index in the mapping.

Hashing is almost like Linked List. The difference is that we can store multiple data into the same index, which we call the Hash Chain. Once we have the key and the index, then it is not very complicated to find somebody else's phone numbers even though it is stored in the same index as Bob here. As you can see in the picture below.
Image result for hashing chain

Image source: geeksforgeeks.org

The vertical white boxes containing the numbers 0-5 is what we call the Hashtable, while the horizontal colored boxes are what we call Hash Chain.

Now, the process in which Hash Chains are created are called collision. To further explain what a collision is, we can imagine buying a new phonebook. In that phonebook, we want to list all our friends' phone numbers ascending according to the first letter of their names.
For example, we have Alan, Henry, Jaka, and Jono. Let's say we have 26 indexes in our book, which are sorted alphabets (from A to Z). And here is how we sort them:

The first letter of Alan's name is 'A', which is the 1st letter of the alphabets. So, we simply put his name into the 1st index of our phonebook.
Next, we have Henry whose first letter of his name is 'H', which is the 8th letter of the alphabets. That is why we are putting Henry in the 8th index.
Then, we have Jaka and Jono that both name starts with 'J'. Following the rules above, we are placing Jaka and Jono into the 10th index of our phonebook. But, how can we possibly fit 2 names into the same index? This is what we precisely call as a collision. And to handle such problem, we use linked list in order to store these two names into one index. So now, the 10th index of our phonebook has Jaka and then Jono, who is connected with linked list (visualization: "10. Jaka → Jono").

When we were trying to figure out which index they belong to in the illustration above, we call it as finding our key. Alan has the key of 1; Henry has the key of 8; Jaka and Jono both have the keys of 10. And to further make us understand on how to find our hashtable key, study the code of the getHashKey() function written in C compiler below:

How is Hashing utilized in the current Blockchain technology?
In Blockchain, every block has a hash of the previous block. The previous block of our current block is called as the parent block.
According to a Blockchain developer, every block in the Blockchain has a hash of the previous block. If we change any data in the current block, the hash of the block will be changed. This will affect the current block, because the current block has the address of the previous block. They are interconnected to each other. In other words, we have to change the data in the current block, then we also have to change the data in our parent block. (note: current block is also called present block)

Binary Tree

A Binary Tree is a special type of tree in which every node or vertex has up to 2 child nodes. The starting node is called the root node. There are 2 terms to refer to child nodes (corresponding to which side of the node it is branching to): the left child node & the right child node. The nodes without any child node branching from it are called the leaf nodes.

Image source: javatpoint.com

From the image above, we can see that the root is on the number 30. Then, 30 has two child nodes, which are 15 (left child node) and 60 (right child node). We also have some leaf nodes on the number 7, 17, 27, 45, and 75 -- all because they have no child nodes branching from them.

Binary Trees can be represented in 2 ways: using Arrays or using Linked Lists.

There are 2 types of Binary Trees:
1. Full Binary Tree / Proper Binary Tree / 2-tree: a tree in which all the node has exactly two children, except for the leaf nodes.
2. Complete Binary Tree: a binary tree in which at every level, except possibly the last, has to be filled and all nodes are as far left as possible.

Tuesday, March 10, 2020

003 - Hashing & Binary Tree

Hashing
Hashing is some kind of data structure used to store many data using mapping and the function called Hash function. The Hash function is used to store a value in the map provided that the user gives the key to that value.

Hash function's format: Hash(key) → index
Supposed we want to make a phonebook using the Hashing. The data that we store are phone numbers, these are the values that the Hashtable will later commit into memory. Say that we want to store Bob's phone number into the phonebook's index number 1, then we would write the Hash s function as such: " Hash(Bob) → 1 ". In this case, Bob is the key that can only be used to access his index in the mapping.

Hashing is almost like Linked List. The difference is that we can store multiple data into the same index, which we call the Hash Chain. Once we have the key and the index, then it is not very complicated to find somebody else's phone numbers even though it is stored in the same index as Bob here. As you can see in the picture below.
Image result for hashing chain

Image source: geeksforgeeks.org

The vertical white boxes containing the numbers 0-5 is what we call the Hashtable, while the horizontal colored boxes are what we call Hash Chain.

How is Hashing utilized in the current Blockchain technology?
In Blockchain, every block has a hash of the previous block. The previous block of our current block is called as the parent block.
According to a Blockchain developer, every block in the Blockchain has a hash of the previous block. If we change any data in the current block, the hash of the block will be changed. This will affect the current block, because the current block has the address of the previous block. They are interconnected to each other. In other words, we have to change the data in the current block, then we also have to change the data in our parent block. (note: current block is also called present block)

Binary Tree
A Binary Tree is a special type of tree in which every node or vertex has up to 2 child nodes. The starting node is called the root node. There are 2 terms to refer to child nodes (corresponding to which side of the node it is branching to): the left child node & the right child node. The nodes without any child node branching from it are called the leaf nodes.

Binary Trees can be represented in 2 ways: using arrays or using Linked Lists.

Types of Binary Trees:
1. Full Binary Tree / Proper Binary Tree / 2-tree: a tree in which all the node has exactly two children, except for the leaf nodes.
2. Complete Binary Tree: a binary tree in which at every level, except possibly the last, has to be filled and all nodes are as far left as possible.