Computer.pdf

  • Uploaded by: Shubham Sharma
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Computer.pdf as PDF for free.

More details

  • Words: 7,606
  • Pages: 39
Unit -I Random Access Machine model Algorithms can be measured in a machine-independent way using the Random Access Machine (RAM) model. This model assumes a single processor. In the RAM model, instructions are executed one after the other, with no concurrent operations. This model of computation is an abstraction that allows us to compare algorithms on the basis of performance. The assumptions made in the RAM model to accomplish this are:   

Each simple operation takes 1 time step. Loops and subroutines are not simple operations. Each memory access takes one time step, and there is no shortage of memory.

For any given problem the running time of an algorithms is assumed to be the number of time steps. The space used by an algorithm is assumed to be the number of RAM memory cells.

Review Of Data Structures

Introduction When we talk about balanced search trees, we specifically are talking about two things.  

A search trees. A height balanced tree. Here in this post we will consider a Binary Search Tree and will try to equip with features to self balance itself upon modifications. Before starting up with the logic and code part we need to understand the motive behind this kind of data structure and more precisely we need to understand the need of such a complex scenario. Apart from that we also need to understand what it means to be termed as balanced.

What does balancing means? We say a tree to be balanced when it is height balanced, that means the following: It is not skewed. It has a uniform distribution of nodes through out the structure. The height is as small as possible. A general relation can be derived such that we can claim that the height is proportional to logN where N is the total number of nodes.

   

Why is balancing important in a search tree? Balanced Search Trees makes a tree data structure efficient, it guarantees logN running time for all the dictionary operations (Insert, Search, Delete). This of course is achieved by paying some cost as mentioned below:  

Re balancing upon insertion of new node. Re balancing upon deletion of existing node.

 Balance Trees

 Mergeable Sets In computer science, a disjoint-set data structure, also called a union–find data structure or merge– find set, is a data structure that keeps track of a set of elements partitioned into a number of disjoint (non-overlapping) subsets. It provides near-constant-time operations (bounded by the inverse Ackermann function) to add new sets, to merge existing sets, and to determine whether elements are in the same set. In addition to many other uses (see the Applications section), disjointsets play a key role in Kruskal's algorithm for finding the minimum spanning tree of a graph. Initially there are 10 subsets and each subset has single element in it.

When each subset contains only single element, the array Arr is:

Let’s perform some Operations: 1) Union(2, 1)

Arr will

be: 2) Union(4, 3) 3) Union(8, 4) 4) Union(9, 3)

Arr will be:

5) Union(6, 5)

Arr will be:

After performing some operations of Union(A ,B), you can see that now there are 5 subsets. First has elements {3, 4, 8, 9}, second has {1, 2}, third has {5, 6}, fourth has {0} and fifth has {7}. All these subsets are said to be Connected Components. One can also relate these elements with nodes of a graph. The elements in one subset can be considered as the nodes of the graph which are connected to each other directly or indirectly, therefore each subset can be considered as connected component. From this, we can infer that Union-Find data structure is useful in Graphs for performing various operations like connecting nodes, finding connected components etc. Let’s perform some Find(A, B) operations. 1) Find (0, 7) - as 0 and 7 are disconnected ,this will gives false result. 2) Find (8, 9) -though 8 and 9 are not connected directly ,but there exist a path connecting 8 and 9, so it will give us true result. When we see above operations in terms of components, then : Union(A, B) - Replace components containing two objects A and B with their union. Find(A, B) - check if two objects A and B are in same component or not. So if we perform operation Union(5, 2) on above components, then it will be :

Now the Arr will be:

Algorithm Design Techniques  Iterative Techniques In computational mathematics, an iterative method is a mathematical procedure that generates a sequence of improving approximate solutions for a class of problems, in which the n-th approximation is derived from the previous ones. A specific implementation of an iterative method, including the termination criteria, is an algorithm of the iterative method. An iterative method is called convergent if the corresponding sequence converges for given initial approximations. A mathematically rigorous convergence analysis of an iterative method is usually performed; however, heuristicbased iterative methods are also common. In the problems of finding the root of an equation (or a solution of a system of equations), an iterative method uses an initial guess to generate successive approximations to a solution. In contrast, direct methods attempt to solve the problem by a finite sequence of operations. In the absence of rounding errors, direct methods would deliver an exact solution (like solving a linear system of equations {\displaystyle A\mathbf {x} =\mathbf {b} } by Gaussian elimination). Iterative methods are often the only choice for nonlinear equations. However, iterative methods are often useful even for linear problems involving a large number of variables (sometimes of the order of millions), where direct methods would be prohibitively expensive (and in some cases impossible) even with the best available computing power

 Divide and Conquer In divide and conquer approach, the problem in hand, is divided into smaller sub-problems and then each problem is solved independently. When we keep on dividing the subproblems into even smaller sub-problems, we may eventually reach a stage where no more division is possible. Those "atomic" smallest possible sub-problem (fractions) are solved. The solution of all subproblems is finally merged in order to obtain the solution of an original problem.

Broadly, we can understand divide-and-conquer approach in a three-step process.

Divide/Break This step involves breaking the problem into smaller sub-problems. Sub-problems should represent a part of the original problem. This step generally takes a recursive approach to divide the problem until no sub-problem is further divisible. At this stage, sub-problems become atomic in nature but still represent some part of the actual problem.

Conquer/Solve This step receives a lot of smaller sub-problems to be solved. Generally, at this level, the problems are considered 'solved' on their own.

Merge/Combine When the smaller sub-problems are solved, this stage recursively combines them until they formulate a solution of the original problem. This algorithmic approach works recursively and conquer & merge steps works so close that they appear as one.

Examples The following computer algorithms are based on divide-and-conquerprogramming approach − 

Merge Sort



Quick Sort



Binary Search



Strassen's Matrix Multiplication



Closest pair (points)

There are various ways available to solve any computer problem, but the mentioned are a good example of divide and conquer approach.

 Dynamic Programming Dynamic programming approach is similar to divide and conquer in breaking down the problem into smaller and yet smaller possible sub-problems. But unlike, divide and conquer, these subproblems are not solved independently. Rather, results of these smaller sub-problems are remembered and used for similar or overlapping sub-problems. Dynamic programming is used where we have problems, which can be divided into similar subproblems, so that their results can be re-used. Mostly, these algorithms are used for optimization. Before solving the in-hand sub-problem, dynamic algorithm will try to examine the results of the previously solved sub-problems. The solutions of sub-problems are combined in order to achieve the best solution. So we can say that − 

The problem should be able to be divided into smaller overlapping sub-problem.



An optimum solution can be achieved by using an optimum solution of smaller sub-problems.



Dynamic algorithms use memorization.

Comparison In contrast to greedy algorithms, where local optimization is addressed, dynamic algorithms are motivated for an overall optimization of the problem. In contrast to divide and conquer algorithms, where solutions are combined to achieve an overall solution, dynamic algorithms use the output of a smaller sub-problem and then try to optimize a bigger sub-problem. Dynamic algorithms use memorization to remember the output of already solved sub-problems.

Example The following computer problems can be solved using dynamic programming approach − 

Fibonacci number series



Knapsack problem



Tower of Hanoi



All pair shortest path by Floyd-Warshall



Shortest path by Dijkstra



Project scheduling

Dynamic programming can be used in both top-down and bottom-up manner. And of course, most of the times, referring to the previous solution output is cheaper than recomputing in terms of CPU cycles

 Greedy Algorithms An algorithm is designed to achieve optimum solution for a given problem. In greedy algorithm approach, decisions are made from the given solution domain. As being greedy, the closest solution that seems to provide an optimum solution is chosen. Greedy algorithms try to find a localized optimum solution, which may eventually lead to globally optimized solutions. However, generally greedy algorithms do not provide globally optimized solutions.

Counting Coins This problem is to count to a desired value by choosing the least possible coins and the greedy approach forces the algorithm to pick the largest possible coin. If we are provided coins of ₹ 1, 2, 5 and 10 and we are asked to count ₹ 18 then the greedy procedure will be − 

1 − Select one ₹ 10 coin, the remaining count is 8



2 − Then select one ₹ 5 coin, the remaining count is 3



3 − Then select one ₹ 2 coin, the remaining count is 1



4 − And finally, the selection of one ₹ 1 coins solves the problem

Though, it seems to be working fine, for this count we need to pick only 4 coins. But if we slightly change the problem then the same approach may not be able to produce the same optimum result. For the currency system, where we have coins of 1, 7, 10 value, counting coins for value 18 will be absolutely optimum but for count like 15, it may use more coins than necessary. For example, the greedy approach will use 10 + 1 + 1 + 1 + 1 + 1, total 6 coins. Whereas the same problem could be solved by using only 3 coins (7 + 7 + 1) Hence, we may conclude that the greedy approach picks an immediate optimized solution and may fail where global optimization is a major concern.

Examples Most networking algorithms use the greedy approach. Here is a list of few of them − 

Travelling Salesman Problem



Prim's Minimal Spanning Tree Algorithm



Kruskal's Minimal Spanning Tree Algorithm



Dijkstra's Minimal Spanning Tree Algorithm



Graph - Map Coloring



Graph - Vertex Cover



Knapsack Problem



Job Scheduling Problem

There are lots of similar problems that uses the greedy approach to find an optimum solution.

Unit- II Selection Sort Selection sort is a simple sorting algorithm. This sorting algorithm is an inplace comparison-based algorithm in which the list is divided into two parts, the sorted part at the left end and the unsorted part at the right end. Initially, the sorted part is empty and the unsorted part is the entire list. The smallest element is selected from the unsorted array and swapped with the leftmost element, and that element becomes a part of the sorted array. This process continues moving unsorted array boundary by one element to the right. This algorithm is not suitable for large data sets as its average and worst case complexities are of Ο(n2), where n is the number of items.

How Selection Sort Works? Consider the following depicted array as an example.

For the first position in the sorted list, the whole list is scanned sequentially. The first position where 14 is stored presently, we search the whole list and find that 10 is the lowest value.

So we replace 14 with 10. After one iteration 10, which happens to be the minimum value in the list, appears in the first position of the sorted list.

For the second position, where 33 is residing, we start scanning the rest of the list in a linear manner.

We find that 14 is the second lowest value in the list and it should appear at the second place. We swap these values.

After two iterations, two least values are positioned at the beginning in a sorted manner.

The same process is applied to the rest of the items in the array. Following is a pictorial depiction of the entire sorting process −

Now, let us learn some programming aspects of selection sort.

Algorithm Step Step Step Step Step

1 2 3 4 5

− − − − −

Set MIN to location 0 Search the minimum element in the list Swap with value at location MIN Increment MIN to point to next element Repeat until list is sorted

Bubble Sort Algorithm Bubble sort is a simple sorting algorithm. This sorting algorithm is comparison-based algorithm in which each pair of adjacent elements is compared and the elements are swapped if they are not in order. This

algorithm is not suitable for large data sets as its average and worst case complexity are of Ο(n2) where n is the number of items.

How Bubble Sort Works? We take an unsorted array for our example. Bubble sort takes Ο(n2) time so we're keeping it short and precise.

Bubble sort starts with very first two elements, comparing them to check which one is greater.

In this case, value 33 is greater than 14, so it is already in sorted locations. Next, we compare 33 with 27.

We find that 27 is smaller than 33 and these two values must be swapped.

The new array should look like this −

Next we compare 33 and 35. We find that both are in already sorted positions.

Then we move to the next two values, 35 and 10.

We know then that 10 is smaller 35. Hence they are not sorted.

We swap these values. We find that we have reached the end of the array. After one iteration, the array should look like this −

To be precise, we are now showing how an array should look like after each iteration. After the second iteration, it should look like this −

Notice that after each iteration, at least one value moves at the end.

And when there's no swap required, bubble sorts learns that an array is completely sorted.

Now we should look into some practical aspects of bubble sort.

Algorithm We assume list is an array of n elements. We further that swapfunction swaps the values of the given array elements. begin BubbleSort(list)

for all elements of list if list[i] > list[i+1] swap(list[i], list[i+1]) end if end for

assume

return list

end BubbleSort

Insertion Sort This is an in-place comparison-based sorting algorithm. Here, a sub-list is maintained which is always sorted. For example, the lower part of an array is maintained to be sorted. An element which is to be 'insert'ed in this sorted sub-list, has to find its appropriate place and then it has to be inserted there. Hence the name, insertion sort. The array is searched sequentially and unsorted items are moved and inserted into the sorted sub-list (in the same array). This algorithm is not suitable for large data sets as its average and worst case complexity are of Ο(n2), where n is the number of items.

How Insertion Sort Works? We take an unsorted array for our example.

Insertion sort compares the first two elements.

It finds that both 14 and 33 are already in ascending order. For now, 14 is in sorted sub-list.

Insertion sort moves ahead and compares 33 with 27.

And finds that 33 is not in the correct position.

It swaps 33 with 27. It also checks with all the elements of sorted sub-list. Here we see that the sorted sub-list has only one element 14, and 27 is greater than 14. Hence, the sorted sub-list remains sorted after swapping.

By now we have 14 and 27 in the sorted sub-list. Next, it compares 33 with 10.

These values are not in a sorted order.

So we swap them.

However, swapping makes 27 and 10 unsorted.

Hence, we swap them too.

Again we find 14 and 10 in an unsorted order.

We swap them again. By the end of third iteration, we have a sorted sublist of 4 items.

This process goes on until all the unsorted values are covered in a sorted sub-list. Now we shall see some programming aspects of insertion sort.

Algorithm Now we have a bigger picture of how this sorting technique works, so we can derive simple steps by which we can achieve insertion sort. Step Step Step Step

1 2 3 4

− − − −

If it is the first element, it is already sorted. return 1; Pick next element Compare with all elements in the sorted sub-list Shift all the elements in the sorted sub-list that is greater than the value to be sorted Step 5 − Insert the value Step 6 − Repeat until list is sorted

Sorting Techniques Sorting refers to arranging data in a particular format. Sorting algorithm specifies the way to arrange data in a particular order. Most common orders are in numerical or lexicographical order. The importance of sorting lies in the fact that data searching can be optimized to a very high level, if data is stored in a sorted manner. Sorting is also used to represent data in more readable formats. Following are some of the examples of sorting in real-life scenarios − 

Telephone Directory − The telephone directory stores the telephone numbers of people sorted by their names, so that the names can be searched easily.



Dictionary − The dictionary stores words in an alphabetical order so that searching of any word becomes easy.

In-place Sorting and Not-in-place Sorting Sorting algorithms may require some extra space for comparison and temporary storage of few data elements. These algorithms do not require any extra space and sorting is said to happen in-place, or for example, within the array itself. This is called in-place sorting. Bubble sort is an example of in-place sorting.

However, in some sorting algorithms, the program requires space which is more than or equal to the elements being sorted. Sorting which uses equal or more space is called not-in-place sorting. Merge-sort is an example of not-in-place sorting.

Stable and Not Stable Sorting If a sorting algorithm, after sorting the contents, does not change the sequence of similar content in which they appear, it is called stable sorting.

If a sorting algorithm, after sorting the contents, changes the sequence of similar content in which they appear, it is called unstable sorting.

Stability of an algorithm matters when we wish to maintain the sequence of original elements, like in a tuple for example.

Adaptive and Non-Adaptive Sorting Algorithm A sorting algorithm is said to be adaptive, if it takes advantage of already 'sorted' elements in the list that is to be sorted. That is, while sorting if the source list has some element already sorted, adaptive algorithms will take this into account and will try not to re-order them.

A non-adaptive algorithm is one which does not take into account the elements which are already sorted. They try to force every single element to be re-ordered to confirm their sortedness.

Important Terms Some terms are generally coined while discussing sorting techniques, here is a brief introduction to them −

Increasing Order A sequence of values is said to be in increasing order, if the successive element is greater than the previous one. For example, 1, 3, 4, 6, 8, 9 are in increasing order, as every next element is greater than the previous element.

Decreasing Order A sequence of values is said to be in decreasing order, if the successive element is less than the current one. For example, 9, 8, 6, 4, 3, 1 are in decreasing order, as every next element is less than the previous element.

Non-Increasing Order A sequence of values is said to be in non-increasing order, if the successive element is less than or equal to its previous element in the sequence. This order occurs when the sequence contains duplicate values. For example, 9, 8, 6, 3, 3, 1 are in non-increasing order, as every next element is less than or equal to (in case of 3) but not greater than any previous element.

Non-Decreasing Order A sequence of values is said to be in non-decreasing order, if the successive element is greater than or equal to its previous element in the sequence. This order occurs when the sequence contains duplicate values. For example, 1, 3, 3, 6, 8, 9 are in non-decreasing order, as every next element is greater than or equal to (in case of 3) but not less than the previous one.

Quick Sort

Quick sort is a highly efficient sorting algorithm and is based on partitioning of array of data into smaller arrays. A large array is partitioned into two arrays one of which holds values smaller than the specified value, say pivot, based on which the partition is made and another array holds values greater than the pivot value. Quick sort partitions an array and then calls itself recursively twice to sort the two resulting subarrays. This algorithm is quite efficient for large-sized data sets as its average and worst case complexity are of Ο(n2), where n is the number of items.

Partition in Quick Sort Following animated representation explains how to find the pivot value in an array.

The pivot value divides the list into two parts. And recursively, we find the pivot for each sub-lists until all lists contains only one element.

Quick Sort Pivot Algorithm Based on our understanding of partitioning in quick sort, we will now try to write an algorithm for it, which is as follows. Step Step Step Step Step Step Step Step

1 2 3 4 5 6 7 8

− − − − − − − −

Choose the highest index value has pivot Take two variables to point left and right of the list excluding pivot left points to the low index right points to the high while value at left is less than pivot move right while value at right is greater than pivot move left if both step 5 and step 6 does not match swap left and right if left ≥ right, the point where they met is new pivot

Quick Sort Algorithm Using pivot algorithm recursively, we end up with smaller possible partitions. Each partition is then processed for quick sort. We define recursive algorithm for quicksort as follows −

Step Step Step Step

1 2 3 4

− − − −

Make the right-most index value pivot partition the array using pivot value quicksort left partition recursively quicksort right partition recursively

To know about quick sort implementation in C programming language, please click here.

Heap Sort Heap is a special case of balanced binary tree data structure where the root-node key is compared with its children and arranged accordingly. If α has child node β then − key(α) ≥ key(β) As the value of parent is greater than that of child, this property generates Max Heap. Based on this criteria, a heap can be of two types − For Input → 35 33 42 10 14 19 27 44 26 31

Min-Heap − Where the value of the root node is less than or equal to either of its children.

Max-Heap − Where the value of the root node is greater than or equal to either of its children.

Both trees are constructed using the same input and order of arrival.

Max Heap Construction Algorithm We shall use the same example to demonstrate how a Max Heap is created. The procedure to create Min Heap is similar but we go for min values instead of max values. We are going to derive an algorithm for max heap by inserting one element at a time. At any point of time, heap must maintain its property. While insertion, we also assume that we are inserting a node in an already heapified tree. Step 1 − Create a new node at the end of heap.

Step Step Step Step

2 3 4 5

− − − −

Assign new value to the node. Compare the value of this child node with its parent. If value of parent is less than child, then swap them. Repeat step 3 & 4 until Heap property holds.

Note − In Min Heap construction algorithm, we expect the value of the parent node to be less than that of the child node. Let's understand Max Heap construction by an animated illustration. We consider the same input sample that we used earlier.

Max Heap Deletion Algorithm Let us derive an algorithm to delete from max heap. Deletion in Max (or Min) Heap always happens at the root to remove the Maximum (or minimum) value. Step Step Step Step Step

1 2 3 4 5

− − − − −

Remove root node. Move the last element of last level to root. Compare the value of this child node with its parent. If value of parent is less than child, then swap them. Repeat step 3 & 4 until Heap property holds.

Merge Sort Algorithm Merge sort is a sorting technique based on divide and conquer technique. With worst-case time complexity being Ο(n log n), it is one of the most respected algorithms. Merge sort first divides the array into equal halves and then combines them in a sorted manner.

How Merge Sort Works? To understand merge sort, we take an unsorted array as the following −

We know that merge sort first divides the whole array iteratively into equal halves unless the atomic values are achieved. We see here that an array of 8 items is divided into two arrays of size 4.

This does not change the sequence of appearance of items in the original. Now we divide these two arrays into halves.

We further divide these arrays and we achieve atomic value which can no more be divided.

Now, we combine them in exactly the same manner as they were broken down. Please note the color codes given to these lists. We first compare the element for each list and then combine them into another list in a sorted manner. We see that 14 and 33 are in sorted positions. We compare 27 and 10 and in the target list of 2 values we put 10 first, followed by 27. We change the order of 19 and 35 whereas 42 and 44 are placed sequentially.

In the next iteration of the combining phase, we compare lists of two data values, and merge them into a list of found data values placing all in a sorted order.

After the final merging, the list should look like this −

Now we should learn some programming aspects of merge sorting.

Algorithm

Merge sort keeps on dividing the list into equal halves until it can no more be divided. By definition, if it is only one element in the list, it is sorted. Then, merge sort combines the smaller sorted lists keeping the new list sorted too. Step 1 − if it is only one element in the list it is already sorted, return. Step 2 − divide the list recursively into two halves until it can no more be divided. Step 3 − merge the smaller lists into new list in sorted order.

Shell Sort Shell sort is a highly efficient sorting algorithm and is based on insertion sort algorithm. This algorithm avoids large shifts as in case of insertion sort, if the smaller value is to the far right and has to be moved to the far left. This algorithm uses insertion sort on a widely spread elements, first to sort them and then sorts the less widely spaced elements. This spacing is termed as interval. This interval is calculated based on Knuth's formula as −

Knuth's Formula h = h * 3 + 1 where − h is interval with initial value 1

This algorithm is quite efficient for medium-sized data sets as its average and worst case complexity are of Ο(n), where n is the number of items.

How Shell Sort Works? Let us consider the following example to have an idea of how shell sort works. We take the same array we have used in our previous examples. For our example and ease of understanding, we take the interval of 4. Make a virtual sub-list of all values located at the interval of 4 positions. Here these values are {35, 14}, {33, 19}, {42, 27} and {10, 44}

We compare values in each sub-list and swap them (if necessary) in the original array. After this step, the new array should look like this −

Then, we take interval of 2 and this gap generates two sub-lists - {14, 27, 35, 42}, {19, 10, 33, 44}

We compare and swap the values, if required, in the original array. After this step, the array should look like this −

Finally, we sort the rest of the array using interval of value 1. Shell sort uses insertion sort to sort the array.

Following is the step-by-step depiction −

We see that it required only four swaps to sort the rest of the array.

Algorithm Following is the algorithm for shell sort. Step Step Step Step

1 2 3 3

− − − −

Initialize the value of h Divide the list into smaller sub-list of equal interval h Sort these sub-lists using insertion sort Repeat until complete list is sorted

External Sorting Review External Sorting--This term is used to refer to sorting methods that are employed when the data to be sorted is too large to fit in primary memory.

Characteristics of External Sorting

1. During the sort, some of the data must be stored externally. Typically the data will be stored on tape or disk. 2. The cost of accessing data is significantly greater than either bookkeeping or comparison costs. 3. There may be severe restrictions on access. For example, if tape is used, items must be accessed sequentially.

Criteria for Developing an External Sorting Algorithm 1. Minimize number of times an item is accessed. 2. Access items in sequential order

Sort Merge Sort merge is the strategy of choice for external sorting because it: 1. Accesses records sequentially 2. Minimizes block accesses 3. Gives a stable sort

Sort Merge Strategy 1. Divide the file into runs such that the size of a run is small enough to fit into main memory 2. Sort each run in main memory using a fast in-memory sorting algorithm 3. Merge the resulting runs together into successively bigger runs, until the file is sorted.

Balanced Multiway Merging Strategy 1. Select an equal number of I/O units (e.g., tapes, disks) 2. Sort Step

Select half of the I/O units to be output files Using an in-memory sorting algorithm, create the initial runs and divide them evenly among the output files 3. Merge Step o (a) On each merge step, alternate the I/O units so that on one step, a unit is an input file and on the next step it is an output file o (b)Read one run from each of the input files, merge the runs, and store the resulting run on an output file. Alternate output files so that the runs are evenly distributed on the output files. o (c)Repeat step b until all the runs on the input files have been merged. o (d)Repeat steps a-c until the file has been sorted o o

Unit-III Lower Bounding Techniques Decision Tree A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. The following decision tree is for the concept buy computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows − 

It does not require any domain knowledge.



It is easy to comprehend.



The learning and classification steps of a decision tree are simple and fast.

Tree Pruning Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches There are two approaches to prune a tree − 

Pre-pruning − The tree is pruned by halting its construction early.



Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity The cost complexity is measured by the following two parameters − 

Number of leaves in the tree, and



Error rate of the tree.

Adversaries

String processing Processing strings of characters is one of the oldest application of mechanical computers, arguably predating numerical computation by at least fifty years. Assuming you've already solved the problem of how to represent characters in memory (e.g. as the C char type encoded in ASCII), there are two standard ways to represent strings: As a delimited string, where the end of a string is marked by a special character. The advantages of this method are that only one extra byte is needed to indicate the length of an arbitrarily long string, that strings can be manipulated by simple pointer operations, and in some cases that common string operations that involve processing the entire string can be performed very quickly. The disadvantage is that the delimiter can't appear inside any string, which limits what kind of data you can store in a string. As a counted string, where the string data is prefixed or supplemented with an explicit count of the number of characters in the string. The advantage of this representation is that a string can hold arbitrary data (including delimiter characters) and that one can quickly jump to the end of the string without having to scan its entire length. The disadvantage is that maintaining a separate count typically requires more space than adding a one-byte delimiter (unless you limit your string length to 255 characters) and that more care needs to be taken to make sure that the count is correct

Knuth-Morris-Pratt (KPM)

Algorithm Knuth-Morris-Pratt

Processing time O(m)

Matching time O(n)

Algorithm • The Knuth-Morris-Pratt (KMP) string searching algorithm differs from the brute-force algorithm by keeping track of information gained from previous comparisons. • Afailure function (f) is computed that indicates how much of the last comparison can be reused if it fails. • Specifically, f is defined to be the longest prefix of the pattern P[0,..,j] that is also a suffix of P[1,..,j] - Note: not a suffix of P[0,..,j] • Example: - value of the KMP failure function: j012345 P[j] a b a b a c f(j) 0 0 1 2 3 0 • This shows how much of the beginning of the string matches up to the portion immediately preceding a failed comparison. - if the comparison fails at (4), we know the a,b in positions 2,3 is identical to positions 0,1

Pros 1. Its searching complexity is O(m+n) which is faster than brute force and Rabin-Karp 2. It’s fairly easy to implement Cons 1. It needs additional space and time – O(m) for pre-processing 2. It can be optimized a bit (Knuth-Morris-Pratt) Final Words Obviously this algorithm is quite useful because it improves in some very elegant manner the brute force matching. In the other hand you must know that there are faster string searching algorithms like the BoyerMoore algorithm. However the Morris-Pratt algorithm can be quite useful in many cases, so understanding its principles can be very handy

Boyre Moore

Robin karp Rabin-Karp string search algorithm From Wikipedia, the free encyclopedia Jump to: navigation, search The Rabin-Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find a substring in a text. It is used for multiple pattern matching rather than single pattern matching. For text of length n and pattern of length m, its average and best case running time is O(n), but the worst case performance of O(nm) is a reason why it is not widely used. However,

it has the unique advantage of being able to find any one of k strings or less in O(n) time on average, regardless of the magnitude of k. A practical application of Rabin-Karp is detecting plagiarism. Given source material, Rabin-Karp can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.

Contents    

1 Shifting substrings search and competing algorithms 2 Use of hashing for shifting substring search 3 Hash function used 4 Rabin-Karp and multiple pattern search

Shifting substrings search and competing algorithms A simple substring search algorithm checks all possible positions: 1 function NaiveSearch(string s[1..n], string sub[1..m]) 2 for i from 1 to n-m+1 3 for j from 1 to m 4 if s[i+j-1] ≠ sub[j] 5 jump to next iteration of outer loop 6 return i 7 return not found

This algorithm works well in many practical cases, but can exhibit relatively long running times on certain examples, such as searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s, in which case it exhibits its worstcase Θ(mn) time. The Knuth-Morris-Pratt algorithm reduces this to Θ(n) time using precomputation to examine each text character only once; the Boyer-Moore algorithm skips forward not by 1 character, but by as many as possible for the search to succeed, effectively decreasing the number of times we iterate through the outer loop, so that the number of characters examined can be as small as n/m in the best case. The Rabin-Karp algorithm focuses instead on speeding up lines 3-6.

Use of hashing for shifting substring search

Rather than pursuing more sophisticated skipping, the Rabin-Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; for example, we might have hash("hello")=5. Rabin-Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value. However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good. The algorithm is as shown: 1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]); hs := hash(s[1..m]) 3 for i from 1 to n-m+1 4 if hs = hsub 5 if s[i..i+m-1] = sub 6 return i 7 hs := hash(s[i+1..i+m]) 8 return not found

Lines 2, 5, and 7 each require Θ(m) time. However, line 2 is only executed once, and line 5 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 4 is executed ntimes, but only requires constant time. So the only problem is line 7. If we naively recompute the hash value for the substring s[i+1..i+m], this would require Θ(m) time, and since this is done on each loop, the algorithm would require Ω(mn) time, the same as the most naive algorithms. The trick to solving this is to note that the variable hs already contains the hash value of s[i..i+m-1]. If we can use this to compute the next hash value in constant time, then our problem will be solved. We do this using what is called a rolling hash. A rolling hash is a hash function specially designed to enable this operation. One simple example is adding up the values of each character in the substring. Then, we can use this formula to compute the next hash value in constant time: s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section. Notice that if we're very unlucky, or have a very bad hash function such as a constant function, line 5 might very well be executed n times, on every iteration of the loop. Because it requires Θ(m) time, the whole algorithm then takes a worst-case Θ(mn) time.

Hash function used The key to Rabin-Karp performance is the efficient computation of hash values of the successive substrings of the text. One popular and effective rolling hash function treats every substring as a number in some base, the base being usually a large prime. For example, if the substring is "hi" and the base is 101, the hash value would be 104 × 1011 + 105 × 1010 = 10609 (ASCII of 'h' is 104 and of 'i' is 105). Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See hash function for a much more detailed discussion. The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths. For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 × 101 2 (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 × 1010 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes. Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited size of the integer data type and the necessity of using modular arithmetic to scale down the hash results, for which see hash function article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many hash collisions and hence slow down the algorithm. Hence the described hash function is typically the preferred one in Rabin-Karp.

Rabin-Karp and multiple pattern search Rabin-Karp is inferior for single pattern searching to Knuth-Morris-Pratt algorithm, Boyer-Moore string searching algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin-Karp is an algorithm of choice for multiple pattern search. That is, if we want to find any of a large number, say k, fixed length patterns in a text, we can create a simple variant of Rabin-Karp that uses a Bloom filter or a set data structure to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for: function RabinKarpSet(string s[1..n], set of string subs, m) { set hsubs := emptySet for each sub in subs insert hash(sub[1..m]) into hsubs hs := hash(s[1..m]) for i from 1 to n-m+1 if hs ∈ hsubs if s[i..i+m-1] = a substring with hash hs return i hs := hash(s[i+1..i+m]) return not found }

Here we assume all the substrings have a fixed length m, but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value. Other algorithms can search for a single pattern in O(n) time, and hence they can be used to search for k patterns in O(n k) time. In contrast, the variant Rabin-Karp above can find all k patterns in O(n+k) time in expectation, because a hash table checks whether a substring hash equals any of the pattern hashes in O(1) time.

More Documents from "Shubham Sharma"