Listen to this story
An unordered collection of elements is sorted by placing them in a monotonically rising (or decreasing) order. The efficiency of sorting big data sets, both in terms of time and memory used, calls for advancements in sorting algorithms and their implementations in modern computing. Machine learning models might be used to improve these sorting algorithms. By analyzing experimental data, machine learning enables the creation of adaptable algorithms. In order to choose an algorithm depending on the properties of the data set, this article reviews evolutionary algorithms. Following are the topics to be covered.
Table of contents
- Sorting in machine learning
- Genetic algorithms used for optimizing sorting
- Advantages of using genetic algorithm
Since the dawn of computers, sorting has drawn much attention as a fundamental data activity. Let’s understand sorting algorithms.
Sign up for your weekly dose of what's up in emerging technology.
Sorting in machine learning
A sorting operation involves placing data in a specific order. The sorting algorithm defines the technique for arranging data in a given order. Data searching may be highly efficient when data is kept in a sorted way, which is why sorting is important. Data representation in more comprehensible ways is another use for sorting.
Algorithms for sorting data may need a little more room for comparison and short-term storage of a few data components. These algorithms are claimed to sort in-place, for instance, within the array itself, and they don’t take up any more space. It is referred to as in-place sorting. An illustration of in-place sorting is the bubble sort. But for some sorting algorithms, the amount of space used by the programme is greater than or equal to the number of elements to be sorted. Not-in-place sorting is defined as sorting with an equal or greater space need. An illustration of not-in-place sorting is merge-sort.
Based on the effects after sorting the data, these sorting algorithms are classified as stable and non-stable. For example, when we wish to maintain the sequence of elements in a tuple, algorithm stability matters.
If a sorting algorithm uses items that have previously been “sorted” in the list that needs to be sorted, it is said to be adaptive. In other words, while sorting, adaptive algorithms will aim to avoid reordering elements if the source list already has part of them sorted. An algorithm that ignores the items that have previously been sorted is said to be non-adaptive. To ensure that the elements are properly sorted, they attempt to push each element into a new order.
Why optimization is required?
Although programme optimization has been greatly automated by compiler technology, a lot of human involvement is still required to produce high-quality code. There are two justifications for this assertion:
- The inconsistent implementations of compilers.
- Traditional compilers don’t contain semantic information, which limits their ability to change data.
Are you looking for a complete repository of Python libraries used in data science, check out here.
Genetic algorithms used for optimizing sorting
The optimal algorithm can only be found through searching because there are no analytical models of the performance of sorting algorithms in terms of the architectural parameters of the machine. Additionally, past research on sorting complexity was predicated on the incorrect assumption that accessing each piece takes the same amount of time given modern technology.
In a genetic algorithm, the parameter values of the sorting and selection primitives, as well as their composition, determine the search space. The goal of the search is to find the hierarchical sorting that best suits the machine’s architectural features and the input set’s characteristics.
An analogy between the genetic structure and behaviour of chromosomes in the population serves as the foundation for genetic algorithms. The basis of GAs based on this comparison is as follows.
- The population’s members compete for resources and mate.
- Successive (fittest) individuals then mate to have more children than other individuals.
- Parents occasionally have children that are better than either parent; this is because genes from the “fittest” parents spread across the generation.
- As a result, each next generation becomes more environment-friendly.
Sorting and selection primitives are employed as the tree-based nodes in the schema. To create new children and alter the population, genetic operators are utilized. The two operations that the majority of genetic algorithms employ are crossover and mutation.
Subtrees from various trees are traded during the crossover. Crossover’s goal is to produce new offspring who perform better than their parents. When the new children inherit the finest subtrees from the parents, this is likely to occur. In most cases, a single point crossover is employed, with the crossover point being chosen at random.
This operator makes adjustments to just one tree. It gives the population variance. The population cannot continue to be the same after any given generation due to the mutation. This strategy allows the search to partially escape local optima. In an effort to identify better values, the mutation modifies the parameter values. The following modifications are possible with the mutation operator:
- Change the parameters’ values at random when sorting and choosing basic nodes.
- Switch two subtrees.
- Including a new subtree.
- Take a subtree out. With this technique, unnecessary subtrees can be eliminated.
The likelihood of an individual reproducing is determined by the fitness function. The likelihood that an organism will reproduce and evolve increases with fitness. The fitness function will be the performance. But in designing the fitness function, the following two factors have been taken into account:
- A sorting algorithm that works well with all potential inputs is the objective. Therefore, a tree’s base fitness is its average performance. A penalty is set on the trees with variable performance by increasing the base fitness by a number that relies on the standard deviation of their performance while sorting the test inputs; however, the sorting algorithm has to perform well consistently across inputs is also an objective.
- The population’s fitness variance is significant in the initial generations. In other words, some sorting trees do far better than others. Since these few trees would have a considerably higher likelihood of reproducing, most of the children would be their descendants if our fitness function was directly proportional to the performance of the tree. These descendants would eventually make up the majority of the population. Premature convergence might emerge from this, which would restrict the algorithm from investigating portions of the search space outside of the vicinity of the highly suited trees. To solve this issue, our fitness function takes advantage of the performance order or rank of the population’s sorting trees. The absolute performance difference between trees is ignored when applying the performance ranking, therefore, trees with lower performance have a higher likelihood of reproducing than trees with higher performance. This eliminates the issue of early convergence and convergence to a local optimum.
Advantages and disadvantages of using genetic algorithms
There are some benefits of using genetic algorithms over other optimization methods.
- Sorting algorithms can be expressed as a tree using primitives, with each primitive representing a node. As a result, genetic algorithms may be simply utilized to explore the universe of potential trees for the best tree structure and parameter values associated with each node.
- Genetic algorithms maintain the finest subtrees and increase their chances of reproduction. Because a sub-tree is also a sorting algorithm, sorting algorithms can make use of this.
A genetic algorithm is a promising problem-solving tool, but it has a few flaws that might lead to inefficiency.
- It is tough to fine-tune the settings. It necessitates the determination of numerous parameters such as population size, mutation rate, and maximum run duration, as well as the creation of algorithms for selection, recombination, and mutation. Finding viable alternatives for these is a difficult task with little to no theoretical backing.
- There is no guarantee that convergence will occur. There is no guarantee that the algorithm will find a global optimum. It is possible that it will become caught in one of the local optima.
Choosing a suitable evolution algorithm is a critical decision. The evolution algorithm decides how many children will be produced, how many members of the current generation will be replaced, and so on. With this article, we have understood the optimization requirement in sorting and the usage of GA for optimizing sorting algorithms.