5 Data Mining Techniques You Should Know About

data mining techniques -magoosh

Data mining – the collective name used to designate the set of detection methods in the data previously unknown, non-trivial, and practically useful and accessible interpretation of knowledge necessary for decision-making in the various spheres of human activity. The term was introduced by Grigory Pyatetskim-Shapiro in 1989.

The basis of data mining methods is all sorts of methods of classification, modeling and forecasting, based on the use of decision trees, artificial neural networks, genetic algorithms, evolutionary programming, associative memory, and fuzzy logic. Data mining methods are often referred to as statistical methods (descriptive analysis, correlation and regression analysis, factor analysis, variance analysis, component analysis, discriminant analysis, time series analysis, survival analysis, communications analysis). Such methods, however, suggest some a priori notions about the data being analyzed, which is somewhat divergent from the goals of data mining (the discovery of previously unknown, nontrivial, and practically useful knowledge).

One of the most important purposes of data mining methods is to visualize the results of calculations (visualization), which allows using data mining tools by people who do not have a special mathematical preparation.

The use of statistical methods of data analysis requires a good knowledge of probability theory and mathematical statistics

Decision tree

A decision tree (also called a classification tree or regression tree) is a decision support tool used in statistics and data analysis for predictive models. The structure of the tree is “leaves” and “branches”. On the edges (“branches”) of the decision tree, attributes are recorded, on which the objective function depends, in the “leaves” the values of the objective function, and in the remaining nodes – the attributes by which the cases differ. To classify a new case, you have to go down the tree to the sheet and give the corresponding value. Similar decision trees are widely used in intelligent data analysis. The goal is to create a model that predicts the value of the target variable based on several variables at the input.

Each sheet represents the value of the target variable, changed during the movement from the root to the sheet. Each internal node corresponds to one of the input variables. The tree can also be “learned” by dividing the original sets of variables into subsets, based on testing attribute values. This is a process that is repeated on each of the received subsets. The recursion is completed when the sub-set at the node has the same values of the target variable, so it does not add value to the predictions. The process going “from top to bottom”, induction of decision trees (TDIDT), is an example of an absorbing “greedy” algorithm, and is by far the most common strategy for decision trees for data, but this is not the only possible strategy.

Artificial neural network

Artificial neural network is a mathematical model, as well as its software or hardware implementation, built on the principle of organization and functioning of biological neural networks — nerve cell networks of a living organism. This concept arose when studying the processes occurring in the brain, and when trying to simulate these processes. The first such attempt was the neural networks of W. McCulloch and W. Pitts. After the development of learning algorithms, the resulting models began to be used for practical purposes: in forecasting problems, for pattern recognition, in management tasks, etc.

Artificial neural network is a system of connected and interacting simple processors (artificial neurons). Such processors are usually quite simple (especially in comparison with processors used in personal computers). Each processor of such a network only deals with the signals it periodically receives, and the signals it periodically sends to other processors. And, nevertheless, being connected to a sufficiently large network with controlled interaction, such separately simple processors together are able to perform rather complex tasks.

  • From the point of view of machine learning, a neural network is a special case of methods for pattern recognition, discriminant analysis, clustering methods, and so on.
  • From the mathematical point of view, the training of neural networks is a multiparameter problem of nonlinear optimization.
  • From the point of view of cybernetics, the neural network is used in problems of adaptive control and as algorithms for robotics.
  • From the point of view of the development of computer technology and programming, the neural network is a way of solving the problem of effective parallelism.
  • And from the point of view of artificial intelligence, INS is the basis of the philosophical trend of connectivism and the main direction in the structural approach to study the possibility of constructing (modeling) natural intelligence using computer algorithms.

Neural networks are not programmed in the usual sense of the word, they are trained. The possibility of learning is one of the main advantages of neural networks over traditional algorithms. Technically, training is to find the coefficients of connections between neurons. In the process of learning, the neural network is able to detect complex dependencies between input data and output, and also perform generalization. This means that in case of successful training, the network will be able to return the correct result based on data that was not available in the training sample, as well as incomplete and/or “noisy”, partially distorted data.

Genetic algorithm

Genetic algorithm is a heuristic search is used for solving the optimization and simulation by random selection, combination and variation of desired parameters, using mechanisms similar to natural selection in nature. It is a kind of evolutionary computation with the help of which optimization problems are solved using the methods of natural evolution, such as inheritance, mutations, selection and crossover. A distinctive feature of the genetic algorithm is the emphasis on the use of the “crossing” operator, which performs a recombination operation of candidate solutions, whose role is analogous to the role of crossing in wildlife.

Fuzzy logic

Fuzzy logic – a branch of mathematics, which is a generalization of classical logic and set theory, which is based on the notion of a fuzzy set as an object to the membership function element to a plurality of receiving any value in the range [0, 1], not only 0 or 1. On the basis of this concept, various logical operations on fuzzy sets are introduced and the concept of a linguistic variable is formulated, with fuzzy sets acting as values.

The subject of fuzzy logic is the study of reasoning under conditions of fuzziness, blurriness, similar to reasoning in the usual sense, and their application in computer systems.

Association Rule Learning

Association Rule Learning refers to the techniques that can help in the identification of relations between different variables in large databases. It can help in finding the underlying patterns in the data. These patterns can be used to identify variables within the data and the concurrence of different variables that appear very frequently in the dataset.

Association rules are extensively used for examination and forecasting of the behavior of the customers. It is highly recommended in the retail industry analysis. For instance, Amazon knows what are the items that people generally tend to purchase together. As an example, those who purchase a smartphone tend to purchase its cover as well. Association rules help in figuring out such relationships.

To conclude, we’d like to add that data mining techniques like the ones mentioned above are powerful tools not just in research, but also are extensively used in the various industries and related segments for improving sales and revenues of various organizations. Therefore, it is important for both academicians and researchers to understand how these algorithms work and how they can be applied.

There are various tools like R, Python and MATLAB that provide powerful built-in libraries for all of the data mining techniques mentioned above. If you are interested, we would encourage you to read further on the same and implement the techniques in at least one programming language.

Comments are closed.

Magoosh blog comment policy: To create the best experience for our readers, we will only approve comments that are relevant to the article, general enough to be helpful to other students, concise, and well-written! 😄 Due to the high volume of comments across all of our blogs, we cannot promise that all comments will receive responses from our instructors.

We highly encourage students to help each other out and respond to other students' comments if you can!

If you are a Premium Magoosh student and would like more personalized service from our instructors, you can use the Help tab on the Magoosh dashboard. Thanks!