An Introduction to Data Mining
Data mining is an interdisciplinary subfield of computer science used to discover patterns in complex datasets. The field has been widely studied since the 70’s since it can produce useful insights that can help to better understand underlying relationships and trends in data sets.
Common data mining tasks are:
- Anomaly Detection: Identifying unusual data records and understanding if they represent errors, noise or exceptions that may require further investigation.
- Dependency Modeling: Searching for relationships between variables.
- Clustering: Identifying groups of records that present similarities in relation to a certain set of variables.
- Classification: Generalizing a known structure to apply to new data.
- Regression: Finding a function that represents the dataset with the least amount of error.
In the last 50 years, the data mining process has been refined in order to define standard stage-based approaches. Here are some of the most common:
Knowledge Discovering in Database (KDD) is organized into 5 major stages:
- Preprocessing: Identifying and removing noise and outlier records, and find strategies to handle missing fields.
- Transformation: Reducing the number of effective variables using dimensionality reduction.
- Data Mining: Defining a specific task and, choosing an appropriate data mining algorithm and running it.
- Interpretation/Evaluation: Interpreting the processed results.
Cross Industry Standard Process (CRISP). This process is widely used by consulting agencies and is more focused on business. While KDD is mostly thought of as a straightforward approach, CRISP foresees iterations between its stages that act to refine the process results as shown in the picture. It is organized into 6 major stages:
- Business Understanding: Defining a business perspective and project goals and transforming them into a data mining a problem definition.
- Data Understanding: Getting familiar with data and define initial hypothesis on underlying correlations or patterns.
- Data Preparation: Similar to the selection, preprocessing and transformation stages in KDD.
- Modeling: Comparing different algorithms/strategies suitable for the problem defined in the previous stages.
- Evaluation: Deciding which model fits better with the business perspective and evaluating the quality of the identified models.
- Deployment: Producing something that can be communicated to external audiences. It can be a simple report or the definition of a mathematical model.
Sample, Explore, Modify, Model, Assess (SEMMA). This approach can be seen as a list of best practices for modeling data mining problems. Just as CRISP, it is meant to be an iterative process. It is organized into 5 major stages:
- Sample: Selecting and extracting a subset of data that can be used in the following stages.
- Explore: Similar to the data understanding stage in CRISP.
- Modify: Similar to the preprocessing and transformation stages in KDD.
- Model: Similar to the modeling stage in CRISP.
- Assess: Evaluating the quality and the reliability of the model defined in the previous stages.
A challenging task related to data mining processes is to find a way to present the results that can be easily understood and that can emphasize the discovered facts. Finding a proper data visualization may require not only design skills but also a deep mathematical and programming knowledge. Here are some brilliant examples of data visualization:
The New York Times analysis on Obama budget proposals for 2013 shows the proposed expenditure items proposed by president Obama in 2013. This web page allows the user to slice the items in different ways that can provide a more clear understanding of the data. Also, it makes it simple to understand proportions between the various items.
The guardian analysis on Brexit vote. In this example, the UK map has been transformed to represent the population of each area. The bigger the area is represented, the most populated it is. The color of each area represents the result of the vote. The analysis also shows the correlation between the vote and some demographic variables.
Game of Thrones characters. (Spoiler alert) This example shows a list of the major characters appeared so far in Game of Thrones with their family and the cause of their death if happened. In this case, the data presented is a set of facts, while in the previous examples the presented data were mostly numeric variables. Interactiveness allows the user to quickly understand relationships between the shown entities.
Machine learning and data mining
All the common data mining tasks described above, have been widely studied and today, we have solid mathematic theories that describe each of them and that can be used to perform them. In spite of this, sometimes, performing data mining of tasks can be computationally expensive. For this reason, machine learning techniques are often used to find approximate solutions. Machine learning is a subfield of computer science that groups a set of algorithm and techniques act to perform tasks without explicit programming. Machine learning involves 3 main concepts: a task T that should be performed, a performance measure P that describes how good the task is performed and the experience E that measures how many time the task has been performed. A machine is said to learn if the performance with which the task is performed increases along with the experience. Machine learning techniques can be grouped by the approach used in the learning process.
- Supervised learning: A computer program is presented with example inputs and their desired outputs. The goal is to learn a general rule that maps inputs to outputs. Supervised learning can be used to perform classification and regression tasks.
- Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be used to perform clustering and dependencies modeling tasks.
- Reinforcement learning: A computer program interacts with a dynamic environment and tries to maximize a reward. Reinforcement learning is often used to perform optimization tasks.
Another way to group machine learning techniques is by mathematic model. There are a lot of different approaches used for machine learning purposes. Here are some of the most commonly used for data mining tasks:
- Artificial neural networks (ANN): A mathematical model derived from the observation of biological neural network behavior. It is mostly used to perform classification and regression tasks.
- Support vector machines (SVM): A model that can be used for classification and regression tasks. The main differences from ANNs are that this model is completely mathematical and that can avoid some of the intrinsic ANNs limitations.
- Bayesian networks: A probabilistic model used strong relationships between variables and represent them as a graph. It can be used to perform dependencies modeling.
Conclusion
All the techniques we discussed above are strongly related to statistical/mathematical problems. In recent years, their use has steadily increased, due to the growth of web technologies and the ability of users to generate data. Along with the diffusion of data mining and machine learning techniques, the literature about their misuse has grown. Many authors today are pointing out the common misconception about statistics and common errors in the analysis process. The abuse of data mining techniques is sometimes called Data Dredging. In general, blindly applying algorithms without a proper study of the problem will compromise the statistical value of the results. Despite these problems, in a globalized world where the amount of data grows constantly, data mining and machine learning techniques are still very valuable tools for understanding the world itself.