How to Find Outliers in Data using Machine Learning
No matter how careful you are during data collection, every data scientist has felt the frustration of finding outliers in data.
An outlier is a data point that is noticeably different from the rest. They represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data.
Why You Shouldn't Just Delete Outliers?
Many data analysts are tempted to delete outliers. However, this is sometimes the wrong choice. Occasionally, like in conventional analytical models, in machine learning, too, you need to resist the urge to simply hit the delete button when you come across such an anomaly, to improve your model's accuracy. So, rather than a knee-jerk reaction, one must tread with caution while handling outliers.
One cannot recognize outliers while collecting data; you won't know what values are outliers until you begin analyzing the data.
Many statistical tests are sensitive to outliers and therefore, the ability to detect them is an important part of data analytics.
The interpretability of an outlier model is very important, and decisions seeking to tackle an outlier need some context or rationale.
In fact, outliers sometimes can be helpful indicators. For example, in some applications of data analytics like credit card fraud detection, outlier analysis becomes important because here, the exception rather than the rule may be of interest to the analyst.
Simplistically speaking, here are some options you have when you detect outliers: accept them, correct them or delete them. If there's a chance that the outlier will not significantly alter the outcome, you may "accept" it.
Otherwise, you can either 'correct' it or delete it. However, you should reserve deletion only for data points that are definitely wrong.
Impact on Machine Learning Models
Machine learning algorithms, too, are at risk to the statistics and distribution of the input variables. In supervised models, outliers can deceive the training process resulting in prolonged training times, or leading to the development of less precise models.
According to Alvira Swalin, a data scientist at Uber, machine learning models, like linear & logistic regression are easily influenced by the outliers in the training data.
Some models even exist that hike the weights of misclassified points for every repetition of the training.
Detecting Outliers in Statistics Normal Situations
How to find outliers? There is no one method to detect outliers because of the facts at the center of each dataset. One dataset is different from the other.
A rule-of-the-thumb could be that you, the domain expert, can inspect the unfiltered, basic observations and decide whether a value is an outlier or not.
There are more scientific methods, though. You can carry out two types of analysis to find outliers – uni-variate, which involves just one variable, and multi-variate. These are different outlier methods for outlier analysis:
Box Plot
In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles.
Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers in ml will appear separate from the plot. (Source: Wikipedia)
Scatter Plot
A scatter plot is a chart type that is normally used to observe and visually display the relationship between variables. The values of the variables are represented by dots.
The positioning of the dots on the vertical and horizontal axis will inform the value of the respective data point; hence, scatter plots make use of Cartesian coordinates to display the values of the variables in a data set. Scatter plots are also known as scattergrams, scatter graphs, or scatter charts. (Source: CFI)
Mathematical Function
Z-score: A Z-score is a numerical measurement that describes a value's relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean (Source: Investopedia)
Z-score is a measure of a point's relationship to the average of all points in the dataset. When scored, the values receive a positive or negative number. This number is the number of standard deviations above or below the average value.
Thus, when an analyst calculates z-scores and finds data points with a value above 1, he has found the outliers in AI.
Detecting Outliers in Machine Learning
How to detect outliers in machine learning? How do you deal with outliers in predictive analytics? In machine learning, however, there's one way to tackle outliers: it's called "one-class classification" (OCC).
This involves fitting a model on the "normal" data, and then predicting whether the new data collected is normal or an anomaly.
However, one-class classifiers can only identify if the new data is 'normal' relative to the data it was initially fed. In other words, the OCC will give incorrect predictions if the training set has outliers.
Author Charu C Aggarwal, in his book "Outlier Analysis", discusses many outlier detection methods. Some notable ones include:
Probabilistic and Statistical Models
You can use statistics to identify unlikely outcomes.
Linear Models
This model is interpreted as a linear combination of features. Direct correlations are used to model the data into lower dimensions. As an example, principal component analysis and data with large residual errors may be outliers.
High-Dimensional Outlier Detection
High-dimensional data present a major challenge. Many of the current algorithms cannot address the problems of a large number of features.
A paper by Aggarwal and his colleague Philip S Yu states that, for effectiveness, high dimensional outlier detection algorithms must satisfy many properties, including the provision of interpretability in terms of the reasoning which creates the abnormality.
In machine learning, one cannot just "ignore" data outliers. They can impair the training process, and create cascading errors.
<h3>Increase your Sales and Conversions with Outliers</h3>
<p>Transform your data analysis with our expert outlier detection and machine learning solutions.</p>
<a href="/contact-us" class="cta-button">Contact us</a>
What are 3 Different Types of Outliers
There are 3 different categories of outliers in machine learning:
- Type 1: Global Outliers
- Type 2: Contextual Outliers
- Type 3: Collective Outliers
Global Outliers: Type 1
The Data point is measured as a global outlier if its value is far outside the entirety of the data in which it is contained.
Contextual or Conditional Outliers: Type 2
Contextual or conditional outliers are data sets whose value considerably diverges from other data points within a similar context. The "context" is approximate all the time temporal in time-series data sets, like the records of a detailed extent over time.
Collective Outliers: Type 3
A division of data points in a data set is measured abnormal if those values as a group deviate significantly from the whole data set, but the values of a single data point are not themselves abnormal in whichever contextual or global logic.
In time series data sets, one way this can be noticeable is as usual peaks and valleys happening outside of a time frame when that seasonal sequence is usual or as a grouping of time series that is in an outlier condition as a collection.
Advanced Outlier Detection Techniques
Isolation Forest
Isolation Forest is an unsupervised learning algorithm that works by isolating observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
Local Outlier Factor (LOF)
LOF is an algorithm for identifying density-based local outliers. It compares the local density of a point with the local densities of its neighbors.
One-Class SVM
One-Class SVM is a variation of the Support Vector Machine algorithm that can be used for outlier detection. It learns a decision boundary that encompasses the normal data points.
Best Practices for Outlier Handling
1. Understand Your Data
Before applying any outlier detection method, it's crucial to understand your data domain and the context in which outliers might occur.
2. Use Multiple Methods
Don't rely on a single method for outlier detection. Combine statistical, visual, and machine learning approaches for better results.
3. Document Your Decisions
Keep a record of which outliers you remove, modify, or keep, along with the rationale for your decisions.
4. Validate Your Approach
Test your outlier handling strategy on a subset of your data to ensure it improves model performance.
Impact on Model Performance
Regression Models
Outliers can significantly impact regression models by:
- Pulling the regression line towards them
- Increasing the residual sum of squares
- Affecting the model's coefficients
Classification Models
In classification tasks, outliers can:
- Create decision boundaries that don't generalize well
- Increase training time
- Reduce model accuracy
Conclusion
Outlier detection is a critical step in the machine learning pipeline that requires careful consideration and domain expertise. While outliers can sometimes be removed to improve model performance, they can also contain valuable information about rare events or data quality issues.
The key is to approach outlier detection systematically, using multiple methods and understanding the context of your data. Whether you choose to remove, modify, or keep outliers should be based on a thorough understanding of your data and the specific requirements of your machine learning task.
By implementing robust outlier detection strategies, you can build more reliable and accurate machine learning models that better represent the underlying patterns in your data.
References:
- Machine Learning Mastery
- Towards Data Science
- Data Science Foundation
- Neural Designer
- Wikipedia
- Analytics Vidya