KNN Nearest Neighbor

Vishvaasswaminathan
4 min readSep 28, 2023

--

KNN imputer is a scikit -learn class used to fill out or predict the missing values in a dataset. It is a more useful method which works on the basic approach of the KNN algorithm rather than the naïve approach of filling all the values with mean or the median.

in this approach we specifies the distance from the missing value which is also know as k imputer this missing values will be predicated in references to mean of the neighbour .It is implemented by the KNN imputer() method

In today’s world, data is being collected from a number of sources and is used for analyzing, generating insights, validating theories, and whatnot. This data collected from different resources may often have some information missing. This may be due to a problem in the data collection or extraction process that could be a human error.

Dealing with these missing values, thus becomes an important step in data preprocessing. The choice of method of imputation is crucial since it can significantly impact one’s work.

To see this imputer in action, we will import it from Scikit-Learn’s impute package

One thing to note here is that the KNN Imputer does not recognize text data values. It will generate errors if we do not change these values to numerical values. For example, in our Titanic dataset, the categorical columns ‘Sex’ and ‘Embarked’ have text data.

A good way to modify the text data is to perform one-hot encoding or create “dummy variables”. The idea is to convert each category into a binary data column by assigning a 1 or 0. Other options would be to use Label Encoder or Ordinal Encoder from Scikit — Learn’s preprocessing package.

cat_ variables = df[[‘Sex’, ‘Embarked’]]
cat_ dummies = pd. get_ dummies(cat_ variables, drop _first=True)
cat_ dummies. head()

Now we have 3 dummy variable columns. In the “Sex_ male” column, 1 indicates that the passenger is male and 0 is female. The “Sex_ female” column is dropped since the “drop_ first” parameter is set as True. Similarly, there are only 2 columns for “Embarked” because the third one has been dropped.

Next, we will drop the original “Sex” and “Embarked” columns from the data frame and add the dummy variables.

Another critical point here is that the KNN Imptuer is a distance-based imputation method and it requires us to normalize our data. Otherwise, the different scales of our data will lead the KNN Imputer to generate biased replacements for the missing values. For simplicity, we will use Scikit-Learn’s MinMaxScaler which will scale our variables to have values between 0 and 1.

Now that our dataset has dummy variables and normalized, we can move on to the KNN Imputation. Let’s import it from Scikit-Learn’s Impute package and apply it to our data. In this example, we are setting the parameter ‘n_neighbors’ as 5. So, the missing values will be replaced by the mean value of 5 nearest neighbors measured by Euclidean distance.

Conclusion
There are different ways to handle missing data. Some methods such as removing the entire observation if it has a missing value or replacing the missing values with mean, median or mode values. However, these methods can waste valuable data or reduce the variability of your dataset. In contrast, KNN Imputer maintains the value and variability of your datasets and yet it is more precise and efficient than using the average values.

--

--

No responses yet