Advertisement

DS Wannabe Prep(1): how to code K-nearest neighbours in python for machine learning interviews

阅读量:

you are determined by your closest neighbours

this is different from parametric models

linear regression

logistic regression

how knn works

make a prediction

1. find K closest neighbours

common distance metrics:

euclidean distance

cosine similarity

2. use the neighbours for prediction

regression average of all neighbour values
classification majority of the neighbour's class

Implementation

1. obtaining the data
2. querying the nearest neighbours

Package implementation in a class

- data shared between "train&predict"

复制代码
 #regression problem

    
 class KNN:
    
  def _init_(self):
    
   self.x = None
    
   self.y = None
    
  
    
  def train(self, x,y ):
    
   self.x = x
    
   self.y = y
    
  
    
  def predict(self, x,k):
    
 #get the new distance from the new data point to all the trending data points
    
   distance_label = [
    
       (self. distance(x, train_point),train_label)
    
    for train_point, train_label
    
    in zip(self.x, self.y)]
    
 #sort them in ascending order based on the distance
    
    neighbours = sorted(distance_label)[:k]
    
    return sum(
    
      label for _, label in neighbors)/k
    
 #regression problem: take the mean
    
    
    
    

Dimension of data:

x is a two-dimensional array

first dimension: no. of data points
second dimension: no. of features
复制代码
 return Counter(

    
  neighbour_labels).most_common()[0][0]
    
 #classification: count the majority of labels from its neighbours
    
    
    
    

Space and time complexity

Space & time complexity of train function: O(1)

Predict function: O(MN) M being the no. of data points and N being the no. of features

Sorting: O(M log (M))

Total time complexity: O(MN) + O(M log(M)) -> O(M log (M))

assuming log M is larger than log N. no. of TRAINING DATA is more than no. of FEATURES

How to choose K

k is a hyper-paramter

- predetermined

when k is too small, prediction can be noisy

when k is too large, prediction is average over too many data points, the result is not accurate either

simplest approach:

k = sqr no. of data points

Cross-validation

use training data to test hyper-paramters

we shuffle the training data and divide it into n equal sized partitions

then we picked a range of values we want to select from for the hyper paramter

For each k:

we use n-1 partitions for training and the remianing 1 partition for validation

Compute the validation error for each k and then we select the one with min. error

For a more robust approach:

repeat this exercise on different partitions: every partition has a chance to be a validation data set.

so at the end, we will have N validation errors associated with each candidate
CV Error = rac{um_{i}^{N}Validation Error_{i}}{N}

Finally, we have to select the candidate with the lowest CV Error.

全部评论 (0)

还没有任何评论哟~