DS Wannabe Prep(1): how to code K-nearest neighbours in python for machine learning interviews

阅读量：

you are determined by your closest neighbours

this is different from parametric models

linear regression

logistic regression

how knn works

make a prediction

1. find K closest neighbours

common distance metrics:

euclidean distance

cosine similarity

2. use the neighbours for prediction

regression	average of all neighbour values
classification	majority of the neighbour's class

Implementation

1. obtaining the data
2. querying the nearest neighbours

Package implementation in a class

- data shared between "train&predict"

复制代码

 #regression problem

    
 class KNN:
    
  def _init_(self):
    
   self.x = None
    
   self.y = None
    
  
    
  def train(self, x,y ):
    
   self.x = x
    
   self.y = y
    
  
    
  def predict(self, x,k):
    
 #get the new distance from the new data point to all the trending data points
    
   distance_label = [
    
       (self. distance(x, train_point),train_label)
    
    for train_point, train_label
    
    in zip(self.x, self.y)]
    
 #sort them in ascending order based on the distance
    
    neighbours = sorted(distance_label)[:k]
    
    return sum(
    
      label for _, label in neighbors)/k
    
 #regression problem: take the mean

Dimension of data:

x is a two-dimensional array

first dimension: no. of data points
second dimension: no. of features

复制代码

 return Counter(

    
  neighbour_labels).most_common()[0][0]
    
 #classification: count the majority of labels from its neighbours

Space and time complexity

Space & time complexity of train function: O(1)

Predict function: O(MN) M being the no. of data points and N being the no. of features

Sorting: O(M log (M))

Total time complexity: O(MN) + O(M log(M)) -> O(M log (M))

assuming log M is larger than log N. no. of TRAINING DATA is more than no. of FEATURES

How to choose K

k is a hyper-paramter

- predetermined

when k is too small, prediction can be noisy

when k is too large, prediction is average over too many data points, the result is not accurate either

simplest approach:

k = sqr no. of data points

Cross-validation

use training data to test hyper-paramters

we shuffle the training data and divide it into n equal sized partitions

then we picked a range of values we want to select from for the hyper paramter

For each k:

we use n-1 partitions for training and the remianing 1 partition for validation

Compute the validation error for each k and then we select the one with min. error

For a more robust approach:

repeat this exercise on different partitions: every partition has a chance to be a validation data set.

so at the end, we will have N validation errors associated with each candidate
$CV Error = rac{um_{i}^{N}Validation Error_{i}}{N}$

Finally, we have to select the candidate with the lowest CV Error.

全部评论 (0)

还没有任何评论哟~

是否确定退出登录?

DS Wannabe Prep(1): how to code K-nearest neighbours in python for machine learning interviews

Space and time complexity

How to choose K

Cross-validation

全部评论 (0)

相关文章推荐

DS Wannabe Prep(1): how to code K-nearest neighbours in python for machine learning interviews

How to Prepare Data for Machine Learning?

Using Machine Learning KNN (K-Nearest Neighbors) to Solve Problems

DS Wannabe之5-AM Project: DS 30day int prep day8

DS Wannabe之5-AM Project: DS 30day int prep day7

DS Wannabe之5-AM Project: DS 30day int prep day19

DS Wannabe之5-AM Project: DS 30day int prep day17

DS Wannabe之5-AM Project: DS 30day int prep day18

DS Wannabe之5-AM Project: DS 30day int prep day22

DS Wannabe之5-AM Project: DS 30day int prep day16