I'm told that I need to normalize features before modeling with KNN. What's the difference between scaling to 0 and 1, taking a unit norm, or z score? I've been taught all of these to "normalize" data, but don't know how or why each apply.
Answer:
RESCALING attribute data to values to scale the range in [0, 1] or [-1, 1] is useful for the optimization algorithms, such as gradient descent, that are used within machine learning algorithms that weight inputs (e.g. regression and neural networks). Rescaling is also used for algorithms that use distance measurements for example K-Nearest-Neighbors (KNN). Rescaling like this is sometimes called "normalization". MinMaxScaler class in python skikit-learn does this.
STANDARDIZING attribute data assumes a Gaussian distribution of input features and "standardizes" to a mean of 0 and a standard deviation of 1. This works better with linear regression, logistic regression and linear discriminate analysis. Python StandardScaler class in scikit-learn works for this.
NORMALIZING attribute data is used to rescale components of a feature vector to have the complete vector length of 1. This usually means dividing each component of the feature vector by the Euclidiean length of the vector but can also be Manhattan or other distance measurements. This pre-processing rescaling method is useful for sparse attribute features and algorithms using distance to learn such as KNN.
스케일링은 : 0~1 사이로
표준화는 : 평균 0, 표준편차는 1로
정규화는 : 특징 벡터의 길이가 1이되게
ref) http://datareality.blogspot.com/2016/11/scaling-normalizing-standardizing-which.html
'Data handling' 카테고리의 다른 글
[인코딩 방식] 웹에서 얻은 데이터가 안열릴때 (0) | 2020.09.03 |
---|---|
[pandas] str으로 나타내진 datetime 을 mean 연산 가능한 형식으로 변환하기 (0) | 2020.06.26 |
[크롤링 삽질] selenium 스크롤 다운 안될때 꿀팁 (2) | 2020.06.26 |
[크롤링 관련] selenium stale 에러 (10) | 2020.06.24 |
[mac os & pycharm] 단축키 정리 (0) | 2020.06.23 |