본문 바로가기

Data handling

스케일링/표준화/정규화 차이

728x90

I'm told that I need to normalize features before modeling with KNN. What's the difference between scaling to 0 and 1, taking a unit norm, or z score? I've been taught all of these to "normalize" data, but don't know how or why each apply.

Answer:

RESCALING attribute data to values to scale the range in [0, 1] or [-1, 1] is useful for the optimization algorithms, such as gradient descent, that are used within machine learning algorithms that weight inputs (e.g. regression and neural networks). Rescaling is also used for algorithms that use distance measurements for example K-Nearest-Neighbors (KNN). Rescaling like this is sometimes called "normalization". MinMaxScaler class in python skikit-learn does this.

STANDARDIZING attribute data assumes a Gaussian distribution of input features and "standardizes" to a mean of 0 and a standard deviation of 1. This works better with linear regression, logistic regression and linear discriminate analysis. Python StandardScaler class in scikit-learn works for this.

NORMALIZING attribute data is used to rescale components of a feature vector to have the complete vector length of 1. This usually means dividing each component of the feature vector by the Euclidiean length of the vector but can also be Manhattan or other distance measurements. This pre-processing rescaling method is useful for sparse attribute features and algorithms using distance to learn such as KNN.


스케일링은 : 0~1 사이로

표준화는 : 평균 0, 표준편차는 1로

정규화는 : 특징 벡터의 길이가 1이되게 


ref) http://datareality.blogspot.com/2016/11/scaling-normalizing-standardizing-which.html