Data Mining Process
Pipeline
-
The workflow of a typical data mining application
-
Data collection (데이터 수집)
-
Pre-processing (전처리: 활용을 위한 정제)
-
Analytical processing and algorithms (분석 처리와 알고리즘)
-
Post-processing (후처리)
-
Major Tasks in Data Preprocessing
-
- Data cleaning
- : Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
=> 변수가 누락되어 있는 경우와 잘못 기록된 경우는 어떻게 처리하는지
-
- Data integration
- : Integration of multiple databases, data cubes, or files
=> 여러 데이터를 합칠 때, 일관성을 해치지 않는 방법
-
- Data reduction
- : Dimensionality reduction, Data compression
=> 데이터가 너무 클 경우 어떻게 줄이는가
-
- Data transformation
- : Discretization, Normalization
=> 데이터를 어떻게 변형시키는가
Data Cleaning
📙 NOTE: Data Quality: Why Preprocess the Data?
- Measures for data quality: A multidimensional view
- Accuracy: 얼마나 정확하게 데이터가 수집이 되었는지. correct or wrong, accurate or not
- Completeness: 기록이 잘 안된 것들은 없는지 not recorded, unavailable, …
- Consistency: 일관성이 있는지. some modified but some not, dangling, …
- Timeliness: 얼마나 주기적으로 업데이트가 되는지. timely update?
- Believability: how trustable the data are correct?
- Interpretability: how easily the data can be understood?
현실의 데이터들은 매우 다양하고 더럽다. “Data in the Real World Is Dirty.”
: Lots of potentially incorrect data, e.g. instrument faulty(기록하는 장치가 잘못된 경우), human or computer error(인간 또는 컴퓨터 오류), transmission err(전송 오류)
-
- Incomplete
- : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. (데이터들 사이에 missing value가 존재)
- e.g. Occupation=“ ” (missing data)
-
- Noisy
- : containing noise, errors, or outliers (데이터 중간에 섞여있는 노이즈, 에러 또는 이상치.)
- e.g. Salary=“−10” (an error)
-
- Inconsistent
- : containing discrepancies in codes or names (두 feature 사이의 불일치.)
- e.g. Age=“42”, Birthday=“03/07/2010”
- Was rating “1, 2, 3”, now rating “A, B, C”
- discrepancy between duplicate records
-
- Intentional
- : 잘못된 일반화.
- e.g. 7th of July is everyone’s birthday.
=> 이런 것들을 적절하게 처리해서 깔끔한 데이터를 만들어야 한다.
Incomplete (Missing) Data
-
Data is not always available
e.g. many tuples have no recorded value for several attributes, such as customer income in sales data
-
Missing data may be due to
-
equipment malfunction
-
inconsistent with other recorded data and thus deleted
-
data not entered due to misunderstanding
-
certain data may not be considered important at the time of entry
-
not register history or changes of the data
-
-
Missing data may need to be inferred
Assumption for Missing Data
📙 NOTE: Notation
- Missing or Not for variable Y: M = 0 (observed), M=1 (missing)
- Other observed variable X
-
Missing Completely at Random (MCAR)
-
Missingness is independent of attributes and occurs entirely at random: $P(Y \vert M) = P(Y)$
-
Unrealistically strong assumption in practice
-
-
Missing at Random (MAR)
-
Missingness is related to other attributes (missing이 될지 말지가 다른 변수들에 의해 영향을 받는다) : $ P(Y\vert X, M) = P(Y\vert X) $
-
e.g. 설문조사에서 25세 미만인 사람들의 연봉 칸이 대체로 비어있다. -> 25세 미만은 학생이기 때문에 직업이 없어 연봉이 없을 가능성이 높다. (상관관계)
-
The missingness can be explained by the variables that are actually observed
-
e.g. Missing data on hobbies tend to be more common among individuals with higher incomes
-
-
Missing Not at Random (MNAR)
-
Missingness is related to unobserved measurements (related to the reason it’s missing) : $P(Y\vert X, M)\neq P(Y\vert X)$
-
Informative or non-ignorable missingness
-
e.g. Missing values in education level can imply lower education levels
-
missing된 것이 어떠한 변수에 의한 즉, 원인이 있는데 그 원인이 되는 변수가 데이터에 기록이 되지 않아 파악이 어려운 경우.
-
- y: Variable with missing values
- Blue: observed data
- Red: missing data
How to Handle Missing Data?
-
Eliminate data tuples or attributes (데이터 삭제)
-
Usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably
-
데이터가 너무 많은 경우에 가장 편하다.
-
-
Fill in the missing value manually: tedious + infeasible?
-
Fill in it automatically with (데이터 채우기)
-
a global constant : e.g., “unknown”, a new class?!
-
the attribute mean
-
the attribute mean for all samples belonging to the same class: smarter
-
the most probable value: inference-based such as Bayesian formula or decision tree
-
-
- Imputation Using (Mean/Median) Values
- : 평균 또는 중앙값을 이용
👍 Pros | 👎 Cons |
---|---|
- Easy & Fast | - 단점은 딱히 없다. |
- Works well with small numerical datasets | - 굳이 따지자면 성능이 그렇게 좋진 않다. |
-
- Imputation Using (Most Frequent) or (Zero/Constant) Values
- : 가장 빈도가 많은 값 또는 0을 이용
👍 Pros | 👎 Cons |
---|---|
- Works well with categorical features | - 큰 단점은 딱히 없다. |
-
- Imputation Using Inference-based Method (i.e., k-nn, multi-variate imputations, deep learning)
- : 데이터의 특징들을 이용하여 예측
👍 Pros | 👎 Cons |
---|---|
- Can be much more accurate 가장 정확하다 than the mean, median or most frequent imputation methods. (It depends on the dataset) | - High computational costs with large datasets 모델을 또 하나 만들어야 하는 수고가 요구됨 |
- You have to specify the columns that contain information about the target column that will be imputed |
Noisy Data
-
Noise : random error or variance in a measured variable
-
Incorrect attribute values may be due to
-
faulty data collection instruments
-
data entry problems
-
data transmission problems
-
technology limitation
-
inconsistency in naming convention
-
-
Other data problems which require data cleaning
-
duplicate records
-
incomplete data
-
inconsistent data
-
=> Data Quality에 관한 문제!!!
How to Handle Noisy Data?
-
Binning 구간화
-
first sort data and partition into (equal-frequency) bins
-
then one can smooth by bin means(평균), smooth by bin median(중앙값), smooth by bin boundaries(최대, 최소 중 가까운 쪽 값, 즉 경계값으로 대체됨), etc.
-
-
Regression 회귀
-
smooth by fitting the data into regression functions
-
두 개 이상의 속성으로 다른 속성을 예측해 최선의 직선을 찾는다.
-
-
Clustering 군집화
-
detect and remove outliers
-
유사한 값들을 묶어서 그룹을 만들고, 그 그룹에서 벗어난 값들을 제거한다. (즉, outlier 대상으로 smoothing)
-
-
Combined computer and human inspection
-
detect suspicious values and check by human
-
e.g., deal with possible outliers
-
Data Integration
-
Data integration 데이터 통합
Combines data from multiple sources into a coherent store (여러가지 소스의 데이터를 합치는 것)
-
Schema integration 스키마 통합
Integrate metadata from different sources
-
Entity identification problem
Identify real world entities from multiple data sources (각각의 sample들이 여러가지 데이터 소스에서 오기 때문에 이름 같은 것들이 다를 수 있다.)
(e.g. Bill Clinton = William Clinton)
-
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales
(e.g. metric vs. British units)
Handling Redundancy in Data Integration (데이터 통합에서 중복을 제거하는 방법)
-
Redundant data occur often when integration of multiple databases
-
Object identification: The same attribute or object may have different names in different databases
-
Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
-
-
Redundant attributes may be able to be detected by correlation analysis and covariance analysis
-
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Correlation Analysis (Nominal Data)
-
$x^2$ (chi-square) test
-
$x^2=\sum{(observed-Expected)^2\over Expected}$
-
두 개의 변수가 있을 때, 그 변수의 예측값과 실측값들의 분포를 비교해서 얼마나 두 개의 다른 categorical variable이 서로 관련이 있는지를 본다.
-
The larger the $x^2$ value, the more likely the variables are related
-
The cells that contribute the most to the $x^2$ value are those whose actual count is very different from the expected count
-
-
Correlation does not imply causality. 상관관계는 인과관계를 함축하지 않는다.
-
number of hospitals and number of car-theft in a city are correlated
-
Both are causally linked to the third variable: population
-
Chi-Square Calculation: An Example
Play chess | Not play chess | Sum (row) | |
---|---|---|---|
Like science fiction | 250(90) | 200(360) | 450 |
Not like science fiction | 50(210) | 1000(840) | 1050 |
Sum(col.) | 300 | 1200 | 1500 |
$x^2={(250-90)^2\over 90}+{(50-210)^2\over 210}+{(200-360)^2\over 360}+{(1000-840)^2\over 840}=507.93$
-
첫 번째 변수(: 체스를 할 수 있는지 없는지)와 두 번째 변수(: SF를 좋아하는지 좋아하지 않는지)의 상관관계
-
$x^2$ (chi-squared) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)
-
Chi-square는 자유도 k(Degree of freedom)에 따라 달라진다.
-
It shows that like_science_fiction and play_chess are correlated in the group
-
$p(x^2>507.93) \approx 0$
Correlation Analysis (Numeric Data)
: Correlation coefficient (also called Pearson’s product moment coefficient)
-
$r_{A,B} = \frac{\sum_{i=1}^{n}(a_i - \bar A)(b_i - \bar B)}{(n-1)\sigma_A \sigma_B} = \frac{\sum_{i=1}^{n}(a_ib_i) - n\bar A\bar B}{(n-1)\sigma_A \sigma_B}$
-
$n$ = the number of tuples (데이터의 개수),
-
$\bar A$ and $\bar B$ = the respective means of $A$ and $B$ ($A$와 $B$의 평균)
-
$\sigma_A$ and $\sigma_B$ = the respective standard deviation of $A$ and $B$ ($A$와 $B$의 표준편차)
-
$\sum(a_ib_i)$ = the sum of the $AB$ cross-product.
-
-
If $r_{A,B} > 0$, $A$ and $B$ are positively correlated ($A$’s values increase as $B$’s). The higher, the stronger correlation.
-
$r_{A,B} = 0$: independent (독립)
-
$r_{A,B} < 0$: negatively correlated
Visually Evaluating Correlation
Covariance (Numeric Data)
: Covariance is similar to correlation
-
$Cov(A,B) = E((A - \bar A)(B - \bar B)) = \frac {\sum_{i=1}^{n}(a_i-\bar A)(b_i - \bar B)}{n}$
-
$Cov(A,B) = E(A \cdot B) - \bar A \bar B \qquad r_{A,B} = \frac{Cov(A, B)}{\sigma_A \sigma_B}$
-
- Positive covariance
- : If $Cov(A,B) > 0$, then $A$ and $B$ both tend to be larger than their expected values.
-
- Negative covariance
- : If $Cov(A,B) < 0$ then if $A$ is larger than its expected value, $B$ is likely to be smaller than its expected value.
-
- Independence
- : $Cov(A,B = 0)$ but the converse is not true
Covariance: An Example
-
Suppose two stocks $A$ and $B$ have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
-
Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?
-
$E(A)=(2+3+5+4+6)/5=20/5=4$
-
$E(B)=(5+8+10+11+14)/5=48/5=9.6$
-
$Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4$
-
-
Thus, $A$ and $B$ rise together since $Cov(A, B) > 0$.
Data Reduction
Data Reduction 1: Dimensionality Reduction
-
Curse of dimensionality 차원의 저주
-
When dimensionality increases, data becomes increasingly sparse (희소해진다)
-
Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful
-
-
Dimensionality reduction 차원 축소
-
Avoid the curse of dimensionality
-
Help eliminate irrelevant features and reduce noise
-
Reduce time and space required in data mining
-
Allow easier visualization
-
-
Dimensionality reduction techniques 차원 축소 기술
-
Wavelet transforms
-
Principal Component Analysis
-
Attribute Subset Selection
-
One way to reduce dimensionality of data
-
Redundant attributes 중복된 변수 제거
-
Duplicate much or all of the information contained in one or more other attributes
-
e.g., purchase price of a product and the amount of sales tax paid
-
-
Irrelevant attributes 상관없는 변수 제거
-
Contain no information that is useful for the data mining task at hand
-
e.g., students’ ID is often irrelevant to the task of predicting students’ GPA
-
Data Reduction 2: Numerosity Reduction
-
Clustering
-
Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
-
Can be very effective if data is clustered but not if data is “smeared”
-
Can have hierarchical clustering and be stored in multi-dimensional index tree structures
-
-
Sampling
-
Obtaining a small sample s to represent the whole data set $N$
-
Sampling is typically used in data mining when processing the entire set of data of interest is too expensive or time consuming.
-
Key principles for effective sampling are:
-
A sample is representative if it has approximately the same properties (of interest) as the original data set.
-
If a sample is representative, using the sample will work almost as well as using the entire data set.
-
-
Types of Sampling
-
Simple Random Sampling
-
There is an equal probability of selecting any particular object
-
Sampling without replacement 비복원 추출
-
As each object is selected, it is removed from the population
-
The same object cannot be picked up more than once
-
-
Sampling with replacement 복원 추출
-
Objects are not removed from the population as they are selected for the sample
-
The same object can be picked up more than once
-
-
Simple random sampling may have very poor performance in the presence of skew
-
-
Stratified Sampling
- Split the data into several partitions; then draw a random sample from each partition (데이터를 뽑을 때, 파티션을 나눠서 뽑는 샘플링)
Sampling: With or without Replacement
Sampling: Stratified Sampling
Data Transformation
-
A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values
-
Methods
-
Smoothing: Remove noise from data (평균을 적용해서 노이즈를 제거하는 것 - ex.이동평균)
-
Attribute/feature construction
-New attributes constructed from the given ones
-
Simple function: $xk$, $\log(1+x)$, $e^x$, $\left \vert x \right \vert$, $\dots$
-금융 데이터의 경우, 수치 그 자체보다 비율이 중요하기 때문에 대부분 $\log$ 를 취한다.
-
Normalization 정규화: Scaled to fall within a smaller, specified range (값을 어떤 범위 안에 넣어 주는 것 - 모든 데이터를 일관적인 형태로 관리하기 위함.)
-
min-max normalization
-
z-score normalization
-
normalization by decimal scaling
-
-
Discretization: Concept hierarchy climbing (연속적인 변수를 잘라서 나누는 것)
-
Normalization
-
Min-max normalization: to $newmin_A$, $newmax_A$
-
$v\prime = \frac {v-min_A}{max_A - min_A} (newmax_A - newmin_A) + newmin_A$
-
$v$ 라는 값에 내가 가진 attribute의 가장 작은 $min$ 값을 빼주고, $max - min$으로 나누어 주면 모든 값이 0~1 사이로 가게 된다.
-
모든 값들을 가장 큰 값을 1로 두고, 가장 작은 값을 0으로 두고, 그것을 linear하게 꾸미는 형태가 normalization 정규화이다.
-
-
Ex. Let income range 12,000 to 98,000 normalized to $[0.0, 1.0]$
-
The 73,000 is mapped to $\frac{73600−12000}{98000-12000} (1 − 0) + 0 = 0.716$
-
-
Z-score normalization ($\mu$: mean, $\sigma$: standard deviation):
-
$v\prime = \frac{v - \mu_A}{\sigma_A}$
-
Ex. Let $\mu=54,000$ , $\sigma=16,000$ . Then, $\frac{73600-54000}{16000}=1.225$
-
=> ex. 표준점수 (표준편차로 변환해서 내가 평균보다 얼마나 많이 앞서있는지($\sigma$) 알아본다)
Discretization
-
Three types of attributes
-
Nominal — values from an unordered set, e.g., color, profession
-
Ordinal — values from an ordered set, e.g., military or academic rank
-
Numeric — real numbers, e.g., integer or real numbers
-
-
Discretization: Divide the range of a continuous attribute into intervals
어떠한 기준을 가지고 나누는지가 중요한 관건이 된다-
Interval labels can then be used to replace actual data values
-
Reduce data size by discretization
-
Supervised vs. unsupervised
-
Split (top-down) vs. merge (bottom-up)
-
Discretization can be performed recursively on an attribute
-
Prepare for further analysis, e.g., classification
=> 데이터를 단순화시키고 싶을 때 많이 쓰는 방법
-
Simple Discretization: Binning
-
Equal-width (distance) partitioning
-
Divides the range into N intervals of equal size: uniform grid
-
if $A$ and $B$ are the lowest and highest values of the attribute, the width of intervals will be: $W = (B –A)/N$
-
The most straightforward, but outliers may dominate presentation
-
Skewed data is not handled well
-
-
Equal-depth (frequency) partitioning
-
Divides the range into N intervals, each containing approximately same number of samples
-
Good data scaling
-
Managing categorical attributes can be tricky
-
Binning Methods for Data Smoothing
-
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
-
Partition into equal-frequency (equi-depth) bins:
-
Bin 1: 4, 8, 9, 15
-
Bin 2: 21, 21, 24, 25
-
Bin 3: 26, 28, 29, 34
-
-
Smoothing by bin means:
-
Bin 1: 9, 9, 9, 9
-
Bin 2: 23, 23, 23, 23
-
Bin 3: 29, 29, 29, 29
-
-
Smoothing by bin boundaries:
-
Bin 1: 4, 4, 4, 15
-
Bin 2: 21, 21, 25, 25
-
Bin 3: 26, 26, 26, 34
-
Binarization (One-hot-encoding)
-
Binarization maps a continuous or categorical attribute into one or more binary attributes
-
Often convert a continuous attribute to a categorical attribute and then convert the categorical attribute to a set of binary attributes
-
Typically used for association analysis
-
Data Exploration
Summary Statistics
-
Summary statistics are numbers that summarize properties of the data
-
For categorical attributes
-
Frequency
-
Mode
-
-
For continuous attributes
-
Percentiles
-
Measures of Location: Mean and Median
-
Measures of Spread: Range and Variance
-
-
Histograms
-
Used to plot the distribution of the values of a single attribute (한 가지 변수가 어떻게 분포되어 있는지)
-
It divide the values into bins and shows a bar plot of the number of objects in each bin
-
The height of each bar indicates the number of objects
-
Shape of histogram depends on the number of bins ($x$축을 얼마나 잘게 나눌지 결정하는 요소인 bin 개수를 결정하는 것이 중요하다.)
-
-
Two-Dimensional Histograms
- Used to plot the joint distribution of the values of two attributes
Box Plots
-
Another way of displaying the distribution 분포 of data
-
Box plots can be used to compare 비교 attributes
-
50th percentile : 100분위수가 50 (=median 중앙값)
-
25th, 75th percentile : 4분위수
Scatter Plots
-
Used to plot the pairwise correlations between attributes (두 가지 변수가 어떠한 관계를 가지고 있는지)
-
Each data object is depicted as a marker.
-
The position of a marker is determined by the values of its attributes.
-
While two-dimensional scatter plots are most common, three-dimensional scatter plots can also be utilized.
-
Often, additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects
-
-
Additional attributes can be incorporated into scatter plots through:
-
Marker Size: Representing an attribute using the size of markers.
-
Marker Shape: Differentiating markers based on an additional attribute.
-
Marker Color: Using color variation to convey information about another attribute.
-
-
petal_length와 petal_width는 선형적 관계가 있음. (비례하는 형태의 데이터 구성)
-
sepal_width와 sepal_length는 관련이 없다.