데이터마이닝 Study Note - 1

Data Mining Process

Pipeline

The workflow of a typical data mining application
- Data collection (데이터 수집)
- Pre-processing (전처리: 활용을 위한 정제)
- Analytical processing and algorithms (분석 처리와 알고리즘)
- Post-processing (후처리)

Major Tasks in Data Preprocessing

Data cleaning

: Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
=> 변수가 누락되어 있는 경우와 잘못 기록된 경우는 어떻게 처리하는지
Data integration

: Integration of multiple databases, data cubes, or files
=> 여러 데이터를 합칠 때, 일관성을 해치지 않는 방법
Data reduction

: Dimensionality reduction, Data compression
=> 데이터가 너무 클 경우 어떻게 줄이는가
Data transformation

: Discretization, Normalization
=> 데이터를 어떻게 변형시키는가

Data Cleaning

📙 NOTE: Data Quality: Why Preprocess the Data?

Measures for data quality: A multidimensional view

Accuracy: 얼마나 정확하게 데이터가 수집이 되었는지. correct or wrong, accurate or not

Completeness: 기록이 잘 안된 것들은 없는지 not recorded, unavailable, …

Consistency: 일관성이 있는지. some modified but some not, dangling, …

Timeliness: 얼마나 주기적으로 업데이트가 되는지. timely update?

Believability: how trustable the data are correct?

Interpretability: how easily the data can be understood?

현실의 데이터들은 매우 다양하고 더럽다. “Data in the Real World Is Dirty.”

: Lots of potentially incorrect data, e.g. instrument faulty(기록하는 장치가 잘못된 경우), human or computer error(인간 또는 컴퓨터 오류), transmission err(전송 오류)

Incomplete
: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. (데이터들 사이에 missing value가 존재)
- e.g. Occupation=“ ” (missing data)
Noisy
: containing noise, errors, or outliers (데이터 중간에 섞여있는 노이즈, 에러 또는 이상치.)
- e.g. Salary=“−10” (an error)
Inconsistent
: containing discrepancies in codes or names (두 feature 사이의 불일치.)
- e.g. Age=“42”, Birthday=“03/07/2010”
- Was rating “1, 2, 3”, now rating “A, B, C”
- discrepancy between duplicate records
Intentional
: 잘못된 일반화.
- e.g. 7th of July is everyone’s birthday.

=> 이런 것들을 적절하게 처리해서 깔끔한 데이터를 만들어야 한다.

Incomplete (Missing) Data

Data is not always available

e.g. many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data may be due to
- equipment malfunction
- inconsistent with other recorded data and thus deleted
- data not entered due to misunderstanding
- certain data may not be considered important at the time of entry
- not register history or changes of the data
Missing data may need to be inferred

Assumption for Missing Data

📙 NOTE: Notation

Missing or Not for variable Y: M = 0 (observed), M=1 (missing)

Other observed variable X

Missing Completely at Random (MCAR)
- Missingness is independent of attributes and occurs entirely at random: $P(Y \vert M) = P(Y)$
- Unrealistically strong assumption in practice
Missing at Random (MAR)
- Missingness is related to other attributes (missing이 될지 말지가 다른 변수들에 의해 영향을 받는다) : $ P(Y\vert X, M) = P(Y\vert X) $
- e.g. 설문조사에서 25세 미만인 사람들의 연봉 칸이 대체로 비어있다. -> 25세 미만은 학생이기 때문에 직업이 없어 연봉이 없을 가능성이 높다. (상관관계)
- The missingness can be explained by the variables that are actually observed
- e.g. Missing data on hobbies tend to be more common among individuals with higher incomes
Missing Not at Random (MNAR)
- Missingness is related to unobserved measurements (related to the reason it’s missing) : $P(Y\vert X, M)\neq P(Y\vert X)$
- Informative or non-ignorable missingness
- e.g. Missing values in education level can imply lower education levels
- missing된 것이 어떠한 변수에 의한 즉, 원인이 있는데 그 원인이 되는 변수가 데이터에 기록이 되지 않아 파악이 어려운 경우.

y: Variable with missing values
Blue: observed data
Red: missing data

How to Handle Missing Data?

Eliminate data tuples or attributes (데이터 삭제)
- Usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably
- 데이터가 너무 많은 경우에 가장 편하다.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with (데이터 채우기)
- a global constant : e.g., “unknown”, a new class?!
- the attribute mean
- the attribute mean for all samples belonging to the same class: smarter
- the most probable value: inference-based such as Bayesian formula or decision tree
Imputation Using (Mean/Median) Values

: 평균 또는 중앙값을 이용

👍 Pros	👎 Cons
- Easy & Fast	- 단점은 딱히 없다.
- Works well with small numerical datasets	- 굳이 따지자면 성능이 그렇게 좋진 않다.

Imputation Using (Most Frequent) or (Zero/Constant) Values

: 가장 빈도가 많은 값 또는 0을 이용

👍 Pros	👎 Cons
- Works well with categorical features	- 큰 단점은 딱히 없다.

Imputation Using Inference-based Method (i.e., k-nn, multi-variate imputations, deep learning)

: 데이터의 특징들을 이용하여 예측

👍 Pros	👎 Cons
- Can be much more accurate 가장 정확하다 than the mean, median or most frequent imputation methods. (It depends on the dataset)	- High computational costs with large datasets 모델을 또 하나 만들어야 하는 수고가 요구됨
	- You have to specify the columns that contain information about the target column that will be imputed

Noisy Data

Noise : random error or variance in a measured variable
Incorrect attribute values may be due to
- faulty data collection instruments
- data entry problems
- data transmission problems
- technology limitation
- inconsistency in naming convention
Other data problems which require data cleaning
- duplicate records
- incomplete data
- inconsistent data

=> Data Quality에 관한 문제!!!

How to Handle Noisy Data?

Binning 구간화
- first sort data and partition into (equal-frequency) bins
- then one can smooth by bin means(평균), smooth by bin median(중앙값), smooth by bin boundaries(최대, 최소 중 가까운 쪽 값, 즉 경계값으로 대체됨), etc.
Regression 회귀
- smooth by fitting the data into regression functions
- 두 개 이상의 속성으로 다른 속성을 예측해 최선의 직선을 찾는다.
Clustering 군집화
- detect and remove outliers
- 유사한 값들을 묶어서 그룹을 만들고, 그 그룹에서 벗어난 값들을 제거한다. (즉, outlier 대상으로 smoothing)
Combined computer and human inspection
- detect suspicious values and check by human
- e.g., deal with possible outliers

Data Integration

Data integration 데이터 통합

Combines data from multiple sources into a coherent store (여러가지 소스의 데이터를 합치는 것)
Schema integration 스키마 통합

Integrate metadata from different sources
Entity identification problem

Identify real world entities from multiple data sources (각각의 sample들이 여러가지 데이터 소스에서 오기 때문에 이름 같은 것들이 다를 수 있다.)

(e.g. Bill Clinton = William Clinton)
Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are different

Possible reasons: different representations, different scales

(e.g. metric vs. British units)

Handling Redundancy in Data Integration (데이터 통합에서 중복을 제거하는 방법)

Redundant data occur often when integration of multiple databases
- Object identification: The same attribute or object may have different names in different databases
- Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis and covariance analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Correlation Analysis (Nominal Data)

$x^2$ (chi-square) test
- $x^2=\sum{(observed-Expected)^2\over Expected}$
- 두 개의 변수가 있을 때, 그 변수의 예측값과 실측값들의 분포를 비교해서 얼마나 두 개의 다른 categorical variable이 서로 관련이 있는지를 본다.
- The larger the $x^2$ value, the more likely the variables are related
- The cells that contribute the most to the $x^2$ value are those whose actual count is very different from the expected count
Correlation does not imply causality. 상관관계는 인과관계를 함축하지 않는다.
- number of hospitals and number of car-theft in a city are correlated
- Both are causally linked to the third variable: population

Chi-Square Calculation: An Example

	Play chess	Not play chess	Sum (row)
Like science fiction	250(90)	200(360)	450
Not like science fiction	50(210)	1000(840)	1050
Sum(col.)	300	1200	1500

$x^2={(250-90)^2\over 90}+{(50-210)^2\over 210}+{(200-360)^2\over 360}+{(1000-840)^2\over 840}=507.93$

첫 번째 변수(: 체스를 할 수 있는지 없는지)와 두 번째 변수(: SF를 좋아하는지 좋아하지 않는지)의 상관관계
$x^2$ (chi-squared) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)

Chi-square는 자유도 k(Degree of freedom)에 따라 달라진다.
It shows that like_science_fiction and play_chess are correlated in the group
$p(x^2>507.93) \approx 0$

Correlation Analysis (Numeric Data)

: Correlation coefficient (also called Pearson’s product moment coefficient)

$r_{A,B} = \frac{\sum_{i=1}^{n}(a_i - \bar A)(b_i - \bar B)}{(n-1)\sigma_A \sigma_B} = \frac{\sum_{i=1}^{n}(a_ib_i) - n\bar A\bar B}{(n-1)\sigma_A \sigma_B}$
- $n$ = the number of tuples (데이터의 개수),
- $\bar A$ and $\bar B$ = the respective means of $A$ and $B$ ($A$와 $B$의 평균)
- $\sigma_A$ and $\sigma_B$ = the respective standard deviation of $A$ and $B$ ($A$와 $B$의 표준편차)
- $\sum(a_ib_i)$ = the sum of the $AB$ cross-product.
If $r_{A,B} > 0$, $A$ and $B$ are positively correlated ($A$’s values increase as $B$’s). The higher, the stronger correlation.
$r_{A,B} = 0$: independent (독립)
$r_{A,B} < 0$: negatively correlated

Visually Evaluating Correlation

Covariance (Numeric Data)

: Covariance is similar to correlation

$Cov(A,B) = E((A - \bar A)(B - \bar B)) = \frac {\sum_{i=1}^{n}(a_i-\bar A)(b_i - \bar B)}{n}$
$Cov(A,B) = E(A \cdot B) - \bar A \bar B \qquad r_{A,B} = \frac{Cov(A, B)}{\sigma_A \sigma_B}$
Positive covariance

: If $Cov(A,B) > 0$, then $A$ and $B$ both tend to be larger than their expected values.
Negative covariance

: If $Cov(A,B) < 0$ then if $A$ is larger than its expected value, $B$ is likely to be smaller than its expected value.
Independence

: $Cov(A,B = 0)$ but the converse is not true

Covariance: An Example

Suppose two stocks $A$ and $B$ have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?
- $E(A)=(2+3+5+4+6)/5=20/5=4$
- $E(B)=(5+8+10+11+14)/5=48/5=9.6$
- $Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4$
Thus, $A$ and $B$ rise together since $Cov(A, B) > 0$.

Data Reduction

Data Reduction 1: Dimensionality Reduction

Curse of dimensionality 차원의 저주
- When dimensionality increases, data becomes increasingly sparse (희소해진다)
- Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful
Dimensionality reduction 차원 축소
- Avoid the curse of dimensionality
- Help eliminate irrelevant features and reduce noise
- Reduce time and space required in data mining
- Allow easier visualization
Dimensionality reduction techniques 차원 축소 기술
- Wavelet transforms
- Principal Component Analysis

Attribute Subset Selection

One way to reduce dimensionality of data
Redundant attributes 중복된 변수 제거
- Duplicate much or all of the information contained in one or more other attributes
- e.g., purchase price of a product and the amount of sales tax paid
Irrelevant attributes 상관없는 변수 제거
- Contain no information that is useful for the data mining task at hand
- e.g., students’ ID is often irrelevant to the task of predicting students’ GPA

Data Reduction 2: Numerosity Reduction

Clustering
- Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
- Can be very effective if data is clustered but not if data is “smeared”
- Can have hierarchical clustering and be stored in multi-dimensional index tree structures
Sampling
- Obtaining a small sample s to represent the whole data set $N$
- Sampling is typically used in data mining when processing the entire set of data of interest is too expensive or time consuming.
- Key principles for effective sampling are:
  - A sample is representative if it has approximately the same properties (of interest) as the original data set.
  - If a sample is representative, using the sample will work almost as well as using the entire data set.

Types of Sampling

Simple Random Sampling
- There is an equal probability of selecting any particular object
- Sampling without replacement 비복원 추출
  - As each object is selected, it is removed from the population
  - The same object cannot be picked up more than once
- Sampling with replacement 복원 추출
  - Objects are not removed from the population as they are selected for the sample
  - The same object can be picked up more than once
- Simple random sampling may have very poor performance in the presence of skew
Stratified Sampling
- Split the data into several partitions; then draw a random sample from each partition (데이터를 뽑을 때, 파티션을 나눠서 뽑는 샘플링)

Sampling: With or without Replacement

Sampling: Stratified Sampling

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values
Methods
- Smoothing: Remove noise from data (평균을 적용해서 노이즈를 제거하는 것 - ex.이동평균)
- Attribute/feature construction
  
  -New attributes constructed from the given ones
- Simple function: $xk$, $\log(1+x)$, $e^x$, $\left \vert x \right \vert$, $\dots$
  
  -금융 데이터의 경우, 수치 그 자체보다 비율이 중요하기 때문에 대부분 $\log$ 를 취한다.
- Normalization 정규화: Scaled to fall within a smaller, specified range (값을 어떤 범위 안에 넣어 주는 것 - 모든 데이터를 일관적인 형태로 관리하기 위함.)
  - min-max normalization
  - z-score normalization
  - normalization by decimal scaling
- Discretization: Concept hierarchy climbing (연속적인 변수를 잘라서 나누는 것)

Normalization

Min-max normalization: to $newmin_A$, $newmax_A$
- $v\prime = \frac {v-min_A}{max_A - min_A} (newmax_A - newmin_A) + newmin_A$
  - $v$ 라는 값에 내가 가진 attribute의 가장 작은 $min$ 값을 빼주고, $max - min$으로 나누어 주면 모든 값이 0~1 사이로 가게 된다.
  - 모든 값들을 가장 큰 값을 1로 두고, 가장 작은 값을 0으로 두고, 그것을 linear하게 꾸미는 형태가 normalization 정규화이다.
- Ex. Let income range 12,000 to 98,000 normalized to $[0.0, 1.0]$
- The 73,000 is mapped to $\frac{73600−12000}{98000-12000} (1 − 0) + 0 = 0.716$
Z-score normalization ($\mu$: mean, $\sigma$: standard deviation):
- $v\prime = \frac{v - \mu_A}{\sigma_A}$
- Ex. Let $\mu=54,000$ , $\sigma=16,000$ . Then, $\frac{73600-54000}{16000}=1.225$

=> ex. 표준점수 (표준편차로 변환해서 내가 평균보다 얼마나 많이 앞서있는지($\sigma$) 알아본다)

Discretization

Three types of attributes
- Nominal — values from an unordered set, e.g., color, profession
- Ordinal — values from an ordered set, e.g., military or academic rank
- Numeric — real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
어떠한 기준을 가지고 나누는지가 중요한 관건이 된다
- Interval labels can then be used to replace actual data values
- Reduce data size by discretization
- Supervised vs. unsupervised
- Split (top-down) vs. merge (bottom-up)
- Discretization can be performed recursively on an attribute
- Prepare for further analysis, e.g., classification
=> 데이터를 단순화시키고 싶을 때 많이 쓰는 방법

Simple Discretization: Binning

Equal-width (distance) partitioning
- Divides the range into N intervals of equal size: uniform grid
- if $A$ and $B$ are the lowest and highest values of the attribute, the width of intervals will be: $W = (B –A)/N$
- The most straightforward, but outliers may dominate presentation
- Skewed data is not handled well
Equal-depth (frequency) partitioning
- Divides the range into N intervals, each containing approximately same number of samples
- Good data scaling
- Managing categorical attributes can be tricky

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Binarization (One-hot-encoding)

Binarization maps a continuous or categorical attribute into one or more binary attributes
- Often convert a continuous attribute to a categorical attribute and then convert the categorical attribute to a set of binary attributes
- Typically used for association analysis

Data Exploration

Summary Statistics

Summary statistics are numbers that summarize properties of the data
- For categorical attributes
  - Frequency
  - Mode
- For continuous attributes
  - Percentiles
  - Measures of Location: Mean and Median
  - Measures of Spread: Range and Variance

Histograms

Used to plot the distribution of the values of a single attribute (한 가지 변수가 어떻게 분포되어 있는지)
- It divide the values into bins and shows a bar plot of the number of objects in each bin
- The height of each bar indicates the number of objects
- Shape of histogram depends on the number of bins ($x$축을 얼마나 잘게 나눌지 결정하는 요소인 bin 개수를 결정하는 것이 중요하다.)

Two-Dimensional Histograms
- Used to plot the joint distribution of the values of two attributes

Box Plots

Another way of displaying the distribution 분포 of data
Box plots can be used to compare 비교 attributes

50th percentile : 100분위수가 50 (=median 중앙값)
25th, 75th percentile : 4분위수

Scatter Plots

Used to plot the pairwise correlations between attributes (두 가지 변수가 어떠한 관계를 가지고 있는지)
- Each data object is depicted as a marker.
- The position of a marker is determined by the values of its attributes.
- While two-dimensional scatter plots are most common, three-dimensional scatter plots can also be utilized.
- Often, additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects
Additional attributes can be incorporated into scatter plots through:
- Marker Size: Representing an attribute using the size of markers.
- Marker Shape: Differentiating markers based on an additional attribute.
- Marker Color: Using color variation to convey information about another attribute.

petal_length와 petal_width는 선형적 관계가 있음. (비례하는 형태의 데이터 구성)
sepal_width와 sepal_length는 관련이 없다.