4-2. [지도학습] 회귀분석 (Regression) (2)

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

now is better than never

4-2. [지도학습] 회귀분석 (Regression) (2) 본문

머신러닝 & 딥러닝

4-2. [지도학습] 회귀분석 (Regression) (2)

김초송 2023. 3. 23. 14:30

다중공선성 실험

다중공선성
: 독립변수들끼리 강한 상관관계를 보이고 있어서 종속변수에 대한 특정 독립변수의 영향도가 상대적으로 감소하는 현상
회귀분석 전에 다중공선성 문제가 있는지 미리 확인해야 함

다중공선성 VIF (팽창계수) 확인하는 패키지 설치
데이터 로드
독립변수-종속변수 상관계수 확인
회귀모델 생성
다중공선성 확인

install.packages("car", lib = .libPaths()[1])
library(car)

test2 <- read.csv("C:/Data/test_vif2.csv", fileEncoding = 'euc-kr')
test2

cor(test2[, c("아이큐", "공부시간", "등급평균")])

model <- lm(시험점수 ~ 아이큐 + 공부시간 + 등급평균, data = test2)
summary(model)

vif(model) > 10

# cor 결과 : 아이큐-등급평균 강한 상관관계
            아이큐  공부시간  등급평균
아이큐   1.0000000 0.7710712 0.9736894
공부시간 0.7710712 1.0000000 0.7300546
등급평균 0.9736894 0.7300546 1.0000000

# 결과
Call:
lm(formula = 시험점수 ~ 아이큐 + 공부시간 + 등급평균, 
    data = test2)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3146 -1.2184 -0.4266  1.5516  5.6358 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 50.30669   35.70317   1.409   0.2085  
아이큐       0.05875    0.55872   0.105   0.9197  
공부시간     0.48876    0.17719   2.758   0.0329 *
등급평균     7.37578    8.63161   0.855   0.4256  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.952 on 6 degrees of freedom
Multiple R-squared:  0.9155,	Adjusted R-squared:  0.8733 
F-statistic: 21.68 on 3 and 6 DF,  p-value: 0.001275

# 팽창계수 결과
  아이큐 공부시간 등급평균 
    TRUE    FALSE     TRUE

서로 강한 상관관계를 독립변수인 아이큐와 등급평균 둘 다 p value 높음 -> 유의미하지 않음
독립변수별로 회귀분석을 따로 해야 함

평창계수 (VIF) 가 10보다 큰 것을 골라내는 것이 일반적
엄격하게 -> 5 이상
느슨하게 -> 15 ~ 20 이상

model5 <- lm(시험점수 ~ 아이큐 + 공부시간, data = test2)
summary(model5)

model6 <- lm(시험점수 ~ 등급평균 + 공부시간, data = test2)
summary(model6)

# model5
Call:
lm(formula = 시험점수 ~ 아이큐 + 공부시간, data = test2)

Residuals:
   Min     1Q Median     3Q    Max 
-6.341 -1.130 -0.191  1.450  5.542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  23.1561    15.9672   1.450   0.1903  
아이큐        0.5094     0.1808   2.818   0.0259 *
공부시간      0.4671     0.1720   2.717   0.0299 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.875 on 7 degrees of freedom
Multiple R-squared:  0.9053,	Adjusted R-squared:  0.8782 
F-statistic: 33.45 on 2 and 7 DF,  p-value: 0.0002617

# model6
Call:
lm(formula = 시험점수 ~ 등급평균 + 공부시간, data = test2)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3179 -1.2020 -0.5051  1.3484  5.6317 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.0194     4.9092  11.004 1.14e-05 ***
등급평균      8.2326     2.6398   3.119   0.0169 *  
공부시간      0.4960     0.1514   3.275   0.0136 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.662 on 7 degrees of freedom
Multiple R-squared:  0.9154,	Adjusted R-squared:  0.8912 
F-statistic: 37.87 on 2 and 7 DF,  p-value: 0.0001762

아이큐, 등급평균, 공부시간 3개 모두 영향력 있는 독립변수

파이썬으로 다중회귀분석 구현

데이터 로드
결측치 확인
상관관계 분석
다중 회귀모델 생성
모델 훈련
분석결과 해석

import pandas as pd

# 1. 
insurance = pd.read_csv("C:/Data/graph/insurance.csv")

# 2. 
insurance.isnull().sum()

# 3. 
import seaborn as sns

sns.heatmap(insurance.corr(), annot=True, cmap='Blues', linewidths=0.2)

# 4. 
import statsmodels.formula.api as smf # 결과해석이 R 과 유사

model = smf.ols(formula='expenses ~ age + sex + bmi + children + smoker + region',
                data=insurance)
                
# 5.
result = model.fit()

# 6.
result.summary()

OLS Regression Results

Dep. Variable:	expenses	R-squared:	0.751
Model:	OLS	Adj. R-squared:	0.749
Method:	Least Squares	F-statistic:	500.9
Date:	Thu, 23 Mar 2023	Prob (F-statistic):	0.00
Time:	11:31:40	Log-Likelihood:	-13548.
No. Observations:	1338	AIC:	2.711e+04
Df Residuals:	1329	BIC:	2.716e+04
Df Model:	8
Covariance Type:	nonrobust

결정계수 : 0.75 -> 9 이상이 되게 파생변수 생성

# 파생변수 1
model2 = smf.ols(formula='expenses ~ age + sex + bmi + children + smoker + region + bmi30',
                data=insurance)
result2 = model2.fit()
result2.summary()

# 파생변수 2
insurance['bmi30_smokeryes'] = insurance.apply(lambda x : 
                                               1 if x['bmi30'] == 1 and x['smoker']=='yes' 
                                               else 0, axis=1)
                                               
model3 = smf.ols(formula='expenses ~ age + sex + bmi + children + smoker + region + bmi30 + bmi30_smokeryes',
                data=insurance)
result3 = model3.fit()
result3.summary()

axis = 1 : 행 값에 lambda 함수 적용

OLS Regression Results

Dep. Variable:	expenses	R-squared:	0.864
Model:	OLS	Adj. R-squared:	0.863
Method:	Least Squares	F-statistic:	842.1
Date:	Thu, 23 Mar 2023	Prob (F-statistic):	0.00
Time:	13:52:53	Log-Likelihood:	-13144.
No. Observations:	1338	AIC:	2.631e+04
Df Residuals:	1327	BIC:	2.637e+04
Df Model:	10
Covariance Type:	nonrobust

sklearn 으로 다중회귀모델 생성

데이터 로드
파생변수 추가
독립변수와 종속변수 설정
훈련데이터와 테스트데이터 분리
다중회귀분석 모델 생성
결정계수 확인
테스트 데이터 예측
모델 평가
- 분류 : 정확도
- 회귀 : 상관계수와 평균제곱오차 (Mean Squared Error)

import pandas as pd

insurance = pd.read_csv("C:/Data/graph/insurance.csv")
insurance.head()

insurance['bmi30_smokeryes'] = insurance.apply(lambda x : 
                                               1 if x['bmi'] >= 30 and x['smoker']=='yes' 
                                               else 0, axis=1)
                                               
col_name = list(insurance.columns)
col_name.remove('expenses')
x = pd.get_dummies(insurance[col_name]) # 독립변수
y = insurance['expenses'] # 종속변수

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=1)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

from sklearn.linear_model import LinearRegression

model1 = LinearRegression()
model1.fit(x_train, y_train)

model1.score(x_train, y_train) # 결정계수 확인

result = model1.predict(x_test)

# MSE
from sklearn import metrics

metrics.mean_squared_log_error(y_test, result)

# 상관계수
from scipy.stats import pearsonr

corr, _ = pearsonr(y_test, result)
corr

평균제곱오차 = √(실제값 - 예측값)^2
상관계수 : 1에 가까울 수록 좋은 모델

본 내용은 아이티윌 '빅데이터&머신러닝 전문가 양성 과정' 을 수강하며 작성한 내용입니다.

'머신러닝 & 딥러닝' 카테고리의 다른 글

6. [비지도학습] 연관규칙 (Association Rules) (0)	2023.03.31
5. [지도학습] 서포트 벡터 머신 (Support Vector Machine - SVM) (0)	2023.03.30
4. [지도학습] 회귀분석 (Regression) (1) (0)	2023.03.22
3-2. [지도학습] 의사결정나무 (1R/Ripper) (0)	2023.03.21
3. [지도학습] 의사결정트리 (Decision Tree) (0)	2023.03.17

'머신러닝 & 딥러닝' Related Articles

now is better than never

4-2. [지도학습] 회귀분석 (Regression) (2) 본문

4-2. [지도학습] 회귀분석 (Regression) (2)

다중공선성 실험

파이썬으로 다중회귀분석 구현

sklearn 으로 다중회귀모델 생성

'머신러닝 & 딥러닝' 카테고리의 다른 글

티스토리툴바