3-2. [지도학습] 의사결정나무 (1R/Ripper)

Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

now is better than never

3-2. [지도학습] 의사결정나무 (1R/Ripper) 본문

머신러닝 & 딥러닝

3-2. [지도학습] 의사결정나무 (1R/Ripper)

김초송 2023. 3. 21. 16:32

의사결정나무 대표적인 알고리즘

CART (Classification and Regression Trees)
C4.5 / C5.0
: 의사결정나무, 규칙기반 알고리즘 (1R 알고리즘, Jriper 알고리즘)
CHAID

규칙기반 알고리즘

정보획득량을 가지고 구현한 알고리즘
정보획득량 = 분할 전 엔트로피 - 분할 후 엔트로피
엔트로피 : 데이터 집합의 혼잡도
주어진 데이터 집합에서 서로 다른 종류의 레코드들이 섞여 있으면 엔트로피가 높음
같은 종류의 레코드들이 섞여 있으면 엔트로피가 낮음
출처 : https://eehoeskrap.tistory.com/13

	범주형 / 이산형 종속변수	연속형 종속변수
CHAID	카이제곱 통계량	ANOVA F-통계량
CARD	지니 지수	분산감소량
C4.5 / C5.0	엔트로피 지수

1R 알고리즘
하나의 사실만 가지고 간단하게 데이터를 분류하는 알고리즘
간단하지만 오류가 많음
Riper 알고리즘
여러 개의 사실(조건)을 가지고 분류하는 알고리즘
기계가 데이터를 보고 패턴을 파악해서 정리함

R 로 규칙기반 알고리즘 구현

# 1. 데이터 로드
mush <- read.csv("C:/Data/mushrooms.csv", stringsAsFactors = TRUE)
mush
nrow(mush)

# 2. 결측치 확인
colSums(is.na(mush))

# 3. 훈련 데이터 / 테스트 데이터 분리
install.packages("caret", lib=.libPaths()[1])
library(caret)

train_num <- createDataPartition(mush$type, p=0.8, list=F)
train <- mush[train_num, ] # 훈련 데이터 80%
test <- mush[-train_num, ] # 테스트 데이터 20%

nrow(train)
nrow(test)

# 4. 최대최소 정규화 (숫자형 데이터일 경우)

# 5. 모델 생성 및 훈련
install.packages("OneR", lib=.libPaths()[1])
library(OneR)

model <- OneR(type ~ ., data=train)
model

# 결과
Call:
OneR.formula(formula = type ~ ., data = train)

Rules:
If odor = almond   then type = edible # 아몬드 냄새가 나면 식용
If odor = anise    then type = edible # 아니스 냄새가 나면 식용
If odor = creosote then type = poisonous # creosote 냄새가 나면 독버섯
If odor = fishy    then type = poisonous # 생선 비린내가 나면 독버섯
If odor = foul     then type = poisonous
If odor = musty    then type = poisonous
If odor = none     then type = edible # 냄새가 나지 않으면 식용
If odor = pungent  then type = poisonous # 역한 냄새가 나면 독버섯
If odor = spicy    then type = poisonous # 매운 냄새가 나면 독버섯

Accuracy:
6406 of 6500 instances classified correctly (98.55%) # 정확도 98.55 %

OneR(정답 컬럼 ~ ., data=훈련 데이터)

# 가설검정
summary(model)
# 결과
Contingency table:
           odor
type        almond anise creosote fishy   foul musty   none pungent spicy  Sum
  edible     * 304 * 324        0     0      0     0 * 2739       0     0 3367
  poisonous      0     0    * 151 * 462 * 1732  * 28     94   * 214 * 452 3133
  Sum          304   324      151   462   1732    28   2833     214   452 6500
---
Maximum in each column: '*'

Pearson''s Chi-squared test: # 카이제곱 검정
X-squared = 6136, df = 8, p-value < 2.2e-16  # 카이제곱 값, 자유도, p_value

통계적 데이터 분석 -> 가설검정
귀무가설 : 냄새로 독버섯과 식용버섯을 분류할 수 없다
대립가설 : 냄새로 독버섯과 식용버섯을 분류할 수 있다 (P value 가 0.05보다 작으면 채택 = 통계적으로 유의미하다)
머신러닝 데이터 분석

# 6. 모델 예측
result <- predict(model, test[, -1])
result

# 7. 모델 평가
# 7-1. 정확도 확인
sum(result == test[, 1]) / nrow(test)

# 7-2. FN 값 확인 (사람 생명과 관련)
install.packages("gmodels", lib=.libPaths()[1])
library(gmodels)
cross <- CrossTable(test[ , 1], result)
cross$t

관심범주 : Positive = 독버섯
True Positive : 독버섯을 잘 예측
True Negative : 식용버섯을 잘 예측
False Positive : 식용버섯을 독버섯으로 예측
False Negative : 독버섯을 식용버섯으로 예측 -> 이 값이 낮아야하는 게 제일 중요
정밀도 : 관심범주로 예측했을 때 실제로도 관심범주인 확률
= TP / (TP + FP)
재현율 : 실제로 관심범주인 것을 관심범주로 예측한 확률
= TP / (TP + FN)

# 8. 모델 개선
install.packages("RWeka", lib=.libPaths()[1])
library(RWeka)

model2 <- JRip(type ~ ., data=train)
model2

# 결과
JRIP rules:
===========

(odor = foul) => type=poisonous (1732.0/0.0) # 냄새가 foul 이면 독버섯
(gill_size = narrow) and (gill_color = buff) => type=poisonous (914.0/0.0) # gill 사이즈가 좁고 색깔이 buff 독버섯
(gill_size = narrow) and (odor = pungent) => type=poisonous (214.0/0.0)
(odor = creosote) => type=poisonous (151.0/0.0)
(spore_print_color = green) => type=poisonous (59.0/0.0)
(stalk_surface_above_ring = silky) and (gill_spacing = close) => type=poisonous (51.0/0.0)
(habitat = leaves) and (gill_attachment = free) and (population = clustered) => type=poisonous (12.0/0.0)
 => type=edible (3367.0/0.0)

Number of Rules : 8

# 이원교차표
summary(model2)

# 결과
=== Summary ===

Correctly Classified Instances        6500              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0     
Root mean squared error                  0     
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances             6500     

=== Confusion Matrix ===

    a    b   <-- classified as
 3367    0 |    a = edible
    0 3133 |    b = poisonous

범주형 데이터 분석에 적합 (연속형 X - 붓꽃)

파이썬으로 규칙기반 알고리즘 구현

데이터 로드
결측치 확인
데이터 정규화 (숫자로 변경)
정답 컬럼 0과 1로 변경
훈련 데이터 / 테스트 데이터 분리
모델 생성
모델 예측
모델 평가

import pandas as pd

mush = pd.read_csv("C:/Data/mushrooms.csv")
mush.head()

mush.isnull().sum() 

# 독립변수 종속변수 분리
x = mush.iloc[ : , 1: ]
y = mush.iloc[ : , 0 ]
y

x_dummy = pd.get_dummies(x)
x_dummy.shape

# 관심범주 확인
print(encoder.classes_) # [0, 1]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_dummy, y_prepro, test_size=0.2, random_state=1)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

import wittgenstein as lw

model = lw.RIPPER(random_state=1)
model.fit(x_train, y_train)
predict = model.predict(x_test)

sum(predict == y_test) / len(y_test) * 100

본 내용은 아이티윌 '빅데이터&머신러닝 전문가 양성 과정' 을 수강하며 작성한 내용입니다

'머신러닝 & 딥러닝' 카테고리의 다른 글

4-2. [지도학습] 회귀분석 (Regression) (2) (0)	2023.03.23
4. [지도학습] 회귀분석 (Regression) (1) (0)	2023.03.22
3. [지도학습] 의사결정트리 (Decision Tree) (0)	2023.03.17
2. [지도학습] 나이브 베이즈 분류기 (Naive Bayes Classifier) (0)	2023.03.16
1-2. [지도학습] KNN (K-Nearest Neighborhood) (2) (0)	2023.03.14

'머신러닝 & 딥러닝' Related Articles

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

now is better than never

now is better than never

3-2. [지도학습] 의사결정나무 (1R/Ripper) 본문

3-2. [지도학습] 의사결정나무 (1R/Ripper)

의사결정나무 대표적인 알고리즘

규칙기반 알고리즘

R 로 규칙기반 알고리즘 구현

파이썬으로 규칙기반 알고리즘 구현

'머신러닝 & 딥러닝' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역