빅데이터분석기사/코드

[빅데이터분석기사] 작업형 2유형 연습문제 #2

EveningPrimrose 2023. 6. 14. 01:42
반응형

데이터 출처 : https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists

 

HR Analytics: Job Change of Data Scientists

Predict who will move to a new job

www.kaggle.com

 

import pandas as pd

x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_test.csv")

display(x_train.head())
display(y_train.head())

# print(x_train.info())
# print(x_train.nunique())
# print(x_train.isnull().sum())
# 결측치가 있지만 따로 처리하지 않고 더미화

# 범주형 변수인데 적당히 많은 unique 값을 가진 컬럼은 날린다.
drop_col = ['enrollee_id', 'city', 'company_type', 'experience']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection)

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['target'].astype('int')

x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기(더미화하면서 순서대로 정렬을 하기 때문에 오류가 난다면 해당 컬럼이 누락된 것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))

X_train, x_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size = 0.33, random_state = 42)
rf = RandomForestClassifier(random_state = 23)
rf.fit(X_train, Y_train)

# import skelearn.metrics
# print(dir_sklearn.metrics))

from sklearn.metrics import accurancy_score, f1_score, roc_auc_score, precision_score

# model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도(accurancy), f1_score, recall, precision -> model.predict로 결과뽑기
# auc, 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train, predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation, predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train, predict_train_label))
print('validation f1_score :', f1_score(Y_validation, predict_validatilon_label))
print('\n')
print('train recall_score :', recall_score(Y_train, predict_train_label))
pritn('validation recall_score :', recall_score(Y_validation, predict_train_label))
print('\n')
print('train precision_score :', precision_score(Y_train, predict_train_label))
print('validation precision_score :', precision_score(Y_validation, predict_train_label))
print('\n')
print('train auc :', roc_auc_score(Y_train, predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation, predict_validation_prob))

# test 데이터도 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:, 1]

# accuracy, f1_score, recall, precision
# pd.DataFrame({'ID' : x_test.ID, 'Reached.on.Time_Y.N : predict_test_label}).to_csv('003000000.csv, index=Fasle)

# auc, 확률
# pd.DataFrame({'ID' : x_test.ID, 'Reached.on.Time_Y.N : predict_test_proba}).to_csv('003000000.csv, index=False)
반응형