Knowledge for Data Analysis/ํ†ต๊ณ„ ๊ณต๋ถ€

[Binary classification : Tabular data] / 3rd level / ์ง€๋„ํ•™์Šต

ddoddo201 2021. 8. 3. 20:49

 

Kaggle Study

Binary Classification: Tabular data

 

 

3rd level. Home Credit Default Risk

 

๐Ÿ’กTabular Data๋ž€?

: ํ‘œ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ์˜ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ˜•ํƒœ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

1๏ธโƒฃ ๋ฌธ์ œ ์„ค๋ช…

  • ๋ถ„์„ ๋ชฉํ‘œ: ๊ณผ๊ฑฐ ๋Œ€์ถœ ์‹ ์ฒญ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์‹ ์ฒญ์ž๊ฐ€ ๋ฏธ๋ž˜์— ๋Œ€์ถœ์„ ์ƒํ™˜ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์˜ˆ์ธกํ•œ๋‹ค.
  • ํ•™์Šต ๋ฐฉ๋ฒ•: ์ง€๋„ ํ•™์Šต์˜ ๋ถ„๋ฅ˜ ๋ฌธ์ œ
  • ๋ถ„๋ฅ˜ ๋ฐฉ๋ฒ•: 0(๋Œ€์ถœ ์ƒํ™˜ ๊ฐ€๋Šฅ), 1(๋Œ€์ถœ ์ƒํ™˜ ์–ด๋ ค์›€)

 

2๏ธโƒฃ ๋ฐ์ดํ„ฐ ์„ค๋ช…

  • ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ: application_train/application_test
  • ๋ฐ์ดํ„ฐ ์‹๋ณ„์ž ์ปฌ๋Ÿผ: SK_ID_CURR (๊ณ ๊ฐ๋งˆ๋‹ค ๊ฐ€์ง€๋Š” ๊ณ ์œ  ID๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ)
  • ๋ฐ์ดํ„ฐ ์ •๋‹ต ์ปฌ๋Ÿผ: TARGET (0: ๋Œ€์ถœ ์ƒํ™˜ํ–ˆ์Œ 1: ์ƒํ™˜ ๋ชปํ•จ)

 

*๋Œ€์ถœ ์ƒํ™˜: ๋Œ€์ถœ๋ฐ›์€ ์›๊ธˆ๊ณผ ์ด์ž๋ฅผ ๊ฐš๋Š” ๋ฐฉ์‹

 

 


์ง€๋„ํ•™์Šต๊ณผ ๋น„์ง€๋„ ํ•™์Šต


 

๐Ÿ’ก์ง€๋„ ํ•™์Šต(Supervised Learning)๊ณผ ๋น„์ง€๋„ ํ•™์Šต(Unsupervised Learning)์ด๋ž€?

1๏ธโƒฃ ์ง€๋„ ํ•™์Šต

  • ์ •๋‹ต์„ ์•Œ๋ ค์ฃผ๊ณ  ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค.
  • ์˜ˆ์ธก์ด๋‚˜ ๋ถ„๋ฅ˜๋ฅผ ํ†ตํ•ด ์–ผ๋งˆ๋‚˜ ์ •๋‹ต์„ ์ž˜ ๋งž์ท„๋Š”์ง€๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์˜ˆ์ธก: Linear Regression
  • ๋ถ„๋ฅ˜: Decision Tree, Logistic Regression

 

2๏ธโƒฃ ๋น„์ง€๋„ ํ•™์Šต

  • ์ •๋‹ต์ด ์ฃผ์–ด์ง€์ง€ ์•Š๋Š”๋‹ค.
  • ์œ ์‚ฌํ•œ ๋ฐ์ดํ„ฐ๋“ค์ด ๊ตฐ์ง‘์œผ๋กœ ๋‚˜๋ˆ ์ง€๊ฒŒ ๋งŒ๋“ค์–ด์ค€๋‹ค.
  • ๊ตฐ์ง‘ํ™”: K-Means Clustering, Text Mining