๐Ÿฅž BE
home

Spark ML

Date
2023/09/23
Category
Data Engineering
Tag
Apache Spark
Detail

Spark ML

์ŠคํŒŒํฌ์˜ ์—ฌ๋Ÿฌ ์ปดํฌ๋„ŒํŠธ

โ€ข
Spark SQL
โ€ข
Spark Streaming
โ€ข
MLlib
โ€ข
GraphX ๋“ฑ๋“ฑ..
MLlib(Machine Learing Library)์€ ML์„ ์‰ฝ๊ณ  ํ™•์žฅ์„ฑ ์žˆ๊ฒŒ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋ฐœ์„ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์กŒ๋‹ค.

Machine Learning ์ด๋ž€?

โ€ข
๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์ฝ”๋”ฉ์„ ํ•˜๋Š” ์ผ
โ€ข
์ตœ์ ํ™”์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ํŒจํ„ด์„ ์ฐพ๋Š” ์ผ

๋จธ์‹ ๋Ÿฌ๋‹ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์„ฑ

๋ฐ์ดํ„ฐ ๋กœ๋”ฉ โ†’ ์ „์ฒ˜๋ฆฌ โ†’ ํ•™์Šต โ†’ ๋ชจ๋ธ ํ‰๊ฐ€

MLlib์€ DataFrame์œ„์—์„œ ๋™์ž‘

์•„์ง RDD API๊ฐ€ ์žˆ์ง€๋งŒ maintenance mode์ด๋ฉฐ, ์ƒˆ๋กœ์šด API๋Š” ๊ฐœ๋ฐœ ๋Š๊น€.
DataFrame์„ ์“ฐ๋Š” MLlib API๋ฅผ Spark ML์ด๋ผ๊ณ ๋„ ๋ถ€๋ฆ„.

SparkML์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ 

๋‹ค๋ฅธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๋น„ํ•ด ์ŠคํŒŒํฌ๋Š” ๋Œ€์ค‘์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ช‡๋ช‡ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋งŒ ๊ตฌํ˜„๋˜์–ด ์žˆ๋‹ค. โ†’ ์ƒˆ๋กญ๊ฑฐ๋‚˜ ํ•ซํ•œ ๋ชจ๋ธ์ด ๋‚˜์™€๋„ ์ŠคํŒŒํฌ์—์„œ ์“ฐ๋ ค๋ฉด ๋‹ค๋ฅธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ณด๋‹ค๋Š” ์กฐ๊ธˆ ๋” ๊ธฐ๋‹ค๋ ค์•ผ ํ•œ๋‹ค๋Š” ๋‹จ์ ์€ ์กด์žฌ.
๊ทธ๋Ÿผ SparkML์„ ์™œ ์“ฐ๋Š”๊ฑธ๊นŒ? ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ ๋งค์šฐ ์ ํ•ฉํ•˜๊ธฐ ๋•Œ๋ฌธ.
์ŠคํŒŒํฌ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ธ๋ฉ”๋ชจ๋ฆฌ ์ƒ์—์„œ ์ฒ˜๋ฆฌ. ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ ค์„œ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋””์Šคํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋งต๋ฆฌ๋“€์Šค๋‚˜ ๋จธํ•˜์›ƒ๋ณด๋‹ค 10๋ฐฐ์—์„œ 100๋ฐฐ๊นŒ์ง€ ๋น ๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
ํ•™์Šต์— ํ•„์š”ํ•œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ŠคํŒŒํฌ๋กœ ์ง„ํ–‰ํ•˜๊ณ  ๋ชจ๋ธ๋ง์€ ํ…์„œํ”Œ๋กœ์šฐ์™€ ๊ฐ™์€ ํƒ€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์ง„ํ–‰ํ•˜๊ฑฐ๋‚˜, ์ŠคํŒŒํฌ ์ง€์› ๋ชจ๋ธ๋กœ ์ถฉ๋ถ„ํ•œ ํ”„๋กœ์ ํŠธ๋ผ๋ฉด ๋ชจ๋ธ๋ง๊นŒ์ง€ ์ŠคํŒŒํฌ๋กœ ๋งˆ๋ฌด๋ฆฌํ•˜์—ฌ ์ž‘์—…์˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค.

MLlib ์ปดํฌ๋„ŒํŠธ

โ€ข
์•Œ๊ณ ๋ฆฌ์ฆ˜
โ—ฆ
Classification
โ—ฆ
Regression
โ—ฆ
Clustering
โ—ฆ
Recommendation
โ€ข
ํŒŒ์ดํ”„๋ผ์ธ
โ—ฆ
Training
โ—ฆ
Evaluating
โ—ฆ
Tuning
โ—ฆ
Persistence
โ€ข
Feature Engineering
โ—ฆ
Extraction
โ—ฆ
Transformation
โ€ข
Utils
โ—ฆ
Linear algebra
โ—ฆ
Statistics

์ฃผ์š” ์ปดํฌ๋„ŒํŠธ ์†Œ๊ฐœ

โ€ข
DataFrame
โ—ฆ
MLํŒŒ์ดํ”„๋ผ์ธ์—์„œ๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ๊ธฐ๋ณธ ํฌ๋งท์ด๋ฉฐ, ํ…Œ์ŠคํŠธ์…‹์„ ๋กœ๋”ฉํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ๋ณธ์ ์œผ๋กœ csv, JSON, Parquet, JDBC๋ฅผ ์ง€์›. ML ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ๋‹ค์Œ 2๊ฐ€์ง€์˜ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์†Œ์Šค๋ฅผ ์ถ”๊ฐ€ ์ง€์›ํ•จ.
โ—ฆ
์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์†Œ์Šค
โ–ช
jpeg, png ๋“ฑ์˜ ์ด๋ฏธ์ง€๋“ค์„ ์ง€์ •๋œ ๋””๋ ‰ํ† ๋ฆฌ์—์„œ ๋กœ๋“œ
โ—ฆ
LIBSVM ๋ฐ์ดํ„ฐ์†Œ์Šค
โ–ช
label๊ณผ features ๋‘ ๊ฐœ์˜ ์ปฌ๋Ÿผ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ํŠธ๋ ˆ์ด๋‹ ํฌ๋งท
โ–ช
features ์ปฌ๋Ÿผ์€ ๋ฒกํ„ฐ ํ˜•ํƒœ์˜ ๊ตฌ์กฐ
โ€ข
Transformer
โ—ฆ
ํ”ผ์ณ ๋ณ€ํ™˜๊ณผ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ถ”์ƒํ™”
โ—ฆ
๋ชจ๋“  Transformer๋Š” transform() ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ.
โ—ฆ
๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ํฌ๋งท์œผ๋กœ ๋ฐ”๊ฟˆ.
โ—ฆ
DF๋ฅผ ๋ฐ›์•„ ์ƒˆ๋กœ์šด DF๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ, ๋ณดํ†ต ํ•˜๋‚˜ ์ด์ƒ์˜ column์„ ๋”ํ•˜๊ฒŒ ๋œ๋‹ค.
โ—ฆ
ex) Data Normalization, Tokenization, ์นดํ…Œ๊ณ ๋ฆฌ์ปฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž๋กœ (one-hot encoding)
โ€ข
Estimator
โ—ฆ
๋ชจ๋ธ์˜ ํ•™์Šต ๊ณผ์ •์„ ์ถ”์ƒํ™”
โ—ฆ
๋ชจ๋“  Estimator๋Š” fit() ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ.
โ—ฆ
fit()์€ DataFrame์„ ๋ฐ›์•„ Model์„ ๋ฐ˜ํ™˜
โ—ฆ
๋ชจ๋ธ์€ ํ•˜๋‚˜์˜ Transformer
ex)
lr = LinearRegression()
model = lr.fit(data)
โ€ข
Evaluator
โ—ฆ
metric์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€
ex) Root mean squared error (RMSE)
โ—ฆ
๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ๊ฐœ ๋งŒ๋“ค์–ด์„œ, ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ ํ›„ ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์„ ๋ฝ‘๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ ํŠœ๋‹์„ ์ž๋™ํ™” ๊ฐ€๋Šฅ.
ex) BinaryClassificationEvaluator, CrossValidator
โ€ข
Pipeline
โ—ฆ
ML์˜ ์›Œํฌํ”Œ๋กœ์šฐ
โ—ฆ
์—ฌ๋Ÿฌ stage๋ฅผ ๋‹ด๊ณ  ์žˆ์Œ Pipeline(stages=).
โ—ฆ
์ €์žฅ๋  ์ˆ˜ ์žˆ์Œ (persist)
โ—ฆ
Transformer โ†’ Estimator โ†’ Evaluator โ†’ Model
ML Pipeline์€ ๊ฒฐ๊ตญ ํ•˜๋‚˜ ์ด์ƒ์˜ Transformer์™€ Estimator๊ฐ€ ์—ฐ๊ฒฐ๋œ ๋ชจ๋ธ๋ง ์›Œํฌํ”Œ๋กœ์šฐ๋กœ, ์ž…๋ ฅ์€ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด๊ณ  ์ถœ๋ ฅ์€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ธ ๊ฒƒ์ด๋‹ค. ML Pipeline ๊ทธ ์ž์ฒด๋„ Estimator์ด๋ฏ€๋กœ ์‹คํ–‰์€ fitํ•จ์ˆ˜์˜ ํ˜ธ์ถœ๋กœ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ €์žฅํ–ˆ๋‹ค๊ฐ€ ๋‹ค์‹œ ๋กœ๋”ฉํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ด ํ•œ๋ฒˆ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋งŒ๋“ค์–ด๋‘๋ฉด ๋ฐ˜๋ณต์ ์ธ ๋ชจ๋ธ ๋นŒ๋”ฉ์ด ์‰ฌ์›Œ์ง„๋‹ค.
โ€ข
Parameter
โ—ฆ
Transformer์™€ Estimator์˜ ๊ณตํ†ต API๋กœ ๋‹ค์–‘ํ•œ ์ธ์ž๋ฅผ ์ ์šฉํ•ด์คŒ.
โ—ฆ
Param(ํ•˜๋‚˜์˜ ์ด๋ฆ„๊ณผ ๊ฐ’)๊ณผ ParamMap(Param ๋ฆฌ์ŠคํŠธ) ๋‘ ์ข…๋ฅ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์กด์žฌ.
โ—ฆ
ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” fit (Estimator) ํ˜น์€ transform (Transformer)์— ์ธ์ž๋กœ ์ง€์ • ๊ฐ€๋Šฅ.

Spark ML ํ”ผ์ณ๋ณ€ํ™˜

โ€ข
Feature Transformer๊ฐ€ ํ•˜๋Š” ์ผ
โ—ฆ
๊ธฐ๋ณธ์ ์œผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ๋ชจ๋“  ํ”ผ์ณ ๊ฐ’๋“ค์€ ์ˆซ์ž ํ•„๋“œ์ด์–ด์•ผ ํ•˜๋ฏ€๋กœย ํ…์ŠคํŠธ ํ•„๋“œ(์นดํ…Œ๊ณ ๋ฆฌ ๊ฐ’๋“ค)๋ฅผ ์ˆซ์ž ํ•„๋“œ๋กœ ๋ณ€ํ™˜
โ—ฆ
์ˆซ์ž ํ•„๋“œ๋ผ๊ณ  ํ•ด๋„ ๊ฐ€๋Šฅํ•œ ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ํŠน์ • ๋ฒ”์œ„(0๋ถ€ํ„ฐ 1)๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ‘œ์ค€ํ™”๊ฐ€ ํ•„์š”, ์ด๋ฅผย ํ”ผ์ณ ์Šค์ผ€์ผ๋ง(Scaling) ํ˜น์€ ์ •๊ทœํ™”(Normalization)๋ผ๊ณ  ํ•จ.
โ€ข
Feature Extractor๊ฐ€ ํ•˜๋Š” ์ผ
โ—ฆ
๊ธฐ์กด ํ”ผ์ณ์—์„œ ์ƒˆ๋กœ์šด ํ”ผ์ณ๋ฅผ ์ถ”์ถœ
โ—ฆ
TF-IDF, Word2Vec ๋“ฑ
โ—ฆ
ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ค ํ˜•ํƒœ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ์ด ์—ฌ๊ธฐ์— ํ•ด๋‹น
โ€ข
StringIndexer: ํ…์ŠคํŠธ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜
โ—ฆ
Scikit-Learn์€ sklearn.preprocessing ๋ชจ๋“ˆ ์•„๋ž˜ ์—ฌ๋Ÿฌ ์ธ์ฝ”๋”(OneHotEncoder, LabelEncoder, OrdinalEncoder ๋“ฑ) ์กด์žฌ
โ—ฆ
Spark MLlib์˜ ๊ฒฝ์šฐ pyspark.ml.feature ๋ชจ๋“ˆ ๋ฐ‘์— ๋‘ ๊ฐœ์˜ ์ธ์ฝ”๋” ์กด์žฌ
โ—ฆ
StringIndexer, OneHotEncoder
โ—ฆ
์‚ฌ์šฉ๋ฒ•์€ Indexer ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ (fit), Indexter ๋ชจ๋ธ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ณ€ํ™˜(Transform)
โ€ข
Scaler: ์ˆซ์ž ํ•„๋“œ ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ 0๊ณผ 1์‚ฌ์ด๋กœ ํ‘œ์ค€ํ™”
โ—ฆ
pyspark.ml.feature ๋ชจ๋“ˆ ๋ฐ‘์— ๋‘ ๊ฐœ์˜ ์Šค์ผ€์ผ๋Ÿฌ ์กด์žฌ
โ—ฆ
StandardScaler: ๊ฐ ๊ฐ’์—์„œ ํ‰๊ท ์„ ๋นผ๊ณ  ์ด๋ฅผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ”. ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ๊ฒฝ์šฐ ์‚ฌ์šฉ
โ—ฆ
MinMaxScaler: ๋ชจ๋“  ๊ฐ’์„ 0๊ณผ 1์‚ฌ์ด๋กœ ์Šค์ผ€์ผ๋ง. ๊ฐ ๊ฐ’์—์„œ ์ตœ์†Ÿ๊ฐ’์„ ๋นผ๊ณ  (์ตœ๋Œ“๊ฐ’-์ตœ์†Ÿ๊ฐ’)์œผ๋กœ ๋‚˜๋ˆ”
โ€ข
Imputer: ๊ฐ’์ด ์—†๋Š” ํ•„๋“œ ์ฑ„์šฐ๊ธฐ
โ—ฆ
๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋ ˆ์ฝ”๋“œ๋“ค์ด ์กด์žฌํ•˜๋Š” ํ•„๋“œ๋“ค์˜ ๊ฒฝ์šฐ ๊ธฐ๋ณธ๊ฐ’(ํ‰๊ท ๊ฐ’, ์ค‘์•™๊ฐ’ ๋“ฑ)์„ ์ •ํ•ด ์ฑ„์›€

์ถ”์ฒœ ๋ชจ๋ธ

์œ ์ €๋ณ„ ์˜ํ™” ์ถ”์ฒœ ํŒŒ์ดํ”„๋ผ์ธ
โ€ข
ALS: Alternating Least Squares
์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ต๋Œ€ ์ตœ์†Œ ์ œ๊ณฑ๋ฒ•์ด๋ผ๊ณ ๋„ ๋ถ€๋ฅธ๋‹ค.
ํ•œ ์œ ์ €๊ฐ€ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์˜ํ™”๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ธฐ์—, ๋ชป๋ณธ ์˜ํ™”๋“ค์˜ ํ‰์ ์„ ์˜ˆ์ธกํ•˜๊ณ  ๊ฐ€์žฅ ๋†’์€ ์ ์ˆ˜๋ถ€ํ„ฐ ์œ ์ €์—๊ฒŒ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ ์ถ”์ฒœ์ด๋‹ค.
ALS๋ž€ ๋‘ ํ–‰๋ ฌ ์ค‘ ํ•˜๋‚˜๋ฅผ ๊ณ ์ •์‹œํ‚ค๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜์˜ ํ–‰๋ ฌ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

์˜ˆ์ธก ๋ชจ๋ธ

๊ฑฐ๋ฆฌ๋ณ„ ํƒ์‹œ๋น„ ์˜ˆ์ธกํ•˜๊ธฐ
โ€ข
Linear Regression (์„ ํ˜• ํšŒ๊ท€)
์ข…์†๋ณ€์ˆ˜ y์™€ ํ•œ ๊ฐœ ์ด์ƒ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜ x์— ๋Œ€ํ•œ ์„ ํ˜• ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ํšŒ๊ท€ ๋ถ„์„ ๋ฐฉ๋ฒ•.
์œ„์™€ ๊ฐ™์ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ„ํฌ๋˜์–ด์žˆ์„ ๋•Œ, ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๊ฐ€์žฅ ์ž˜ ๋งž๋Š” ์„ ์„ ๊ธ‹๋Š” ๊ฒƒ(์ตœ์ ํ™”).
โ€ข
RMSE(Root Mean Squared Error)
์˜ˆ์ธก ๋ชจ๋ธ์—์„œ ์˜ˆ์ธกํ•œ ๊ฐ’๊ณผ ์‹ค์ œ ๊ฐ’ ์‚ฌ์ด์˜ ํ‰๊ท  ์ฐจ์ด๋ฅผ ์ธก์ •ํ•œ๋‹ค.
์˜ˆ์ธก ๋ชจ๋ธ์ด ๋ชฉํ‘œ ๊ฐ’(์ •ํ™•๋„)์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ถ”์ •ํ•œ๋‹ค.

์‹ค์Šต ์ฝ”๋“œ

์˜ํ™” ์ถ”์ฒœ ํŒŒ์ดํ”„๋ผ์ธ

git clone https://github.com/Y-gw/boaz-sparkML.git
Shell
๋ณต์‚ฌ
ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ
docker pull jupyter/all-spark-notebook docker run -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v {LOCAL_PATH}:/home/jovyan --name jupyter jupyter/all-spark-notebook
Shell
๋ณต์‚ฌ
token ๋ณต์‚ฌ
http://localhost:8888/
Shell
๋ณต์‚ฌ
์ ‘์†
โ€ข
DF ๊ตฌ์กฐ
+------+-------+------+----------+ |userId|movieId|rating| timestamp| +------+-------+------+----------+ | 1| 296| 5.0|1147880044| | 1| 306| 3.5|1147868817| | 1| 307| 5.0|1147868828| | 1| 665| 5.0|1147878820| | 1| 899| 3.5|1147868510| | 1| 1088| 4.0|1147868495| | 1| 1175| 3.5|1147868826| | 1| 1217| 3.5|1147878326| | 1| 1237| 5.0|1147868839| | 1| 1250| 4.0|1147868414| | 1| 1260| 3.5|1147877857| | 1| 1653| 4.0|1147868097| | 1| 2011| 2.5|1147868079| | 1| 2012| 2.5|1147868068| | 1| 2068| 2.5|1147869044| | 1| 2161| 3.5|1147868609| | 1| 2351| 4.5|1147877957| | 1| 2573| 4.0|1147878923| | 1| 2632| 5.0|1147878248| | 1| 2692| 5.0|1147869100| +------+-------+------+----------+ only showing top 20 rows
Plain Text
๋ณต์‚ฌ
โ€ข
ML ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์™€์„œ ๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ
from pyspark.ml.recommendation import ALS # ALS ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ als = ALS( maxIter=5, regParam=0.1, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop" ) model = als.fit(train_df) # fit ์ปดํฌ๋„ŒํŠธ๋ฅผ ํ™œ์šฉํ•œ train predictions = model.transform(test_df) # transform ์ปดํฌ๋„ŒํŠธ๋ฅผ ํ™œ์šฉํ•œ model test predictions.show()
Python
๋ณต์‚ฌ
+------+-------+------+----------+ |userId|movieId|rating|prediction| +------+-------+------+----------+ | 76| 1342| 3.5| 2.9047337| | 85| 1088| 2.0| 3.7284317| | 132| 1238| 5.0| 3.2149928| | 132| 1580| 3.0| 3.2497048| | 137| 1645| 3.0| 3.167203| | 230| 833| 3.0| 2.5753236| | 230| 1088| 4.0| 3.115355| | 243| 1580| 3.0| 2.5723686| | 319| 1238| 5.0| 3.8150952| | 333| 1088| 5.0| 4.05824| | 368| 1580| 3.5| 3.603326| | 368| 3175| 5.0| 3.5701354| | 409| 8638| 5.0| 3.9008398| | 458| 1580| 3.5| 3.1976578| | 472| 3918| 3.0| 2.3450446| | 548| 5803| 2.5| 2.6988087| | 548| 36525| 3.5| 3.169841| | 548| 82529| 3.0| 3.22782| | 587| 6466| 4.0| 3.3879008| | 597| 3997| 1.0| 1.9163384| +------+-------+------+----------+ only showing top 20 rows
Plain Text
๋ณต์‚ฌ
โ€ข
๋ชจ๋ธ ํ‰๊ฐ€
from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator(metricName="rmse", labelCol='rating', predictionCol='prediction') rmse = evaluator.evaluate(predictions) print(rmse) >> ex) 0.8184303257919787
Python
๋ณต์‚ฌ
โ€ข
์ถ”์ฒœ
model.recommendForAllUsers(3).show() # ์œ ์ € ๋ณ„ Top3๊ฐœ์˜ ์•„์ดํ…œ ์ถ”์ฒœ
Python
๋ณต์‚ฌ
+------+--------------------+ |userId| recommendations| +------+--------------------+ | 12|[{151989, 6.29235...| | 22|[{199187, 7.83498...| | 26|[{151989, 5.92996...| | 27|[{203086, 6.41190...| | 28|[{151989, 8.20413...| | 31|[{151989, 4.24829...| | 34|[{151989, 6.02863...| | 44|[{151989, 7.49052...| | 47|[{151989, 5.90802...| | 53|[{151989, 7.39449...| | 65|[{205277, 6.63911...| | 76|[{151989, 6.86239...| | 78|[{151989, 7.89406...| | 81|[{151989, 4.36021...| | 85|[{151989, 5.68050...| | 91|[{203086, 5.94486...| | 93|[{151989, 6.51991...| | 101|[{151989, 5.67113...| | 103|[{151989, 6.55442...| | 108|[{151989, 6.03695...| +------+--------------------+ only showing top 20 rows
Plain Text
๋ณต์‚ฌ
โ€ข
์œ ์ €๋ณ„ ์ถ”์ฒœ api๋ฅผ ์œ„ํ•œ ์ฝ”๋“œ
from pyspark.sql.types import IntegerType user_list = [65, 78, 81] users_df = spark.createDataFrame(user_list, IntegerType()).toDF('userId')
Python
๋ณต์‚ฌ
user_recs = model.recommendForUserSubset(users_df, 5) movies_list = user_recs.collect()[0].recommendations recs_df = spark.createDataFrame(movies_list)
Python
๋ณต์‚ฌ

ํƒ์‹œ๋น„ ์˜ˆ์ธกํ•˜๊ธฐ

โ€ข
๊ตฌ์กฐ ํ™•์ธ ๋ฐ ํ•„์š” ์ปฌ๋Ÿผ ์ถ”์ถœ
trips_df.createOrReplaceTempView("trips") # sql์—์„œ ์“ธ ์ˆ˜ ์žˆ๊ฒŒ ๋ณ€ํ™˜
Python
๋ณต์‚ฌ
query = """ SELECT trip_distance, #์บ์ŠคํŒ… ํ•„์š”ํ•  ์ˆ˜๋„ ์žˆ์Œ. total_amount FROM trips WHERE total_amount < 5000 AND total_amount > 0 AND trip_distance > 0 AND trip_distance < 500 AND passenger_count < 4 AND TO_DATE(tpep_pickup_datetime) >= '2021-01-01' AND TO_DATE(tpep_pickup_datetime) < '2021-04-01' """
SQL
๋ณต์‚ฌ
+-------------+------------+ |trip_distance|total_amount| +-------------+------------+ | 2.1| 11.8| | 0.2| 4.3| | 14.7| 51.95| | 10.6| 36.35| | 4.94| 24.36| | 1.6| 14.15| | 4.1| 17.3| | 5.7| 21.8| | 9.1| 28.8| | 2.7| 18.95| | 6.11| 24.3| | 1.21| 10.79| | 7.4| 33.92| | 1.01| 10.3| | 0.73| 12.09| | 1.17| 12.36| | 0.78| 9.96| | 1.66| 12.3| | 0.93| 9.3| | 1.16| 11.84| +-------------+------------+ only showing top 20 rows
Plain Text
๋ณต์‚ฌ
โ€ข
feature column ์ƒ์„ฑ์„ ํ†ตํ•œ Train ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ
from pyspark.ml.feature import VectorAssembler vassembler = VectorAssembler(inputCols=["trip_distance"], outputCol="features") vtrain_df = vassembler.transform(train_df)
Python
๋ณต์‚ฌ
โ€ข
regression ๋ชจ๋ธ ์ƒ์„ฑ
from pyspark.ml.regression import LinearRegression lr = LinearRegression( maxIter=50, labelCol="total_amount", featuresCol="features" ) model = lr.fit(vtrain_df) vtest_df = vassembler.transform(test_df) prediction = model.transform(vtest_df) prediction.show()
Python
๋ณต์‚ฌ
+-------------+------------+--------+-----------------+ |trip_distance|total_amount|features| prediction| +-------------+------------+--------+-----------------+ | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.3| [0.01]|8.291036440655487| | 0.01| 3.8| [0.01]|8.291036440655487| | 0.01| 3.8| [0.01]|8.291036440655487| | 0.01| 3.8| [0.01]|8.291036440655487| | 0.01| 3.8| [0.01]|8.291036440655487| | 0.01| 3.8| [0.01]|8.291036440655487| | 0.01| 3.8| [0.01]|8.291036440655487| +-------------+------------+--------+-----------------+ only showing top 20 rows
Plain Text
๋ณต์‚ฌ
โ€ข
๋ชจ๋ธ ํ‰๊ฐ€
model.summary.rootMeanSquaredError >> ex) 4.872759850891687
Python
๋ณต์‚ฌ
model.summary.r2 # total amount์˜ 82%๊ฐ€ trip_distance๋กœ ์„ค๋ช…์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋ง๊ณผ ๊ฐ™์Œ. >> ex) 0.8237208415594777
Python
๋ณต์‚ฌ
โ€ข
์ง์ ‘ ์ž…๋ ฅ๊ฐ’ ์„ค์ •ํ•ด์„œ ์˜ˆ์ธก
from pyspark.sql.types import DoubleType distance_list = [1.1, 5.5, 10.5, 30.0] distance_df = spark.createDataFrame(distance_list, DoubleType()).toDF("trip_distance") vdistance_df = vassembler.transform(distance_df) model.transform(vdistance_df).show()
Python
๋ณต์‚ฌ
+-------------+--------+------------------+ |trip_distance|features| prediction| +-------------+--------+------------------+ | 1.1| [1.1]|11.770960525327574| | 5.5| [5.5]|25.818360500150682| | 10.5| [10.5]|41.781315016995116| | 30.0| [30.0]|104.03683763268842| +-------------+--------+------------------+
Plain Text
๋ณต์‚ฌ

Reference

Spark MLlib
MLlib์†Œ๊ฐœ MLlib๋Š” Spark SQL๊ณผ ์ŠคํŒŒํฌ ์ŠคํŠธ๋ฆฌ๋ฐ(Spark Streaming)๊ณผ ๊ฐ™์ด ์ŠคํŒŒํฌ ๋‚ด๋ถ€์˜ ์„œ๋ธŒ ํ”„๋กœ์ ํŠธ๋‹ค. ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋จธ์‹ ๋Ÿฌ๋‹์„ ์œ„ํ•ด ๋งŒ๋“ค์–ด์กŒ๋‹ค. ์‚ฌ์‹ค ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ๊ทธ๋žจ์€ ๋งค์šฐ ๋งŽ๊ณ , ์ด๋ฏธ ์ƒ์šฉ ์‹œ์žฅ์—์„œ๋„ ๋„๋ฆฌ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์˜คํ”ˆ์†Œ์Šค๋งŒ ๋‚˜์—ดํ•˜์ž๋ฉด, ๋น…๋ฐ์ดํ„ฐ์šฉ์€ ์•„๋‹ˆ์ง€๋งŒ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋ถ„์„ ์˜คํ”ˆ์†Œ์Šค์ธ R, ํŒŒ์ด์ฌ์˜ scikit-learn, ๋น…๋ฐ์ดํ„ฐ์šฉ์ด๊ณ  ํ•˜๋‘ก๊ณผ ์—ฐ๊ณ„๋กœ ์œ ๋ช…ํ•ด์ง„ ๋จธํ•˜์›ƒ, ์ตœ๊ทผ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ๋Š” H20 ๋“ฑ์ด ์žˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ถ„์„ํˆด์ด ์žˆ๊ณ , ํŠนํžˆ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ๋Š” ๋จธํ•˜์›ƒ์ด ๊ฐ๊ด‘๋ฐ›๊ณ  ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ตณ์ด ๋‹ค์‹œ ์ŠคํŒŒํฌ ์„œ๋ธŒ ํ”„๋กœ์ ํŠธ๋กœ์„œ MLlib๋ฅผ ๋งŒ๋“  ์ด์œ ๋Š” ๋ฌด์—‡์ด๊ณ , ๊ทธ ์žฅ์ ์€ ์–ด๋–ค๊ฒƒ์ด ์žˆ์„๊นŒ? ๋Œ€์šฉ๋Ÿ‰ ๋ถ„์„์ด ๊ฐ€๋Šฅํ•œ ๋จธํ•˜์›ƒ๊ณผ ๋น„๊ตํ•ด ๋ณด๋ฉฐ MLlib์˜ ํŠน์ง•์— ๋Œ€ํ•ด ์•Œ์•„ ๋ณด์ž. ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ๊ณผ ์—ฐ๋™ ๊ณ„์† ์ด์•ผ๊ธฐํ•ด ์™”๋˜ ๋‹ค๋ฅธ ์ฑ„๋„ ํ˜น์€ ๋ถ„์„ ๋ฐฉ๋ฒ•๊ณผ์˜ ์—ฐ๊ณ„๋‹ค. ์ŠคํŒŒํฌ(Spark)๋Š” ๋™์ผํ•œ ํ™˜๊ฒฝ์—์„œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ถ„์„ ๋ฐฉ๋ฒ•์„ ์ง€์›ํ•œ๋‹ค(same pipeline). ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ŠคํŒŒํฌ ์—”์ง„ ์œ„์—์„œ ์ž‘๋™ํ•˜๋ฉฐ ์ŠคํŠธ๋ฆฌ๋ฐ์ด๋‚˜ SQL๊ณผ ์—ฐ๋™ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ธ๋ฉ”๋ชจ๋ฆฌ ์ŠคํŒŒํฌ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ RDDs ํ˜•ํƒœ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ–ˆ์—ˆ๋‹ค. ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ ค์„œ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋””์Šคํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋งต๋ฆฌ๋“€์Šค๋‚˜ ๋จธํ•˜์›ƒ๋ณด๋‹ค (์ŠคํŒŒํฌ ์ง„์˜์—์„œ ์ฃผ์žฅํ•˜๊ธธ) 10๋ฐฐ์—์„œ 100๋ฐฐ๊นŒ์ง€ ๋น ๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹ ๋ถ€๋ถ„์—์„œ๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ์ด๋‚˜ SQL On Hadoop๋ณด๋‹ค ์†๋„ ๋…ผ๋ž€์ด ์ ๋‹ค. ๊ทธ๋งŒํผ ์••๋„์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ์•„๋ž˜๋Š” ๋น„๋ก ์ŠคํŒŒํฌ ์ง„์˜์—์„œ ํ…Œ์ŠคํŠธํ•œ ๊ฒฐ๊ณผ์ด์ง€๋งŒ, ํ™•์‹คํ•˜๊ฒŒ Spark MLlib๊ฐ€ ๋จธํ•˜์›ƒ๋ณด๋‹ค๋Š” ๋†’์€ ์„ฑ๋Šฅ์„ ์ง€์›ํ•จ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. [๊ทธ๋ฆผ โ…ฃ-1-24] ๋จธํ•˜์›ƒ๊ณผ MLlib ์†๋„๋น„๊ต ์ง€์› ์–ธ์–ด ๊ธฐ๋ณธ์ ์œผ๋กœ ๋จธํ•˜์›ƒ์€ ์ž๋ฐ”๋งŒ ์ง€์›ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ MLlib๋Š” ์ŠคํŒŒํฌ์™€ ๊ฐ™์ด ์Šค์นผ๋ผ, ์ž๋ฐ”, ํŒŒ์ด์ฌ ๋ชจ๋‘๋ฅผ ์ง€์›ํ•œ๋‹ค. ๋ฌผ๋ก  ๋‘ ํ”„๋กœ์ ํŠธ ๋ชจ๋‘ ๋‹ค๋ฅธ ์–ธ์–ด๋กœ ํฌํŒ…ํ•˜๊ฑฐ๋‚˜ ์„œ๋“œํŒŒํ‹ฐ๋กœ ๋งŒ๋“œ๋Š” ์ž‘์—…๋“ค์€ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ผ๋ฐ˜์ ์ธ ๋ฒ”์œ„ ์•ˆ์—์„œ๋Š” ๋จธํ•˜์›ƒ์€ ์ž๋ฐ”๋งŒ ์ง€์›ํ•˜๊ณ , ์ŠคํŒŒํฌ๋Š” ์ž๋ฐ”๋Š” ๋ฌผ๋ก  ์Šค์นผ๋ผ์™€ ํŒŒ์ด์ฌ๊นŒ์ง€ ์ง€์›ํ•œ๋‹ค. ๋‹จ MLlib๋„ ์™„๋ฒฝํ•˜๊ฒŒ ๋ชจ๋“  ๋ถ€๋ถ„์— ์žˆ์–ด์„œ ์„ธ ๊ฐ€์ง€ ์–ธ์–ด๋ฅผ ์ง€์›ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค. ๋ฉ”์ธ์€ ์Šค์นผ๋ผ์ด๊ณ  matrix ๊ฐ™์€ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ํŒŒ์ด์ฌ์€ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ๋“ฑ ์ œํ•œ์‚ฌํ•ญ์ด ์žˆ๋‹ค. ์ง€์› ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ง€์› ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์•„์ง๊นŒ์ง€ ๋จธํ•˜์›ƒ์ด ํ›จ์”ฌ ๋งŽ๋‹ค. MLlib๋Š” 1.0 ๋ฒ„์ „ ํ˜„์žฌ K-Means, regression tree, SVM, nanve Bayes ๋“ฑ 15๊ฐœ ๊ฐ€๋Ÿ‰์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์›ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ MLlib๋Š” ์•„์ง ์‹œ์ž‘ํ•œ ์ง€ ์–ผ๋งˆ ๋˜์ง€ ์•Š์•˜์ง€๋งŒ, ๋จธํ•˜์›ƒ์— ๋น„ํ•ด ๊ธฐ๋Šฅ ์ถ”๊ฐ€ ์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค. ์‹ค์ œ๋กœ 1.1 ํ˜น์€ 1.2 ๋ฒ„์ „์ด ๋„˜์–ด๊ฐ€๊ธฐ ์ „๊นŒ์ง€ ์ง€์› ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‘ ๋ฐฐ ์ •๋„๋กœ ๋Š˜๋ฆด ๊ณ„ํš์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ง€์› ํ™˜๊ฒฝ ์ŠคํŒŒํฌ๋Š” ์–ด๋Š ํ•˜๋‚˜์— ๊ท€์†๋œ๋‹ค๊ธฐ๋ณด๋‹ค๋Š” ๋ฒ”์šฉ์„ฑ์˜ ๋ชฉ์ ์œผ๋กœ ๋งŒ๋“  ๋น…๋ฐ์ดํ„ฐ ํ”Œ๋žซํผ์ด๋‹ค. ๊ทธ๋ž˜์„œ Yarn, Mesos, Amazon ๋“ฑ์˜ Management๋ฅผ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋˜ํ•œ HDFS๋‚˜ S3, Hbase ๋“ฑ ๋‹ค์–‘ํ•œ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. MLlib๋„ ์ŠคํŒŒํฌ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์˜ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ Management๋“ค์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ, ์ŠคํŠธ๋ฆฌ๋ฐ์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ๋ถ„์„์ด ๊ฐ€๋Šฅํ•˜๋‹ค. Graph Processing Graph๋Š” Vertex์™€ Edge๋กœ ๊ตฌ์„ฑ๋์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฐ๊ฐ์˜ Vertex์— ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ๋ผ ์žˆ๋‹ค. ๋˜ํ•œ ๊ฐ๊ฐ์˜ ์—ฐ๊ฒฐ์—๋Š” ๊ฐ€์ค‘์น˜์™€ ๋ฐฉํ–ฅ์„ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์–ด์„œ ๋„๋กœ๋ง, ์†Œ์…œ ๋„คํŠธ์›Œํฌ, ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ, ์ƒ๋ฌผ ์œ ์ „ํ•™, ๋ฌธ์„œ๊ตฌ์กฐ ๋“ฑ๊ณผ ๊ฐ™์ด ํ–‰๋ ฌ๋กœ๋งŒ ํ‘œํ˜„ํ•˜๊ธฐ์—๋Š” ์„œ๋กœ ๊ด€๊ณ„๋„ ๋„ˆ๋ฌด ๋ณต์žกํ•˜๊ณ  iteration(๋ฐ˜๋ณต์ฒ˜๋ฆฌ)์ด ๋งŽ์€ ๋ถ„์•ผ์— ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค. [๊ทธ๋ฆผ โ…ฃ-1-25] Graph ์•Œ๊ณ ๋ฆฌ์ฆ˜ Graph ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์˜คํ”ˆ์†Œ์Šค๋„ ์žˆ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ ๊ตฌ๊ธ€์—์„œ Pregel์„ ์˜คํ”ˆ์†Œ์Šคํ™”ํ•ด์„œ ๋‚ด๋†“์€ Apache Giraph๊ฐ€ ์žˆ๋‹ค. ํŽ˜์ด์Šค๋ถ์ด ์ง€๋‚œ 13๋…„์— ๋ฐœํ‘œํ•œ ๋‚ด์šฉ(http://goo.gl/PurGt1)์„ ๋ณด๋ฉด, ํ•˜์ด๋ธŒ์—์„œ 15์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๋˜ ์ž‘์—…์„ Giraph๋กœ ํ•˜๋ฉด 9๋ถ„์œผ๋กœ ์ค„์–ด๋“ค์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. Giraph ๋•๋ถ„์— CPU ํ™œ์šฉ์„ 20๋ฐฐ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋๊ณ , ์ „์ฒด์ ์œผ๋กœ 100๋ฐฐ ์ด์ƒ์˜ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์•„์ง ์•ŒํŒŒ ๋ฒ„์ „์ด์ง€๋งŒ ์ŠคํŒŒํฌ์—๋„ GraphX๋ผ๋Š” Graph ํ”„๋กœ์„ธ์‹ฑ ์„œ๋ธŒ ํ”„๋กœ์ ํŠธ๊ฐ€ ์žˆ๋‹ค. ์ฐธ๊ณ ๋กœ ์›๋ž˜ Pregel์„ ๋ณธ ๋”ฐ์„œ ๋งŒ๋“ค์—ˆ๋˜ Bagel์ด๋ผ๋Š” ํ”„๋กœ์ ํŠธ๋„ ์žˆ์—ˆ์ง€๋งŒ, GraphX๋กœ ํ†ตํ•ฉ๋˜๊ณ  ์žˆ๋‹ค. Next Mahout ๋จธํ•˜์›ƒ์ด ํ˜„์žฌ๋Š” ๋งต๋ฆฌ๋“€์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ง€๋งŒ, ๋‹ค์Œ ์ค‘์š” ๋กœ๋“œ๋งต ์ค‘์— ํ•˜๋‚˜๋กœ ์ŠคํŒŒํฌ ์—”์ง„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋จธํ•˜์›ƒ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์„œ๋ธŒ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰์ค‘์ด๋‹ค. ์ •ํ™•ํžˆ ์ด์•ผ๊ธฐํ•˜์ž๋ฉด, ์„œ๋ธŒ ํ”„๋กœ์ ํŠธ๋ผ๊ธฐ๋ณด๋‹ค๋Š” codebase ์ž์ฒด๋ฅผ ๋ฐ”๊พผ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹น๋ถ„๊ฐ„์€ ๋จธํ•˜์›ƒ์˜ ์ฝ”๋“œ ์—…๋ฐ์ดํŠธ๋ฅผ ๊ธฐ๋Œ€ํ•˜๊ธฐ๋Š” ์–ด๋ ค์šธ ๊ฒƒ์ด๋‹ค. ๋‹ค๋งŒ ์ œ๋Œ€๋กœ ์ด์ „๋œ๋‹ค๋ฉด ํ’๋ถ€ํ•œ ๋จธํ•˜์›ƒ ์‚ฌ์šฉ์ž๊ฐ€ ๊ตณ์ด Spark MLlib๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋˜๋ฏ€๋กœ ์•ž์œผ๋กœ ์ง„ํ–‰ ์ƒํ™ฉ์„ ์ง€์ผœ๋ด์•ผ ํ•  ๊ฒƒ์ด๋‹ค. ์‹ค์Šต ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋จธํ•˜์›ƒ์—์„œ ์–ธ๊ธ‰ํ–ˆ์œผ๋ฏ€๋กœ ์ด๋ฒˆ ์‹ค์Šต์—์„œ๋Š” ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ŠคํŒŒํฌ์—์„œ ์–ผ๋งˆ๋‚˜ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ˜„๋˜๋Š”์ง€ ์œ„์ฃผ๋กœ ์„ค๋ช…ํ•˜๊ฒ ๋‹ค. Example KMeans PCA Decision Tree NavieBayes ์—ฐ๊ณ„ ์ด๋ฒˆ์—๋Š” ์ŠคํŒŒํฌ์˜ ์„œ๋ธŒ ํ”„๋กœ์ ํŠธ์™€ ์–ด๋–ป๊ฒŒ ์—ฐ๊ณ„๋˜๋Š”์ง€๋ฅผ ์•Œ์•„ ๋ณด๊ฒ ๋‹ค. Spark SQL + MLlib Spark Streaming + MLlib Graphx + MLlib