๐Ÿฅž BE
home

SSD

0. Abstract

โ€ข
SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location
โ†’ ๊ฐ feature map์—์„œ ๋‹ค๋ฅธ ๋น„์œจ๊ณผ ์Šค์ผ€์ผ์˜ default box๋กœ bounding box์˜ output ๊ณต๊ฐ„์„ ๋‚˜๋ˆˆ๋‹ค.
feature map? grid
default box? ์—ฐ๋‘์ƒ‰ ์•ˆ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋กœ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•œ ์…€๋‹น 4 ๋˜๋Š” 6์œผ๋กœ ์„ค์ •ํ•œ๋‹ค.

1. Introduction

โ€ข
object detection์€ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•œ๋‹ค๋Š” task๋ฅผ ๊ฐ€์ง„๋‹ค.
โ€ข
YOLO๋Š” ์ฒ˜๋ฆฌ๊ฐ€ ๋น ๋ฅด์ง€๋งŒ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์Œ
โ€ข
Faster R-CNN์€ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ๊ณ , ์ฒ˜๋ฆฌ๊ฐ€ ๋Š๋ฆฌ์ง€๋งŒ ์ •ํ™•๋„๋Š” ๋†’์Œ
โ†’ ์ฒ˜๋ฆฌ ์†๋„์™€ ์ •ํ™•๋„! ๋‘ ๋งˆ๋ฆฌ ํ† ๋ผ๋ฅผ ์žก๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆ๋จ

2. The Single Shot Detector(SSD)

โ€ข
Input : image & ground truth box
โ€ข
Output(์˜ˆ์ธก) : loc & conf โ‡’ offsets & confidence for all object categories
โ—ฆ
offsets : (x,y,w,h)(x,y,w,h) - default box์˜ ์ขŒํ‘œ
โ–ช
x,yx,y๋Š” ๋ฐ•์Šค ์ค‘์‹ฌ ์ขŒํ‘œ
โ–ช
w,hw,h๋Š” ๋ฐ•์Šค์˜ ๋„ˆ๋น„์™€ ๋†’์ด
โ—ฆ
confidence for all object categories : (c1,c2,โ€ฆ,cp)(c_1,c_2,โ€ฆ,c_p) - class์˜ ์ ์ˆ˜
โ€ข
loss : localization loss์™€ confidence loss์˜ ๊ฐ€์ค‘ํ•ฉ
โ€ข
feature map์˜ ๊ฐ cell๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅธ scale๊ณผ aspect ratio๋ฅผ ๊ฐ€์ง„ default box (anchor box)๋ฅผ ์‚ฌ์šฉ
โ€ข
ํ˜„๋ฏธ๊ฒฝ์ฒ˜๋Ÿผ ์ž‘์€ feature map์€ ํฐ ๋ฌผ์ฒด๋ฅผ ํƒ์ง€, ํฐ feature map์€ ์ž‘์€ ๋ฌผ์ฒด๋ฅผ ํƒ์ง€

2.1 Model

We use the VGG-16 network as a base, but other networks should also produce good results.
โ€ข
๋‹ค์–‘ํ•œ size์˜ feature map์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
โ€ข
์ค‘๊ฐ„์ค‘๊ฐ„ 1x1 conv Bottleneck์„ ์ ์šฉ
โ€ข
๋ณด์กฐ(Auxiliary network) ๋„คํŠธ์›Œํฌ
2.1.1 Multi-scale feature maps for detection
โ€ข
๋‹ค์–‘ํ•œ size์˜ feature map์„ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.
โ€ข
ํฐ ํŠน์„ฑ๋งต(feature map)์ผ์ˆ˜๋ก ์ž‘์€ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•œ๋‹ค.
2.1.2 Convolutional predictors for detection
โ€ข
Multi-scale feature map VS Single scale feature map
โ—ฆ
Multi-scale feture map์ธ SSD๋Š” YOLO์— ๋ณด๋‹ค ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง
โ€ข
single stage detection : classification๊ณผ localization๋ฌธ์ œ๋ฅผ ๋™์‹œ์— ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•
โ—ฆ
ํด๋ž˜์Šค ๋ถ„๋ฅ˜์™€ bbox ํšŒ๊ท€๋ฅผ ๋™์‹œ์— ์ง„ํ–‰
โ—ฆ
SSD๋Š” 2-stage์ธ R-CNN๋ณด๋‹ค ๋น ๋ฅธ ์†๋„๋ฅผ ๊ฐ€์ง
โ€ข
VGG16(base network)์˜ ๋งˆ์ง€๋ง‰์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ Feature layers๋ฅผ ์ถ”๊ฐ€
โ†’ base network + extra network(SSD๋…ผ๋ฌธ์—์„œ๋Š” 4๊ฐœ์˜ network)
โ†’ base network๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์ ์šฉํ•ด๋„ ๋ฌด๋ฐฉํ•˜๋‹ค. (Ex. ResNet)
โ€ข
Convolutional Network ์ค‘๊ฐ„์˜ conv layer์—์„œ ์–ป์€ feature map์„ ํฌํ•จ์‹œ์ผœ, ์ดย 6๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ scale์˜ feature map์„ ์˜ˆ์ธก์— ์‚ฌ์šฉ
โ€ข
feature map : 38*38, 19*19, 10*10, 5*5, 3*3, 1*1
Conv4_3 : 38*38*4 = 5,776
Conv7 : 19*19*6 = 2,166
Conv8_2 : 10*10*6 = 600
Conv9_2 : 5*5*6 = 150
Conv10_2 : 3*3*4 = 36
Conv11_2 : 1*1*4 = 4
# of bounding box = 5,776 + 2,166 + 600 + 150 + 36 + 4 = 8,732
2.1.3 Default boxes and aspect ratios
โ€ข
๊ฐ ํ”ผ์ฒ˜๋งต์˜ ์…€ (8x8์ธ ๊ฒฝ์šฐ ์ด 64๊ฐœ์˜ ์…€)์—์„œ default bounding box๋ฅผ ๋งŒ๋“ค๊ณ  ๊ทธ default box์™€ ๋Œ€์‘๋˜๋Š” ์ž๋ฆฌ์—์„œ ์˜ˆ์ธก๋˜๋Š” ๋ฐ•์Šค์˜ offset๊ณผ per-class scores(๋ฐ•์Šค ์•ˆ์— ๋ฌผ์ฒด์˜ ์กด์žฌ ์œ ๋ฌด)๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
โ€ข
per-class scores๋Š” ํ™•๋ฅ ์ด ์•„๋‹ˆ๋ผ ๋ฐ•์Šค์— ์‚ฌ๋ฌผ์ด ์žˆ๋Š”์ง€ ์—†๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’
โ€ข
# of channels : k(c+4)k(c+4)
* kk : 4 ๋˜๋Š” 6
* c c : # of class score โ†’ # of class + ์•„๋ฌด๊ฒƒ๋„ ๋ฐ”์šด๋”ฉ ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
* 4 : offsets (x, y, w, h)
โ€ข
# of feature map : k(c+4)mnk(c+4)mn
โ€ข
Faster R-CNN์˜ anchor boxes์™€ default boxes์˜ ์ฐจ์ด?
โ—ฆ
์—ฌ๋Ÿฌ๊ฐœ์˜ feature map์„ ์‚ฌ์šฉํ•œ๋‹ค!
โ—ฆ
however we apply them to several feature maps of different resolutions

2.2 Training

2.2.1 Matching strategy
โ€ข
default box์™€ ground truth์„ ๋งค์นญํ•˜์—ฌ ๋‘ ์˜์—ญ์˜ IoU๊ฐ€ ํ•œ๊ณ„์  ์ฆ‰, 0.5์ด์ƒ์ธ default box๋ฅผ ์ฐพ๋Š”๋‹ค.
โ€ข
ํ•œ ์…€์—์„œ IoU๊ฐ€ 0.5 ์ด์ƒ์ธ default box๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ ๋‚˜์˜ค๋ฉด IoU๊ฐ€ ๊ฐ€์žฅ ํฐ default box๋ฅผ ๋ฝ‘๋Š” ๊ฒƒ์ด ์•„๋‹Œ ๋ชจ๋“  default box๋ฅผ ๋ฝ‘๋Š”๋‹ค.
โ€ข
object detector๊ฐ€ ์˜ˆ์ธกํ•œ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ค‘์—์„œ ์ •ํ™•ํ•œ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋ฅผ ์„ ํƒํ•˜๋„๋ก ํ•˜๋Š” ๊ธฐ๋ฒ•
2.2.2 Training objective
localization loss์™€ confidence loss์˜ ๊ฐ€์ค‘ํ•ฉ
NN : #of matched default boxes (IoU์˜ 0.5 ์ด์ƒ)
ll : predicted box
gg : ground truth box
localization loss (loc) โ†’ Faster R-CNN์™€ ์œ ์‚ฌ
cx,cycx, cy : default bounding box(d)d)์˜ ์ค‘์‹ฌ์ 
ww : width
hh : height
xijkx^k_{ij} : IoU์˜ ๊ฒฐ๊ณผ๊ฐ€ 0.5์ด์ƒ์ธ ๊ฒฝ์šฐ 1, ๋ฏธ๋งŒ์ธ ๊ฒฝ์šฐ 0
confidence loss (conf)
โ€ข
๋ชจ๋“  class์— ๋Œ€ํ•œ loss๋ฅผ softmax loss๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ
2.2.3 Choosing scales and aspect ratios for default boxes
โ€ข
๊ฐ feature map๋‹น ์„œ๋กœ ๋‹ค๋ฅธ 6๊ฐœ์˜ sks_k๊ฐ€ ๋‚˜์˜ด
โ€ข
default box์˜ scale : sks_k
<๊ฐ feature map์˜ scale ๊ตฌํ•˜๋Š” ๊ณต์‹>
sk=smin+smaxโˆ’sminmโˆ’1(kโˆ’1),kโˆˆ[1,m]s_k = s_{min} + \frac{s_{max}-s_{min}}{m-1}(k-1), k\in[1,m]
โ€ข
smins_{min} = 0.2 , smaxs_{max} = 0.9
โ€ข
mm : ์˜ˆ์ธก์— ์‚ฌ์šฉํ•œ feature map์˜ ์ˆ˜ (SSD์˜ ๊ฒฝ์šฐ 6๊ฐœ)
โ—ฆ
์ฒซ ๋ฒˆ์งธ feature map (38*38)์˜ sks_k : 0.2, ๋งˆ์ง€๋ง‰ feature map (1*1)์˜ sks_k : 0.9
โ—ฆ
feature map์˜ scale์ด ์ž‘์•„์งˆ์ˆ˜๋ก default box์˜ scale์€ ์ปค์ง
โ—ฆ
feature map์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์•„์งˆ์ˆ˜๋ก ๋” ํฐ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธ
โ€ข
ara_r (aspect ratio) : {1,2,3,12,13{1,2,3,\frac{1}{2},\frac{1}{3}}}
โ€ข
wkaw^a_k (default box์˜ width) : skars_k\sqrt{a_r}
โ€ข
hkah^a_k (default box์˜ height) : sk/ars_k/\sqrt{a_r}
2.2.4 Hard negative mining
โ€ข
๋Œ€๋ถ€๋ถ„์˜ default box๊ฐ€ ๋ฐฐ๊ฒฝ์ด๋ฏ€๋กœ xijpx^p_{ij}= 0 ๊ฐ€ ๋งŽ์Œ
โ€ข
positive์™€ negative๊ฐ€ ๊ท ํ˜•์ด ๋งž์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— confidence loss๋ฅผ ๋†’์€ ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด์„œ positive : negative ๋น„์œจ์„ 1 : 3์œผ๋กœ ๋ฝ‘์Œ
โ‡’ ๋น ๋ฅธ ์ตœ์ ํ™”์™€ ์•ˆ์ •์ ์ธ train ๊ฐ€๋Šฅ
2.2.5 Data augmentation
โ€ข
์ „์ฒด input image ์‚ฌ์šฉ
โ€ข
์ตœ์†Œ IOU(0.1, 0.3, 0.5, 0.7, 0.9)
โ€ข
Randomly sample a patch