Post

PR [Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks]

PR [Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks]

Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks

๐Ÿ’ก Faster R-CNN : Towards Real-Time Object Detection with Region Proposal Networks (2016, Jan, 06)

์ €์ž : SHaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

Abstract

  • ๊ทธ ๋‹น์‹œ์˜ ๊ฐ์ฒด ํƒ์ง€ ๊ธฐ์ˆ  : Region Proposal Algorithms๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌผ์ฒด์˜ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•˜๊ณค ํ–ˆ์—ˆ๋‹ค.
    • Region Proposal (์˜์—ญ ์ œ์•ˆ) ์ด๋ž€?
      • ๊ฐ์ฒด๊ฐ€ ์žˆ์„ ๋งŒํ•œ ํ›„๋ณด ์˜์—ญ๋“ค์„ ์ฐพ์•„์ฃผ๋Š” ๊ณผ์ •
      • ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€์„ ๋ชจ๋‘ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๊ฐ์ฒด๊ฐ€ ์žˆ์„๋งŒํ•œ ํ›„๋ณด ์˜์—ญ์„ ๋จผ์ € ์ œ์•ˆํ•˜๊ณ , ๊ทธ ๋ถ€๋ถ„๋งŒ CNN์ด ์ธ์‹ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ตฌ์กฐ
      • ํ›„๋ณด ์˜์—ญ๋งˆ๋‹ค CNN์„ ๊ฐ๊ฐ ๋”ฐ๋กœ ์ ์šฉ
      • ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.
  • SPPnet, Fast R-CNN
    • CNN ์—ฐ์‚ฐ์„ ๊ณตํ†ต feature map์œผ๋กœ ๊ณต์œ ํ•˜๊ฒŒ ํ•ด์„œ ์†๋„๋ฅผ ํ–ฅ์ƒ
      • feature map
        • CNN์ด ์ถ”์ถœํ•œ ์‹œ๊ฐ์  ์ •๋ณด์˜ ์š”์•ฝ
        • feature map ์— ROI๋ฅผ ์ ์šฉํ•˜์—ฌ ์˜์—ญ ์ œ์•ˆ
    • CNN ์—ฐ์‚ฐ์„ ๋งŽ์ด ์ง„ํ–‰ X

ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” Region Proposal Network (RPN) ์„ ์†Œ๊ฐœํ•˜๊ณ ์ž ํ•œ๋‹ค.

  • full - image convolutional features๋ฅผ ํƒ์ง€ ๋„คํŠธ์›Œํฌ์™€ ๊ณต์œ ํ•˜๋ฉฐ, cost-free regional proposal ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
    • full-image convolutional features
      • ์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ ์ถ”์ถœํ•œ convolutional feature map
    • ์ถ”๊ฐ€์ ์ธ ๊ณ„์‚ฐ ๋น„์šฉ ์—†์ด ์˜์—ญ ์ œ์•ˆ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
  • ์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•ด CNN์„ ํ•œ๋ฒˆ ์ ์šฉํ•˜์—ฌ feature map์„ ์‚ฐ์ถœํ•˜๊ณ , RPN์„ ์ ์šฉํ•˜์—ฌ ์˜์—ญ์ œ
  • Region Proposal Network (RPN)
    • Fully convolutional network
    • ๊ฐ ์œ„์น˜๋งˆ๋‹ค ๊ฐ์ฒด์˜ ๊ฒฝ๊ณ„์™€ ๊ฐ์ฒด์ผ ํ™•๋ฅ ์„ ๋™์‹œ์— ์˜ˆ์ธก
    • end-to-end ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต
      • ์ž…๋ ฅ๋ถ€ํ„ฐ ์ตœ์ข… ์ถœ๋ ฅ๊นŒ์ง€ ์ „์ฒด ๋„คํŠธ์›Œํฌ๋ฅผ ํ•˜๋‚˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ํ๋ฆ„์œผ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ์˜ค์ฐจ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™์‹œ์— ํ•™์Šต

    RPN + Fast R-CNN ์ด๋ ‡๊ฒŒ 2๊ฐ€์ง€๋ฅผ ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ๋กœ ํ•ฉ์นœ๋‹ค.

    โ‡’ convolutional features ๊ณต์œ !

    • Convolutional Features ๋ž€?
      • CNN์ด ์ž…๋ ฅ ์ด๋ฏธ์ง€์— ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์„ ์ ์šฉํ•˜์—ฌ ์ถ”์ถœํ•œ ์ถœ๋ ฅ๊ฐ’
    • โ€œattentionsโ€์— ๊ด€ํ•ด์„œ๋Š” RPN ๊ตฌ์„ฑ์š”์†Œ๋“ค์ด ํ†ตํ•ฉ๋œ ๋„คํŠธ์›Œํฌ ( RPN + Fast R-CNN )์— ์–ด๋””์— ์ง‘์ค‘ํ• ์ง€ ์•Œ๋ ค์ค€๋‹ค.
    • Frame rate : 5fps (GPU)
๋‹จ๊ณ„Fast R-CNNFaster R-CNN
1. CNN (feature map ์ƒ์„ฑ)โœ…โœ…
2. Region ProposalโŒ ์™ธ๋ถ€ Selective Searchโœ… CNN ๊ธฐ๋ฐ˜ RPN
3. RoI Poolingโœ… ์‚ฌ์šฉโœ… ์‚ฌ์šฉ (๋™์ผ)
4. Classification + bbox regressionโœ…โœ…
5. ํ•™์Šต ๋ฐฉ์‹โ—๏ธ๋ถ€๋ถ„๋งŒ end-to-endโœ… ์ „์ฒด end-to-end

1 Introduction

๊ฐ์ฒด ํƒ์ง€ (Object Detection)์˜ ๊ธฐ๋ณธ

โ‡’ Region Proposal Algorithm(์˜์—ญ์ œ์•ˆ) + CNN

๊ธฐ์กด์˜ Fast R-CNN

  • Region Proposal (์˜์—ญ์ œ์•ˆ) ์‹œ๊ฐ„์„ ์ œ์™ธํ•œ๋‹ค๋ฉด, ์•„์ฃผ ๊นŠ์€ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•ด๋„ ๊ฑฐ์˜ ์‹ค์‹œ๊ฐ„ ์†๋„๋ฅผ ์ด๋ฃฌ๋‹ค.
  • CNN์˜ ๋ฐฑ๋ณธ(BackBone)์„ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. (๋…ผ๋ฌธ์—์„  VGG-16)

๊ฐ์ฒด ํƒ์ง€ ์‹œ์Šคํ…œ โ‡’ ์˜์—ญ ์ œ์•ˆ ๊ณผ์ •์ด ํ…Œ์ŠคํŠธ ๋‹จ๊ณ„์—์„œ ์†๋„๋ฅผ ๊ฐ€์žฅ ๋А๋ฆฌ๊ฒŒ ๋งŒ๋“œ๋Š” ๋ณ‘๋ชฉ์ด ๋˜์—ˆ๋‹ค.

๊ธฐ์กด ์˜์—ญ ์ œ์•ˆ ๊ธฐ๋ฒ•

  • ๋น ๋ฅด์ง€๋งŒ ๋‹จ์ˆœํ•œ ํŠน์ง•์— ์˜์กด, ํ•™์Šต ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ณ„์‚ฐ ํšจ์œจ ์œ„์ฃผ์˜ ๋ฐฉ์‹ (์‚ฌ๋žŒ์ด ์ •ํ•ด๋†“์€ ๊ทœ์น™์—๋งŒ ์˜์กด)

Selective Search

  • ๋งŽ์ด ์“ฐ์ด๋Š” ์˜์—ญ ์ œ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • Greedy Merge (๋น„์Šทํ•ด๋ณด์ด๋Š” superpixel๋ผ๋ฆฌ ํ•˜๋‚˜์”ฉ ์ฐจ๋ก€๋กœ ๊ณ„์† ํ•ฉ์นจ)
    • ํ”ฝ์…€๋“ค์„ ํ•˜๋‚˜์”ฉ ํ•ฉ์น˜๋ฉด์„œ ์˜์—ญ์„ ์ƒ์„ฑ
    • superpixel
      • ์ž‘๊ณ  ๋น„์Šทํ•œ ์ƒ‰/ํ…์Šค์ฒ˜ ๋ฉ์–ด๋ฆฌ๋กœ ๋ถ„ํ• 
  • ์‚ฌ๋žŒ์ด ์„ค๊ณ„ํ•œ ์ €์ˆ˜์ค€ ํŠน์ง•๋“ค (์ƒ‰์ƒ, ์งˆ๊ฐ ๋“ฑ)์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค.

โ‡’ ํšจ์œจ์ ์ธ ๊ฐ์ฒด ํƒ์ง€ ์‹ ๊ฒฝ๋ง์— ๋น„ํ•˜๋ฉด, Selective Search ๋Š” CPU ๊ตฌํ˜„ํ™˜๊ฒฝ์—์„œ ์ด๋ฏธ์ง€ ๋‹น 2์ดˆ๊ฐ€ ๊ฑธ๋ฆฌ๋Š” ์—„์ฒญ ๋А๋ฆฐ ์†๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

Edge Boxes

  • ํ˜„์žฌ๋กœ์จ ์ตœ๊ณ ์˜ ๊ท ํ˜•์„ ๋ณด์—ฌ์คŒ (์˜์—ญ ์ œ์•ˆ์˜ ์งˆ๊ณผ ์†๋„)
    • ์ด๋ฏธ์ง€๋‹น 0.2์ดˆ
  • ๊ทธ๋ž˜๋„ ๊ฐ์ฒด ํƒ์ง€ ์‹ ๊ฒฝ๋ง์˜ ์ „์ฒด ์‹œ๊ฐ„๋งŒํผ์˜ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฐ๋‹ค.

๊ฐ์ฒด ํƒ์ง€ ๋„คํŠธ์›Œํฌ๋Š” GPU ๊ตฌํ˜„์ธ๋ฐ, Region Proposal์€ CPU ๊ตฌํ˜„ํ™˜๊ฒฝ์ด๋‹ˆ๊นŒ ๋น„๊ต๊ฐ€ ๋ถˆ๊ณตํ‰ํ•˜์ง€ ์•Š๋‚˜?

Region Proposal ์„ GPU ํ™˜๊ฒฝ์—์„œ๋„ ์ž‘๋™๋  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌํ˜„

โ‡’ GPU ํ™˜๊ฒฝ์œผ๋กœ ์˜ฎ๊ธฐ๋Š” ๊ฒƒ์€ ํšจ๊ณผ์ ์ผ ์ˆœ ์žˆ์ง€๋งŒ, ํ›„์† ํƒ์ง€ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ์„ ๊ณต์œ ํ•˜๋Š” ๊ธฐํšŒ๋ฅผ ๋†“์น˜๊ฒŒ ๋œ๋‹ค.

RPN๊ณผ Detection Network๊ฐ€ ๊ฐ™์ด ๊ณ„์‚ฐ์„ ๊ณต์œ ํ•˜๋Š” ๋ฐฉ์‹์ด ๋” ์ข‹๋‹ค.

๋”ฐ๋ผ์„œ Faster R-CNN : ์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด CNN ํ›„, ์–ป์€ feature map์„ RPN๊ณผ Detection Network๊ฐ€ ๊ณ„์‚ฐ์„ ๊ณต์œ 

๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š”, ๊นŠ์€ CNN๋ฅผ ์‚ฌ์šฉํ•ด ์˜์—ญ ์ œ์•ˆ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ด ํšจ์œจ์ ์ธ ํ•ด๊ฒฐ์„ ์ œ๊ณตํ•˜๊ณ , cost-freeํ•˜๊ฒŒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์ค€๋‹ค.

CNN ์œผ๋กœ Region Proposal์„ ์ˆ˜ํ–‰ โ‡’ detection ์—ฐ์‚ฐ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉ โ‡’ ํšจ์œจ์  (cost-free)

๊ฐ€์žฅ ํ™œ์šฉ์„ฑ์ด ๋†’์€ ๊ฐ์ฒด ํƒ์ง€ ๋„คํŠธ์›Œํฌ๋“ค๊ณผ ๊ณ„์‚ฐ์„ ๊ณต์œ ํ•˜๋Š” RPN์„ ๋„์ž…ํ•œ๋‹ค.

โ‡’ ์˜์—ญ ์ œ์•ˆ์˜ ๋น„์šฉ์€ ๋งค์šฐ ์ž‘๋‹ค

๋˜ํ•œ, Fast R-CNN๊ณผ ๊ฐ™์ด, ์ง€์—ญ ํƒ์ง€๋ฅผ ํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” feature map์€ ์ง€์—ญ ์ œ์•ˆ์„ ํ• ๋•Œ์—๋„ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.

โ‡’ ์ด์ „์˜ feature map : โ€˜ํƒ์ง€โ€™์—๋งŒ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ์˜์—ญ ์ œ์•ˆ์€ ๋‹ค๋ฅธ regional proposal algorithm์„ ์‚ฌ์šฉ (Selective Search) ํ•˜์ง€๋งŒ, feature map์ด regional proposal์„ ํ• ๋•Œ์—๋„ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„๋‚ด์—ˆ๋‹ค.

๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” โ€œRPNโ€

  • ํ•ด๋‹น ํ•ฉ์„ฑ๊ณฑ์  ํŠน์„ฑ ์œ„์—, ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ฉ์„ฑ๊ณฑ ์ธต์„ ์—ฌ๋Ÿฌ๊ฐœ ์Œ“์€ RPN์„ ์ œ์•ˆํ•œ๋‹ค.
    • ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ•ฉ์„ฑ๊ณฑ ์ธต์€ ๊ฐ๊ฐ์˜ ์œ„์น˜์—์„œ regular grid ์œ„์— region bounds์™€ objectness scores๋ฅผ ๋™์‹œ์— ์˜ˆ์ธกํ•œ๋‹ค.
    • ํ•ด๋‹น RPN์€ Fully Convolutional Network (FCN)์ด๋‹ค.
      • Fully Convolutional Network โ‡’ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์œผ๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„ ๊ณ„์ธต์„ ์˜๋ฏธํ•œ๋‹ค.
      • ํŠนํžˆ ํƒ์ง€ ์ œ์•ˆ์„ ํ•˜๋Š” ๋ชฉํ‘œ์— ๋Œ€ํ•ด end-to-end ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๊ฐ€๋Šฅํ•˜๋‹ค.
  • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ์ข…ํšก๋น„์— ๋Œ€ํ•ด ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํšจ์œจ์ ์œผ๋กœ ๋””์ž์ธ๋˜์—ˆ๋‹ค.

  • ๊ธฐ์กด์˜ ๊ฐ์ฒด ํƒ์ง€ ๋ฐฉ์‹ (a)(b)
    • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์œ„ํ•ด ์ด๋ฏธ์ง€์™€ ํ•„ํ„ฐ๋ฅผ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋กœ ์žฌ์กฐ์ •ํ•˜์—ฌ ์ฒ˜๋ฆฌํ–ˆ๋‹ค.
      • ๊ฐ๊ฐ์˜ ํฌ๊ธฐ์— ๋Œ€ํ•ด CNN์„ ์ผ์ผ์ด ๋Œ๋ ค์•ผํ–ˆ์Œ
  • ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” โ€œโ€™anchorโ€™ boxesโ€ : ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ์ข…ํšก๋น„์˜ ๊ฐ์ฒด๋“ค์„ ํƒ์ง€ ๊ฐ€๋Šฅํ•˜๋‹ค.
    • anchor boxes : CNN์„ ํ†ตํ•ด ์‚ฐ์ถœ๋œ feature map์— ์—ฌ๋Ÿฌ ํฌ๊ธฐ์˜ box๋“ค์„ ๊น”์•„๋‘๊ณ , ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
    • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์— ๋Œ€ํ•ด ์ด๋ฏธ์ง€, ํ•„ํ„ฐ๋“ค์„ ๊ณ„์‚ฐํ•  ํ•„์š” ์—†์ด ๊น”์•„๋‘” anchor box๋ฅผ ํ†ตํ•ด ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
      • ๋ฐ•์Šค์˜ ํฌ๊ธฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์กฐ์ •๋˜์–ด์•ผํ•˜๋Š”์ง€?
      • ๋ฐ•์Šค ์•ˆ์— ๊ฐ์ฒด๊ฐ€ ์žˆ์„ ํ™•๋ฅ ?
    • ํ•ด๋‹น ๋ชจ๋ธ์€ ๋‹จ์ผ ํฌ๊ธฐ๋กœ ํ•™์Šต๋˜๊ณ  ํ…Œ์ŠคํŠธ๋ฅผ ํ•˜์˜€์„๋•Œ ์„ฑ๋Šฅ์ด ์ข‹๊ณ , ์†๋„๋„ ๋น ๋ฅด๋‹ค.

RPN ๊ณผ Fast R-CNN ๊ฐ์ฒด ํƒ์ง€ ์‹ ๊ฒฝ๋ง๊ณผ ํ•ฉ์น˜๋Š” ๋ฐฉ์‹

  • RPN๊ณผ Fast R-CNN์„ ํ†ตํ•ฉํ•˜์—ฌ end-to-end ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ง€์—ญ ์ œ์•ˆ(RPN)๊ณผ ๊ฐ์ฒด ํƒ์ง€์— ๋Œ€ํ•ด์„œ ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉฐ fine-tuningํ•˜๋Š” ๋ฐฉ์‹์„ ์„ ํƒ
    • ๊ฐ์ฒด ํƒ์ง€์— ๋Œ€ํ•ด fine-tuning์„ ํ• ๋•Œ์—๋Š” ์ง€์—ญ ์ œ์•ˆ (Regional Proposal)์„ ๊ณ ์ •ํ•œ์ฑ„๋กœ ์ง„ํ–‰

PASCAL VOC Detection Benchmarks๋ฅผ ํ†ตํ•ด ์œ„์˜ ๋ฐฉ์‹์„ ํ‰๊ฐ€

  • RPN + Fast R-CNN ์ด Selective Search + Fast R-CNN๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค.
  • ๋˜ํ•œ, Fast R-CNN๊ณผ RPN์„ ํ†ตํ•ฉํ•œ ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ Selective Research๋ฅผ ํ†ตํ•ด ๋ฐœ์ƒํ•˜๋Š” ๊ณ„์‚ฐ๋Ÿ‰์„ ํ”ผํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์‹œ๊ฐ„์ ์ธ ์ธก๋ฉด์—์„œ๋„ ์šฐ์ˆ˜ํ•˜๋‹ค.
  • CNN์˜ ๋ฐฑ๋ณธ์œผ๋กœ ๊ต‰์žฅํžˆ ๊นŠ์€ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด๋„, ํ•ด๋‹น ๋ชจ๋ธ์€ frame rate : 5fps ๋กœ ์†๋„, ์ •ํ™•๋„ ์ธก๋ฉด์—์„œ ์‹ค์šฉ์ ์ธ ๊ฐ์ฒด ํƒ์ง€ ์‹œ์Šคํ…œ์ด๋‹ค.
  • 3D ๊ฐ์ฒด ํƒ์ง€, ๋ถ€๋ถ„ ๊ฐ์ฒด ํƒ์ง€, ๊ฐ์ฒด ํƒ์ง€ ์„ธ๋ถ„ํ™”, ์ด๋ฏธ์ง€ ์บก์…˜์—์„œ๋„ ์‚ฌ์šฉ๋œ๋‹ค.

โ‡’ RPN + Fast R-CNN์€ ํšจ์œจ์ ์ผ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์‹ค์šฉ์ ์ธ ๋ฐฉ์‹์ด๋ฉฐ, ๊ฐ์ฒด ํƒ์ง€์˜ ์ •ํ™•๋„๋ฅผ ๋†’์ด๋Š” ํšจ์œจ์ ์ธ ๋ฐฉ์‹์ด๋‹ค.

2 Related Work

Object Proposals

  • super-pixels๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ๋ฐฉ์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.
    • Selective Research
  • Sliding Windows๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋ฐฉ์‹
    • EdgeBoxes
  • Object Proposal ๋ฐฉ์‹๋“ค์€ ํƒ์ง€ (Detectors)์™€ ๋…๋ฆฝ์ ์ธ ์™ธ๋ถ€ ๋ชจ๋“ˆ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

    โ‡’ ํƒ์ง€๊ธฐ์™€ ๊ณ„์‚ฐ ๊ณต์œ  X

Deep Networks for Object Detection

  • R-CNN ๋ฐฉ์‹์€ CNN๋“ค์„ ์ œ์•ˆ๋œ ์ง€์—ญ์„ ๊ฐ์ฒด or ๋ฐฐ๊ฒฝ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด end-to-end ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์‹œํ‚จ๋‹ค.
  • R-CNN์€ ์ฃผ๋กœ ๋ถ„๋ฅ˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๊ฐ์ฒด ๊ฒฝ๊ณ„๋ฅผ ์˜ˆ์ธกํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. (๋ฐ•์Šค ์˜ˆ์ธก์„ ํ†ตํ•ด ์žฌ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ์ œ์™ธ)
    • ๋ถ„๋ฅ˜ ์—ญํ•  ์ •ํ™•๋„ โ‡’ ์ง€์—ญ ์ œ์•ˆ (Region Proposal) ๋ชจ๋“ˆ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค.
  • ์—ฌ๋Ÿฌ ๋…ผ๋ฌธ๋“ค์—์„œ ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ bounding box๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด์™”๋‹ค.
    • OverFeat ๋ฐฉ์‹์—์„œ๋Š” ์™„์ „์—ฐ๊ฒฐ๊ณ„์ธต์„ ํ•™์Šต์‹œ์ผœ ๊ฐ์ฒด์˜ bounding box๋ฅผ ์˜ˆ์ธกํ–ˆ๋‹ค.
    • ์ดํ›„ ์™„์ „์—ฐ๊ฒฐ๊ณ„์ธต์€ ๋‹ค์ค‘ํด๋ž˜์Šค ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต์œผ๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค.
      • OverFeat ์˜ โ€˜single-boxโ€™๋ฅผ ํ™•์žฅ

      โ‡’ Multibox ๋ฐฉ์‹์€ ๋„คํŠธ์›Œํฌ์˜ ๋งˆ์ง€๋ง‰ ์™„์ „์—ฐ๊ฒฐ๊ณ„์ธต์œผ๋กœ๋ถ€ํ„ฐ ๋‹ค์ˆ˜์˜ class-agnostic boxes๋ฅผ ๋™์‹œ์— ์˜ˆ์ธกํ•˜์—ฌ Regional proposal์„ ์ƒ์„ฑํ•œ๋‹ค.

      • ์ด class-agnostic boxes๋Š” R-CNN์˜ ์ œ์•ˆ์— ์‚ฌ์šฉ๋œ๋‹ค.
      • MultiBox ๋„คํŠธ์›Œํฌ๋Š” ๋‹จ์ผ ์ด๋ฏธ์ง€ ํ˜น์€ ๋‹ค์ˆ˜์˜ ํฐ ์ด๋ฏธ์ง€์— ์ ์šฉ๋˜๋ฉฐ, ์ด๋Š” ๋…ผ๋ฌธ์˜ ํ•ฉ์„ฑ๊ณฑ์œผ๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„ ๋ฐฉ์‹๊ณผ ๋Œ€๋น„๋œ๋‹ค.
      • Proposal๊ณผ detection ๋„คํŠธ์›Œํฌ ์‚ฌ์ด์— ํŠน์ง•๋“ค์„ ๊ณต์œ ํ•˜์ง€ ์•Š๋Š”๋‹ค.
        • ํ•ฉ์„ฑ๊ณฑ์˜ ์—ฐ์‚ฐ์„ ๊ณต์œ ํ•˜๋Š” ๊ฒƒ์€ ํšจ์œจ๊ณผ ์ •ํ™•ํ•œ ์‹œ๊ฐ ์ธ์‹์„ ์œ„ํ•ด ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋‹ค.

    OverFeat ๋ฐฉ์‹

    • ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ์‚ฌ์ด์ฆˆ๋กœ ๋งŒ๋“  ํ›„, ๊ฐ๊ฐ์˜ ์‚ฌ์ด์ฆˆ์— ๋Œ€ํ•ด ๋ชจ๋‘ CNN์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ์‹

    SPP ๋ฐฉ์‹

    • ์ด๋ฏธ์ง€์˜ Feature Map์— ์—ฌ๋Ÿฌ ์‚ฌ์ด์ฆˆ์˜ anchor boxes๋ฅผ ๋‘๊ณ  ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•œ๋‹ค.
    • ํšจ์œจ์ ์œผ๋กœ ์ง€์—ญ ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ํƒ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

    Fast R-CNN

    • ๊ณต์œ ๋œ ํ•ฉ์„ฑ๊ณฑ์  ํŠน์ง•์—์„œ end-to-end ํƒ์ง€ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์†๋„์™€ ์ •ํ™•๋„๊ฐ€ ๋ชจ๋‘ ์šฐ์ˆ˜ํ•˜๋‹ค.

    # 3 Faster R-CNN

  • Faster R-CNN์€ 2๊ฐ€์ง€ ๋ชจ๋“ˆ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
    1. ๊นŠ์€ ํ•ฉ์„ฑ๊ณฑ ๋„คํŠธ์›Œํฌ โ†’ ์ง€์—ญ์„ ์ œ์•ˆ
    2. Fast R-CNN detector โ†’ ์ œ์•ˆ๋œ ์ง€์—ญ์„ ์‚ฌ์šฉ
  • ์ „์ฒด์ ์ธ ์‹œ์Šคํ…œ์€ ํ•˜๋‚˜๋กœ ํ†ตํ•ฉ๋˜์–ด ์žˆ๋‹ค.
  • RPN ๋ชจ๋“ˆ์ด Fast R-CNN์—๊ฒŒ ์–ด๋””์— โ€œAttentionโ€์„ ๋‘˜์ง€ ์•Œ๋ ค์ค€๋‹ค.

3.1 Region Proposal Networks

  • ์ด๋ฏธ์ง€๋ฅผ input์œผ๋กœ ๋ฐ›์œผ๋ฉฐ, ์ถœ๋ ฅ์œผ๋กœ๋Š” ์ง์‚ฌ๊ฐํ˜•์˜ ๊ฐ์ฒด/์ง€์—ญ ์ œ์•ˆ์„ ๊ฐ€์ง„๋‹ค.
  • ์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์„ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต์œผ๋กœ๋งŒ ๊ตฌํ˜„ํ•˜๋ฉฐ, ํ•ด๋‹น ๋ชจ๋ธ์˜ ๋ชฉํ‘œ๋Š” ๊ณ„์‚ฐ(์ด๋ฏธ์ง€โ†’feature map)์„ Fast R-CNN๊ณผ ๊ณต์œ ํ•˜๊ธฐ ์œ„ํ•ด์„œ์ด๋‹ค.
    • 2๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋Š” ๋™์ผํ•œ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต์„ ๊ณต์œ ํ•œ๋‹ค.

โ‡’ ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ์‹คํ—˜์—์„œ๋Š” Zelier and Fergus Model (๊ณต์œ  ๊ฐ€๋Šฅํ•œ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต : 5๊ฐœ), Simonyan and Zisserman Model (๊ณต์œ  ๊ฐ€๋Šฅํ•œ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต : 13๊ฐœ)

Region Proposal

  • ๋งˆ์ง€๋ง‰ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต์˜ ์ถœ๋ ฅ์ธ Feature Map ์œ„์— small network ๋ฅผ slide ์‹œํ‚จ๋‹ค
  • small network
    • feature map์œ„์—์„œ nxn ํฌ๊ธฐ์˜ ๋ถ€๋ถ„์  ์œˆ๋„์šฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๊ฐ€์ง„๋‹ค.
    • ๊ฐ๊ฐ์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋Š” ์ €์ฐจ์› ํŠน์ง•๋“ค๋กœ ๋งคํ•‘๋˜๋ฉฐ (ZF ๋„คํŠธ์›Œํฌ - 256 ์ฐจ์›, VGG - 512์ฐจ์›) ์ดํ›„ ReLU ํ•จ์ˆ˜๋กœ ์ „๋‹ฌ๋œ๋‹ค.

โ‡’ ํŠน์ง•๋“ค์€ ์ดํ›„ box-regression layer(reg)์™€ box-classification layer(cls)๋กœ ์ „๋‹ฌ๋œ๋‹ค.

  • n=3์œผ๋กœ ์„ค์ • (ํ•˜๋‚˜์˜ ์œ„์น˜์— ๋Œ€ํ•ด 3๊ฐœ์˜ scale์„ ์ ์šฉ)

3.1.1 Anchors

  • Sliding Window ์œ„์น˜์—์„œ, ๋™์‹œ์— ๋‹ค์ˆ˜์˜ ์ง€์—ญ ์ œ์•ˆ์„ ํ•˜๊ณ , ๊ทธ ๊ฐœ์ˆ˜๋Š” k๋กœ ์ •์˜๋˜์–ด์žˆ๋‹ค.
  • reg layer (์œ„์น˜๋ฅผ ์–ผ๋งˆ๋‚˜ ์กฐ์ •ํ•  ๊ฒƒ์ธ์ง€)
    • 4k์˜ ์ถœ๋ ฅ (x,y,w,h)
  • cls layer (ํ•ด๋‹น ๊ฐ์ฒด๊ฐ€ ์ „๊ฒฝ์ธ์ง€ ๋ฐฐ๊ฒฝ์ธ์ง€)
    • 2k์˜ ์ถœ๋ ฅ (object, not-object)
  • 3๊ฐœ์˜ scales, 3๊ฐœ์˜ aspect ratios๋ฅผ ์ ์šฉํ•˜์—ฌ ์ด 9๊ฐœ์˜ anchors๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. (๋…ผ๋ฌธ์—์„œ)
  • ์ด anchor์˜ ๊ฐœ์ˆ˜
    • feature map์ด WxH๋ฉด, ์ด anchor์˜ ๊ฐœ์ˆ˜ = WxHxk์ด๋‹ค.
      • sliding window ๋Š” feature map ๊ฐ๊ฐ์˜ ์œ„์น˜์—์„œ ๋ชจ๋‘ ์ ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Translation-Invariant Anchors

  • ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์€ translation invariant ์ด๋‹ค.
    • ๋งŒ์•ฝ ๊ฐ์ฒด์˜ ์œ„์น˜๊ฐ€ ์กฐ๊ธˆ ๋ฐ”๋€Œ๋”๋ผ๋„ ๊ฐ ์œ„์น˜์— ์ ์šฉ๋˜๋Š” anchor๋“ค๊ณผ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” ํ•จ์ˆ˜๋“ค๋กœ ๊ฐ์ฒด๊ฐ€ ํƒ์ง€๋˜๊ฒŒ๋” ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
    • ๋งŒ์•ฝ ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ์–ด๋–ค ๊ฐ์ฒด๋ฅผ ์ธ์‹ํ•˜์˜€๋‹ค๋ฉด, ๋™์ผํ•œ ํ•จ์ˆ˜๊ฐ€ ํ•ด๋‹น ๊ฐ์ฒด๋ฅผ ๋‹ค๋ฅธ ์œ„์น˜์—์„œ๋„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์–ด์•ผํ•œ๋‹ค.
      • Faster R-CNN์€ ์ด๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค! (Multibox๋Š” K-means๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.)
    • Translation-Invariant ๋Š” ๋˜ํ•œ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ธ๋‹ค.
    • Multibox์˜ ๊ฒฝ์šฐ, parameters์˜ ๊ฐœ์ˆ˜๊ฐ€ 6.1x10^6๊ฐœ ์กด์žฌํ•˜๋ฉฐ, ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ์—” parameters์˜ ๊ฐœ์ˆ˜๊ฐ€ (512 x (4+2) x 9)๊ฐœ์ด๋‹ค. (VGG-16 ๋ชจ๋ธ์„ CNN ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€์„๋•Œ)

    โ‡’ parameter๋“ค์˜ ๊ฐœ์ˆ˜๋„ ์ค„์ด๋ฉฐ, ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋„ ๊ณผ์ ํ•ฉ์˜ ์œ„ํ—˜์„ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๋‹ค.

Multi-Scale Anchors as Regression References

  • Faster R-CNN์˜ ๋ฐฉ์‹์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— ๋Œ€ํ•œ ๊ฐ์ฒด ํƒ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํ•ด๋‹น ๋ฐฉ์‹์—๋Š” 2๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.
    1. image/feature pyramids๋ฅผ ์‚ฌ์šฉ (์ด๋ฏธ์ง€๋ฅผ ์กฐ์ž‘)
    • ์ด๋ฏธ์ง€๋“ค์ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋กœ ์žฌ์กฐ์ •๋˜๋ฉฐ, ์žฌ์กฐ์ •๋œ ์‚ฌ์ด์ฆˆ์— ๋Œ€ํ•ด feature map ์ด ๊ฐ๊ฐ ์ ์šฉ๋˜๋Š” ๋ฐฉ์‹์ด๋‹ค.
      1. Sliding Windows๋ฅผ Feature map์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹
    • ์„œ๋กœ ๋‹ค๋ฅธ ์ข…ํšก๋น„๋ฅผ ๊ฐ€์ง„ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ๋“ค์€ ๊ฐ๊ฐ ์„œ๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋“ค๋กœ ํ›ˆ๋ จ๋œ๋‹ค.
    • โ€œpyramid of filtersโ€๋ผ๊ณ  ์ƒ๊ฐํ•˜์ž!

    โ‡’ ํ•ด๋‹น ๋ฐฉ์‹์€ 1๋ฒˆ์˜ ๋ฐฉ์‹๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๊ณค ํ•œ๋‹ค.

  • ์œ„์˜ 2๊ฐ€์ง€ ๋ฐฉ์‹๊ณผ ๋น„๊ตํ•˜์—ฌ, anchor๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์€ ํšจ์œจ์ ์ธ โ€˜Pyramids of anchorsโ€์˜ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค.

  • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ anchors ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, feature map๊ณผ image๋Š” ๋‹จ ํ•˜๋‚˜์˜ ํฌ๊ธฐ๋งŒ์„ ๊ฐ€์ง„๋‹ค.
  • ํ•ด๋‹น ๋ฐฉ์‹์œผ๋กœ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ ๊ฐ์ฒด๋“ค์„ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.

3.1.2 Loss Function

  • ํ•™์Šตํ•˜๋Š” RPN์„ ์œ„ํ•ด์„œ๋Š”, ์ด์ง„๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ํด๋ž˜์Šค ๋ผ๋ฒจ์„ ๋ถ€์—ฌํ•œ๋‹ค. (๊ฐ์ฒด์ธ์ง€? ์•„๋‹Œ์ง€?)
  • 2๊ฐ€์ง€์˜ anchor๋“ค์„ ์œ„ํ•ด ๊ธ์ •์ ์ธ ๋ผ๋ฒจ์„ ๋ถ€์—ฌํ•œ๋‹ค.
    1. ground-truth box ์™€ ๊ฐ€์žฅ ๋†’์€ ์˜์—ญ ๊ฒน์นจ ๋น„์œจ์„ ๊ฐ€์ง„ anchor
    2. ground-truth box ์™€ ์˜์—ญ์ด ๊ฒน์น˜๋Š” ๋น„์œจ์ด 0.7 ์ด์ƒ์ธ anchor

    ground-truth box? : ์ •๋‹ต ๋ฐ•์Šค, ์ •๋‹ต ๊ธฐ์ค€, ์‚ฌ๋žŒ์ด ๋ถ€์—ฌํ•œ ๋ผ๋ฒจ

    ํ•˜๋‚˜์˜ ground-truth box๊ฐ€ ์ •๋‹ต ๋ผ๋ฒจ์„ ๋‹ค์ˆ˜์˜ anchor์—๊ฒŒ ํ• ๋‹นํ•  ์ˆ˜ ์žˆ๋‹ค.

    โ‡’ 2๋ฒˆ์งธ ๋ฐฉ์‹์ด ์ •๋‹ต ์ƒ˜ํ”Œ๋“ค์„ ๊ฒฐ์ •ํ•˜๋Š”๋ฐ ํšจ์œจ์ ์ด์ง€๋งŒ, ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ์ฒซ๋ฒˆ์งธ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. (๊ฐ€๋” ๋“œ๋ฌผ๊ฒŒ 2๋ฒˆ์งธ ๋ฐฉ์‹์ด ์ •๋‹ต ๋ผ๋ฒจ์„ ํƒ์ง€ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์œ„ํ•ด)

  • ๋งŒ์•ฝ IoU (์ •๋‹ต ๋ผ๋ฒจ๊ณผ ๊ฒน์น˜๋Š” ๋น„์œจ)์ด 0.3๋ณด๋‹ค ์ž‘์œผ๋ฉด, Negative label(๋ฐฐ๊ฒฝ)์„ ๋ถ€์—ฌํ•œ๋‹ค.
  • Positive/Negative๊ฐ€ ์•„๋‹Œ anchor๋“ค์€ ํ•™์Šต์— ๊ด€์—ฌํ•˜์ง€ ์•Š๋Š”๋‹ค!
๋ผ๋ฒจ์„ค๋ช…/์‚ฌ์šฉ
Positive LabelGround-Truth box์™€ ๊ฒน์น˜๋Š” ๋น„์œจ(IoU)๊ฐ€ 0.7์ด ๋„˜์„๋•Œ (ํ•™์Šต์‹œ ๊ฐ์ฒด ๋ผ๋ฒจ๋กœ ์‚ฌ์šฉ)
Negative LabelGround-Truth box๊ณผ ๊ฒน์น˜๋Š” ๋น„์œจ (IoU)๊ฐ€ 0.3๋ณด๋‹ค ์ž‘์„๋•Œ (ํ•™์Šต์‹œ ๋ฐฐ๊ฒฝ ๋ผ๋ฒจ๋กœ ์‚ฌ์šฉ)
0.3<IoU<0.7ํ•™์Šต์‹œ ์‚ฌ์šฉ X

์œ„์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ Fast R-CNN์—์„œ์˜ ๊ฐ์ฒด ํƒ์ง€ (๋ฉ€ํ‹ฐ ํ…Œ์Šคํฌ) ์†์‹ค์„ ์ตœ์†Œํ™”ํ•œ๋‹ค.

  • ์œ„์˜ ์‹์—์„œ i๋Š” ๋ฏธ๋‹ˆ๋ฐฐ์น˜์—์„œ์˜ anchor์˜ ์ธ๋ฑ์Šค๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
  • p_i๋Š” i๋ฒˆ์งธ anchor๊ฐ€ ๊ฐ์ฒด๋กœ ์˜ˆ์ธก๋  ํ™•๋ฅ ์„ ์˜๋ฏธํ•œ๋‹ค.
  • p_i* (ground-truth label)์€ anchor๊ฐ€ positive(๊ฐ์ฒด์ด๋ฉด), 1, ๊ฐ์ฒด๊ฐ€ ์•„๋‹ˆ๋ฉด(๋ฐฐ๊ฒฝ์ด๋ฉด) 0์ด๋œ๋‹ค.
  • t_i ์™€ t^*_i ๋Š” ๋‘˜ ๋‹ค Anchor๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋œ ์œ„์น˜ ๋ณด์ •๊ฐ’ ๋ฒกํ„ฐ์ด๋ฉฐ, ํ•˜๋‚˜๋Š” ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๊ทธ anchor์— ์—ฐ๊ฒฐ๋œ ์ •๋‹ต ๋ฐ•์Šค ๊ฐ’
    • ์™œ ํŒŒ๋ผ๋ฏธํ„ฐํ™” ํ•˜๋Š”๊ฐ€? โ†’ anchor๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•œ ์ƒ๋Œ€์  ๋ณ€ํ™”๋Ÿ‰์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ๋•Œ๋ฌธ!
  • Classification_Loss : ๊ฐ์ฒด/๋ฐฐ๊ฒฝ 2๊ฐ€์ง€์˜ ํด๋ž˜์Šค์˜ log๋ฅผ ์”Œ์šด ํ˜•ํƒœ์ด๋‹ค.
  • Regression Loss : Robus Loss Function์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์ธ๋‹ค.
    • p*_i L_reg๋Š” ํšŒ๊ท€ ์†์‹ค์ด ์ •๋‹ต anchor๋“ค์„ ์œ„ํ•ด์„œ๋งŒ ํ™œ์„ฑํ™”๋˜๊ณ , ๋ฐฐ๊ฒฝ anchor๋“ค์— ํ•œํ•ด์„œ๋Š” ํ™œ์„ฑํ™”๋˜์ง€ ์•Š๋Š”๋‹ค.
  • cls, reg : cls ๋Š” ์ง€๊ธˆ ํ˜„์žฌ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ(256)์— ์˜ํ•ด ์ •๊ทœํ™”๋˜์—ˆ์œผ๋ฉฐ, reg๋Š” anchor์˜ ๊ฐœ์ˆ˜์— ์˜ํ•ด ์ •๊ทœํ™”๋˜์—ˆ๋‹ค.
    • cls : ๊ฐ์ฒด vs ๋ฐฐ๊ฒฝ ๋ถ„๋ฅ˜
    • reg : bbox ์œ„์น˜ ๋ณด์ •
  • balancing parameter (ฮป) ๋Š” Classification term๊ณผ Regression term ๊ฐ’์˜ ์Šค์ผ€์ผ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ๋ƒฅ ํ•ฉ์น˜๋ฉด ํ•œ ์ชฝ์ด ๋„ˆ๋ฌด ํฌ๊ฒŒ/์ž‘๊ฒŒ ์ž‘์šฉํ•˜์—ฌ ํ•™์Šต์ด ํ•œ์ชฝ์œผ๋กœ ์น˜์šฐ์น  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด 2๊ฐœ์˜ loss์˜ ๋น„์ค‘์œผ๋กœ ์กฐ์ ˆํ•˜๋Š” ๊ฐ€์ค‘์น˜์ด๋‹ค.

โ‡’ ์ด์™€ ๊ฐ™์€ ์ •๊ทœํ™” ๊ณผ์ •๊ณผ balancing parameter์˜ ๊ฐ’์€ ๊ฐ„๋žตํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

bounding box ํšŒ๊ท€๋ฅผ ์œ„ํ•ด์„œ๋Š”, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐํ™” ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค.

โ†’ x,y,๋†’์ด, ๋„ˆ๋น„ ๋ฐฉํ–ฅ์œผ๋กœ์˜ ๋ณ€ํ™”๋Ÿ‰์„ ๋น„์œจ๋กœ ๋‚˜ํƒ€๋‚ด์„œ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

์œ„์™€ ๊ฐ™์€ bounding box regression ๋ฐฉ์‹์ด ์ด์ „ RoI ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค๊ณผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ๊ฐ€?

  • ์ด์ „์˜ RoI ๊ธฐ๋ฐ˜ ๋ฐฉ์‹ ([1],[2] = R-CNN, Fast R-CNN)
    1. Region Proposal (Selective Search ๋“ฑ) โ†’ ํฌ๊ธฐ ์ œ๊ฐ๊ฐ์ธ RoI๋ฅผ ์ƒ์„ฑ
    2. ๊ฐ RoI์—์„œ feature pooling (RoI Pooling)
    3. ๊ฐ™์€ Regression weight๋ฅผ ๋ชจ๋“  RoI ํฌ๊ธฐ์— ๋Œ€ํ•ด ์‚ฌ์šฉ

    ๋ฌธ์ œ์  : RoI ํฌ๊ธฐ๊ฐ€ ๋‹ค ๋‹ค๋ฅด์ง€๋งŒ, regression ํŒŒ๋ผ๋ฏธํ„ฐ(๊ฐ€์ค‘์น˜)๋Š” ํ•˜๋‚˜๋ผ์„œ ํฐ ๋ฌผ์ฒด, ์ž‘์€ ๋ฌผ์ฒด ๋ชจ๋‘ ๋˜‘๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋ณด์ • โ†’ ์ตœ์ ํ™” ์–ด๋ ค์›€

๊ธฐ์กด RoI ๊ธฐ๋ฐ˜ ๋ฐฉ์‹Faster R-CNN (RPN ๋ฐฉ์‹
๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ RoI poolingfeature map์—์„œ ๊ณ ์ • ํฌ๊ธฐ(3x3) ์˜์—ญ ์‚ฌ์šฉ
๋ชจ๋“  RoI์— ๋™์ผํ•œ regression weightscale/ratio ๋ณ„๋กœ ๋ณ„๋„ regressor ์‚ฌ์šฉ
RoI feature๋งˆ๋‹ค ํฌ๊ธฐ ์ฐจ์ด ๋ฐ˜์˜ ์–ด๋ ค์›€anchor design์œผ๋กœ ํฌ๊ธฐ/๋น„์œจ ์ฐจ์ด๋ฅผ ํ•ด๊ฒฐ

โ†’ Faster R-CNN์ด ์ข‹์€ ์ด์œ 

  • ๊ณ ์ •๋œ feature ํฌ๊ธฐ โ†’ ํ•™์Šต ์•ˆ์ •์„ฑ์ด ์ข‹๊ณ , ๊ตฌํ˜„์„ ๋‹จ์ˆœํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • scale/ratio ๋ณ„ ๋ณ„๋„ regressor โ†’ ์ž‘์€ ๋ฌผ์ฒด/ํฐ ๋ฌผ์ฒด ๊ฐ๊ฐ์— ํŠนํ™”๋œ weight ํ•™์Šต ๊ฐ€๋Šฅ
  • Anchor ๋•๋ถ„์—, ๊ณ ์ • feature size์—์„œ๋„ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ bbox ์˜ˆ์ธก ๊ฐ€๋Šฅ

3.1.3 Training RPNs

  • RPN์€ end-to-end back propagation ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ๋‹ค.
    • Faster R-CNN ์ „์ฒด ๊ตฌ์กฐ๊ฐ€ ํ•˜๋‚˜์˜ ์—ฐ๊ฒฐ๋œ ๋„คํŠธ์›Œํฌ์ฒ˜๋Ÿผ ์ž‘๋™ํ•˜์—ฌ ์ž…๋ ฅ ์ด๋ฏธ์ง€ โ†’ ์ตœ์ข… loss ๊นŒ์ง€์˜ ๋ชจ๋“  ๊ฒฝ๋กœ๊ฐ€ ์—ญ์ „ํŒŒ๋กœ ๋™์‹œ์— ํ•™์Šต!
  • Fast R-CNN์—์„œ์˜ โ€˜image-centricโ€ ์ƒ˜ํ”Œ๋ง ์ „๋žต์„ ์‚ฌ์šฉํ•œ๋‹ค.
    • ๊ฐ๊ฐ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋Š” ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์—์„œ ๋‹ค์ˆ˜์˜ ๊ฐ์ฒด/๋ฐฐ๊ฒฝ anchor๋“ค์„ ๊ฐ€์ง„๋‹ค.
    • ๋ชจ๋“  anchor๋“ค์— ๋Œ€ํ•ด์„œ ์†์‹คํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™” (์†์‹คํ•จ์ˆ˜์˜ ๊ฐ’์„ ๊ฐ€๋Šฅํ•œํ•œ ์ž‘๊ฒŒ ๋งŒ๋“ฆ)ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, negative sample(๋ฐฐ๊ฒฝ ์ƒ˜ํ”Œ)์ด ๋” ์šฐ์„ธํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต๊ฒฐ๊ณผ๊ฐ€ negative sample ๋ฐฉํ–ฅ์œผ๋กœ ์น˜์šฐ์น  ์ˆ˜๋„ ์žˆ๋‹ค.

    โ†’ ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋ฌด์ž‘์œ„๋กœ 256๊ฐœ์˜ anchor๋ฅผ ๋ฝ‘์•„์„œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ์†์‹คํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค. (์—ฌ๊ธฐ์„œ ์ƒ˜ํ”Œ๋“ค์€ negative์™€ positive์˜ ๋น„์œจ์ด 1:1์„ ๊ฐ€์ง„๋‹ค.)

    ๋งŒ์•ฝ ๊ฐ์ฒด ์ƒ˜ํ”Œ์ด ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์— 128๊ฐœ๋ณด๋‹ค ์ ๊ฒŒ ์žˆ์„ ๊ฒฝ์šฐ์—”, negative anchor๋กœ ์ฑ„์šด๋‹ค.

3.2 Sharing Features for RPN and Fast R-CNN

  • RPN๊ณผ Fast R-CNN์€ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต๋˜๋ฉฐ, ๊ทธ๋“ค์˜ ConV ๊ณ„์ธต์„ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ˆ˜์ •ํ•  ๊ฒƒ์ด๋‹ค.

โ‡’ ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ConV ๊ณ„์ธต์„ 2๊ฐ€์ง€์˜ ๋„คํŠธ์›Œํฌ์—์„œ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ธฐ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค.

Faster R-CNN์˜ ํ•ต์‹ฌ!!!! โ‡’ Region Proposal(RPN) ๋„คํŠธ์›Œํฌ์™€ Fast R-CNN์ด Feature Map์„ ๊ณต์œ ํ•œ๋‹ค!!

  1. Alternating Training
    • ํ•ด๋‹น ์†”๋ฃจ์…˜์—์„œ๋Š”, ์ฒซ๋ฒˆ์งธ๋กœ RPN์„ ํ•™์Šต์‹œํ‚ค๊ณ , ์ง€์—ญ ์ œ์•ˆ์„ Fast R-CNN์„ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ ์‚ฌ์šฉํ•œ๋‹ค.
    • Fast R-CNN์œผ๋กœ tuned ๋œ ๋„คํŠธ์›Œํฌ๋Š” RPN์„ ์‹œ์ž‘ํ• ๋•Œ ์‚ฌ์šฉ๋˜๋ฉฐ, ์ด์™€ ๊ฐ™์€ ๊ณผ์ •์ด ๋ฐ˜๋ณต๋œ๋‹ค.
  2. Approximate joint training (ํ•œ๋ฒˆ์˜ forward/backward๋กœ ํ•™์Šต)
    • RPN๊ณผ Fast R-CNN ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šต์‹œ์— ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ๋กœ ๋ณ‘ํ•ฉ๋œ๋‹ค.
    • ํ•œ ๋ฒˆ์˜ forward pass
      • Backbone CNN์ด feature map ์ƒ์„ฑ
      • RPN์ด proposal ์ƒ์„ฑ
      • proposal์„ ๋ฐ”๋กœ Fast R-CNN Detector์— ๋„ฃ์–ด์„œ Classification & bbox regression ์ˆ˜ํ–‰
      • ํ•™์Šต๊ณผ์ •
        • Forward pass
          • RPN์ด Proposal ์ƒ์„ฑ
          • Fast R-CNN detector๊ฐ€ ์ด proposal์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ Loss ๊ณ„์‚ฐ
        • Backward pass
          • RPN loss์™€ Fast R-CNN loss ๋ชจ๋‘ ๊ณต์œ ๋œ Conv Layer๊นŒ์ง€ Gradient ์ „ํŒŒ
          • ๊ณต์œ ๋œ layer์—์„œ๋Š” 2๊ฐœ์˜ gradient๋ฅผ ํ•ฉ์ณ์„œ ์—…๋ฐ์ดํŠธ

        โ†’ RPN์ด ๋‚ด๋†“์€ proposal ์ขŒํ‘œ์— ๋Œ€ํ•œ gradient๋Š” ๊ณ„์‚ฐํ•˜์ง€ ์•Š๋Š”๋‹ค!

  3. Non-approximate joint Training
    • RPN์— ์˜ํ•ด ์˜ˆ์ธก๋œ ๊ฒฝ๊ณ„ ๋ฐ•์Šค๋“ค์€ ํ•จ์ˆ˜(์ž…๋ ฅ ์ด๋ฏธ์ง€ โ†’ RPN โ†’ bounding box)๋“ค์˜ input์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
    • Fast R-CNN์•ˆ์— ์žˆ๋Š” RoI pooling layer๋Š” input์œผ๋กœ์จ convolutional feature๋“ค๊ณผ, ์˜ˆ์ธก๋œ ๊ฒฝ๊ณ„๋ฐ•์Šค๋“ค์„ input์œผ๋กœ ๋ฐ›๋Š”๋‹ค.
    • 2๋ฒˆ ๋ฐฉ์‹ (approximate joint Training)์—์„œ๋Š” RPN์ด ์ƒ์„ฑํ•œ proposal box ์ขŒํ‘œ๋ฅผ ๊ณ ์ •๊ฐ’์ฒ˜๋Ÿผ ์ทจ๊ธ‰ํ•˜์—ฌ ์ขŒํ‘œ์— ๋Œ€ํ•œ gradient๋Š” ๊ณ„์‚ฐ์„ ํ•˜์ง€ ์•Š์•˜์ง€๋งŒ, ์ด๋ก ์ ์œผ๋กœ ์™„์ „ํ•œ joint training์—์„œ๋Š” box ์ขŒํ‘œ๋„ ๋„คํŠธ์›Œํฌ์˜ ์ถœ๋ ฅ์ด๋ฏ€๋กœ, loss๋ฅผ box์ขŒํ‘œ๊นŒ์ง€ ์—ญ์ „ํŒŒํ•ด์•ผํ•œ๋‹ค.

    โ‡’ ์™„์ „ํ•œ joint training์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” box ์ขŒํ‘œ๊นŒ์ง€ gradient๋ฅผ ์ „ํŒŒํ•ด์•ผํ•˜๋Š”๋ฐ, ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” box ์ขŒํ‘œ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ RoI Pooling์ด ํ•„์š”ํ•˜๋‹ค.

4-Step Alternating Training

  • ๊ณต์œ ๋œ features๋“ค์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” 4-๋‹จ๊ณ„์˜ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ๋‹ค.
    1. RPN์„ ํ•™์Šต์‹œํ‚จ๋‹ค. (3.1.3์—์„œ ์†Œ๊ฐœํ•œ ๋ฐฉ์‹์œผ๋กœ!)
    • ์ด ๋„คํŠธ์›Œํฌ๋Š” ImageNet-pre-trained ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘๋˜์—ˆ์œผ๋ฉฐ, ์ง€์—ญ ์ œ์•ˆ์„ ์œ„ํ•ด fine Tuning ๋˜์—ˆ๋‹ค.
      1. RPN์—์„œ ์ œ์•ˆํ•œ ์ง€์—ญ์ œ์•ˆ์„ ํ™œ์šฉํ•˜์—ฌ Fast R-CNN์—์„œ ๋‚˜์˜จ detection network๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค.
    • detection network ๋˜ํ•œ ImageNet-pre-trained ๋ชจ๋ธ์ด๋‹ค.
    • ํ•ด๋‹น ๊ณผ์ •์—์„œ 2๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋Š” ConV ๊ณ„์ธต์„ ๊ณต์œ ํ•˜์ง€ ์•Š๋Š”๋‹ค.
      1. Detector ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ RPN ํ•™์Šต
    • Fast R-CNN (Detector Network)์ด ํ•™์Šตํ•œ ๋ฐฑ๋ณธ ๊ฐ€์ค‘์น˜๋ฅผ RPN ์ดˆ๊ธฐํ™”์— ์‚ฌ์šฉํ•˜๊ณ , ๋ฐฑ๋ณธ์€ ๊ณ ์ •ํ•œ ์ฑ„ RPN ์ „์šฉ ๋ ˆ์ด์–ด๋งŒ ๋ฏธ์„ธ ์กฐ์ •ํ•œ๋‹ค.

    โ‡’ ์ด์ œ 2๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋Š” convolution layer๋ฅผ ๊ณต์œ ํ•œ๋‹ค.

3.3 Implementation Detatils

  • ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” region proposal๊ณผ object detection network๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๊ณ  ๊ฒ€์ฆํ•˜๊ณ  ์žˆ๋‹ค. (ํ•˜๋‚˜์˜ ํฌ๊ธฐ์—์„œ)
  • ์ด๋ฏธ์ง€๋“ค์„ ์งง์€ ๋ณ€ ๊ธธ์ด๋ฅผ 600 ํ”ฝ์…€๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆํ•œ๋‹ค.
  • ๊ธด ๋ณ€๋“ค์€ ๋น„์œจ์— ๋งž๊ฒŒ ์ž๋™ ์กฐ์ •ํ•˜๊ณ , ์ข…ํšก๋น„ (aspect ratio)๋Š” ์œ ์ง€์‹œํ‚จ๋‹ค.
  • ํฌ๊ธฐ๊ฐ€ ์กฐ์ •๋œ ์ด๋ฏธ์ง€๋“ค์€, ZF์™€ VGG net์—์„œ ๋งˆ์ง€๋ง‰ Conv ๊ณ„์ธต์€ 16 ํ”ฝ์…€์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, Stride๋ฅผ 16์œผ๋กœ ์„ค์ •ํ•ด๋„ ์ถฉ๋ถ„ํžˆ ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค.
    • ZF - net, VGG - net
      • ๋Œ€ํ‘œ์ ์ธ CNN ๋ฐฑ๋ณธ
    • Stride
      • CNN์—์„œ ํ•„ํ„ฐ๊ฐ€ ํ•œ ๋ฒˆ์— ์–ผ๋งˆ๋‚˜ ๊ฑด๋„ˆ๋›ฐ๋ฉด์„œ ์ด๋™ํ•˜๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’
  • Anchors
    • 3๊ฐœ์˜ ์Šค์ผ€์ผ๊ณผ 3๊ฐœ์˜ ์ข…ํšก๋น„์„ ์‚ฌ์šฉํ•œ๋‹ค. (128^2, 256^2, 512^2), (1:1, 1:2, 2:1)
    • ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ์˜ ๋ฐฉ์‹์€ image pyramid๋‚˜ filter pyramid๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.

์ด๋ฏธ์ง€์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋‚˜๋“œ๋Š” anchor box๋“ค์— ๋Œ€ํ•ด

  • ํ•™์Šต์‹œ์—๋Š” ์ด๋ฏธ์ง€์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋‚˜๋“œ๋Š” anchor box๋“ค์„ ๋ฌด์‹œํ•œ๋‹ค.
    • ์†์‹ค์— ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ!
  • 1000 x 600์˜ ์ „ํ˜•์ ์ธ ์ด๋ฏธ์ง€๋Š” ์•ฝ 20000๊ฐœ์˜ anchor๋“ค์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.
    • ํ•˜์ง€๋งŒ ๋งŒ์•ฝ ์ด๋ฏธ์ง€์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋‚˜๋“œ๋Š” anchor box๋“ค์„ ์ œ๊ฑฐํ•œ๋‹ค๋ฉด, 6000๊ฐœ์˜ anchor๋“ค๋งŒ์ด ๋‚จ๋Š”๋‹ค. (ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๋‹น)

  • ๋ช‡ ๊ฐœ์˜ RPN proposal๋“ค์€ overlap๋˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.
  • ์ค‘๋ณต์„ฑ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ, ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” non-maximum suppression ๋ฐฉ์‹์„ proposal regions์˜ cls(๊ฐ anchor๊ฐ€ ๊ฐ์ฒด์ธ์ง€ ์•„๋‹Œ์ง€) ์ ์ˆ˜์— ๋„์ž…ํ•œ๋‹ค.

4 Experiments

4.1 Experiments on PASCAL VOC

  • Faster R-CNN ๋ฐฉ์‹์€ PASCAL VOC 2007 ํƒ์ง€ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•˜์˜€๋‹ค.
  • ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์€ ์•ฝ 5์ฒœ๊ฐœ์˜ ํ•™์Šต/๊ฒ€์ฆ ์ด๋ฏธ์ง€์™€, 5์ฒœ๊ฐœ์˜ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€๋“ค๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. (20๊ฐœ ์ด์ƒ์˜ ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ฐ€์ง)

ImageNet pre-trained network

  • ZF net
    • โ€œfastโ€ version ์‚ฌ์šฉ
    • 5 convolutional layers
    • 3 fully-connected layers
  • VGG-16 model
    • 13 convolutional layers
    • 3 fully-connected layers

โ‡’ mAP(mean Average Precision) ์„ ์‚ฌ์šฉ (๊ฐ์ฒด ํƒ์ง€์—์„œ ํ‘œ์ค€์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ตœ์ข… ์„ฑ๋Šฅ ์ง€ํ‘œ)

mAP (mean Average Precision)

  • ํด๋ž˜์Šค๋ณ„ Average Precision์„ ๊ณ„์‚ฐ, ์ด๋ฅผ ๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•ด ํ‰๊ท  ๋‚ธ ๊ฐ’

  • ์œ„์˜ ํ…Œ์ด๋ธ”์—์„œ๋Š” Fast R-CNN์ด ๋‹ค์–‘ํ•œ region proposal methods๋ฅผ ํ†ตํ•ด ํ•™์Šต๋˜๊ณ  ํ‰๊ฐ€๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
  • ํ…Œ์ด๋ธ”์„ ์‚ดํŽด๋ณด๋ฉด, SS (Selective Search)๋Š” ์•ฝ 58.7%์˜ mAP๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ, RPN with Fast R-CNN์€ 59.9%์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค.
  • RPN์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด SS ํ˜น์€ EB๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ๋” ๋น ๋ฅด๋ฉฐ, ๊ทธ ์ด์œ ๋Š” convolutional ๊ณ„์‚ฐ์„ ๊ณต์œ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Ablation Experiments on RPN

RPN์„ proposal method๋กœ ์‚ฌ์šฉํ•˜์˜€์„๋•Œ ์ฐจ์ด๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด์„œ๋Š”, ๋‹ค์–‘ํ•œ ablation study๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

Ablation Study?

โ†’ Ablation Study๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์†Œ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์š”์†Œ ๋ฐ feature๋“ค์„ ๋‹จ๊ณ„์ ์œผ๋กœ ์ œ๊ฑฐ ํ•˜๊ฑฐ๋‚˜ ๋ณ€๊ฒฝํ•ด๊ฐ€๋ฉฐ ์„ฑ๋Šฅ์˜ ๋ณ€ํ™”๋ฅผ ๊ด€์ฐฐํ•˜๋Š” ๋ฐฉ๋ฒ•

  1. RPN๊ณผ Fast R-CNN ๊ฐ์ง€ ๋„คํŠธ์›Œํฌ๊ฐ€ convolutional layer๋ฅผ ๊ณต์œ ํ• ๋•Œ์˜ ์˜ํ–ฅ
    • 4-step ํ•™์Šต ๊ณผ์ •์—์„œ 2๋ฒˆ์งธ ๋‹จ๊ณ„ ์ดํ›„์— ํ•™์Šต์„ ๋ฉˆ์ถ˜๋‹ค.
    • ๋ถ„๋ฆฌ๋œ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, ๊ฒฐ๊ณผ๊ฐ€ 58.7%๋กœ ์กฐ๊ธˆ ๊ฐ์†Œํ•œ๋‹ค.

    โ‡’ ๊ด€์ฐฐ ๊ฒฐ๊ณผ 3๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ dector์— ์˜ํ•ด tuning๋œ ํ”ผ์ฒ˜๋“ค์ด RPN์„ ๋ฏธ์„ธ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ ๋•Œ ์„ฑ๋Šฅ์ด ์˜ค๋ฅธ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ๋˜์—ˆ๋‹ค.

  2. Fast R-CNN ํƒ์ง€ ๋„คํŠธ์›Œํฌ์—์„œ RPN์˜ ์˜ํ–ฅ์„ ํ’€์–ด๋ณด์•˜๋‹ค.
    • ํ•ด๋‹น ๊ณผ์ •์˜ ์˜๋„๋Š” Fast R-CNN ๋ชจ๋ธ์„ 2000 SS proposals ์™€ ZF net์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์‹œ์ผฐ๋‹ค.
    • ํ•ด๋‹น ํƒ์ง€๊ธฐ๋ฅผ ๊ณ ์น˜๊ณ , proposal regions๋ฅผ ๋ฐ”๊พธ๋ฉฐ mAP๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•˜์˜€๋‹ค.
    • ํ•ด๋‹น ablation ๊ณผ์ •์—์„œ๋Š”, RPN์ด detector์™€ features์„ ๊ณต์œ ํ•˜์ง€ ์•Š๋Š”๋‹ค. - Selective Search๋ฅผ 300 RPN Proposals๋กœ ๋Œ€์ฒดํ•˜์˜€์„๋•Œ, mAP๊ฐ€ 56.8%๊ฐ€ ๋˜์—ˆ์œผ๋ฉฐ, mAP๊ฐ€ ์ค„์–ด๋“  ์ด์œ ๋Š” training/testing proposal์˜ ๋ถˆ์ผ์น˜ ๋•Œ๋ฌธ์ด์—ˆ๋‹ค. - RPN์€ ์—ฌ์ „ํžˆ ๊ฒฝ์Ÿ์ ์ธ ๊ฒฐ๊ณผ (55.1%)๋ฅผ top-ranked 100proposals๋ฅผ ์‚ฌ์šฉํ•˜์˜€์„๋•Œ ๋ณด์ด๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋Š” top-ranked RPN proposals๊ฐ€ ์ •ํ™•ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

CLS ์ถœ๋ ฅ ์—ญํ•  ๋ถ„์„

  • cls layer ์ œ๊ฑฐ = proposal ์ ์ˆ˜ ์—†์Œ

    โ‡’ NMS (non-max suppression)์ด๋‚˜ ranking ๋ถˆ๊ฐ€

๊ฒฐ๊ณผ :

  • N=1000 โ†’ mAP ๊ฑฐ์˜ ๋™์ผ (55.8%)
    • ์ƒ์œ„ ์ œ์•ˆ ์ˆ˜๊ฐ€ ๋งŽ์œผ๋ฉด ranking์ด ํฌ๊ฒŒ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค.
  • N=100 โ†’ mAP ๊ธ‰๋ฝ (44,6%)
    • ์ ์€ ์ˆ˜์˜ ์ œ์•ˆ์„ ์“ธ ๊ฒฝ์šฐ, cls ์ ์ˆ˜ ๊ธฐ๋ฐ˜ ranking์ด ์ •ํ™•๋„ ์œ ์ง€์— ์ค‘์š”

โ‡’ cls score๋Š” โ€œ์ƒ์œ„ ์ˆœ์œ„ proposalโ€์˜ ์ •ํ™•๋„์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.

reg ์ถœ๋ ฅ ์—ญํ•  ๋ถ„์„

  • reg layer ์ œ๊ฑฐ = anchor box ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ

๊ฒฐ๊ณผ :

  • mAP 55.8% โ†’ 52.1% ํ•˜๋ฝ
  • ์—ฌ๋Ÿฌ scale/aspect ratio์˜ anchor box๋งŒ์œผ๋กœ๋„ ์ •ํ™•๋„ ๋ถ€์กฑ
  • bbox regression์ด ์œ„์น˜ ๋ณด์ •์— ํ•„์ˆ˜

โ‡’ reg๋Š” ์ œ์•ˆ ์˜์—ญ์˜ ์ •๋ฐ€๋„๋ฅผ ๋†’์ด๋Š” ํ•ต์‹ฌ

๋ฐฑ๋ณธ (Backbone) ๋ณ€๊ฒฝ ํšจ๊ณผ

  • ZF-net ์‚ฌ์šฉ โ†’ mAP = 56.8%
  • VGG-16 ์‚ฌ์šฉ โ†’ mAP=59.2%
  • Detector๋Š” ๋™์ผํ•˜๊ฒŒ SS(Selective Search) + ZF ์‚ฌ์šฉ

๊ฒฐ๊ณผ:

  • ๋ฐฑ๋ณธ์ด ๋” ๊ฐ•๋ ฅํ•ด์งˆ์ˆ˜๋ก RPN ์ œ์•ˆ ํ’ˆ์งˆ๋„ ํ–ฅ์ƒ
  • RPN+ZF๊ฐ€ ์ด๋ฏธ SS์™€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ

โ‡’ RPN+VGG๋Š” SS๋ณด๋‹ค ๋” ์ข‹์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ

Perfomance of VGG-16

Table 3๋Š” VGG-16์˜ proposal, detection์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ๋ณด์—ฌ์ค€๋‹ค.

  • RPN+VGG๋ฅผ ์‚ฌ์šฉํ•˜์˜€์„๋•Œ, feature ๋ฅผ ์„œ๋กœ ๊ณต์œ ํ•˜์ง€ ์•Š๊ณ ๋„ 68.5%์˜ mAP๋ฅผ ๊ธฐ๋กํ•˜๊ณ , SS baseline๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ๋†’์•˜๋‹ค.
  • feature-shared (feature๋ฅผ ์„œ๋กœ ๊ณต์œ )ํ–ˆ์„ ๊ฒฝ์šฐ์—”, ๊ฒฐ๊ณผ๊ฐ€ 69.9%์˜€๋‹ค.
  • PASCAL VOC 2007, 2012 ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ๋” RPN๊ณผ Detection Network๋ฅผ ๋” ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ, mAP๋Š” 73.2%์˜€๋‹ค.
  • Table 4 ์—์„œ๋Š” PASCAL VOC 2012 test set์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์œผ๋ฉฐ, Table 6,7์—์„œ๋Š” ์ž์„ธํ•œ ๋‚ด์šฉ์„ ๋‹ด๊ณ ์žˆ๋‹ค.

Table 5

  • ์‹คํ–‰์†๋„ ๋น„๊ต
    • Selective Search(SS): 1~2์ดˆ (๋งค์šฐ ๋А๋ฆผ)
    • Fast R-CNN + VGG-16 :
      • SS proposals 2000๊ฐœ ์‚ฌ์šฉ ์‹œ : 320ms
      • SVD ์ตœ์ ํ™” ์ ์šฉ ์‹œ : 223ms
    • Faster R-CNN (RPN+VGG-16)
      • ์ „์ฒด : 198ms
      • Conv feature ๊ณต์œ  ๋•๋ถ„์— RPN ์ž์ฒด๋Š” 10ms๋งŒ ์†Œ์š”
    • ZF-Net backbone
      • 17fps ์†๋„ ๋‹ฌ์„ฑ

โ‡’ feature ๊ณต์œ ์™€ proposal ์ˆ˜ ์ถ•์†Œ ๋•๋ถ„์— SS ๊ธฐ๋ฐ˜ ๋Œ€๋น„ ํฐ ์†๋„ ํ–ฅ์ƒ

Anchor ์„ค์ • ์‹คํ—˜ (Table 8)

  • ๊ธฐ๋ณธ๊ฐ’ : 3 scales x 3 aspect ratios โ†’ mAP = 69.9%
  • ์•ต์ปค 1๊ฐœ๋งŒ ์‚ฌ์šฉ : mAP 3~4% ํ•˜๋ฝ
  • 3 scales + 1 aspect ratio : 69.8% (๊ฑฐ์˜ ๋™์ผ)
  • 1 sclale + 3 aspect ratios : mAP ์ƒ์Šน (1 anchor ๋Œ€๋น„)

โ‡’ scale, aspect ratio ๋ชจ๋‘ ๋‹ค์–‘ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

ฮป ๊ฐ’ ์˜ํ–ฅย  (Table 9)

  • ๊ธฐ๋ณธ๊ฐ’ ฮป =10 โ†’ cls term๊ณผ reg termdl ์ •๊ทœํ™” ํ›„ ๋น„์Šทํ•œ ํฌ๊ธฐ
  • ฮป์˜ ๋ฒ”์œ„๋ฅผ 1~100๊นŒ์ง€ ๋ณ€๊ฒฝ โ†’ ์„ฑ๋Šฅ ๋ณ€ํ™” ์•ฝ 1% ์ˆ˜์ค€

โ‡’ ฮป ๊ฐ’์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ๋‘ฌ๋„ ์„ฑ๋Šฅ์— ๋ฏผ๊ฐ Xย ย 

Recall-to-IoU ๋ถ„์„

  • Recall-to-IoU metric
    • ํŠน์ • IoU ๊ธฐ์ค€ ์ด์ƒ์—์„œ ์ œ์•ˆ์ด ์ •๋‹ต ๋ฐ•์Šค๋ฅผ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ์ปค๋ฒ„ํ•˜๋Š”์ง€?
  • ํ•ด๋‹น ์ง€ํ‘œ๋Š” ์ตœ์ข… mAP์™€ ์•ฝํ•œ ์ƒ๊ด€๊ด€๊ณ„๋งŒ ์žˆ์Œ โ†’ proposal ํ’ˆ์งˆ ์ง„๋‹จ์šฉ
  • ์‹คํ—˜ ๊ฒฐ๊ณผ (Figure 4)
    • ๋น„๊ต ๋Œ€์ƒ : RPN, SS, Edgeboxes(EB)
    • proposal๊ฐœ์ˆ˜๋ฅผ 2000 โ†’ 1000 โ†’ 300์œผ๋กœ ์ค„์˜€์„ ๋•Œ

โ‡’ RPN์€ proposal ์ˆ˜๊ฐ€ ์ ์–ด๋„ recall์ด ์•ˆ์ •์  โ†’ ํšจ์œจ์ 

One-stage vs Two-stage

  • One-stage (OverFeat ์Šคํƒ€์ผ):
    • class-specific detection์„ ํ•œ ๋ฒˆ์— ์ˆ˜ํ–‰ (sliding window ๊ธฐ๋ฐ˜)
    • ํ•œ ๋‹จ๊ณ„์—์„œ ์œ„์น˜ + ํด๋ž˜์Šค ์˜ˆ์ธก ๋™์‹œ์—
    • ์‹คํ—˜ ์„ธํŒ…:
      • Dense sliding window (3 scales ร— 3 aspect ratios)
      • Fast R-CNN์ด ์ง์ ‘ ํด๋ž˜์Šค ์ ์ˆ˜์™€ bbox regression
      • 5-scale image pyramid ๋ฒ„์ „๋„ ํ…Œ์ŠคํŠธ
  • Two-stage (Faster R-CNN):
    • Stage 1: class-agnostic RPN โ†’ proposal ์ƒ์„ฑ
    • Stage 2: Fast R-CNN์ด proposal ๊ธฐ๋ฐ˜์œผ๋กœ ํด๋ž˜์Šค + bbox ์˜ˆ์ธก
    • RoI Pooling์œผ๋กœ proposal ์œ„์น˜์— ๋งž๊ฒŒ feature๋ฅผ ์ž˜ ์ถ”์ถœ

๊ฒฐ๊ณผ (Table 10, ZF backbone):

  • One-stage: mAP = 53.9%
  • Two-stage: mAP = 58.7% (+4.8% ํ–ฅ์ƒ)
  • ์†๋„: one-stage๊ฐ€ proposal ์ˆ˜๊ฐ€ ๋งŽ์•„์„œ ์˜คํžˆ๋ ค ๋А๋ฆผ
  • ๊ธฐ์กด ์—ฐ๊ตฌ([2], [39])์—์„œ๋„ SS ๋Œ€์‹  sliding window ์“ฐ๋ฉด ์•ฝ -6% ์„ฑ๋Šฅ ํ•˜๋ฝ ๋ณด๊ณ ๋จ

โ†’ two-stage ๊ตฌ์กฐ (proposal โ†’ ์ •๋ฐ€ ๋ถ„๋ฅ˜/๋ณด์ •)๊ฐ€ sliding window ๊ธฐ๋ฐ˜ one-stage๋ณด๋‹ค ์ •ํ™•ํ•˜๊ณ  ํšจ์œจ์ 

MS-COCO ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ

  • MS COCO : 80๊ฐœ ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ
  • ์‚ฌ์šฉ๋ฐ์ดํ„ฐ
    • Train : 80k ์ด๋ฏธ์ง€
    • val : 40k ์ด๋ฏธ์ง€
    • Test-dev : 20k ์ด๋ฏธ์ง€
  • ํ‰๊ฐ€์ง€ํ‘œ : mAP

๊ฒฐ๊ณผ (Table 11)

mAP@0.5 vs mAP@[.5,.95]

mAP@0.5mAP@[.5,.95]
IoU ์ž„๊ณ„๊ฐ’ =0.5IoU์ž„๊ณ„๊ฐ’์„ 0.5 ~ 0.95๊นŒ์ง€ 0.05 ๊ฐ„๊ฒฉ์œผ๋กœ ๋ณ€ํ™”์‹œํ‚ค๋ฉฐ ํ‰๊ท 
PASCAL VOC์—์„œ ์‚ฌ์šฉํ•˜๋˜ ์ „ํ†ต์ ์ธ ๋ฐฉ์‹COCO ํ‘œ์ค€ ํ‰๊ฐ€ ๋ฐฉ์‹

  • Fast R-CNN baseline
    • mAP@.5 = 39.3%
    • mAP@[.5,.95] = 19.3% (๊ธฐ์กด์˜ Fast R-CNN๊ณผ ์œ ์‚ฌ)
  • Faster R-CNN (Train set ํ•™์Šต)
    • mAP@0.5 = 42.1%
    • mAP@[.5,.95] = 21.5%
  • Faster R-CNN (train+val set ํ•™์Šต)
    • mAP@0.5 = 42.7%
    • mAP@[.5,.95] = 21.9%

๐Ÿ“Œ ํ•ต์‹ฌ ์š”์•ฝ

  1. COCO ์‹คํ—˜์—์„œ ์•ต์ปค scale ํ™•๋Œ€์™€ negative sample ๋ฒ”์œ„ ํ™•์žฅ์œผ๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  2. Faster R-CNN์€ Fast R-CNN ๋Œ€๋น„ mAP@0.5์—์„œ +2.8%, mAP@[.5, .95]์—์„œ +2.2% ๊ฐœ์„ 
  3. RPN์€ ํŠนํžˆ ๋†’์€ IoU ๊ธฐ์ค€์˜ localization ์ •ํ™•๋„ ๊ฐœ์„ ์— ํšจ๊ณผ์ 

๋งŒ์•ฝ Faster R-CNN์˜ ๋ฐฑ๋ณธ์ด ๊ฐ•๋ ฅํ•œ ๋„คํŠธ์›Œํฌ๋กœ ๊ต์ฒด๋œ๋‹ค๋ฉด?

์„ฑ๋Šฅ ๋น„๊ต (VGG-16 โ†’ ResNet-101)

  • ๋ฐ์ดํ„ฐ์…‹ : MS COCO val set
๋ฐฑ๋ณธ์„ฑ๋Šฅ (mAP@0.5)์„ฑ๋Šฅ (mAP@[.5,.95])
VGG-1541.5%48.4%
ResNet-10121.2%27.2%

COCO ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์˜ PASCAL VOC ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ

๋ฐฐ๊ฒฝ

  • MS COCO๋Š” PASCAL VOC ๋ณด๋‹ค ํ›จ์”ฌ ํฌ๊ณ , ํด๋ž˜์Šค ์ˆ˜๋„ ๋งŽ๋‹ค.
  • COCO์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” VOC์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํฌํ•จํ•˜๋Š” Superset

โ‡’ COCO ๋ชจ๋ธ์„ VOC์— ์ง์ ‘ ์ ์šฉ ๊ฐ€๋Šฅ

์‹คํ—˜ 1 : COCO ๋ชจ๋ธ โ†’ VOC ์ง์ ‘ ํ‰๊ฐ€ (Fine-Tuning X)

  • ๊ฒฐ๊ณผ : VOC 2007 test mAP = 76.1%
  • ๋น„๊ต : VOC 07 + 12 ๋ฐ์ดํ„ฐ๋กœ๋งŒ ํ•™์Šต : 73.2%

โ‡’ VOC ๋ฐ์ดํ„ฐ๋ฅผ ์ „ํ˜€ ์“ฐ์ง€ ์•Š์•„๋„, COCO๋กœ๋งŒ ํ•™์Šตํ•œ ๋ชจ๋ธ์ด VOC ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋†’๋‹ค

์‹คํ—˜ 2. COCO ๋ชจ๋ธ์„ VOC์— Fine-Tuning

  • COCO ๋ชจ๋ธ์„ ImageNet-pretrained ๋ชจ๋ธ ๋Œ€์‹  ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜๋กœ ์‚ฌ์šฉ
  • VOC ๋ฐ์ดํ„ฐ๋กœ Fine-Tuning ์ง„ํ–‰ (Faster R-CNN ๋ฐฉ์‹)
  • ๊ฒฐ๊ณผ : VOC 2007 test mAP = 78.8%

5 Conclusion

  • ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” RPN๋ฅผ ํ†ตํ•ด ํšจ์œจ์ ์ด๊ณ  ์ •ํ™•ํ•œ ์ง€์—ญ ์ œ์•ˆ์„ ์ œ์‹œํ•˜๊ณ ์žˆ๋‹ค.
  • Convolution Features์„ ๊ณต์œ ํ•จ์œผ๋กœ์จ, ์ง€์—ญ ์ œ์•ˆ์„ ํ•˜๋Š” ๋‹จ๊ณ„์—์„œ์˜ ๋น„์šฉ์„ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋…ผ๋ฌธ์—์„œ์˜ ๋ฐฉ์‹์€ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ๊ฐ์ฒด ํƒ์ง€ ์‹œ์Šคํ…œ์œผ๋กœ์จ, near-real-time ํƒ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
This post is licensed under CC BY 4.0 by the author.