Post

PR [Vision-Language Models for Vision Tasks : A Survey]

PR [Vision-Language Models for Vision Tasks : A Survey]

Vision-Language Models for Vision Tasks : A Survey

Abstract

1 Introduction

  • ์ด๋ฏธ์ง€ ์ธ์‹ (Visual Recognition)์€ Computer Vision ์—ฐ๊ตฌ๋ถ„์•ผ์—์„œ ๋‹ค์–‘ํ•œ ์‚ฐ์—…์—์„œ์˜ ์ ์šฉ์„ ์œ„ํ•œ ๊ธฐ์ดˆ์ด๋‹ค.
  • ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ์œผ๋กœ ์ด๋ฏธ์ง€ ์ธ์‹ ์—ฐ๊ตฌ๋Š” ์—„์ฒญ ์„ฑ๊ณผ๋ฅผ ์ด๋ฃจ์—ˆ๋‹ค. (ํ•˜์ง€๋งŒ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ œ๋ฅผ ๋‚จ๊ฒผ๋‹ค.)
    • ์•„๋ฌด๊ฒƒ๋„ ์—†๋Š” ์ƒํƒœ (from scratch)์—์„œ๋ถ€ํ„ฐ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ๊ต‰์žฅํžˆ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค.
    • DNN์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ํ™•๋ณดํ•˜๊ธฐ๊ฐ€ ํž˜๋“ค๋‹ค.
  • Pre-Training, Fine-Tuning and Prediction ๋ฐฉ์‹์˜ ๋“ฑ์žฅ
    • DNN์„ ๋จผ์ € ์—„์ฒญ๋‚œ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•˜์—ฌ ์‚ฌ์ „ ํ•™์Šตํ•˜๋ฉฐ, ์‚ฌ์ „ ํ•™์Šต (Pre-Trained)๋œ ๋ชจ๋ธ์€ ํŠน์ • ๋ฌธ์ œ์— ๋งž์ถฐ์„œ Fine-Tuned ๋œ๋‹ค.

    โ‡’ ์œ„์—์„œ ์†Œ๊ฐœํ•œ ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ์ด ๋‚จ๊ธด ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ณผ์ •์„ ๊ฐ€์†ํ™”! (์™„์ „ํžˆ ํ•ด๊ฒฐํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ , ์—ฌ์ „ํžˆ ํŠน์ • ๋ฌธ์ œ์— (task-specific) ๋งž์ถฐ fine-tuning ํ•˜๋Š” ๊ฒƒ๊ณผ, fine-tuning์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๊ฐœ์„ ์ ์€ ์•„์ง ์กด์žฌ)

  • Vision-Language Model Pre-training and Zero-shot Prediction
    • ํ•ด๋‹น ๋ฐฉ์‹์˜ ๋“ฑ์žฅ์œผ๋กœ ์ธํ•ด VLM ๋ชจ๋ธ์€ (์ด๋ฏธ์ง€-ํ…์ŠคํŠธ) ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์…‹์„ ํ†ตํ•ด ์‚ฌ์ „ํ•™์Šต๋˜๋ฉฐ, fine-tuning ๊ณผ์ • ์—†์ด๋„ ๋‹ค์–‘ํ•œ task(๋ถ„์•ผ?)์— ์ ์šฉ์ด ๊ฐ€๋Šฅํ–ˆ๋‹ค.
    • VLM์˜ pre-training ๊ณผ์ •์ด CLIP ๋ชจ๋ธ๊ณผ ๊ฐ™์€ โ€˜๋Œ€์กฐํ•™์Šตโ€™ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ vision-language์˜ ์œ ์‚ฌ์„ฑ์„ ์ž˜ ํฌ์ฐฉํ•˜๋ฉฐ, zero-shot prediction์— ํฐ ๊ธฐ์—ฌ
      • โ€˜๋Œ€์กฐํ•™์Šตโ€™ : ์„œ๋กœ ๋งž๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์€ ์ตœ๋Œ€ํ•œ ๊ฐ€๊น๊ฒŒ, ๋งž์ง€ ์•Š๋Š” ์Œ์€ ์ตœ๋Œ€ํ•œ ๋ฉ€๊ฒŒ ์„ค์ •
      • zero-shot prediction : ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ํ•œ ๋ฒˆ๋„ ๋ฐฐ์šด ์  ์—†๋Š” ํด๋ž˜์Šค(class)๋‚˜ ๋ ˆ์ด๋ธ”(label)์— ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ
    • ์ „์ด ํ•™์Šต (Transfer Learning), ์ง€์‹ ์ฆ๋ฅ˜ (Knowledge Distillation)๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ์— ๊ธฐ์—ฌ

2 Background

2.1 Training Paradigms for Visual Recognition

์ด๋ฏธ์ง€ ์ธ์‹ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „ 5๋‹จ๊ณ„

  1. Traditional Machine Learning and Prediction
  2. Deep Learning from Scratch and Prediction
  3. Supervised Pre-training, Fine-tuning and Prediction
  4. Unsupervised Pre-training, Fine-tuning and Prediction
  5. Vision-Language Model Pre-training and Zero-shot Prediction

2.1.1 Traditional Machine Learning and Prediction

  • ๋”ฅ๋Ÿฌ๋‹ ์ด์ „์˜ Visual Recognition

โ†’ ์‚ฌ๋žŒ์ด โ€˜๋ฌด์—‡์„ ๋ด์•ผํ• ์ง€โ€™ ๋ฏธ๋ฆฌ ์ •ํ•ด์ฃผ๊ณ  ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์œผ๋กœ ์—ฐ๊ตฌ

2.1.2 Deep Learning from Scratch and Prediction

  • ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ
    • End - to - End ํ•™์Šต ๊ฐ€๋Šฅ (์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ์‚ฌ๋žŒ์˜ ๊ฐœ์ž…์—†์ด ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ด ์ตœ์ข… ์ถœ๋ ฅ๊นŒ์ง€ ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌ)

    ํ•œ๊ณ„์ โ“ย โ†’ ๋А๋ฆผ ์ˆ˜๋ ด ์†๋„ (ํ•™์Šต์†๋„?), ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ์–ด๋ ค์›€

2.1.3 Supervised Pre-Training, Fine-tuning and Prediction

  • ์ง€๋„ํ•™์Šต์„ ํ†ตํ•ด ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์„ Fine-Tuningํ•˜์—ฌ ๋ชฉ์ ์— ๋งž๊ฒŒ ๋ณ€ํ™”์‹œํ‚ค๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฐœ์ „

โ‡’ ๋ชจ๋ธ์˜ ํ•™์Šต์†๋„๋ฅผ ๊ฐ€์†ํ™”ํ•˜๊ณ , ํ•œ์ •๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋ก

2.1.4 Unsupervised Pre-Training, Fine-tuning & Prediction

  • ์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ๋งŽ์€ ์–‘์˜ ๋ผ๋ฒจํ™”๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•œ ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•˜์—ฌ, ์ •๋‹ต์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ›ˆ๋ จํ•˜๊ณ , ํŠน์ง•์„ ๋ฝ‘๋Š” ๋น„์ง€๋„ํ•™์Šต ๋ฐฉ์‹์ด ๋“ฑ์žฅ

โ‡’ ์ดํ›„, ๋ชฉ์ ์— ๋งž๊ฒŒ Fine-Tuningํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฐœ์ „

2.1.5 VLM Pre-training and Zero-shot Prediction

  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„, ์ธํ„ฐ๋„ท์—์„œ ์‰ฝ๊ฒŒ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•˜์—ฌ vision-language์— ๋Œ€ํ•œ ๊ณ ์ฐจ์›์ ์ธ ์ง€์‹์„ ํ•™์Šตํ•˜๊ณ , Zero-shot predictions์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค.
  • ์ดํ›„ VLM์„ ๋ฐœ์ „์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์‹œ๋„
    • ๋‹ค์–‘ํ•œ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ ํ™œ์šฉ
    • ํฌ๊ณ , ํ‘œํ˜„๋ ฅ์ด ํ’๋ถ€ํ•œ ๋ชจ๋ธ ๊ฐœ๋ฐœ
    • ์ƒˆ๋กœ์šด ์‚ฌ์ „ํ•™์Šต ๋ชฉํ‘œ ์„ค๊ณ„

2.2 Development of VLMs for Visual Recognition

์ด๋ฏธ์ง€ ์ธ์‹์„ ์œ„ํ•œ VLM ๋ชจ๋ธ์˜ ๋ฐœ์ „ ๊ณผ์ •

2.3 Relevant Surveys

  • ์ง€๊ธˆ๊นŒ์ง€ Visual Question Answering, Natural Language for Visual Reasoning, Phrase Grounding๊ณผ ๊ฐ™์€ ๋ถ„์•ผ์— ๋Œ€ํ•œ Survey๋งŒ ์กด์žฌํ•˜์˜€์Œ.

๐Ÿ’ก ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š”..!

  • ์ตœ๊ทผ์˜ VLM ์‚ฌ์ „ํ•™์Šต ๋ฐฉ์‹
  • VLM์ด ํ•™์Šตํ•œ ์ง€์‹์„ Visual Recognition์— ์ ์šฉํ•˜๋Š” 2๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•
  • Visual Recognition์„ ์œ„ํ•œ VLM์˜ ๋ฒค์น˜๋งˆํ‚น

3 VLM Foundations

3.1 Network Architectures

  • VLM ์‚ฌ์ „ํ•™์Šต์€ โ€˜์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ํ•ต์‹ฌ ํŠน์ง•์„ ์ž˜ ๋ฝ‘์•„๋‚ด๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŒ๋“œ๋Š” ๊ณผ์ •โ€™
  • ํ…์ŠคํŠธ/์ด๋ฏธ์ง€ ์ธ์ฝ”๋” : ํ…์ŠคํŠธ/์ด๋ฏธ์ง€ ์ƒ˜ํ”Œ์„ ์ž…๋ ฅ๋ฐ›์•„ ํ…์ŠคํŠธ/์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ(์ˆซ์ž)์œผ๋กœ ๋ณ€ํ™˜

3.1.1 Architectures for Learning Image Features

  • ์–ด๋–ป๊ฒŒ ์ด๋ฏธ์ง€ ํŠน์ง• ํ•™์Šต?
    • CNN ๊ธฐ๋ฐ˜์˜ ์•„ํ‚คํ…์ณ, Transformer ๊ธฐ๋ฐ˜์˜ ์•„ํ‚คํ…์ณ

CNN-based Architectures

  • VGG, ResNet, EfficientNet๊ณผ ๊ฐ™์€ ํ•ฉ์„ฑ๊ณฑ ๋„คํŠธ์›Œํฌ (์ด๋ฏธ์ง€ ํŠน์ง•๋“ค์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„)
    • ResNet : ํ•ฉ์„ฑ๊ณฑ ๋ธ”๋ก๋“ค ์‚ฌ์ด์— skip connections ๋ฅผ ํ†ตํ•˜์—ฌ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค/ํญ๋ฐœ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ณ , ๋”์šฑ ๋” ๊นŠ์€ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.
      • skip connections : ์ด์ „ ๊ณ„์ธต์—์„œ ๋‚˜์˜จ ์ •๋ณด๋ฅผ ๋ฐ”๋กœ ๋‹ค์Œ ๊ณ„์ธต์ด ์•„๋‹ˆ๋ผ, ๋ช‡ ๊ฐœ์˜ ๊ณ„์ธต์„ ๊ฑด๋„ˆ๋›ฐ๊ณ  ์ „๋‹ฌ

vision-language ๋ชจ๋ธ๋ง๊ณผ ํŠน์ง•๋“ค์„ ๋” ์ž˜ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์กŒ๋‹ค.

  • ์˜ˆ์‹œ) ResNet-D
    • antialiased rect-2 blur pooling ์‚ฌ์šฉ
      • antialiased rect-2 blur pooling? โ‡’ ๋‹ค์šด์ƒ˜ํ”Œ๋ง ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ณ„๋‹จ ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ธฐ์ˆ 
    • global average pooling(์ „์—ญ ํ‰๊ท  ํ•„ํ„ฐ) โ†’ attention pooling(์–ดํ…์…˜ ํ’€๋ง) (transformer multi-head attention) ๋Œ€์น˜
      • ์ด๋ฏธ์ง€์˜ ์ตœ์ข… ํŠน์ง• ๋งต์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์— ๋” ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌ

Transformer-based Architectures

๋Œ€ํ‘œ์ ์ธ Transformer ์•„ํ‚คํ…์ณ๋ฅผ ๊ฐ€์ง„ ์ด๋ฏธ์ง€ ํ•™์Šต ๋ชจ๋ธ์ธ ViT

  • multi-head self-attention layer์™€ feed-forward network๋กœ ๊ตฌ์„ฑ๋œ Transfomer ๋ธ”๋ก๋“ค์„ ์ธต์ธต์ด ์Œ“๋Š”๋‹ค.
  • multi-head self-attention layer : ์ด๋ฏธ์ง€ ์ •๋ณด๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…
  • feed-forward network : multi-head self-attention layer์—์„œ ์–ป์€ ์ƒˆ๋กœ์šด ์ •๋ณด๋“ค์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ฒ˜๋ฆฌ/๋ณ€ํ™˜
    1. ์ด๋ฏธ์ง€ ๋ถ„ํ•  (Split into patches)
    2. ๋ฒกํ„ฐํ™”, ์œ„์น˜ ์ •๋ณด ์ถ”๊ฐ€ (Linear Projection & Position Embedding)

๐Ÿ’ก ViT ์ˆ˜์ • ๋‚ด์šฉ : ์ค€๋น„๋œ ์ด๋ฏธ์ง€ ๋ฒกํ„ฐ๋“ค์„ Transformer ์ธ์ฝ”๋” ์ฒ˜๋ฆฌ ์ „์— ์ •๊ทœํ™” ๊ณ„์ธต ์ถ”๊ฐ€ (์ด ๋ถ€๋ถ„์—!)

  1. Transformer ์ธ์ฝ”๋” ์ฒ˜๋ฆฌ

3.1.2 Architectures for Learning Language Features

  • ์–ด๋–ป๊ฒŒ ํ…์ŠคํŠธ ํŠน์ง•๋“ค์„ ํ•™์Šต?
  • Transformer ์•„ํ‚คํ…์ณ๊ฐ€ ์ด๋ฅผ ๋‹ด๋‹นํ•˜๋ฉฐ, CLIP๊ณผ ๊ฐ™์€ ๋ชจ๋ธ๋“ค ๋˜ํ•œ ํ‘œ์ค€์ ์ธ Transformer ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค.

3.2 VLM Pre-training Objectives

  • VLM์˜ ํ•ต์‹ฌ์œผ๋กœ์จ, VLM ์‚ฌ์ „ํ•™์Šต ๋ชฉํ‘œ๋Š” ํ’๋ถ€ํ•œ vision-language ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์œ„ํ•ด ์„ค์ •๋˜์—ˆ์Œ
  • ํฌ๊ฒŒ Contrastive Objectives, generative objectives, alignment objectives๋กœ ๋‚˜๋‰œ๋‹ค.

3.2.1 Contrastive Objectives (๋Œ€์กฐํ•™์Šต ๋ชฉํ‘œ)

Image Contrastive Learning

  • ์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€ ํŠน์ง•๋“ค์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ
  • ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ positive keys(๊ฐ™์€ ์ด๋ฏธ์ง€)์™€๋Š” ๊ฐ€๊นŒ์ด, ๊ทธ๋ฆฌ๊ณ  negative keys(๋‹ค๋ฅธ ์ด๋ฏธ์ง€)์™€๋Š” ๋ฉ€๋ฆฌ

  • ฯ„ : ํ•™์Šต์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

Image-Text Contrastive Learning

  • ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ VLM ๋Œ€์กฐ ํ•™์Šต ๋ฐฉ์‹
  • ์ด๋ฏธ์ง€ โ†” ํ…์ŠคํŠธ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ง€๋ฉฐ, image-text infoNCE ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.
  • infoNCE โ‡’ ๋งŽ์€ ๋…ธ์ด์ฆˆ๋“ค๊ณผ์˜ ๋Œ€์กฐ๋ฅผ ํ†ตํ•ด ์ •๋ณด๋Ÿ‰์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์†์‹คํ•จ์ˆ˜ (์˜ค๋‹ต๋“ค ์‚ฌ์ด์—์„œ ํ•˜๋‚˜์˜ ์ •๋‹ต์„ ๊ตฌ๋ณ„ํ•˜๊ฒŒ ํ•˜์—ฌ ์˜ค๋‹ต๋“ค๊ณผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ๋น„๊ตํ•˜์—ฌ ๋งŽ์€ ์ •๋ณด๋Ÿ‰์„ ํš๋“)

Image-Text-Label Contrastive Learning

  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋ผ๋ฒจ ๋Œ€์กฐ ํ•™์Šต์€ ์œ„์˜ ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ ๋Œ€์กฐ ํ•™์Šต์— โ€˜์ง€๋„โ€™ ๊ฐœ๋…์„ ๋”ํ•˜์—ฌ ๊ฐ•ํ™”ํ•˜์˜€๋‹ค.
  • ์›๋ž˜ ์ง (Image-Text)๋งŒ์„ ์ •๋‹ต์œผ๋กœ ์ทจ๊ธ‰ํ•˜๋˜ ๋ฐฉ์‹์—์„œ, ๊ฐ™์€ ํด๋ž˜์Šค (๋ผ๋ฒจ)์— ์†ํ•˜๋Š” ๋ชจ๋“  ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ์ •๋‹ต์œผ๋กœ ๊ฐ„์ฃผํ•˜๋„๋ก ๊ทœ์น™ ํ™•์žฅ!

3.2.2 Generative Objectives

  • ์ƒ์„ฑ ํ•™์Šต ๋ชฉํ‘œ (Generative Objectives)๋Š” ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ ์˜๋ฏธ์  ํŠน์ง•์„ ํ•™์Šต

Masked Image Modelling

  • Cross-Patch ๋ฐฉ์‹์œผ๋กœ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ์˜ ์œ ์‚ฌ๋„๋ฅผ ํ•™์Šต
    • Cross-Patch ๋ฐฉ์‹ : ์ž…๋ ฅ ์ด๋ฏธ์ง€์— ์—ฌ๋Ÿฌ ํŒจ์น˜๋“ค์„ ๋ฌด์ž‘์œ„๋กœ ์„ค์ •ํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ผ๋ถ€๋ถ„ ๊ฐ€๋ฆฌ๊ณ , ์•ˆ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„์„ ํ† ๋Œ€๋กœ ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„์„ ์ฑ„์›Œ๋‚˜๊ฐ€๋Š” ๊ณผ์ •!

Masked Language Modelling

  • ์ž…๋ ฅ ํ…์ŠคํŠธ ํ† ํฐ๋“ค ์ค‘ ์ผ์ • ํผ์„ผํŠธ๋ฅผ ๋งˆ์Šคํ‚นํ•˜๊ณ , ๋งˆ์Šคํ‚น๋˜์ง€ ์•Š์€ ํ† ํฐ๋“ค์„ ํ† ๋Œ€๋กœ ๋‹ค์‹œ ์ฑ„์›Œ๋‚˜๊ฐ„๋‹ค.

Masked Cross-Model Modelling

  • ์œ„์—์„œ ์„ค๋ช… 2๊ฐ€์ง€ ๋ฐฉ์‹์„ ํ†ตํ•ฉ
  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ ๋ฐ์ดํ„ฐ์—์„œ ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ๋ฌด์ž‘์œ„๋กœ ๋งˆ์Šคํ‚นํ•˜๊ณ , ๋งˆ์Šคํ‚น๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„๋“ค์„ ํ† ๋Œ€๋กœ ๋‹ค์‹œ ์ฑ„์›Œ๋‚˜๊ฐ„๋‹ค.

Image-to-Text Generation

  • ์ด๋ฏธ์ง€์™€ ์ด์ „ ๋ฌธ๋งฅ์„ ํ†ตํ•˜์—ฌ ๋งค ์ˆœ๊ฐ„ ๋‹ค์Œ ์ •๋‹ต ๋‹จ์–ด๋ฅผ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ๋กœ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จ์‹œํ‚จ๋‹ค.

3.2.3 Alignment Objectives

  • ์ •๋ ฌ ๋ชฉํ‘œ (Alignment Objectives)๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์— ๋‘์–ด์„œ global ๋งค์นญ ํ˜น์€ region-word ๋งค์นญ์„ ํ†ตํ•ด ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ๋ฅผ ์ •๋ ฌํ•œ๋‹ค. (โ€™์ •๋ ฌํ•œ๋‹คโ€™ โ†’ [์ด๋ฏธ์ง€,ํ…์ŠคํŠธ]๊ฐ€ ๋™์ผํ•œ ์˜๋ฏธ๋ฅผ ๊ณต์œ ํ•˜๋„๋ก ๋งŒ๋“ฆ)

Image-Text Matching (์ „์—ญ์  ๋งค์นญ)

  • global ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์„ค์ •
    • score ํ•จ์ˆ˜ S๋ฅผ ํ†ตํ•˜์—ฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์‚ฌ์ด์˜ ์ •๋ ฌํ™•๋ฅ (์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๊ฐ€ ์„œ๋กœ ์ง์ด ๋งž์„ ๊ฐ€๋Šฅ์„ฑ)์„ ์ด์ง„ ๋ถ„๋ฅ˜ ์†์‹ค๋กœ ์ธก์ •ํ•œ๋‹ค. (์•„๋ž˜ ์‹์—์„œ p๊ฐ€ 1์ด๋ฉด ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๊ฐ€ ์„œ๋กœ ์ง์„ ์ด๋ฃจ๊ณ , 0์ด๋ฉด ์ด๋ฃจ์ง€ ์•Š์Œ)

Region-Word Matching (์ง€์—ญ์  ๋งค์นญ)

  • local cross-modal ์—ฐ๊ด€์„ฑ(์ด๋ฏธ์ง€์˜ ์ผ๋ถ€๋ถ„๊ณผ ํ…์ŠคํŠธ)๋ฅผ ์ธก์ • (ํŠน์ • ๋‹จ์–ด/๊ตฌ๋ฌธ ์„ ํ†ตํ•˜์—ฌ ํŠน์ • ์˜์—ญ/๊ฐ์ฒด์™€ ์—ฐ๊ฒฐ)
    • ๊ฐ์ฒด ํƒ์ง€์™€ ๊ฐ™์€ ๋ถ„์•ผ์—์„œ ํ™œ์šฉ

3.3 VLM Pre-Training Frameworks

  • two-tower ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ VLM ์‚ฌ์ „ํ•™์Šต์— ๊ฐ€์žฅ ๋งŽ์ด ํ™œ์šฉ๋œ๋‹ค.
    • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๊ฐ€ ๊ฐ๊ฐ์˜ ์ธ์ฝ”๋”์—์„œ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ธ์ฝ”๋”ฉ๋จ
  • two-leg ํ”„๋ ˆ์ž„์›Œํฌ (์ด๊ฒƒ๋„ ์ธ์ฝ”๋”๊ฐ€ ๊ฐ๊ฐ ๋‚˜๋‰˜์–ด์žˆ๊ธดํ•จ)
    • multi-modal fusion ๊ณ„์ธต์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํŠน์ง•๋“ค ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.
  • one-tower ํ”„๋ ˆ์ž„์›Œํฌ
    • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•˜๋‚˜์˜ ์ธ์ฝ”๋”์—์„œ ํ†ตํ•ฉํ•˜์—ฌ ์ฒ˜๋ฆฌํ•œ๋‹ค.

3.4 Evaluation Setups and Downstream Tasks

  • VLM ๋ชจ๋ธ๋“ค์„ ํ‰๊ฐ€ํ• ๋•Œ ๊ฐ€์žฅ ํ”ํžˆ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹ ์†Œ๊ฐœ

3.4.1 Zero-shot Prediction

  • zero-shot ์˜ˆ์ธก์€ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ํŒŒ์ธํŠœ๋‹ํ•˜์ง€ ์•Š๊ณ ๋„, ๋‹ค์–‘ํ•œ downstream ์ž‘์—…๋“ค(์•„๋ž˜์— ๋‚˜์˜ค๋Š”)์— ์ ์šฉ์ด ๊ฐ€๋Šฅํ•œ์ง€์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•˜๋Š” ๊ฐ€์žฅ ํ”ํ•œ ๋ฐฉ์‹์ด๋‹ค.

Image Classification

Semantic Segmentation

Object Detection

Image-text Retrieval (์ด๋ฏธ์ง€ ํ…์ŠคํŠธ ๊ฒ€์ƒ‰)

  • ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ง€๊ณ  ๊ด€๋ จ๋œ ํ…์ŠคํŠธ๋ฅผ ์ฐพ์Œ / ํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ๊ด€๋ จ๋œ ์ด๋ฏธ์ง€๋ฅผ ์ฐพ์Œ

3.4.2 Linear Probing (์„ ํ˜• ์กฐ์‚ฌ)

  • ์‚ฌ์ „ํ•™์Šต๋œ VLM ๋ชจ๋ธ์„ ์–ผ๋ ค์„œ(์ ˆ๋Œ€ ์ˆ˜์ •ํ•˜์ง€ ์•Š์€ ์ฑ„๋กœ) VLM ์ž์ฒด๊ฐ€ ์–ผ๋งˆ๋‚˜ ํŠน์ง•๋“ค์„ ์ž˜ ์ถ”์ถœํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹
  • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋‚˜, ํ–‰๋™์ธ์‹ ๋ถ€๋ถ„์— ์ฃผ๋กœ ํ™œ์šฉ๋œ๋‹ค.

4 Datasets

4.1 Datasets for Pre-Training VLMs

  • crowd-labelled ๋ฐ์ดํ„ฐ์…‹๋ณด๋‹ค image-text ๋ฐ์ดํ„ฐ ์…‹์ด ๋” ํ™•๋ณดํ•˜๊ธฐ ์‰ฝ๊ณ , ๊ฐ€๊ฒฉ๋„ ์ €๋ ด
  • image-text ๋ฐ์ดํ„ฐ์…‹ ๋ง๊ณ ๋„, ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€์˜ ์ง€์—ญ์ ์ธ ํŠน์ง•๋“ค์„ ์ž˜ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด axuxiliary(๋ถ€๊ฐ€์˜) ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋“ค๋„ ์ง„ํ–‰๋˜์—ˆ๋‹ค.

5 Vision-Language Model Pre-Training

  • VLM Pre-Training์€ 3๊ฐ€์ง€์˜ ํ•™์Šต ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง€๊ณ  ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์กŒ๋‹ค.
  • Contrastive Objectives, Generative Objectives, Alignment Objectives

5.1 VLM Pre-Training with Contrastive Objectives

  • ์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ ํŠน์ง•๋“ค์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด Contrastive Objectives๋ฅผ ์„ค๊ณ„

5.1.1 Image Contrastive Learning

  • ์ด๋ฏธ์ง€ ์–‘์‹์—์„œ ์ฐจ๋ณ„์ ์ธ ํŠน์ง•๋“ค์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ
    • ์ด ๋ฐฉ์‹์€ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ ๋ชฉํ‘œ ์™ธ์— ๋ถ€๊ฐ€์ ์ธ ๋ชฉํ‘œ๋กœ์จ ์ž์ฃผ ํ™œ์šฉ

5.1.2 Image-Text Contrastive Learning

  • Image-Text ๋Œ€์กฐ๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ€๊น๊ฒŒํ•˜๊ณ , ์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ์˜ ๊ฑฐ๋ฆฌ๋Š” ๋ฉ€๊ฒŒ ์„ค์ •ํ•จ์œผ๋กœ์จ vision-language์˜ ์ƒ๊ด€์„ฑ์„ ํ•™์Šต
  • ์˜ˆ์‹œ) CLIP
    • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋“ค์˜ ๋‚ด์  (dot-product)์„ ํ†ตํ•ด ์œ ์‚ฌ์„ฑ์„ ์ธก์ •
    • ๋Œ€์นญ์ ์ธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ infoNCE ์†์‹ค ํ™œ์šฉ
      • ๋Œ€์นญ์ ์ธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ infoNCE? โ†’ (์ด๋ฏธ์ง€โ†’ํ…์ŠคํŠธ ๋ฐฉํ–ฅ์˜ ์ ์ˆ˜, ํ…์ŠคํŠธโ†’์ด๋ฏธ์ง€ ๋ฐฉํ–ฅ์˜ ์ ์ˆ˜)

    • CLIP์˜ ์˜ํ–ฅ์œผ๋กœ, ALIGN ๋ชจ๋ธ์€ 18์–ต๊ฐœ์˜ ์—„์ฒญ๋‚œ ์–‘์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‚ฌ์ „ ํ•™์Šต์˜ ๊ทœ๋ชจ๋ฅผ ๋Š˜๋ฆผ

      โ‡’ ํ•ด๋‹น ๋ฐ์ดํ„ฐ ์•ˆ์—๋Š” ์ง์ด ์ž˜ ๋งž์ง€ ์•Š๋Š” Noisy Data๊ฐ€ ๋งŽ์ด ์„ž์ž„ (์‚ฌ์ „ ํ•™์Šต์— ์žˆ์–ด์„œ ๋ฐ์ดํ„ฐ์˜ ์–‘์œผ๋กœ ์Šน๋ถ€)

      โ†’ ๋…ธ์ด์ฆˆ์— ๊ฐ•ํ•œ ๋Œ€์กฐ ํ•™์Šต ๋ฐฉ๋ฒ• ์‚ฌ์šฉ

    • ALIGN ์ฒ˜๋Ÿผ ์–‘์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต์„ ์‹œํ‚ค๋Š” ๋ฐฉ์‹๊ณผ, ๋ฐ์ดํ„ฐ์˜ ์–‘์€ ์ ์ง€๋งŒ, ์ตœ๋Œ€ํ•œ์˜ ์ •๋ณด๋ฅผ ๋ฝ‘์•„๋‚ด๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต์„ ์‹œํ‚ค๋Š” 2๊ฐ€์ง€ ์—ฐ๊ตฌ ๋™ํ–ฅ์ด ๋‚˜ํƒ€๋‚จ
  • ์–‘์ด ๋งŽ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค
    • DeCLIP : ์ตœ๊ทผ์ ‘ ์ด์›ƒ supervision ์ œ์•ˆ
      • ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ ์†, ๋น„์Šทํ•œ ์ง(์ด๋ฏธ์ง€-ํ…์ŠคํŠธ)์—์„œ ํšจ์œจ์ ์ธ ์‚ฌ์ „ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•จ
    • OTTER : ๊ฐ€์ƒ์˜ (์ด๋ฏธ์ง€-ํ…์ŠคํŠธ)์ง์„ ๋งŒ๋“ฆ
      • ์‚ฌ์ „ํ•™์Šต์‹œ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์ค†์ž„
    • ZeroVL : ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ ์†์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉ
      • ํŽธํ–ฅ๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง, ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ํ™œ์šฉ
  • ๋Œ€์กฐ ํ•™์Šต์„ ํ™œ์šฉํ• ๋•Œ์— ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹๋“ค (Performing image-text contrastive Learning across various semantic levels)
    • FILIP : ๋‹จ์–ด์™€ ์ด๋ฏธ์ง€์˜ ๊ฐ ๋ถ€๋ถ„๋“ค์„ ํ•˜๋‚˜ํ•˜๋‚˜ ๋น„๊ตํ•˜๋Š” โ€˜์ง€์—ญ์ โ€™ ๋ฐฉ์‹ ์‚ฌ์šฉ
    • PyramidCLIP : ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋‹จ๊ณ„(์ด๋ฏธ์ง€์˜ ์ „์ฒด์ ์ธ ๋ถ€๋ถ„ โ†’ ์„ธ๋ถ€์ ์ธ ๋ถ€๋ถ„๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋‹จ๊ณ„)๋กœ ๋‚˜๋ˆ„์–ด ๋Œ€์กฐํ•™์Šต์„ ํ™œ์šฉํ•˜๋ฉฐ, ๋‹จ๊ณ„๋“ค ์‚ฌ์ด์˜ Cross-level(์ˆ˜์ง์  ์ •๋ณด ๊ตํ™˜) ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋™์ผํ•œ ๋‹จ๊ณ„ ๋‚ด์—์„œ Peer-level(์ˆ˜ํ‰์  ์ •๋ณด๊ตํ™˜) ๋ฐฉ์‹์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต
  • ์ตœ๊ทผ์˜ VLM ๋ชจ๋ธ๋“ค์€ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์ ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์Œ
    • LA-CLIP, ALIP ๋ชจ๋ธ : ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด LLM์„ ์‚ฌ์šฉํ•˜์—ฌ ๋” ์ƒ์„ธํ•˜๊ณ  ํ’๋ถ€ํ•œ ์„ค๋ช…๊ธ€(ํ…์ŠคํŠธ)๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์งˆ์„ ํ–ฅ์ƒ
    • RA-CLIP : ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ํ•™์Šตํ•  ๋•Œ, ๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค์— ์˜๋ฏธ์ ์œผ๋กœ ๊ด€๋ จ๋œ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ๋“ค์„ ๋ถˆ๋Ÿฌ์™€์„œ ํ•™์Šต์— ํ•จ๊ป˜ ํ™œ์šฉ

5.1.3 Image-Text-Label Contrastive Learning

  • ํ•ด๋‹น ์‚ฌ์ „ํ•™์Šต ๋ฐฉ์‹์€ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋Œ€์กฐ ํ•™์Šต ๋ฐฉ์‹์— ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋ผ๋ฒจ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹
  • ์ด๋ฏธ์ง€ ๋ผ๋ฒจ์„ ํ™œ์šฉํ•˜๋Š” ์ง€๋„ํ•™์Šต๊ณผ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” VLM ๋น„์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์„ ๋ชจ๋‘ ํ™œ์šฉ
    • UniCL : Discriminative Features (๋ฒ”์šฉ์  ํŠน์ง•, ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋“ค๊ณผ ๊ตฌ๋ณ„ํ•˜๋„๋ก ๋•๋Š” ํŠน์ง•), Task-Specific(ํŠน์ • ๋ชฉ์ ์— ๋งž๋„๋ก ๊ตฌ๋ถ„์„ ๋•๋Š” ํŠน์ง•)

5.1.4 Discussion

  • ๋Œ€์กฐ ํ•™์Šต (Contrastive Learning)
    • ์„œ๋กœ ๋งž๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์€ ์„œ๋กœ ๊ฐ™์€ ์ž„๋ฒ ๋”ฉ ๋ถ€์—ฌ, ๋‹ค๋ฅธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ถ€์—ฌ
    • ์ด๋ฏธ์ง€๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ํŠน์ง•, ๊ทธ์— ๋งž๋Š” ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ํŠน์ง•๋“ค์„ VLM์—๊ฒŒ ์ œ๊ณต
      • Zero-Shot ์˜ˆ์ธก์— ํฐ ๊ธฐ์—ฌ
    • ๋Œ€์กฐํ•™์Šต ๋ฐฉ์‹์˜ ํ•œ๊ณ„์ 
      • Positive, Negative ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ ์Œ์„ ์ •ํ™•ํ•˜๊ฒŒ ๊ฐ€๊นŒ์ดํ•˜๊ณ , ๋ฉ€๋ฆฌํ•˜๋Š” ๊ฒƒ(์ตœ์ ํ™”)์ด ๋ณต์žก
      • ํŠน์ง•์˜ ํŒ๋ณ„๋ ฅ์„ ์กฐ์ ˆํ•˜๋Š” temperature(์˜จ๋„) ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ํฌ๊ฒŒ ์˜์กด
        • ์ด temperature(์˜จ๋„) ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ •ํ•˜๋Š” ๋ฐฉ์‹์ด ๊ฒฝํ—˜์— ์˜์กดํ•˜๋Š” ๋น„๊ณผํ•™์  ๋ฐฉ์‹์ž„

5.2 VLM Pre-training with Generative Objectives

  • ์ƒ์„ฑ์  VLM (์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ƒˆ๋กœ์šด ์ฝ˜ํ…์ธ ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ) ์„ ์‚ฌ์ „ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹
    • ์ด๋ฏธ์ง€ ๋งˆ์Šคํ‚น ๋ฐฉ์‹, ํ…์ŠคํŠธ ๋งˆ์Šคํ‚น ๋ฐฉ์‹, ์ด 2๊ฐ€์ง€๋ฅผ ๋ชจ๋‘ ํ™œ์šฉํ•˜๋Š” cross-modal modelling

5.2.1 Masked Image Modelling

  • ์ž๊ธฐ์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์œผ๋กœ์จ, ์ด๋ฏธ์ง€์˜ ์ผ๋ถ€๋ถ„์„ ๋งˆ์Šคํ‚น ์ฒ˜๋ฆฌํ•œ ํ›„, ์ธ์ฝ”๋”๊ฐ€ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์˜ˆ์ธกํ•˜๊ณ  ๋ณต์›ํ•˜๋„๋ก ํ›ˆ๋ จ์‹œํ‚ด (๋งˆ์Šคํ‚น ์•ˆ๋œ ๋ถ€๋ถ„์„ ๊ธฐ๋ฐ˜์œผ๋กœ!)
  • FLAVA : BeiT ๋ชจ๋ธ์—์„œ์ฒ˜๋Ÿผ, ์ง์‚ฌ๊ฐํ˜• ๋ชจํ˜•์˜ ๋ธ”๋ก์œผ๋กœ ๋งˆ์Šคํ‚นํ•˜๋Š” ๋ฐฉ์‹์„ ์ฑ„ํƒ
  • KELIP, SegCLIP : ์ด๋ฏธ์ง€ ํŒจ์น˜์˜ 75%์„ ๋งˆ์Šคํ‚นํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ด

5.2.2 Masked Language Modelling

  • ์ด๋ฏธ์ง€ ๋งˆ์Šคํ‚น๊ณผ ๊ฐ™์ด ๋ฌธ์žฅ์˜ ์ผ๋ถ€ ํ† ํฐ์„ ๋งˆ์Šคํ‚นํ•˜๊ณ , ๋งˆ์Šคํ‚น๋œ ํ…์ŠคํŠธ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต
  • FLAVA : ํ…์ŠคํŠธ ํ† ํฐ์˜ 15%๋ฅผ ๋งˆ์Šคํ‚น, ๋‚˜๋จธ์ง€ ํ…์ŠคํŠธ ํ† ํฐ๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ๋งˆ์Šคํ‚น๋œ ํ…์ŠคํŠธ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต
  • FIBER : ๋” ์ข‹์€ ์–ธ์–ดํŠน์ง• (Language Feature)๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด Masked Language Modeling ๋ฐฉ์‹์ฑ„ํƒ

5.2.3 Masked Cross-Modal Modelling

  • Masked Cross-Modal ๋ฐฉ์‹
    • ์ด๋ฏธ์ง€ ํŒจ์น˜์˜ ์ผ๋ถ€๋ถ„์„ ๋งˆ์Šคํ‚น, ํ…์ŠคํŠธ ํ† ํฐ์˜ ์ผ๋ถ€๋ถ„์„ ๋งˆ์Šคํ‚นํ•˜์—ฌ VLM์ด ๋งˆ์Šคํ‚น๋œ ์ด๋ฏธ์ง€ ํŒจ์น˜, ํ…์ŠคํŠธ ํ† ํฐ๋“ค์„ ๋งˆ์Šคํ‚น๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ๋ณต์›ํ•˜๋„๋ก ํ•™์Šต

5.2.4 Image-to-Text Generation

  • ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์— ๋ถ€ํ•ฉํ•˜๋Š” ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ
  • VLM์„ ํ† ํฐํ™”๋œ ํ…์ŠคํŠธ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šตํ•˜์—ฌ Vision-Language Correlation(์‹œ๊ฐ-์–ธ์–ด ์ƒ๊ด€๊ด€๊ณ„)์˜ ์„ธ๋ถ€์ ์ธ ํŠน์ง•๋“ค๊นŒ์ง€ ํฌ์ฐฉํ•˜๋„๋ก ํ•จ
  • ์šฐ์„  ์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ˆซ์ž ๋ฒกํ„ฐ ํ˜•ํƒœ(Intermediate Embedding)์œผ๋กœ ๋ฐ”๊พธ๊ณ , ํ•ด๋‹น ์ด๋ฏธ์ง€์— ๋งž๋Š” ํ…์ŠคํŠธ๋กœ ๋””์ฝ”๋”ฉํ•˜์—ฌ ๋ฌธ์žฅ ์ƒ์„ฑ
  • ํ…์ŠคํŠธ ๋””์ฝ”๋”๊ฐ€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋งค ์ˆœ๊ฐ„๋งˆ๋‹ค ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ํ•„์š”ํ•œ ์‹œ๊ฐ์  ํžŒํŠธ๋ฅผ ์ฐธ๊ณ ! ๋ผ๊ณ  ์ƒ๊ฐํ•˜์ž

5.2.5 Discussion

  • ์ƒ์„ฑ์  VLM ๋ชจ๋ธ์„ ์‚ฌ์ „ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์€ ์ด๋ฏธ์ง€-์–ธ์–ด์˜ ํŠน์ง•๋“ค์„ ํ’๋ถ€ํ•˜๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ VLM์˜ ์‚ฌ์ „ํ•™์Šต ๋ฐฉ์‹์—๋„ ์ถ”๊ฐ€์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณคํ•œ๋‹ค.
  • ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ/Cross-Modal ๋งˆ์Šคํ‚น ๋ฐฉ์‹์€ ์ด๋ฏธ์ง€-์–ธ์–ด์˜ ์„ธ๋ถ€์ ์ธ ํŠน์ง•๋“ค๊นŒ์ง€ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— zero-shot ์˜ˆ์ธก์— ๋” ๊ฐ•ํ•จ

5.3 VLM Pre-Training with Alignment Objectives

  • ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ์ž˜ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋Š”์ง€, ์ด๋ฏธ์ง€์— ๋ถ€ํ•ฉํ•˜๋Š” ์„ค๋ช…์ธ์ง€ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋“ค์„ ์ •๋ ฌํ•˜๋Š” ๊ฒƒ์„ VLM์˜ ๋ชฉํ‘œ๋กœ ์‚ผ๋Š”๋‹ค.

5.3.1 Image-Text Matching

  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋งค์นญ ๋ฐฉ์‹
    • ์ด๋ฏธ์ง€ ์ „์ฒด์™€ ํ…์ŠคํŠธ ์ „์ฒด๋ฅผ ๋ณด๊ณ  (Global image-text Correlation)์„ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„ (์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๊ฐ€ ์Œ์ด ๋งž๋Š”์ง€)๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•จ
  • FLAVA ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ๋“ค์„ ์„œ๋กœ ์Œ์ด ๋งž๊ฒŒ ๋งค์นญํ•จ (๋ถ„๋ฅ˜, ์ด์ง„๋ถ„๋ฅ˜ ์†์‹ค์„ ํ†ตํ•ด)
  • FIBER ๋ชจ๋ธ : ์„œ๋กœ ๋งž์ง€ ์•Š์€ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฐ•ํ•œ ๋ถ€์ •์ ์ธ ํŠน์ง•๋“ค์„ ํ•™์Šตํ•˜๋„๋กํ•˜์—ฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋“ค์„ ๋” ์ž˜ ์ •๋ ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ

5.3.2 Region-Word Matching

  • ์ด๋ฏธ์ง€ ์ „์ฒด๊ฐ€ ์•„๋‹Œ ์ผ๋ถ€๋ถ„๊ณผ, ํ…์ŠคํŠธ์˜ ์ผ๋ถ€๋ถ„์„ ์„œ๋กœ ๋งž๊ฒŒ ์ •๋ ฌํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ์ง€์—ญ์ ์ธ ์ƒ์„ธํ•œ ํŠน์ง•๋“ค์„ ํ•™์Šตํ•˜๋„๋ก ํ•จ
    • zero-shot ์˜ˆ์ธก์—์„œ๋„ ์ข‹์€ ์˜ˆ์ธก๋ ฅ, ๊ฐ์ฒด ํƒ์ง€๋‚˜ ์˜์ƒ ๋ถ„ํ•  ๋ถ„์•ผ์—์„œ๋„ ์ข‹์€ ์˜ˆ์ธก๋ ฅ์„ ๊ฐ€์ง
  • GLIP, FIBER, DetCLIP ๋ชจ๋ธ๋“ค ๋ชจ๋‘ ๊ฐ์ฒด๋ฅผ ๋ถ„๋ฅ˜ํ• ๋•Œ ํ™œ์šฉํ•˜๋Š” logits(๋ชจ๋ธ์˜ ์ตœ์ข… ํ™•๋ฅ ์„ ๋งŒ๋“ค๊ธฐ ์ „์˜ ๋ฐ์ดํ„ฐ)์„ ์ง€์—ญ์  ์ด๋ฏธ์ง€-์–ธ์–ด ์ •๋ ฌ ์ ์ˆ˜๋กœ ๋Œ€์ฒด
    • ์˜์—ญ-๋‹จ์–ด ์ •๋ ฌ ์ ์ˆ˜ : ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ ์ง€์—ญ์  ํŠน์ง•๋“ค์˜ ์œ ์‚ฌ์„ฑ์„ ๋‚ด์ ํ•œ ์ ์ˆ˜

5.3.3 Discussion

  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ๊ฐ€ ์„œ๋กœ ๋งž๋„๋ก ์ •๋ ฌํ•˜๋Š” Alignment Objectives๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์™€ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๊ฐ€ ์„œ๋กœ ๋งž๋Š”์ง€์— ๋Œ€ํ•ด ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต
    • ์žฅ์ 
      • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์‚ฌ์ด์˜ ์„ธ๋ฐ€ํ•˜๊ณ  ์ •๊ตํ•œ ์ƒ๊ด€๊ด€๊ณ„๋“ค์„ ์ž˜ ํ•™์Šตํ•จ
    • ๋‹จ์ 
      • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์‚ฌ์ด์˜ ๊ด€๊ณ„์—๋งŒ ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์˜ ๊ด€๊ณ„ (์ด๋ฏธ์ง€ ๋‚ด์—์„œ ๋ˆˆ, ์ฝ” ์‚ฌ์ด์˜ ๊ด€๊ณ„), ํ…์ŠคํŠธ ๋‚ด๋ถ€์˜ ๊ด€๊ณ„(๋ฌธ๋ฒ•์  ๊ด€๊ณ„)๋ฅผ ์ž˜ ํ•™์Šตํ•˜์ง€ ๋ชปํ•จ

โ‡’ ๋”ฐ๋ผ์„œ Alignment Objectives (์ •๋ ฌ ๋ชฉํ‘œ)๋Š” ๋‹จ๋…์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ๋ณด๋‹ค๋Š” ๋‹ค๋ฅธ VLM ์‚ฌ์ „ ํ•™์Šต์— ์ถ”๊ฐ€๋˜๋Š” ๋ณด์กฐ ์†์‹ค๋กœ ์ž์ฃผ ์‚ฌ์šฉ๋จ

5.4 Summary and Discussion

๐Ÿ’ก VLM ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ

โ‡’ ์ด๋ ‡๊ฒŒ 2๊ฐ€์ง€์˜ ์—ฐ๊ตฌ ํ๋ฆ„์ด VLM ์—ฐ๊ตฌ์˜ ํฐ ์ถ•์„ ์ด๋ฃจ๊ณ  ์žˆ๋‹ค.

6 VLM Transfer Learning

6์žฅ์—์„œ ์†Œ๊ฐœํ•  ๊ฒƒ๋“ค

  • ์‚ฌ์ „ํ•™์Šต๋œ VLM์˜ ์ „์ดํ•™์Šต ๋ฐฉ์‹์— ๋Œ€ํ•œ ๋™๊ธฐ
  • ์ „์ดํ•™์Šต์„ ์œ„ํ•œ ๊ธฐ๋ณธ์ ์ธ ๊ตฌ์„ฑ
  • 3๊ฐ€์ง€ ์ „์ด ํ•™์Šต ์ ‘๊ทผ๋ฒ•

6.1 Motivation of Transfer Learning

  • ์‚ฌ์ „ํ•™์Šต๋œ VLM ๋ชจ๋ธ๋“ค์ด ๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ์ ์šฉ๋ ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ 2๊ฐ€์ง€ ์ฐจ์ด์ ์— ๋ถ€๋”ชํž˜
    1. ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ์ฐจ์ด
      • ๋ฐ์ดํ„ฐ๊ฐ€ ํŠน์ • ์ด๋ฏธ์ง€ ์Šคํƒ€์ผ๊ณผ ํ…์ŠคํŠธ ํ˜•์‹์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ์Œ
    2. ํ•™์Šต ๋ชฉํ‘œ์˜ ์ฐจ์ด
      • VLM ์˜ ์‚ฌ์ „ ํ•™์Šต ๋ชฉํ‘œ๋Š” ํŠน์ • ์ž‘์—…์— ์–ฝ๋งค์ด์ง€ ์•Š๊ณ , ๋ฒ”์šฉ์ ์ธ ์ง€์‹์„ ํ•™์Šตํ•˜๋„๋ก ์„ค์ •๋˜์ง€๋งŒ, ์‹ค์ œ ์ž‘์—…์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€์ฒ˜๋Ÿผ ๋น„๊ต์  ๊ตฌ์ฒด์ ์ด๊ธฐ ๋•Œ๋ฌธ!

6.2 Common Setup of Transfer Learning

  • ์ „์ด ํ•™์Šต์˜ ๊ธฐ๋ณธ์ ์ธ ๊ตฌ์„ฑ (3๊ฐ€์ง€๊ฐ€ ์žˆ์Œ)
    1. Supervised Transfer (์ง€๋„ ์ „์ด ํ•™์Šต)
      • ๋ผ๋ฒจ(์ •๋‹ต)์ด ์žˆ๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฏธ์„ธ ์กฐ์ •
    2. Few-shot Supervised Transfer ( ์ ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์ง€๋„ ์ „์ด ํ•™์Šต )
      • ์•„์ฃผ ์ ์€ ์ˆ˜์˜ ๋ผ๋ฒจ์ด ์žˆ๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ƒ˜ํ”Œ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฏธ์„ธ์กฐ์ •
    3. Unsupervised Transfer (๋น„์ง€๋„ ์ „์ดํ•™์Šต)
      • ๋ผ๋ฒจ (์ •๋‹ต)์ด ์—†๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฏธ์„ธ ์กฐ์ •
      • 3๊ฐ€์ง€ ๋ฐฉ์‹ ์ค‘ ๊ฐ€์žฅ ์–ด๋ ต์ง€๋งŒ, ๊ฐ€์žฅ ์œ ๋งํ•˜๊ณ  ํšจ์œจ์ ์ž„
      • ์ตœ๊ทผ์—๋Š” ๋น„์ง€๋„ ์ „์ดํ•™์Šต์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•จ!

6.3 Common Transfer Learning Methods

  • ๊ธฐ์กด์— ์กด์žฌํ•˜๋Š” VLM ์ „์ดํ•™์Šต ๋ฐฉ์‹์„ 3๊ฐ€์ง€์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋‚˜๋ˆ”

6.3.1 Tansfer via Prompt Tuning

  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ์˜ โ€˜prompt learningโ€™์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ ๋ฐฉ์‹
  • ์ „์ฒด VLM ๋ชจ๋ธ์„ ๋ฏธ์„ธ์กฐ์ •ํ•˜๊ธฐ ๋ณด๋‹ค๋Š”, ๋‹ค์šด์ŠคํŠธ๋ฆผ์— ๋งž๊ฒŒ VLM์„ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ ์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ฐพ๋Š” ๊ฒƒ (์ด์— ๋Œ€ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ 3๊ฐ€์ง€ ๋ฐฉํ–ฅ์˜ ์—ฐ๊ตฌ๊ฐ€ ์กด์žฌ)

๐Ÿ’ก Text Prompt Tuning

  • ์‚ฌ๋žŒ์ด ์ง์ ‘ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋งŒ๋“œ๋Š”(ํ”„๋กฌํ”„ํŠธ ์—”์ง€๋‹ˆ์–ด๋ง) ๋Œ€์‹ , ๋ผ๋ฒจ์ด ์žˆ๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ƒ˜ํ”Œ๋“ค์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹
    • ์‰ฝ๊ฒŒ ์ดํ•ดํ•ด๋ณด์ž๋ฉด,,
      ๊ฝƒ ์ข…๋ฅ˜ 3๊ฐ€์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์ด๋ผ๋ฉด, ๊ฐ๊ฐ์˜ ๊ฝƒ ์ข…๋ฅ˜(์žฅ๋ฏธ, ํŠค๋ฆฝ, ๋“ฑ)์˜ ๋ผ๋ฒจ์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ ๋ช‡๊ฐœ๋ฅผ ์ƒ˜ํ”Œ๋กœ ํ™œ์šฉํ•˜์—ฌ 3๊ฐ€์ง€ ๊ฝƒ์„ ๊ฐ€์žฅ ์ž˜ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ตœ์ ์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ฐพ์Œ
    • ์—ฌ๊ธฐ์„œ ํ”„๋กฌํ”„ํŠธ๋Š” ๋ชจ๋ธ์ด ์ตœ์ ํ™”๋œ ์ˆซ์ž ๋ฒกํ„ฐ๋“ค์˜ ์กฐํ•ฉ์ž„
  • CoOp ๋ชจ๋ธ (ํ•ด๋‹น ๊ธฐ์ˆ ์˜ ์ดˆ๊ธฐ ๋ชจ๋ธ)
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํด๋ž˜์Šค ์ด๋ฆ„์— ์ตœ์ ํ™”๋œ ๋ฌธ๋งฅ์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ
    • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๋ฐฉ์‹
      • CoOp ๋ชจ๋ธ์€ ๋ชจ๋“  ์ด๋ฏธ์ง€์— ๋™์ผํ•œ ํ•™์Šต๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ถ€์—ฌ, ๊ณผ์ ํ•ฉ ๋ฐœ์ƒ ์œ„ํ—˜์ด ๋†’์Œ

      โ‡’ ์ž…๋ ฅ๋˜๋Š” ์ด๋ฏธ์ง€ ๊ฐ๊ฐ์— ๋งž์ถฐ ๋™์ ์œผ๋กœ ๋‹ค๋ฅธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ

  • ๊ทธ ์ด์™ธ์˜ ๊ฐœ์„  ๋ฐฉ๋ฒ•๋“ค
    • SubPT ๋ชจ๋ธ : ํ”„๋กฌํ”„ํŠธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•œ ๋ถ€๋ถ„ ๊ณต๊ฐ„ ๊ฐœ๋… ๋„์ž…
    • LASP : ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋„ˆ๋ฌด ์—‰๋šฑํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋˜์ง€ ์•Š๊ฒŒ โ€˜๊ทœ์ œโ€™ ๋„์ž…
    • VPT : ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์œ„ํ•œ ๊ฐ ์ด๋ฏธ์ง€์— ๋งž๋Š” โ€˜ํ”„๋กฌํ”„ํŠธ ๋ถ„ํฌโ€™ ๋ชจ๋ธ๋ง
    • KgCoOp : ํ•™์Šต ๋•Œ ๋ณด์ง€ ๋ชปํ•œ ์ƒˆ๋กœ์šด ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ

๐Ÿ’ก

Visual Prompt Tuning

  • VLM ์„ ์ƒˆ๋กœ์šด ์ž‘์—…์— ์ ์šฉํ•  ๋•Œ, ํ…์ŠคํŠธ ์ž…๋ ฅ์ด ์•„๋‹Œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์˜ ์ž…๋ ฅ์„ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ์‹
    • VLM ๋ชจ๋ธ์˜ ์‚ฌ์ „ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋Š” ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ , ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์— ๋“ค์–ด๊ฐ€๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋งŒ ์กฐ์ •
    • VLM ๋ชจ๋ธ์— ์ž…๋ ฅ์œผ๋กœ ํ™œ์šฉ๋˜๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ ํŒจ์น˜๋“ค ์‚ฌ์ด์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒจ์น˜(๋น„์ฃผ์–ผ ํ”„๋กฌํ”„ํŠธ ๋ฒกํ„ฐ) ์ถ”๊ฐ€
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒจ์น˜?
      • ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹˜! ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ผ๊ณ  ์ƒ๊ฐ!
      • ์ฒ˜์Œ์—๋Š” ๊ทธ๋ƒฅ ๋ฌด์ž‘์œ„ ์ˆซ์ž๋กœ ์‹œ์ž‘ํ•ด์„œ ํ›ˆ๋ จ์„ ํ†ตํ•˜์—ฌ ํ•ด๋‹น ์ˆซ์ž๊ฐ€ ๋ชจ๋ธ์˜ ๋ชฉ์ ์— ๋งž๊ฒŒ ์กฐ๊ธˆ์”ฉ ์กฐ์ •๋จ

    โ‡’ ์ด ์ˆซ์ž๋“ค์€ ํ•™์Šต์ด ๋๋‚˜๋ฉด ํ˜„์žฌ ํŠน์ • ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ๊ฐ€์žฅ ์ตœ์ ํ™”๋œ ์ƒํƒœ๊ฐ€ ๋จ!

๐Ÿ’ก Text-Visual Prompt Tuning

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹ + ๋น„์ฃผ์–ผ ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ์กฐ์ ˆ
  • Visual Prompt Tuning์ฒ˜๋Ÿผ ์ด๋ฏธ์ง€ ํŒจ์น˜ ์‚ฌ์ด์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋น„์ฃผ์–ผ ํ”„๋กฌํ”„ํŠธ ๋ฒกํ„ฐ ์ถ”๊ฐ€
  • ๋˜‘๊ฐ™์ด ์ž…๋ ฅ๋œ ํ…์ŠคํŠธ์—๋„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๋ฒกํ„ฐ๋ฅผ ๋ผ์›Œ ๋„ฃ์Œ

โ‡’ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉด์„œ ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์— ๋ผ์›Œ์ ธ์žˆ๋Š” ํ”„๋กฌํ”„ํŠธ ๋ฒกํ„ฐ๊ฐ€ ๊ฐ๊ฐ ํŠน์ • ๋‹ค์šด์ŠคํŠธ๋ฆผ์— ๊ฐ€์žฅ ์ตœ์ ํ™”๋œ ์ƒํƒœ๋กœ ํ•™์Šต์ด๋จ!

โ‡’ ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ 2๊ฐ€์ง€ ๋ฐฉ์‹์ด ํ•จ๊ป˜ ํ˜‘๋ ฅํ•˜์—ฌ ์ตœ์ข… ์†์‹ค์„ ๊ฐ€์žฅ ํšจ๊ณผ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•จ๊ป˜ ์กฐ์ •

Discussion

  • ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹ : ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์ ์ž„! (๊ฑฐ๋Œ€ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ์•„์ฃผ ์ž‘์€ ์ผ๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ›ˆ๋ จ or ์ˆ˜์ •)
    • ๊ณ„์‚ฐ๋Ÿ‰์ด ์ ์Œ
  • ํ™œ์šฉํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉฐ, ๊ฐ„๋‹จํ•จ (์ถ”๊ฐ€์ ์ธ ๋„คํŠธ์›Œํฌ ๋ ˆ์ด์–ด/๋„คํŠธ์›Œํฌ ๋ ˆ์ด์–ด์˜ ๋ณ€๊ฒฝ ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Œ)
  • ๊ณ ์ •๋œ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€์™€ ์ž˜ ๋งž์ง€ ์•Š์•„ ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ์œตํ†ต์„ฑ ๋ถ€์กฑ์˜ ํ•œ๊ณ„๋Š” ์•„์ง ์กด์žฌ
    • ๊ณ ์ •๋œ ํ”„๋กฌํ”„ํŠธ? : ํŠน์ • ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ๋งž์ถฐ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ํ•™์Šต๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ƒˆ๋กœ์šด ๋ถ„์•ผ์˜ ์ด๋ฏธ์ง€์—๋Š” ์„ฑ๋Šฅ์ด ๋‚ฎ์Œ

6.3.2 Transfer via Feature Adaptation

Feature Adaptation (ํŠน์ง• ์ ์‘?)์€ VLM์ด ์ด๋ฏธ์ง€๋‚˜ ํ…์ŠคํŠธ ํŠน์ง•๋“ค์— ์ ์‘ํ•˜๋„๋ก ํŒŒ์ธํŠœ๋‹ํ•˜๋Š” ๋ฐฉ์‹ ์ „์ดํ•™์Šต์˜ ๋ฐฉ์‹ ์ค‘ ํ•˜๋‚˜

  • ์ถ”๊ฐ€์ ์ธ light-weight ํŠน์ง• adapter ๋ชจ๋“ˆ์„ ํ™œ์šฉ
  • Clip-Adapter : ๊ธฐ์กด์˜ CLIP ๋ชจ๋ธ์˜ ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€ ์ธ์ฝ”๋” ๋’ค์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ์‚ฝ์ž…
    • ์ƒˆ๋กญ๊ฒŒ ์‚ฝ์ž…๋œ ์„ ํ˜• ๋ ˆ์ด์–ด๋“ค๋งŒ ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต, CLIP์ด ์ถ”์ถœํ•œ ํŠน์ง•์„ ๋ณ€ํ™˜
  • SVL-Adapter : ์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ž๊ธฐ์ง€๋„ํ•™์Šต์„ ํ•˜๋Š” ์ถ”๊ฐ€์ ์ธ ์ธ์ฝ”๋”๋ฅผ ์•„๋‹ตํ„ฐ๋กœ ํ™œ์šฉ

๊ฒฐ๋ก  : ํŠน์ง• ์ ์‘๊ธฐ(feature adapter)๋Š” VLM์ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ์— ์ ์‘ํ•˜๋„๋ก ํ•˜๋ฉฐ, ์•ž์„œ ์†Œ๊ฐœํ•œ ์ „์ดํ•™์Šต ๋ฐฉ๋ฒ• ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹์˜ ๋Œ€์•ˆ์œผ๋กœ ๋– ์˜ค๋ฅด๊ณ  ์žˆ์Œ

Discussion

  • Feature Adaptation
    • ์žฅ์  : ํ•ด๋‹น ์ „์ดํ•™์Šต ๋ฐฉ์‹์ด ๊ต‰์žฅํžˆ ์œตํ†ต์„ฑ(๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ํ™œ์šฉ ๊ฐ€๋Šฅ)์ด ์žˆ์œผ๋ฉฐ, ํšจ๊ณผ์ ์ž„
    • ๋‹จ์  : ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ๋ฅผ ์ˆ˜์ •ํ•ด์•ผํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ง€์  ์žฌ์‚ฐ๊ถŒ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์—†๋‹ค.

6.3.3 Other Transfer Methods

  • ์œ„์—์„œ ์†Œ๊ฐœํ•œ ๋ฐฉ๋ฒ• ์ด์™ธ์—๋„ ๋‹ค์–‘ํ•œ ์ „์ดํ•™์Šต ๋ฐฉ๋ฒ•์ด ์กด์žฌ
  • Wise-FT : ์›๋ณธ VLM์˜ ๊ฐ€์ค‘์น˜์™€ ๋ฏธ์„ธ ์กฐ์ •๋œ VLM์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹
  • Mask-CLIP : ์ด๋ฏธ์ง€ ์ธ์ฝ”๋” ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ˆ˜์ •, ์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•œ ํ’๋ถ€ํ•œ ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ
  • VT-CLIP : ์‹œ๊ฐ์  ์œ ๋„ ์–ดํ…์…˜ ๋„์ž…
  • CuPL & VCD : GPT-3์™€ ๊ฐ™์€ LLM์„ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์ˆœํ•œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋” ์ƒ์„ธํ•œ ํ”„๋กฌํ”„ํŠธ๋กœ ํ™•์žฅ

6.4 Summary and Discussion

  • VLM์˜ ์ „์ด ํ•™์Šต ๋ฐฉ์‹์˜ ๊ฐ€์žฅ ๋ฉ”์ธ์ด ๋˜๋Š” ๋ฐฉ์‹ 2๊ฐ€์ง€
    • Prompt Tuning
    • Feature Adapter
  • ์ง€๊ธˆ๊นŒ์ง€๋Š” few-shot ์ง€๋„ ์ „์ดํ•™์Šต์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ–ˆ๋‹ค๋ฉด, ์ตœ๊ทผ์—๋Š” ๋น„์ง€๋„ ์ „์ด ํ•™์Šต์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•˜๊ฒŒ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค.

7 VLM Knowledge Distillation

  • VLM ์ง€์‹ ์ฆ๋ฅ˜
  • ์‚ฌ์ „ํ•™์Šต๋œ VLM์€ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ ๋ฐ ํ…์ŠคํŠธ ๊ฐœ๋…์„ ํฌ๊ด„ํ•˜๋Š” ๋ฒ”์šฉ์ ์ธ ์ง€์‹์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ
  • ํ•˜์ง€๋งŒ, ๊ฐ์ฒด ํƒ์ง€, ์˜์ƒ ๋ถ„ํ• ๊ณผ ๊ฐ™์€ โ€˜์กฐ๋ฐ€ํ•œ ์˜ˆ์ธกโ€™(Dense Prediction)์€ ํ”ฝ์…€ ๋‹จ์œ„์˜ ์ดํ•ด๋ฅผ ์š”๊ตฌํ•จ

โ‡’ ์–ด๋–ป๊ฒŒ VLM์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์‹œ๊ฐ, ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ๋ฒ”์šฉ์ ์ธ ์ง€์‹์„ ์กฐ๋ฐ€ํ•œ ์˜ˆ์ธก ์ž‘์—…์„ ์œ„ํ•ด ์„ค๊ณ„๋œ ๋ชจ๋ธ์— ์ „๋‹ฌ(์ฆ๋ฅ˜)ํ•  ์ˆ˜ ์žˆ์„๊นŒ?

7.1 Motivation of Distilling Knowledge from VLMs

  • ์ง€์‹ ์ฆ๋ฅ˜์™€ ์ „์ด ํ•™์Šต์˜ ์ฐจ์ด์ 
์ง€์‹ ์ฆ๋ฅ˜ (Knowledge Distillation)์ „์ด ํ•™์Šต (Transfer Learning)
VLM์˜ โ€˜์ง€์‹โ€™๋งŒ์„ ๊ฐ€์ ธ์™€ ์™„์ „ํžˆ ๋‹ค๋ฅธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์—๊ฒŒ ์ „๋‹ฌ (VLM ์•„ํ‚คํ…์ฒ˜์— ์–ฝ๋งค์ผ ํ•„์š” X)๊ธฐ์กด์˜ ์‚ฌ์ „ํ•™์Šต๋œ VLM ์•„ํ‚คํ…์ฒ˜๋Š” ๊ทธ๋Œ€๋กœ ๋‘” ์ƒํƒœ์—์„œ ์ผ๋ถ€ ์ž‘์€ ๋ถ€๋ถ„๋งŒ ์ˆ˜์ •/์ถ”๊ฐ€ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ž‘์—…์— ์ ์‘์‹œํ‚ด
Faster R-CNN, DETR ๊ฐ™์€ ํƒ์ง€ ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜์˜ ์žฅ์ ์„ ์‚ด๋ฆฌ๋ฉด์„œ VLM ์ง€์‹์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅ!๋งŒ์•ฝ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์ด ์›๋ณธ VLM ์•„ํ‚คํ…์ฒ˜์— ์ ํ•ฉํ•˜์ง€ ์•Š์•„๋„ ๊ทธ ๊ตฌ์กฐ๋ฅผ ๋ฌด์กฐ๊ฑด ๋”ฐ๋ผ์•ผํ•จ

7.2 Common Knowledge Distillation Methods

  • ๋Œ€๋ถ€๋ถ„์˜ ์ง€์‹ ์ฆ๋ฅ˜ ๋ฐฉ์‹์€ ์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•œ ์ง€์‹ ์ˆ˜์ค€์„ ์ด๋ฏธ์ง€์˜ ์ผ๋ถ€(์ง€์—ญ์ ) ํ˜น์€ ํ”ฝ์…€ ๋‹จ์œ„์˜ ์ž‘์—…(๋” ์„ธ๋ฐ€ํ•œ ์ž‘์—…๋“ค)๋“ค์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ชจ๋ธ์— ์ „๋‹ฌํ•˜๋Š” ๋ฐฉ์‹์ž„ (๊ฐ์ฒด ํƒ์ง€(Object Detection) or ์˜์ƒ ๋ถ„ํ• (Semantic Segmentation))

7.2.1 Knowledge Distillation for Object Detection

Open-Vocabulary Object Detection

  • ์ผ๋ฐ˜์ ์ธ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ : โ€˜๊ฐœโ€™, โ€˜์ž๋™์ฐจโ€™ ๋“ฑ ์ •ํ•ด์ง„ ํด๋ž˜์Šค๋งŒ ์•Œ๊ธฐ ๋•Œ๋ฌธ์— ์–ดํœ˜๋ ฅ์ด ์ œํ•œ์ 
  • CLIP๊ณผ ๊ฐ™์€ VLM ๋ชจ๋ธ๋“ค์€ ์ธํ„ฐ๋„ท์˜ ์ˆ˜์‹ญ์–ต ๊ฐœ์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ํ•™์Šต๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์–ดํœ˜๋ ฅ์ด ๋” ๋„“์Œ!
  • ViLD, ZSD-YOLO, OADP ๋ชจ๋‘ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด VLM์˜ ๋ฐฉ๋Œ€ํ•œ ์ง€์‹์„ ์ฆ๋ฅ˜(์ „๋‹ฌ)๋ฐ›์Œ

โ‡’ VLM์˜ ์ด ๋ฐฉ๋Œ€ํ•œ ์ง€์‹์„ ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์— ์ฆ๋ฅ˜(์ „๋‹ฌ)ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์—ฐ๊ตฌ

Prompt Learning์„ ํ†ตํ•œ ์ง€์‹ ์ฆ๋ฅ˜๋ฅผ ์—ฐ๊ตฌ

  • โ€˜ํ”„๋กฌํ”„ํŠธโ€™๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, VLM์˜ ์ง€์‹์„ ํƒ์ง€๊ธฐ์— ์ „๋‹ฌ

VLM์ด ์ƒ์„ฑํ•œ ๊ฐ€์ƒ ๋ผ๋ฒจ ํ™œ์šฉ

  • ์ด๋ฏธ ํ•™์Šต๋œ VLM์„ ์ผ์ข…์˜ โ€˜์ž๋™ ๋ผ๋ฒจ๋ง ๊ธฐ๊ณ„โ€™๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด ํƒ์ง€ ํ•™์Šต์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ

7.2.2 Knowledge Distillation for Semantic Segmentation

  • ์˜์ƒ ๋ถ„ํ• (Semantic Segmentation)์„ ์œ„ํ•œ ์ง€์‹ ์ฆ๋ฅ˜ ๋ฐฉ๋ฒ•
  • ๊ฐ์ฒด ํƒ์ง€์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€์ด ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋Š” ํด๋ž˜์Šค ๋ชฉ๋ก์˜ ๋ฒ”์œ„๋ฅผ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•ด (์–ดํœ˜๋ ฅ ํ–ฅ์ƒ)
    • CLIPSeg : ์˜์ƒ ๋ถ„ํ• ๋งŒ์„ ์œ„ํ•œ ๋ชจ๋ธ๋กœ์จ, ๊ฐ€๋ฒผ์šด ํŠธ๋žœ์Šคํฌ๋จธ ๋””์ฝ”๋”๋ฅผ ์ถ”๊ฐ€
    • LSeg : CLIP์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ ํ”ฝ์…€ ๋‹จ์œ„์˜ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ตœ๋Œ€ํ™”

Knowledge Distillation for weakly-supervised semantic (์•ฝํ•œ ์ง€๋„ ํ•™์Šต ํ™˜๊ฒฝ์—์„œ์˜ VLM์˜ ์ง€์‹ ์ฆ๋ฅ˜)

  • ์•ฝํ•œ ์ง€๋„ (weak-supervision) ์ด๋ž€?
    • ์ •๊ตํ•œ ํ”ฝ์…€ ๋‹จ์œ„์˜ ์ •๋‹ต ์—†์ด, ์ด๋ฏธ์ง€ ๋ ˆ๋ฒจ์˜ ๋ผ๋ฒจ๊ณผ ๊ฐ™์ด ๋ถˆ์™„์ „ํ•˜๊ณ  ์•ฝํ•œ ํ˜•ํƒœ์˜ ์ •๋‹ต๋งŒ์„ ํ™œ์šฉ
    • ๊ฐ•ํ•œ ์ง€๋„ : ์‚ฌ์ง„ ์†์— ์†ŒํŒŒ, ๊ณ ์–‘์ด๋ฅผ ํ”ฝ์…€ ๋‹จ์œ„๋กœ ํŠน์ • ์ง€์–ด์คŒ
    • ์•ฝํ•œ ์ง€๋„ : ๊ทธ๋ƒฅ ์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•œ ์„ค๋ช… (๊ณ ์–‘์ด์™€ ์†ŒํŒŒ์•ผ)
  • ์•ฝํ•œ ์ง€๋„์˜ ๊ฐ€์žฅ ํฐ ํ•œ๊ณ„์  : ์ด๋ฏธ์ง€ ๋‚ด์— ํŠน์ • ๊ฐ์ฒด๊ฐ€ ์žˆ๋‹ค๋Š” ์ •๋ณด๋งŒ์œผ๋กœ โ€˜์–ด๋–ค ํ”ฝ์…€โ€™์ด ๊ทธ ๊ฐ์ฒด๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋Š”์ง€ ์•Œ๊ธฐ ํž˜๋“ฆ

โ‡’ ํด๋ž˜์Šค ํ™œ์„ฑํ™” ๋งต์˜ ํ’ˆ์งˆ์„ ๋†’์ด๋Š” ๋ฐ VLM์˜ ์ง€์‹์„ ํ™œ์šฉ (ํด๋ž˜์Šค ํ™œ์„ฑํ™” ๋งต : ๋ชจ๋ธ์ด ํŠน์ • ํ”ฝ์…€์„ ๊ฐ์ฒด๋กœ ํŒ๋‹จํ• ๋•Œ, ์ด๋ฏธ์ง€์˜ ์–ด๋А ๋ถ€๋ถ„์„ ์ฃผ๋กœ ๋ณด์•˜๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํžˆํŠธ๋งต)

  • CLIP-ES, CLIMS

7.3 Summary and Discussion

  • ์ „์ดํ•™์Šต๊ณผ ๋น„๊ตํ•ด๋ณด์•˜์„๋–„, ์ง€์‹ ์ฆ๋ฅ˜ ๋ฐฉ์‹์€ ๋” ์œตํ†ต์„ฑ์ด ์žˆ์œผ๋ฉฐ, ์›๋ณธ VLM์˜ ๊ตฌ์กฐ์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๊ณ  ๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ์ ์šฉ ๊ฐ€๋Šฅ
  • ๋Œ€๋ถ€๋ถ„์˜ ์ง€์‹ ์ฆ๋ฅ˜ ์—ฐ๊ตฌ๋Š” ๊ฐ์ฒด ํƒ์ง€ ํ˜น์€ ์˜์ƒ ๋ถ„ํ•  ์ž‘์—…์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค.

8 Performance Comparison

8.1 Performance of VLM Pre-Training

  • ์‚ฌ์ „ํ•™์Šต๋œ VLM ๋“ค์ด ์–ด๋А ์ •๋„์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š”์ง€ Zero-shot Prediction(์ œ๋กœ์ƒท ์˜ˆ์ธก) ํ‰๊ฐ€ ๋ฐฉ์‹์„ ํ†ตํ•ด ๋น„๊ต/๋ถ„์„
  • ํ‰๊ฐ€ ๋ฐฉ์‹ : ๋ชจ๋ธ์„ ์ถ”๊ฐ€๋กœ fine-tuningํ•˜์ง€ ์•Š๊ณ , ์‚ฌ์ „ ํ•™์Šต๋งŒ ๋œ ์ƒํƒœ๋กœ ํ‰๊ฐ€
  • ํ‰๊ฐ€ ํ•ญ๋ชฉ : ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€, ์˜์ƒ ๋ถ„ํ•  ๋“ฑ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ์‹œ๊ฐ ์ธ์‹ ์ž‘์—…์— ๋Œ€ํ•ด ํ‰๊ฐ€

  • ์œ„์˜ ํ…Œ์ด๋ธ” 7,8์„ ํ†ตํ•ด์„œ๋„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด, ํŠน์ • ์ž‘์—… (๊ฐ์ฒด ํƒ์ง€, ์˜์ƒ ๋ถ„ํ• ) ๋ถ„์•ผ์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
  • ์ˆซ์ž๊ฐ€ ๋‚ฎ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์ง€๋งŒ,, ์ œ๋กœ์ƒท ์˜ˆ์ธก ํ‰๊ฐ€๋ผ๋Š” ๋งค์šฐ ์–ด๋ ค์šด ์กฐ๊ฑดํ•˜์—์„œ ๋‹ฌ์„ฑ๋œ ์ ์ˆ˜์ด๋ฉฐ, ๊ฐ์ฒด ํƒ์ง€๋‚˜ ์˜์ƒ ๋ถ„ํ• ๊ณผ ๊ฐ™์€ ํŠน์ • ์ž‘์—…์— fine-tuned๋˜์ง€ ์•Š์€ ์ƒํƒœ๋ผ๋Š” ๊ฒƒ์„ ๊ฐ์•ˆํ–ˆ์„๋•Œ, ์ข‹์€ ์„ฑ๋Šฅ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

8.2 Performance of VLM Transfer Learning

  • VLM ์ „์ดํ•™์Šต์˜ ์„ฑ๋Šฅ
    • ์ง€๋„ ์ „์ดํ•™์Šต, few-shot ์ง€๋„ ์ „์ดํ•™์Šต, ๋น„์ง€๋„ ์ „์ดํ•™์Šต ๋ฐฉ์‹์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ง„ํ–‰

8.3 Performance of VLM Knowledge Distillation

  • VLM ์ง€์‹ ์ฆ๋ฅ˜ ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ
    • ์–ด๋–ป๊ฒŒ ๊ฐ์ฒด ํƒ์ง€/์˜์ƒ ๋ถ„ํ•  ๋ถ„์•ผ์—์„œ ์ง€์‹ ์ฆ๋ฅ˜๊ฐ€ ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธ
    • ๊ฐ์ฒด ํƒ์ง€์— ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹, ์˜์ƒ ๋ถ„ํ• ์— ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ

8.4 Summary

9 Future Directions

  • VLM ์‚ฌ์ „ ํ•™์Šต (Pre-Training)์˜ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
    • ์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ-์–ธ์–ด ์ƒ๊ด€๊ด€๊ณ„ ๋ชจ๋ธ๋ง - ์ด๋ฏธ์ง€ ์ „์ฒด๊ฐ€ ์•„๋‹Œ ํŠน์ • ๋ถ€๋ถ„(ํ”ฝ์…€/ํŒจ์น˜)๊ณผ ํ…์ŠคํŠธ๋ฅผ ์—ฐ๊ฒฐ
    • ์‹œ๊ฐ๊ณผ ์–ธ์–ด ํ•™์Šต์˜ ํ†ตํ•ฉ - ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ Transformer์ธ์ฝ”๋” ์•ˆ์—์„œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌ
    • ๋‹ค์ค‘ ์–ธ์–ด๋ฅผ ์‚ฌ์šฉํ•œ ์‚ฌ์ „ ํ•™์Šต - ์˜์–ด ์ค‘์‹ฌ์ด ์•„๋‹Œ ๋‹ค์–‘ํ•œ ์–ธ์–ด๋กœ๋„ ์‚ฌ์ „ ํ•™์Šต ๊ฐ€๋Šฅ(๋ฌธํ™”์ , ์ง€์—ญ์  ํŽธํ–ฅ)
    • ๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์„ฑ - ๋” ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•ด์•ผํ•จ
    • LLM์˜ ํ™œ์šฉ - LLM์„ ํ™œ์šฉํ•˜์—ฌ ๋” ํ’๋ถ€ํ•˜๊ณ  ์ •ํ™•ํ•œ ํ…์ŠคํŠธ ์„ค๋ช…์„ ์ƒ์„ฑํ•˜์—ฌ ์ด๋ฅผ ํ™œ์šฉํ•ด์•ผํ•จ
  • VLM ์ „์ด ํ•™์Šต (Transfer Learning)์˜ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
    • ๋น„์ง€๋„ VLM ์ „์ด ํ•™์Šต
      • ๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜๋ฉฐ, ๊ณผ์ ํ•ฉ ์œ„ํ—˜์ด ์žˆ๋Š” ์ง€๋„/์†Œ์ˆ˜์ƒท ํ•™์Šต์„ ๋„˜์–ด ๋ผ๋ฒจ์ด ์—†๋Š” ์ƒํƒœ์—์„œ๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์•ผํ•จ (์ด ๋ถ„์•ผ ๋” ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค!!!)
    • ํ…Œ์ŠคํŠธ ์‹œ์  VLM ์ „์ด ํ•™์Šต
      • ๊ฐ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…๋งˆ๋‹ค ํ›ˆ๋ จ์„ ๋”ฐ๋กœ ํ•ด์•ผํ•˜๋Š” ๋น„ํšจ์œจ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ…Œ์ŠคํŠธ ์‹œ์ ์—์„œ ์ฆ‰์„์œผ๋กœ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ ์‘์‹œํ‚ฌ ์ˆ˜ ์žˆ์–ด์•ผํ•จ
    • LLM์„ ํ™œ์šฉํ•œ VLM ์ „์ดํ•™์Šต
      • ์‚ฌ๋žŒ์ด ์ง์ ‘ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋งŒ๋“ค๊ฑฐ๋‚˜, ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹ , LLM์„ ํ†ตํ•ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์„ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ์—ฐ๊ตฌ ํ•„์š”
  • VLM ์ง€์‹ ์ฆ๋ฅ˜ (VLM Knowledge Distillation)์˜ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
    • ๋‹ค์ˆ˜์˜ VLM๋กœ๋ถ€ํ„ฐ ์ง€์‹ ์ฆ๋ฅ˜
      • ํ•˜๋‚˜์˜ VLM์ด ์•„๋‹Œ, ์—ฌ๋Ÿฌ VLM๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์ง€์‹์„ ์ „๋‹ฌ ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์•ผํ•จ
    • ๋‹ค๋ฅธ ์‹œ๊ฐ ์ธ์‹ ์ž‘์—…์œผ๋กœ์˜ ํ™•์žฅ
      • ๊ฐ์ฒด ํƒ์ง€, ์˜์ƒ ๋ถ„ํ•  ์ด์™ธ์— ์‚ฌ๋žŒ ์žฌ์‹๋ณ„, ์ธ์Šคํ„ด์Šค ์„ธ๋ถ„ํ™” ๋“ฑ ๋” ๋„“์€ ์‹œ๊ฐ ์ธ์‹ ์ž‘์—…์—๋„ ์ ์šฉ๋  ์ˆ˜ ์žˆ์–ด์•ผํ•จ

10 Conclusion

  • VLM์˜ ํ•ต์‹ฌ ๊ฐ€์น˜
    • ์‹œ๊ฐ ์ธ์‹์„ ์œ„ํ•œ VLM์€ ์›น ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํŠน์ • ์ž‘์—…์— ๋Œ€ํ•œ ํŒŒ์ธํŠœ๋‹ ์—†์ด๋„ ์ œ๋กœ์ƒท ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•จ

    โ‡’ ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ๊ด‘๋ฒ”์œ„ํ•œ ์‹œ๊ฐ์  ์ธ์‹ ์ž‘์—…์—์„œ ํฐ ์„ฑ๊ณผ

  • VLM ๋ฐ์ดํ„ฐ์…‹, ์ ‘๊ทผ๋ฒ•, ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜์—ฌ VLM ์‚ฌ์ „ ํ•™์Šต์˜ ์ตœ๊ทผ ๋ฐœ์ „์— ๋Œ€ํ•œ ์ „์ฒด์ ์ธ ๊ทธ๋ฆผ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, VLM์˜ ์•ž์œผ๋กœ์˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์ œ์‹œ
This post is licensed under CC BY 4.0 by the author.