VLM : 논문 리뷰 : TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

논문리뷰

VLM : 논문 리뷰 : TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

AI바라기 2024. 12. 23. 18:06

Abstract

우리는 multimodal understanding과 generation 사이의 오랜 격차를 해소하는 새로운 통합 image tokenizer인 TokenFlow를 제시합니다. 이전 research에서는 이 두 가지 tasks를 통합하기 위해 단일 reconstruction-targeted Vector Quantization (VQ) encoder를 사용하는 것을 시도했습니다. 우리는 understanding과 generation이 근본적으로 다른 granularities의 visual information을 필요로 한다는 것을 관찰했습니다. 이것은 특히 multimodal understanding tasks에서 performance를 저하시키는 중요한 trade-off를 초래합니다. TokenFlow는 shared mapping mechanism을 통해 정렬을 유지하면서 semantic 및 pixel-level feature learning을 분리하는 혁신적인 dual-codebook architecture를 통해 이 문제를 해결합니다. 이 디자인은 shared indices를 통해 understanding tasks에 중요한 high-level semantic representations과 generation에 필수적인 fine-grained visual features 모두에 직접 액세스할 수 있도록 합니다. 우리의 광범위한 experiments는 여러 차원에서 TokenFlow의 우수성을 입증합니다. TokenFlow를 활용하여, 우리는 discrete visual input이 understanding performance에서 LLaVA-1.5 13B를 능가하여 평균 7.2%의 improvement를 달성할 수 있음을 처음으로 보여줍니다. image reconstruction의 경우, 384x384 resolution에서 0.63의 강력한 FID score를 달성했습니다. 또한, TokenFlow는 256x256 resolution에서 0.55의 GenEval score로 autoregressive image generation에서 state-of-the-art performance를 확립하여 SDXL과 comparable results를 달성합니다.

TokenFlow: Multimodal Understanding과 Generation을 위한 통합 Image Tokenizer

핵심 문제:

Multimodal understanding(이해)과 generation(생성) task는 요구하는 visual information의 세부 수준(granularity)이 다름.
기존 연구: 단일 Vector Quantization (VQ) encoder를 사용해 두 task를 통합하려 했지만, 이는 trade-off를 야기하여 특히 understanding task의 성능 저하를 초래함.

TokenFlow의 해결책:

Dual-codebook architecture: Semantic feature와 pixel-level feature를 분리하여 학습.
Shared mapping mechanism: 두 feature 간의 정렬 유지.
결과: Shared indices를 통해 high-level semantic representations(understanding에 중요)과 fine-grained visual features(generation에 중요) 모두에 직접 접근 가능.

주요 성과:

Understanding: Discrete visual input이 LLaVA-1.5 13B를 능가, 평균 7.2% 성능 향상.
Image Reconstruction: 384x384 resolution에서 FID score 0.63 달성.
Autoregressive Image Generation: 256x256 resolution에서 GenEval score 0.55 달성, SDXL과 comparable results.

결론:

TokenFlow는 multimodal understanding과 generation을 위한 효율적이고 강력한 image tokenizer.
Understanding과 generation task 모두에서 state-of-the-art 성능을 보임.
기존 연구의 한계를 극복하고, 두 task의 통합 가능성을 제시.

1. Introduction

Large Language Models (LLMs)은 통일된 autoregressive framework를 통해 natural language processing에 혁명을 일으켰으며, 다양한 tasks에서 뛰어난 capabilities를 보여주었습니다. 그러나, vision과 language의 multimodal domain에서는, perception과 generation paradigms 사이에 근본적인 분리가 여전히 존재합니다. 현재 approaches는 이들을 별개의 architectures로 다룹니다. multimodal understanding models은 vision encoders와 projection layers를 활용하여 visual representations을 pre-trained LLMs에 정렬하는 반면, visual generation은 diffusion-based methods 또는 autoregressive generation을 위한 discrete image tokens에 의존합니다. 이러한 차이는 understanding과 generation이 모두 가능한 unified approaches에 대한 추구를 촉진합니다. 최근 공개된 GPT-4o는 더 generalist한 multimodal models 개발에 대한 관심을 크게 높였습니다. perception과 generation capabilities를 통합하려는 초기 efforts는 주로 LLMs에 diffusion models의 능력을 부여하는 데 중점을 두었습니다. 그러나 이러한 approaches는 상당한 architectural complexity와 computational overhead를 야기하여, 보다 우아한 unified solution의 필요성을 강조합니다. 최근의 efforts는 하나의 유망한 방향을 탐구했습니다. 바로 단일 transformer architecture를 사용하여 visual 및 textual information을 next-token prediction framework 내에서 통합하는 것입니다. 이 approach는 VQ encoders에 의존하여 visual inputs을 discrete tokens로 변환하여 text와 함께 처리될 수 있도록 함으로써, 잠재적으로 더 간단하고 효율적인 framework를 제공합니다. 두 modalities를 모두 discrete tokens의 sequences로 취급함으로써, 이 framework는 단일 architecture 내에서 end-to-end training을 가능하게 합니다. 그러나 이러한 unified approaches에는 근본적인 challenge가 존재합니다. Multimodal understanding은 복잡한 reasoning을 지원하기 위해 풍부한 semantic representations을 요구하는 반면, visual generation은 spatial structure와 textural details의 정밀한 encoding을 필요로 합니다. 현재 methods는 주로 reconstruction-targeted VQ encoders를 사용하며, 이는 주로 reconstruction fidelity에 최적화되어 있습니다. 이러한 최적화는 generation tasks에 적합하지만, understanding tasks에 중요한 high-level semantic features를 capture하는 능력을 제한할 수 있습니다. Janus는 understanding과 generation tasks에 별도의 encoders를 사용하여 이 conflict를 해결하려고 시도하지만, 이는 model complexity를 증가시키면서 근본적인 representation 불일치를 해결하지는 못합니다. 이러한 limitations은 이 분야의 중대한 격차, 즉 perception과 generation objectives를 모두 효과적으로 수행할 수 있는 unified visual encoding mechanism의 부재를 강조합니다. 이것은 우리의 핵심적인 research question을 제기합니다: 단일 image tokenizer가 multimodal understanding과 generation 모두에 적합한 representations을 도출할 수 있을까요? 이 challenge를 해결하기 위해, 우리는 understanding과 generation 사이의 격차를 해소하는 독특한 dual-flow design을 갖춘 새로운 unified image tokenizer인 TokenFlow를 제안합니다. 핵심 insight는 shared index mapping을 통해 정렬을 유지하면서 semantic 및 pixel-level features의 learning을 분리하는 것입니다. semantic 및 pixel-level similarities를 모두 가진 patches를 동일한 indices에 mapping함으로써, quantized features는 autoregressive visual generation과 multimodal understanding 모두에 직접 적용될 수 있습니다. 서로 다른 feature levels를 단일 codebook 내에 제한하는 동시적 approach와 달리, TokenFlow의 dual-codebook design은 shared indices를 통해 cross-level correlations을 유지하면서 specialized learning을 가능하게 합니다. 이 innovation은 어느 한 측면을 손상시키지 않으면서 semantic 및 pixel-level representations에 동시에 접근할 수 있도록 합니다. 구체적으로, TokenFlow는 해당 specialized codebooks와 결합된 dual-encoder architecture를 채택합니다. CLIP-style teacher로부터 learned된 semantic encoder는 강력한 semantic priors를 제공하는 반면, pixel encoder는 detailed visual information을 capture합니다. 추출된 features는 semantic 및 pixel-level distances의 weighted summation을 최소화하여 quantized되며, joint representation space를 생성합니다. 우리의 framework는 놀라운 scalability를 보이며, 130K entries 이상의 large-scale codebooks에서도 95% 이상의 뛰어난 codebook utilization을 유지하여, capacity와 efficiency 모두에서 이전 approaches를 크게 앞섭니다. TokenFlow는 또한 384x384 resolution에서 0.63의 강력한 FID score를 달성합니다. Text-to-image synthesis의 경우, 우리는 autoregressive paradigm에서 새로운 state-of-the-art GenEval score 0.55를 256x256 resolution에서 확립했으며, 기존 methods에 비해 훨씬 적은 sampling steps를 필요로 합니다. Multimodal understanding benchmarks에서, TokenFlow는 최소한의 training overhead로 새로운 state-of-the-art performance를 달성하여, LLaVA-1.5 13B를 평균 7.2% 능가합니다. 이는 처음으로 discrete visual inputs이 이 강력한 baseline을 능가할 수 있음을 보여줍니다. 이러한 results는 TokenFlow가 understanding과 generation tasks 사이의 오랜 격차를 해소하는 unified visual tokenizer로서의 effectiveness를 입증합니다.

1. Introduction: Multimodal (Vision & Language) 연구의 과제와 TokenFlow의 등장

현재 Multimodal 연구의 동향 및 한계:

LLMs의 발전: Autoregressive framework를 통해 NLP 분야에서 큰 성과를 보임.
Vision & Language의 분리:
- Multimodal Understanding: Vision encoders와 projection layers를 사용해 pre-trained LLMs과 visual representations을 align.
- Visual Generation: Diffusion-based methods 또는 discrete image tokens을 사용한 autoregressive generation 방식.
- 이러한 분리는 understanding과 generation을 모두 아우르는 unified approach의 필요성을 야기함.
GPT-4o의 등장: More generalist multimodal models 개발에 대한 관심 증대.
초기 통합 시도의 한계:
- LLMs에 diffusion models의 기능을 부여하려는 시도.
- Architectural complexity와 computational overhead가 큼.
최근 연구 동향:
- 단일 transformer architecture를 사용해 next-token prediction framework 내에서 visual과 textual information을 통합.
- VQ encoders를 사용해 visual inputs을 discrete tokens로 변환.
- 두 modalities를 discrete tokens의 sequences로 취급하여 end-to-end training 가능.
Unified Approaches의 근본적인 한계:
- Multimodal Understanding: 복잡한 reasoning을 위한 풍부한 semantic representations 필요.
- Visual Generation: Spatial structure와 textural details의 정밀한 encoding 필요.
- Reconstruction-targeted VQ encoders의 한계: Reconstruction fidelity에 최적화되어 있어 high-level semantic features를 capture하는 데 제한적.
- Janus의 한계: 별도의 encoders를 사용하지만, model complexity 증가와 근본적인 representation 불일치 문제 존재.
핵심 문제: Perception과 generation objectives를 모두 효과적으로 수행할 수 있는 unified visual encoding mechanism의 부재.

TokenFlow: Unified Image Tokenizer 제안:

핵심 연구 질문: 단일 image tokenizer가 multimodal understanding과 generation 모두에 적합한 representations을 도출할 수 있는가?
TokenFlow의 핵심 아이디어:
- Dual-flow design: Understanding과 generation 사이의 격차 해소.
- Decoupled learning: Semantic 및 pixel-level features를 분리하여 학습.
- Shared index mapping: 두 feature 간의 정렬 유지.
- 결과: Quantized features를 autoregressive visual generation과 multimodal understanding에 모두 적용 가능.
TokenFlow의 차별점:
- Dual-codebook design: Shared indices를 통해 cross-level correlations을 유지하면서 specialized learning 가능.
- Semantic 및 pixel-level representations에 동시 접근.
TokenFlow의 특징:
- Dual-encoder architecture: Specialized codebooks와 결합.
- Semantic encoder: CLIP-style teacher로부터 learned되어 강력한 semantic priors 제공.
- Pixel encoder: Detailed visual information capture.
- Joint representation space: Semantic 및 pixel-level distances의 weighted summation을 최소화하여 features를 quantize.

TokenFlow의 성과:

Scalability: 130K 이상의 entries를 가진 large-scale codebooks에서도 95% 이상의 codebook utilization 유지.
Image Reconstruction: 384x384 resolution에서 FID score 0.63 달성.
Text-to-Image Synthesis: Autoregressive paradigm에서 256x256 resolution 기준 GenEval score 0.55 달성 (state-of-the-art).
Multimodal Understanding: LLaVA-1.5 13B를 평균 7.2% 능가 (최초로 discrete visual inputs이 해당 baseline 능가).

결론:

TokenFlow는 understanding과 generation tasks를 모두 효과적으로 수행하는 unified visual tokenizer.
기존 연구의 한계를 극복하고 multimodal 연구의 새로운 방향성을 제시.

2. Related Work

2.1. Tokenization for Visual Generation.

Vector quantized (VQ) image tokenizers는 최근 autoregressive image generation의 발전에 중요한 역할을 해왔습니다. [54]는 reconstruction loss를 통해 encoder-decoder structure로 learned된 codebook entry를 사용하여 patch-level features를 quantize하는 VQVAE를 제안했습니다. VQVAE-2는 exponential moving average updates와 hierarchical multi-scale approach를 통해 이 framework를 발전시켰습니다. VQGAN은 adversarial 및 perceptual losses를 통합하여 architecture를 더욱 개선하여 보다 정밀하고 detailed representations을 생성했습니다. VQ tokenizers의 최근 발전은 세 가지 주요 방향에 중점을 두었습니다. reconstruction fidelity 및 generation quality 향상, codebook utilization 개선, 그리고 images의 next-scale prediction을 위한 multi-scale VQVAE와 같은 새로운 architectures 탐구. 이러한 methods는 quantization 후 local details를 효과적으로 보존하지만, 종종 semantic-level information을 capture하는 데 어려움을 겪어 autoregressive multi-modal image understanding tasks에서 effectiveness를 제한합니다. 우리가 제안한 TokenFlow는 shared mapping을 사용하는 dual codebooks를 도입하여 이러한 한계를 해결하여 autoregressive generation과 multimodal understanding 모두에서 state-of-the-art performance를 달성합니다.

2.2. Tokenization for Unified Multimodal Understanding and Generation

최근 multimodal understanding과 generation 사이의 격차를 해소하기 위한 노력이 등장했습니다. Chameleon, EMU3 및 Show-o와 같은 approaches는 VQ tokenizers를 사용하여 두 tasks 모두에 대한 images를 encode합니다. 그러나 이러한 methods는 일반적으로 multimodal training from scratch가 필요하며 tokenized features의 제한된 semantic representation으로 인해 visual perception tasks에서 performance 저하를 겪는 경우가 많습니다. SEED-LLaMA는 understanding을 위해 high-level semantics를 통합하고 generation decoder로 SD를 활용하는 새로운 VQ tokenizer를 도입했습니다. Janus는 understanding 및 generation을 위해 별도의 tokenizers를 사용하여 modality gap을 해결하려고 시도했지만, 이는 근본적인 challenge를 해결하지 않으면서 model complexity를 증가시킵니다. 동시 작업은 pre-training 중에 discrete visual features를 text와 정렬하는 unified vision tower를 제안했습니다. 그러나 그들의 approach는 low-level 및 high-level representations을 단일 flow 내에 제한하여 downstream performance의 상한을 제한합니다. 대조적으로, 우리의 연구는 understanding과 generation을 통합하는 열쇠가 universal mapping을 learning하는 데 있다고 가정합니다. TokenFlow는 shared mapping을 사용하는 dual codebooks를 정의함으로써 low 및 high-level features의 유연한 조합을 가능하게 하여 모든 downstream tasks에서 우수한 performance를 제공합니다.

2. Related Work: Visual Generation과 Multimodal Understanding & Generation을 위한 Tokenization 연구들

2.1. Tokenization for Visual Generation:

VQ-based Image Tokenizers: Autoregressive image generation 발전에 중요한 역할.
주요 모델:
- VQVAE: Patch-level features를 quantize, reconstruction loss를 통해 codebook 학습.
- VQVAE-2: Exponential moving average updates, hierarchical multi-scale approach 적용.
- VQGAN: Adversarial 및 perceptual losses를 추가하여 더 정밀한 representation 생성.
최근 연구 동향:
- Reconstruction fidelity 및 generation quality 향상.
- Codebook utilization 개선.
- Next-scale prediction을 위한 multi-scale VQVAE와 같은 새로운 architectures 탐구.
한계: Local details는 잘 보존하지만, semantic-level information을 capture하는 데 어려움. Multimodal understanding tasks에 적용하기 어려움.
TokenFlow의 차별점: Shared mapping을 사용하는 dual codebooks를 도입하여 autoregressive generation과 multimodal understanding 모두에서 state-of-the-art performance 달성.

2.2. Tokenization for Unified Multimodal Understanding and Generation:

목표: Multimodal understanding과 generation 사이의 격차 해소.
주요 모델:
- Chameleon, EMU3, Show-o: VQ tokenizers를 사용해 두 tasks 모두를 위한 image encoding.
- SEED-LLaMA: Understanding을 위해 high-level semantics를 통합한 VQ tokenizer, generation decoder로 SD 활용.
- Janus: Understanding과 generation을 위해 별도의 tokenizers 사용.
한계:
- Multimodal training from scratch를 요구하는 경우가 많음.
- Visual perception tasks에서 성능 저하 발생 가능 (tokenized features의 제한된 semantic representation).
- Janus: Model complexity 증가, 근본적인 문제 해결 어려움.
- Concurrent work: Pre-training 중 discrete visual features를 text와 정렬하는 unified vision tower 제안. Low-level 및 high-level representations을 단일 flow에 제한하여 downstream performance 제한.
TokenFlow의 차별점:
- 핵심 아이디어: Understanding과 generation 통합의 핵심은 universal mapping 학습.
- Shared mapping을 사용하는 dual codebooks: Low 및 high-level features의 유연한 조합 가능.
- 결과: 모든 downstream tasks에서 우수한 성능.

결론:

기존 연구들은 visual generation을 위한 tokenization 또는 multimodal understanding과 generation 통합을 위해 노력했지만, 각각 한계 존재.
TokenFlow는 shared mapping을 사용하는 dual codebooks라는 새로운 approach를 통해 이러한 한계를 극복하고, 두 분야 모두에서 state-of-the-art performance를 달성.
TokenFlow는 multimodal 연구의 새로운 방향성을 제시.

3. Method

3.1. Motivation

Multimodal understanding과 generation을 일관된 next-token prediction paradigm으로 통합하려면 input images에서 indices를 extracting하는 VQ tokenizer가 필요합니다. 기존의 VQ tokenizers는 pixel-level image reconstruction에서 뛰어나지만, 우리의 조사에 따르면 image understanding capabilities에 significant limitation이 있음이 밝혀졌습니다. 우리는 이러한 tokenizers를 LLaVA-1.5 framework 내에서 feature extractors로 활용하는 experiments를 수행했습니다. Tab. 1의 Exp. 2-4와 같이, 이러한 discrete tokenizers의 performance는 continuous tokenizer인 CLIP ViT-B/14보다 consistently 뒤떨어집니다. 우리는 이러한 performance gap이 주로 더 나은 low-level reconstruction quality를 향해 optimize하는 pre-training objectives에서 비롯된다고 가정합니다. 결과적으로, extracted features는 주로 low-level information을 encode하여 복잡한 visual reasoning에 중요한 semantic-level understanding이 부족합니다. 통합된 understanding과 generation을 위한 또 다른 직접적인 solution은 pre-trained CLIP에서 discrete tokens을 distill한 다음 image reconstruction capability를 갖추는 것입니다. Exp. 5에서 입증된 바와 같이, CLIP ViT-B/14에서 distilled된 VQKD는 다른 discrete tokenizers에 비해 performance gap을 크게 줄입니다. 우리는 VQKD에서 extracted된 quantized features에서 original image를 reconstruct하는 experiment를 추가로 수행했습니다. reconstructed images는 Fig. 8과 같이 significant blurring과 high-frequency details의 명백한 손실을 보였습니다. 우리는 이 결과가 semantically close patches를 동일한 codebook index에 mapping하는 VQKD encoder의 특성 때문이라고 생각합니다. Fig. 4 (a)에서 시각화된 것처럼, VQKD는 semantic meaning이 같은 images를 동일한 codebook index에 mapping하는 경향이 있는 반면, VQGAN(Fig. 4 (b))은 visually similar images를 동일한 codebook index에 mapping하여 semantic content보다 low-level features를 우선시하는 경향이 있습니다. 따라서 VQKD에 의해 aggregated된 low-level dissimilar patches에서 fine-grained details를 reconstruction하는 것은 매우 어렵습니다. 이러한 observations는 high-level semantic understanding과 low-level visual reconstruction tasks를 효과적으로 처리할 수 있는 새로운 tokenization approach 개발의 필요성을 강조합니다.

3.2. Unified Image Tokenizer

이 격차를 해소하기 위해, 우리는 semantic 및 pixel level 모두에서 joint representation learning을 가능하게 하는 새로운 unified image tokenizer인 TokenFlow(Fig. 3)를 제안합니다. 우리는 understanding과 generation을 통합하는 열쇠가 universal mapping을 learning하는 데 있다고 생각합니다. tokenizer가 high-level 및 low-level이 모두 similar한 patches를 동일한 codebook index에 mapping할 수 있다면, quantized features는 쉽게 decoded될 수 있고 autoregressive visual generation tasks와 multimodal understanding tasks 모두에 직접 적용될 수 있습니다. Encoder. low-level image information을 extract하기 위해 하나의 단일 encoder를 활용하는 이전 approaches와 달리, 우리는 semantic encoder Esem과 pixel encoder Epix로 구성된 dual-encoder architecture를 제안합니다. 이 design은 두 가지 distinct types의 image features를 extraction할 수 있게 합니다. semantic encoder의 경우, pre-trained text-aligned vision encoder(예: CLIP ViT-B/14)로 초기화합니다. 이 initialization 전략은 semantic codebook에서 high-level text-aligned embeddings의 더 나은 learning을 촉진하여 궁극적으로 model의 multimodal understanding capabilities를 향상시킵니다. 여기서 간결성을 위해 feature representations의 spatial indices는 생략합니다. ẑsem = Esem(x) ∈ ℝdsem 및 ẑpix = Epix(x) ∈ ℝdpix는 semantic 및 pixel encoder에서 encoded된 features입니다. Quantization. 우리는 dual codebooks를 사용하는 혁신적인 quantization approach를 도입합니다. semantic-level embeddings Zsem = {zsem,i}Ki=1 ∈ ℝK×dsem 및 pixel-level embeddings Zpix = {zpix,i}Ki=1 ∈ ℝK×dpix이며, 여기서 K는 codebook entries의 수입니다. 이 두 codebooks는 unified mapping을 공유하여 quantization process 중에 high-level semantic information과 low-level pixel details를 동시에 고려할 수 있습니다. encoded feature representations ẑsem과 ẑpix가 주어지면, l2-norm 후 각각의 codebook embeddings에 대한 distances를 계산합니다.

dsem,i = ||ẑsem - zsem,i||22, for i = 1, ..., K (1)

dpix,i = ||ẑpix - zpix,i||22, for i = 1, ..., K (2)

i* = arg mini (dsem,i + wdis · dpix,i) (3)

최적의 quantization index i*는 Eq. (3)과 같이 이 두 distances의 weighted sum을 최소화하여 결정되며, 여기서 wdis는 distance balance weight입니다. 이 joint optimization approach는 일반적으로 단일 feature type의 distribution을 learning하는 데 중점을 두는 이전 VQ methods와 크게 다릅니다. 우리는 codebook representation의 richness를 향상시키기 위해 multi-scale VQ (MSVQ) structure를 추가로 채택합니다. 우리의 shared mapping 전략을 통해 codebook은 high-level semantics와 low-level features의 joint distribution을 learn할 수 있으며, 이로 인해 다음과 같은 몇 가지 주요 advantages가 있습니다. ❶ Scalability: 우리의 approach는 codebook size가 증가함에 따라 generative 및 understanding tasks 모두에서 일관된 performance improvements를 보여줍니다. 왜냐하면 large codebook size는 더 많은 high-level 및 low-level feature combination possibilities를 제공하기 때문입니다. 131,072로 확장된 codebook size로도 여전히 95% 이상의 매우 높은 utilization rate를 유지하면서 최고의 image reconstruction quality와 multimodal understanding performance를 달성할 수 있습니다. ❷ Multi-task Capabilities: Semantic 및 pixel-level features의 joint distribution을 learning함으로써, 우리의 method는 generation과 understanding tasks 간의 격차를 해소합니다. 이 unified representation을 통해 단일 tokenizer가 두 domains 모두에서 탁월할 수 있습니다. 이 design은 또한 architectural modifications 없이 더 많은 downstream tasks로의 extensibility를 가능하게 하면서 다른 type의 feature representations를 embed하기 위해 더 많은 codebooks를 원활하게 integration할 수 있습니다. Decoder and Training Objective. 우리의 architecture는 semantic features와 original image를 reconstructing하기 위한 semantic decoder Dsem과 pixel decoder Dpix를 포함한 두 개의 distinct decoders를 통합합니다. 우리는 target feature extraction을 위해 teacher model(semantic encoder의 initialization과 동일)을 사용합니다. semantic loss Lsem은 decoded 및 teacher-extracted features 사이의 l2 distance로 계산됩니다. reconstruction loss는 다음과 같이 공식화됩니다.

Lpix = ℓ2(x, x̂) + LP(x, x̂) + λGLG(x̂) (4)

여기서 x̂ = Dpix(z)이고, ℓ2는 pixel-wise reconstruction loss를 나타내고, LP(·)는 LPIPS를 사용한 perceptual loss를 나타내고, LG(·)는 λG를 weight coefficient로 하는 adversarial loss를 나타냅니다. vector quantization conventions에 따라, straight-through gradient estimator를 사용합니다: z = sg[z−ẑ]+ ẑ 여기서 sg[·]는 stop-gradient operation을 나타냅니다. codebook learning objective는 다음과 같습니다: LVQ = ||sg[ẑ] − z||22 + β||ẑ− sg[z]||22 여기서 두 번째 항은 balancing factor β를 사용한 commitment loss를 나타냅니다. total training objective는 모든 losses의 합입니다: Ltotal = Lsem + LVQ + Lpix.

3.3. Visual Generation with TokenFlow

TokenFlow는 next-scale prediction paradigm을 사용하여 autoregressive text-to-image generation에서 SOTA performance를 달성하는 데 도움이 됩니다. 아래에서는 high-quality image synthesis를 위한 training 및 inference strategy에 대해 자세히 설명합니다. Training Strategy. 우리의 visual generation architecture는 pre-trained LLM model을 기반으로 합니다. text encoding의 경우, model의 native BPE tokenizer를 활용하여 input text를 discrete token sequences로 변환하고 feature representations를 extract합니다. original vocabulary는 specialized visual tokens로 확장됩니다. TokenFlow를 사용하여 image tokens를 extract하고, MLP를 통과시키고, training을 위해 text tokens와 concatenate합니다. model의 autoregressive 특성을 감안할 때, image tokens에 대해서만 계산된 cross-entropy loss를 사용합니다. inference 중에 classifier-free guidance를 활성화하기 위해 training 중에 probability pdrop = 0.1로 conditioned text를 empty string으로 무작위로 바꿉니다. training stability를 향상시키고 loss spikes를 방지하기 위해 QK-normalization 및 norm re-ordering을 통합합니다. Inference Strategy. 우리는 conventional top-k-top-p sampling strategies가 next-scale paradigm에서 사용될 때 종종 image collapse와 repetitive local patterns를 초래한다는 것을 관찰했습니다. 이는 cross-entropy training objective가 주로 top-1 prediction과 attention-based relationships를 설정하기 때문일 수 있습니다. inference 중 각 token에 대한 independent top-k sampling은 direct correlations이 부족한 tokens를 초래할 수 있으며, 이는 inconsistent하거나 repetitive patterns로 이어질 수 있으며, 이는 후속 scales의 attention을 통해서만 부분적으로 해결될 수 있습니다. 이 문제는 특히 limited inference steps에서 더 심각해집니다. 이러한 근본적인 한계를 해결하기 위해, 우리는 새로운 multi-step sampling approach를 제안합니다. (i) Initial sampling: parameters k1 및 p1을 사용하여 top-k top-p sampling을 수행합니다. (ii) Refinement: sampled output을 reduced parameters k2 < k1 및 p2 < p1을 사용하여 동일한 scale에서 두 번째 sampling round의 input으로 사용합니다. 이 progressive narrowing of the sampling space는 refinement steps를 통해 consistency를 적용하면서 creative diversity를 유지합니다. Empirical results는 single-pass sampling methods에 비해 훨씬 더 coherent하고 visually appealing generations을 보여줍니다(Fig. 5 및 Appendix B.1의 detailed ablation 참조).

3.4. Multimodal Understanding with TokenFlow

TokenFlow는 multi-scale VQ tokenizer로 기능하며, quantized multi-scale features는 LLaVA-1.5 paradigm에 따라 multimodal understanding training을 위해 pre-trained LLM에 직접 공급될 수 있습니다. dual flow의 joint feature representations는 model에 대한 input 역할을 합니다. 우리는 (i) 모든 scales의 feature (ii) final-scale feature만 (iii) 모든 scales의 residual features와 같은 여러 feature input strategies를 검증합니다. Appendix B.1에 자세히 설명된 것처럼 final scale의 features가 최고의 overall performance를 달성한다는 것을 발견했습니다. 이는 final scale이 multimodal understanding에 가장 relevant한 semantic information을 capture하는 반면, additional scale features 또는 residual features는 performance를 저하시키는 noise를 유발할 수 있음을 시사합니다. 우리의 model은 기존의 discrete multimodal methods에 비해 상당한 improvements를 보여줍니다. 특히, LLaVA 1.5 training data를 사용하여 8×A100 GPUs에서 24시간 미만의 training이 필요한 minimal computational overhead로 performance gains를 달성할 수 있습니다.

3. Method: TokenFlow - Understanding과 Generation을 위한 Unified Image Tokenizer

3.1. Motivation: 기존 Visual Encoder의 한계와 TokenFlow의 필요성

문제 제기:
- 기존 VQ Tokenizers: Pixel-level reconstruction은 우수하지만, image understanding 능력은 제한적.
- LLaVA-1.5 framework에서 실험 결과, discrete tokenizers(VQGAN, VQGAN-f8, VQGAN-f4)는 continuous tokenizer(CLIP ViT-B/14)보다 understanding 성능이 떨어짐 (Table 1 참고).
- 원인: 기존 VQ Tokenizers는 low-level reconstruction quality에 중점을 둔 pre-training objectives로 인해, extracted features가 주로 low-level information을 encoding하고, visual reasoning에 중요한 semantic-level understanding이 부족함.
CLIP을 사용한 대안의 한계:
- Pre-trained CLIP에서 discrete tokens을 distill하고 reconstruction capability를 부여하는 방법.
- VQKD (CLIP ViT-B/14에서 distill)는 다른 discrete tokenizers보다 성능은 향상되었지만, reconstructed images에서 significant blurring과 high-frequency details 손실 발생 (Fig. 8 참고).
- 원인: VQKD encoder는 semantically close patches를 동일한 codebook index에 mapping (Fig. 4 (a) 참고). 이로 인해 low-level dissimilar patches에서 fine-grained details를 reconstruction하기 어려움.
결론: High-level semantic understanding과 low-level visual reconstruction tasks를 모두 효과적으로 처리할 수 있는 새로운 tokenization approach가 필요.

3.2. Unified Image Tokenizer: TokenFlow

핵심 아이디어: Understanding과 generation을 통합하기 위해, high-level 및 low-level similarity를 모두 고려하여 patches를 동일한 codebook index에 mapping하는 universal mapping 학습.
TokenFlow Architecture (Fig. 3):
- Dual-Encoder:
 - Semantic Encoder (Esem): Pre-trained text-aligned vision encoder (e.g., CLIP ViT-B/14)로 초기화. High-level text-aligned embeddings 학습을 촉진하여 multimodal understanding capabilities 향상.
 - Pixel Encoder (Epix): Low-level visual features 추출.
- Quantization:
 - Dual Codebooks:
 - Semantic-level embeddings (Zsem): Semantic information 저장.
 - Pixel-level embeddings (Zpix): Pixel details 저장.
 - 두 codebooks는 unified mapping을 공유.
 - Quantization Process:
 - Encoded features (ẑsem, ẑpix)와 각 codebook embeddings 간의 distances 계산 (Eq. 1, 2).
 - Weighted sum of distances를 최소화하는 optimal quantization index i* 결정 (Eq. 3).
 - Distance balance weight (wdis)로 semantic 및 pixel-level information 고려 비율 조정.
 - Multi-scale VQ (MSVQ) structure: Codebook representation의 richness 향상.
- Shared Mapping의 장점:
 - High-level semantics와 low-level features의 joint distribution 학습.
 - Scalability: Codebook size 증가에 따라 generative 및 understanding tasks 모두에서 성능 향상. Large codebook size (131,072)에서도 95% 이상의 utilization rate 유지.
 - Multi-task Capabilities: Generation과 understanding tasks 간의 격차 해소. Unified representation으로 단일 tokenizer가 두 domains 모두에서 뛰어난 성능.
 - Extensibility: Architectural modifications 없이 더 많은 codebooks를 통합하여 다른 feature representations를 embed하고 downstream tasks 확장 가능.
- Decoder and Training Objective:
 - Decoders:
 - Semantic Decoder (Dsem): Semantic features reconstruction.
 - Pixel Decoder (Dpix): Original image reconstruction.
 - Training Objective:
 - Teacher model: Target feature extraction (semantic encoder initialization과 동일).
 - Semantic Loss (Lsem): Decoded와 teacher-extracted features 간의 l2 distance.
 - Reconstruction Loss (Lpix): Pixel-wise reconstruction loss, perceptual loss (LPIPS), adversarial loss 포함 (Eq. 4).
 - Codebook Learning Objective (LVQ): Stop-gradient operation을 사용한 commitment loss 포함.
 - Total Training Objective (Ltotal): Lsem + LVQ + Lpix.

3.3. Visual Generation with TokenFlow

Next-scale prediction paradigm을 사용한 autoregressive text-to-image generation에서 SOTA performance 달성.
Training Strategy:
- Pre-trained LLM model 기반.
- BPE tokenizer를 사용하여 text encoding.
- TokenFlow를 사용하여 image tokens 추출 후 MLP 통과.
- Text tokens와 image tokens concatenate.
- Cross-entropy loss 사용 (image tokens에 대해서만 계산).
- Classifier-free guidance를 위해 conditioned text를 empty string으로 무작위로 바꿈 (pdrop = 0.1).
- QK-normalization 및 norm re-ordering으로 training stability 향상.
Inference Strategy:
- 문제: Conventional top-k-top-p sampling은 image collapse와 repetitive local patterns 초래 가능.
- 원인: Cross-entropy training objective가 top-1 prediction과 주로 attention-based relationships를 설정하기 때문.
- 해결: Multi-step sampling approach 제안:
 - Initial sampling: Top-k top-p sampling (k1, p1).
 - Refinement: Reduced parameters (k2 < k1, p2 < p1)를 사용한 두 번째 sampling.
 - 결과: Single-pass sampling methods에 비해 더 coherent하고 visually appealing generations (Fig. 5 참고).

3.4. Multimodal Understanding with TokenFlow

Multi-scale VQ tokenizer로 기능.
Quantized multi-scale features를 pre-trained LLM에 직접 공급하여 multimodal understanding training (LLaVA-1.5 paradigm).
Dual flow의 joint feature representations를 model input으로 사용.
Feature input strategies 검증:
- 모든 scales의 feature.
- Final-scale feature만 사용.
- 모든 scales의 residual features.
결과: Final scale features가 가장 좋은 overall performance 달성 (Appendix B.1 참고).
해석: Final scale이 multimodal understanding에 가장 relevant한 semantic information을 capture하고, additional scale features 또는 residual features는 noise를 유발할 수 있음.
성과: 기존 discrete multimodal methods 대비 상당한 improvements. Minimal computational overhead (8×A100 GPUs에서 24시간 미만 training).

결론:

TokenFlow는 high-level semantic understanding과 low-level visual reconstruction tasks를 모두 효과적으로 처리하는 unified image tokenizer.
Dual-encoder, dual-codebook, shared mapping, multi-step sampling 등의 기술을 통해 understanding과 generation 모두에서 뛰어난 성능 달성.
Multimodal 연구의 새로운 방향성을 제시.

TokenFlow Method 핵심 요약 (토끼 두마리 잡기 대작전):

1. 문제 인식: 기존 토끼 잡이 도구(VQ Tokenizer)의 한계

기존 도구: 이미지 픽셀 정보(토끼A)는 잘 잡지만, 이미지 의미 정보(토끼B)는 잘 못 잡음.
결과: 이미지 생성(토끼A 관련)은 잘하지만, 이미지 이해(토끼B 관련)는 성능이 떨어짐.

2. 해결책: 똑똑한 도구(TokenFlow) 개발

핵심 전략: 이미지 픽셀 정보와 의미 정보를 모두 잘 잡는, 똑똑한 토끼 잡이 도구(TokenFlow)를 만들자!
TokenFlow의 비장의 무기:
- 두 개의 눈 (Dual-Encoder):
  - 의미 정보 눈 (Semantic Encoder): 이미지의 의미 정보(토끼B)를 담당. CLIP으로 사전 학습하여 똑똑함.
  - 픽셀 정보 눈 (Pixel Encoder): 이미지의 픽셀 정보(토끼A)를 담당.
- 두 개의 주머니 (Dual-Codebook):
  - 의미 정보 주머니 (Semantic Codebook): 의미 정보 눈이 본 것을 담는 주머니.
  - 픽셀 정보 주머니 (Pixel Codebook): 픽셀 정보 눈이 본 것을 담는 주머니.
  - 비밀 통로 (Shared Mapping): 두 주머니는 서로 연결(mapping)되어 있어서, 정보 교환이 가능!
- 두 주머니를 활용한 똑똑한 정보 저장 (Quantization):
  - 두 눈으로 본 정보(feature)와 두 주머니(codebook) 속 정보들을 비교(distance 계산).
  - 어느 주머니에 더 가까운지 계산하여, 더 가까운 쪽에 저장(weighted sum of distances 최소화).
  - 여러 단계(multi-scale)로 나누어 저장하여, 더 풍부하게 저장(MSVQ).
- 이미지 생성 특화 훈련 (Visual Generation Training):
  - 훈련 방법: Text-to-image generation을 위해, text는 기존 방식(BPE tokenizer)으로, 이미지는 TokenFlow로 처리하여 훈련.
  - 생성 꿀팁 (Multi-step Sampling): 이미지 생성 시, 한번에 생성하지 않고, 여러 단계에 걸쳐(initial sampling -> refinement) 생성하여, 더 자연스럽고 일관된 이미지 생성.

3. 결과: 두 마리 토끼를 모두 잡다!

TokenFlow: 이미지 픽셀 정보와 의미 정보를 모두 잘 처리.
성과: 이미지 생성(토끼A)과 이미지 이해(토끼B) 모두에서, 기존 토끼 잡이 도구보다 뛰어난 성능.

한 줄 요약: TokenFlow는 dual-encoder, dual-codebook, shared mapping, multi-step sampling 등의 기술을 활용하여, 이미지 픽셀 정보와 의미 정보를 모두 잘 처리하는 똑똑한 토크나이저. 그 결과, 이미지 생성과 이해 모두에서 뛰어난 성능을 달성!

이미지 입력: TokenFlow에 이미지가 입력됩니다.
두 개의 눈으로 보기 (Encoding):
- Semantic Encoder: 이미지의 의미 정보(semantic information)를 파악하여, 이를 벡터 형태(embedding)로 변환합니다.
- Pixel Encoder: 이미지의 픽셀 정보(pixel-level details)를 파악하여, 역시 벡터 형태(embedding)로 변환합니다.
두 개의 주머니에서 비슷한 것 찾기 (Quantization):
- Semantic Codebook: Semantic Encoder가 만든 의미 정보 벡터(embedding)와 가장 유사한 벡터를 Semantic Codebook에서 찾습니다.
- Pixel Codebook: Pixel Encoder가 만든 픽셀 정보 벡터(embedding)와 가장 유사한 벡터를 Pixel Codebook에서 찾습니다.
- 유사도 측정: 벡터 간의 거리(distance)를 계산하여 유사도를 측정합니다. (예: Euclidean distance, Cosine similarity 등)
- 최종 선택: 두 codebook에서 찾은 가장 유사한 벡터들의 조합(weighted sum of distances)을 고려하여, 최종적으로 어떤 codebook entry를 사용할지 결정합니다.
선택된 Codebook Entry 사용:
- 선택된 codebook entry의 index가 해당 이미지 영역(patch)을 대표하는 정보로 사용됩니다.
- 이 index는, 마치 단어 사전에서 단어의 index를 사용하는 것과 유사하게, 이미지를 discrete token sequence로 변환하는 데 사용됩니다.

Codebook의 역할:

미리 학습된 벡터들의 집합: Codebook은 다양한 의미 정보 또는 픽셀 정보를 담고 있는 벡터(embedding)들의 집합입니다. 마치 단어 사전과 유사한 역할을 합니다.
효율적인 정보 표현: 이미지를 연속적인 벡터(continuous vector)가 아닌, codebook 내의 특정 벡터(discrete token)로 표현함으로써, 정보를 효율적으로 압축하고 처리할 수 있습니다.
두 가지 Codebook:
- Semantic Codebook: 이미지의 추상적, 고차원적 의미를 표현하는 벡터들을 담고 있습니다.
- Pixel Codebook: 이미지의 세부적인 픽셀 정보를 표현하는 벡터들을 담고 있습니다.

정리하자면, TokenFlow는 이미지를 두 가지 관점(의미, 픽셀)에서 분석하고, 각각에 맞는 codebook에서 가장 비슷한 임베딩(벡터)을 찾아, 이를 활용하여 이미지를 효과적으로 표현합니다. 이 과정에서 shared mapping을 통해 두 codebook 간의 정보가 유기적으로 연결됩니다.