Loading......

KAIROS - DCGAN Text-to-Image Generation

Generative AI Model

A Deep Convolutional Generative Adversarial Network (DCGAN) implementation for text-to-image generation using GLOVE embeddings and COCO dataset. This project features a complete PyTorch implementation with comprehensive evaluation metrics, automatic checkpointing, and visualization tools. Trained on the Microsoft COCO dataset with 300-dimensional GLOVE word embeddings projected to 1024-dimensional space for optimal text representation.

Role

Deep Learning Engineer

Research Scientist

Collaborators

Jason Olefson

Joo Young Gonzalez

Duration

4 months

Tools

PyTorch

Python

NumPy

Pillow

GLOVE Embeddings

Jupyter Notebook

View GitHub Repository

View Jupyter Notebook

Key Features

Text-to-Image Generation: Generate high-quality 64x64 RGB images from text descriptions using DCGAN architecture

GLOVE Embeddings: 300-dimensional word embeddings projected to 1024-dimensional space for robust text representation

COCO Dataset: Trained on Microsoft COCO dataset with 82,783 training and 40,504 validation image-caption pairs

Comprehensive Evaluation: FID, IS, Text-Image Matching, and CLIP scores for detailed performance analysis

Automatic Checkpointing: Model checkpoints saved every 10 epochs with resume capability for long training runs

Visualization Tools: Generate sample images, training GIFs, loss plots, and real vs generated image comparisons

Model Architecture

Generator

Input: 100-dimensional noise vector concatenated with 1024-dimensional text embedding
Output: 3-channel RGB image (64x64 pixels)
Architecture: Transposed convolutional layers with batch normalization and ReLU activations

Discriminator

Input: 3-channel RGB image (64x64) concatenated with spatial text embedding
Output: Binary classification score (real/fake probability)
Architecture: Convolutional layers with batch normalization and leaky ReLU activations

Text Processing

GLOVE Embeddings: Pre-trained 300-dimensional word vectors (glove.6B.300d.txt)
Projection Layer: Linear layer mapping 300d vectors to 1024d space
Caption Processing: Average word embeddings for sentence-level representation

Training Configuration

Image Resolution: 64x64 pixels (3-channel RGB)

Batch Size: 512 images per batch

Total Epochs: 70 epochs with automatic checkpointing every 10 epochs

Learning Rate: 0.0002 (Adam optimizer)

Optimizer Settings: Beta1 = 0.5, Beta2 = 0.999

Noise Dimension: 100-dimensional latent noise vector

Embedding Dimension: 1024-dimensional (projected from GLOVE 300d)

Total Training Time: Approximately 24 hours on GPU hardware

Evaluation Results

FID Score (Frechet Inception Distance): 322.30

IS Score (Inception Score): 0.53

Text-Image Matching: -0.15 ± 0.40

CLIP Score: 0.70

Generated Outputs: 20+ sample images per generation run with quality progression visualization

Dataset and Resources

Microsoft COCO Dataset:
- Training images: 82,783 images from train2014/
- Validation images: 40,504 images from val2014/
- Captions: captions_train2014.json and captions_val2014.json

GLOVE Embeddings:
- glove.6B.300d.txt (300-dimensional word vectors)
- Pre-trained on 6 billion tokens from Wikipedia and Gigaword

Dependencies:
- torch==2.0.0
- torchvision
- numpy==1.21.5
- Pillow==10.0.0
- matplotlib, imageio, scipy, h5py==3.6.0

Repository Structure

models/: Generator and Discriminator architectures plus text processing models

saved_models/: Trained model checkpoints and final weights

generated_images/: Output samples, GIFs, and evaluation visualizations

utils.py: Utility functions for model operations

data_util.py: Dataset processing and loading utilities

DCGAN_Text2Image.ipynb: Main training notebook with all pipeline steps

Usage and Generation

Generate Images from Text:
Load the trained generator model, provide a text description, and the model generates a corresponding image by combining the GLOVE text embedding with random noise.

Resume Training:
Automatic checkpointing enables resuming training from any saved epoch without losing progress on loss values and model weights.

Evaluation:
Comprehensive evaluation metrics (FID, IS, CLIP scores) provide quantitative measures of generation quality and text-image alignment.

Future Improvements

Advanced Text Encoders: Implement CLIP or BERT-based text encoders for improved semantic understanding

Higher Resolution: Scale architecture to 128x128 or 256x256 pixel images

Advanced Architectures: Consider StyleGAN, Progressive GAN, or Diffusion-based approaches

Enhanced Evaluation: Implement human evaluation studies and more comprehensive perceptual metrics

Hyperparameter Optimization: Systematic tuning of learning rates, batch sizes, and network architecture