

Loading......
KAIROS - DCGAN Text-to-Image Generation
Generative AI Model
A Deep Convolutional Generative Adversarial Network (DCGAN) implementation for text-to-image generation using GLOVE embeddings and COCO dataset. This project features a complete PyTorch implementation with comprehensive evaluation metrics, automatic checkpointing, and visualization tools. Trained on the Microsoft COCO dataset with 300-dimensional GLOVE word embeddings projected to 1024-dimensional space for optimal text representation.
Role
Deep Learning Engineer
Research Scientist
Collaborators
Jason Olefson
Joo Young Gonzalez
Duration
4 months
Tools
PyTorch
Python
NumPy
Pillow
GLOVE Embeddings
Jupyter Notebook
View GitHub Repository
View Jupyter Notebook
Key Features
Text-to-Image Generation: Generate high-quality 64x64 RGB images from text descriptions using DCGAN architecture
GLOVE Embeddings: 300-dimensional word embeddings projected to 1024-dimensional space for robust text representation
COCO Dataset: Trained on Microsoft COCO dataset with 82,783 training and 40,504 validation image-caption pairs
Comprehensive Evaluation: FID, IS, Text-Image Matching, and CLIP scores for detailed performance analysis
Automatic Checkpointing: Model checkpoints saved every 10 epochs with resume capability for long training runs
Visualization Tools: Generate sample images, training GIFs, loss plots, and real vs generated image comparisons
Model Architecture
Generator
Input: 100-dimensional noise vector concatenated with 1024-dimensional text embedding
Output: 3-channel RGB image (64x64 pixels)
Architecture: Transposed convolutional layers with batch normalization and ReLU activations
Discriminator
Input: 3-channel RGB image (64x64) concatenated with spatial text embedding
Output: Binary classification score (real/fake probability)
Architecture: Convolutional layers with batch normalization and leaky ReLU activations
Text Processing
GLOVE Embeddings: Pre-trained 300-dimensional word vectors (glove.6B.300d.txt)
Projection Layer: Linear layer mapping 300d vectors to 1024d space
Caption Processing: Average word embeddings for sentence-level representation
Training Configuration
Image Resolution: 64x64 pixels (3-channel RGB)
Batch Size: 512 images per batch
Total Epochs: 70 epochs with automatic checkpointing every 10 epochs
Learning Rate: 0.0002 (Adam optimizer)
Optimizer Settings: Beta1 = 0.5, Beta2 = 0.999
Noise Dimension: 100-dimensional latent noise vector
Embedding Dimension: 1024-dimensional (projected from GLOVE 300d)
Total Training Time: Approximately 24 hours on GPU hardware
Evaluation Results
FID Score (Frechet Inception Distance): 322.30
IS Score (Inception Score): 0.53
Text-Image Matching: -0.15 ± 0.40
CLIP Score: 0.70
Generated Outputs: 20+ sample images per generation run with quality progression visualization
Dataset and Resources
Microsoft COCO Dataset:
- Training images: 82,783 images from train2014/
- Validation images: 40,504 images from val2014/
- Captions: captions_train2014.json and captions_val2014.json
GLOVE Embeddings:
- glove.6B.300d.txt (300-dimensional word vectors)
- Pre-trained on 6 billion tokens from Wikipedia and Gigaword
Dependencies:
- torch==2.0.0
- torchvision
- numpy==1.21.5
- Pillow==10.0.0
- matplotlib, imageio, scipy, h5py==3.6.0
Repository Structure
models/: Generator and Discriminator architectures plus text processing models
saved_models/: Trained model checkpoints and final weights
generated_images/: Output samples, GIFs, and evaluation visualizations
utils.py: Utility functions for model operations
data_util.py: Dataset processing and loading utilities
DCGAN_Text2Image.ipynb: Main training notebook with all pipeline steps
Usage and Generation
Generate Images from Text:
Load the trained generator model, provide a text description, and the model generates a corresponding image by combining the GLOVE text embedding with random noise.
Resume Training:
Automatic checkpointing enables resuming training from any saved epoch without losing progress on loss values and model weights.
Evaluation:
Comprehensive evaluation metrics (FID, IS, CLIP scores) provide quantitative measures of generation quality and text-image alignment.
Future Improvements
Advanced Text Encoders: Implement CLIP or BERT-based text encoders for improved semantic understanding
Higher Resolution: Scale architecture to 128x128 or 256x256 pixel images
Advanced Architectures: Consider StyleGAN, Progressive GAN, or Diffusion-based approaches
Enhanced Evaluation: Implement human evaluation studies and more comprehensive perceptual metrics
Hyperparameter Optimization: Systematic tuning of learning rates, batch sizes, and network architecture