Open-Sora 성능 최적화 완벽 가이드: CUDA 메모리부터 Flash Attention까지

개요

AI 비디오 생성은 막대한 계산 리소스를 요구합니다. Open-Sora는 11B 파라미터 모델로 768px 비디오를 생성할 때 80GB 이상의 GPU 메모리가 필요할 수 있습니다. 이번 포스트에서는 Open-Sora가 사용하는 다양한 성능 최적화 기법을 상세히 분석해보겠습니다.

메모리 최적화 전략

1. CUDA 메모리 할당 최적화

# Dockerfile에서의 CUDA 메모리 설정
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:8,expandable_segments:True,garbage_collection_threshold:0.95
ENV PYTORCH_NO_CUDA_MEMORY_CACHING=1
ENV CUDA_MEMORY_FRACTION=0.4
ENV PYTORCH_CUDA_ALLOC_CONF=backend=cudaMallocAsync,max_split_size_mb:16

각 설정의 효과

max_split_size_mb:8

목적: 메모리 조각화 방지
효과: 큰 메모리 블록을 8MB 단위로 분할
장점: 메모리 부족 에러 감소

expandable_segments:True

목적: 동적 메모리 확장
효과: 필요에 따라 메모리 세그먼트 확장
장점: 메모리 사용량 최적화

garbage_collection_threshold:0.95

목적: 적극적인 가비지 컬렉션
효과: 메모리 사용률 95% 도달 시 정리
장점: 메모리 누수 방지

cudaMallocAsync 백엔드

목적: 비동기 메모리 할당
효과: 메모리 할당/해제 속도 향상
장점: GPU 아이들 타임 감소

2. 데이터 타입 최적화

# 모델 정밀도 최적화
vae = vae.to(torch.bfloat16).eval()
stdit = stdit.to(torch.bfloat16).eval()

# T5는 fp32 유지 (정확도 중요)
text_encoder.t5.model = text_encoder.t5.model.eval()

bfloat16 vs float32 vs float16

데이터 타입	메모리 사용량	속도	정확도	수치 안정성
float32	100% (기준)	1.0x	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
bfloat16	50%	1.5-2x	⭐⭐⭐⭐	⭐⭐⭐⭐
float16	50%	1.5-2x	⭐⭐⭐	⭐⭐⭐

bfloat16 선택 이유:

메모리 절약: 50% 메모리 감소
속도 향상: 1.5-2배 빠른 연산
안정성: float16보다 넓은 지수 범위
하드웨어 지원: 최신 GPU에서 네이티브 지원

3. 그래디언트 및 추론 최적화

# 추론 모드 활성화
with torch.inference_mode():
    # 그래디언트 계산 완전 비활성화
    samples = scheduler.sample(...)

# 모델 평가 모드
model.eval().requires_grad_(False)

# CUDA 캐시 정리
torch.cuda.empty_cache()

torch.inference_mode() vs torch.no_grad()

# torch.no_grad(): 그래디언트 계산만 비활성화
with torch.no_grad():
    output = model(input)

# torch.inference_mode(): 완전한 추론 최적화
with torch.inference_mode():
    output = model(input)  # 더 빠름, 메모리 절약

Flash Attention 최적화

1. Flash Attention 2 설치 및 활용

# Dockerfile에서 Flash Attention 설치
RUN pip install --no-cache-dir \
    flash-attn==2.2.3.post2 \
    --no-build-isolation \
    --config-settings="--global-option=build_ext" \
    --config-settings="--global-option=-j1" \
    --no-deps

2. Flash Attention의 성능 향상 원리

전통적인 Attention의 메모리 복잡도

# 표준 어텐션: O(n²) 메모리 복잡도
def standard_attention(Q, K, V):
    # Q @ K^T: [seq_len, seq_len] 매트릭스 생성
    attention_scores = torch.matmul(Q, K.transpose(-2, -1))  # O(n²) 메모리
    attention_weights = torch.softmax(attention_scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output

Flash Attention의 최적화

# Flash Attention: O(n) 메모리 복잡도
def flash_attention_concept(Q, K, V, block_size=64):
    """Flash Attention 핵심 아이디어 (의사코드)"""
    seq_len = Q.shape[1]
    output = torch.zeros_like(Q)
    
    # 블록 단위로 처리 (메모리 효율성)
    for i in range(0, seq_len, block_size):
        for j in range(0, seq_len, block_size):
            # 작은 블록 단위로 어텐션 계산
            q_block = Q[:, i:i+block_size]
            k_block = K[:, j:j+block_size]
            v_block = V[:, j:j+block_size]
            
            # 블록 내에서만 어텐션 계산
            attention_block = compute_attention_block(q_block, k_block, v_block)
            output[:, i:i+block_size] += attention_block
    
    return output

성능 향상 결과

시퀀스 길이	표준 Attention	Flash Attention	메모리 절약	속도 향상
512	100%	45%	55%	2.2x
1024	100%	25%	75%	4.0x
2048	100%	15%	85%	6.7x
4096	100%	8%	92%	12.5x

3. Flash Attention 3 고급 최적화

# 선택적으로 Flash Attention 3 설치
RUN git clone https://github.com/Dao-AILab/flash-attention && \
    cd flash-attention/hopper && \
    python setup.py install

Flash Attention 3의 추가 개선사항:

H100 GPU 최적화: Hopper 아키텍처 전용 최적화
더 긴 시퀀스 지원: 32K+ 토큰 처리 가능
낮은 정밀도 지원: FP8 연산 활용

xformers 메모리 최적화

1. xformers 설치 및 설정

# 메모리 최적화를 위한 xformers 설치
RUN pip install --no-cache-dir xformers==0.0.27.post2 --no-build-isolation

2. xformers 최적화 기법

Memory-Efficient Attention

import xformers.ops as xops

def memory_efficient_attention(Q, K, V, attention_mask=None):
    """xformers의 메모리 효율적 어텐션"""
    return xops.memory_efficient_attention(
        Q, K, V,
        attn_bias=attention_mask,
        scale=1.0 / math.sqrt(Q.shape[-1])
    )

Checkpointing 최적화

import xformers.checkpoint as checkpoint

class MemoryEfficientTransformer(nn.Module):
    def forward(self, x):
        # 메모리 절약을 위한 그래디언트 체크포인팅
        x = checkpoint.checkpoint(self.layer1, x)
        x = checkpoint.checkpoint(self.layer2, x)
        return x

3. 성능 비교

최적화 기법	메모리 사용량	속도	구현 복잡도
기본 PyTorch	100%	1.0x	⭐
xformers	60-70%	1.3-1.5x	⭐⭐
Flash Attention	40-50%	2.0-3.0x	⭐⭐⭐
Flash + xformers	30-40%	2.5-4.0x	⭐⭐⭐⭐

분산 처리 최적화

1. NCCL 통신 최적화

# 분산 훈련을 위한 NCCL 설정
ENV NCCL_DEBUG=INFO
ENV NCCL_IB_DISABLE=1
ENV NCCL_P2P_DISABLE=1
ENV NCCL_SOCKET_IFNAME=eth0
ENV NCCL_BLOCKING_WAIT=1

각 설정의 의미

NCCL_IB_DISABLE=1

목적: InfiniBand 비활성화
상황: 일반 이더넷 환경
효과: 통신 안정성 향상

NCCL_P2P_DISABLE=1

목적: P2P 통신 비활성화
상황: 다중 노드 환경
효과: 네트워크 병목 방지

2. 멀티 GPU 추론 전략

# 단일 GPU (256px)
torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py \
    configs/diffusion/inference/t2i2v_256px.py \
    --save-dir samples --prompt "raining, sea"

# 8 GPU 분산 처리 (768px)
torchrun --nproc_per_node 8 --standalone scripts/diffusion/inference.py \
    configs/diffusion/inference/t2i2v_768px.py \
    --save-dir samples --prompt "raining, sea"

GPU별 성능 확장성

GPU 수	256px (시간/메모리)	768px (시간/메모리)	확장 효율성
1 GPU	60s / 52.5GB	1656s / 60.3GB	-
2 GPU	40s / 44.3GB	863s / 48.3GB	75%
4 GPU	34s / 44.3GB	466s / 44.3GB	85%
8 GPU	-	276s / 44.3GB	90%

컴파일 최적화

1. TensorFloat-32 (TF32) 활성화

# TF32 최적화 활성화
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

TF32의 장점:

자동 최적화: 코드 변경 없이 성능 향상
A100/H100 최적화: 최신 GPU에서 2-3배 속도 향상
정확도 유지: float32와 거의 동일한 정확도

2. JIT 컴파일 최적화

# JIT 컴파일 비활성화 (Gradio 호환성)
torch.jit._state.disable()

# 또는 프로덕션에서 JIT 활용
@torch.jit.script
def optimized_function(x):
    return x * 2 + 1

동적 최적화 기법

1. 적응적 배치 크기

def adaptive_batch_size(model, input_shape, max_memory_gb=40):
    """GPU 메모리에 따른 적응적 배치 크기 설정"""
    batch_size = 1
    while True:
        try:
            # 테스트 배치로 메모리 사용량 확인
            test_input = torch.randn(batch_size, *input_shape).cuda()
            with torch.inference_mode():
                _ = model(test_input)
            
            # 메모리 사용량 체크
            memory_used = torch.cuda.memory_allocated() / 1024**3
            if memory_used > max_memory_gb * 0.8:  # 80% 임계값
                break
            
            batch_size += 1
        except RuntimeError as e:
            if "out of memory" in str(e):
                batch_size -= 1
                break
            raise e
    
    return max(1, batch_size)

2. 메모리 오프로딩

def memory_offloading_inference(model, input_data):
    """CPU-GPU 메모리 오프로딩"""
    # 모델을 CPU로 이동
    model.cpu()
    
    # 필요한 레이어만 GPU로 로드
    for layer_name, layer in model.named_modules():
        if "attention" in layer_name:  # 중요한 레이어만 GPU
            layer.cuda()
    
    # 추론 실행
    with torch.inference_mode():
        output = model(input_data.cuda())
    
    # 정리
    torch.cuda.empty_cache()
    return output.cpu()

실시간 모니터링 및 프로파일링

1. GPU 메모리 모니터링

def monitor_gpu_memory():
    """실시간 GPU 메모리 모니터링"""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3
        reserved = torch.cuda.memory_reserved() / 1024**3
        max_allocated = torch.cuda.max_memory_allocated() / 1024**3
        
        print(f"GPU Memory Usage:")
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved:  {reserved:.2f} GB")
        print(f"  Max Used:  {max_allocated:.2f} GB")
        
        # 메모리 사용률이 높으면 경고
        if allocated / reserved > 0.9:
            print("⚠️  GPU memory usage is high!")
        
        return {
            'allocated': allocated,
            'reserved': reserved,
            'max_allocated': max_allocated
        }

2. 성능 프로파일링

def profile_model_performance(model, input_data, num_runs=10):
    """모델 성능 프로파일링"""
    import time
    
    # 워밍업
    for _ in range(3):
        with torch.inference_mode():
            _ = model(input_data)
    
    torch.cuda.synchronize()
    
    # 실제 측정
    times = []
    for i in range(num_runs):
        start_time = time.time()
        
        with torch.inference_mode():
            output = model(input_data)
        
        torch.cuda.synchronize()
        end_time = time.time()
        
        times.append(end_time - start_time)
        print(f"Run {i+1}: {times[-1]:.3f}s")
    
    avg_time = sum(times) / len(times)
    std_time = (sum((t - avg_time)**2 for t in times) / len(times))**0.5
    
    print(f"\nAverage: {avg_time:.3f}s ± {std_time:.3f}s")
    return avg_time, std_time

환경별 최적화 가이드

1. 로컬 개발 환경 (RTX 3090/4090)

# 메모리 제한 설정
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512,garbage_collection_threshold:0.8"

# 256px 해상도로 제한
python inference.py --resolution 256px --length 65 --offload True

2. 클라우드 환경 (A100/H100)

# 최대 성능 설정
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:8,expandable_segments:True"

# 768px 고해상도 지원
python inference.py --resolution 768px --length 97 --batch_size 2

3. 멀티 노드 클러스터

# 분산 설정
export NCCL_DEBUG=INFO
export NCCL_TREE_THRESHOLD=0

# 노드 간 통신 최적화
torchrun --nnodes=2 --nproc_per_node=8 --master_addr=192.168.1.100 \
    inference.py --distributed

성능 향상 결과 요약

최적화 전후 비교

최적화 기법	메모리 절약	속도 향상	구현 난이도	안정성
bfloat16 변환	50%	30-50%	⭐	⭐⭐⭐⭐
Flash Attention	60-80%	100-300%	⭐⭐⭐	⭐⭐⭐⭐
xformers	30-40%	20-50%	⭐⭐	⭐⭐⭐⭐
CUDA 메모리 최적화	20-30%	10-20%	⭐⭐	⭐⭐⭐⭐⭐
분산 처리	-	200-600%	⭐⭐⭐⭐	⭐⭐⭐

실제 벤치마크 결과

768px 비디오 생성 (97 프레임, 5초)
├── 최적화 전: 3000s (50분), 80GB 메모리, 1 GPU
├── 부분 최적화: 1656s (28분), 60GB 메모리, 1 GPU
└── 완전 최적화: 276s (5분), 44GB 메모리, 8 GPU

256px 비디오 생성 (97 프레임, 5초)
├── 최적화 전: 120s (2분), 60GB 메모리, 1 GPU
├── 부분 최적화: 60s (1분), 52GB 메모리, 1 GPU
└── 완전 최적화: 34s (34초), 44GB 메모리, 4 GPU

결론

Open-Sora의 성능 최적화는 다층적 접근법을 통해 달성됩니다.

핵심 최적화 전략:

메모리 계층 최적화: CUDA 설정 → 데이터 타입 → 알고리즘
계산 최적화: Flash Attention → xformers → 분산 처리
시스템 최적화: TF32 → JIT → 하드웨어 특화

이러한 최적화 기법들은 다른 대규모 AI 모델에도 적용할 수 있는 범용적 방법론입니다. 특히 메모리 제약이 있는 환경에서 AI 모델을 운영할 때 필수적인 기술들입니다.

다음 포스트에서는 Open-Sora의 실제 사용법과 다양한 활용 예제를 살펴보겠습니다.

이 글이 도움이 되셨다면 공유해주세요! 궁금한 점이 있으시면 댓글로 남겨주시기 바랍니다.

David Lee

개요