base-seoul

#dailychallenge 

MoE - Mixture of Expert

- Neural network architecture. Split LLM to multiple small experts, organize a network of those experts, and activate only a specific subset of experts based on the input. This approach helps improve efficiency by reducing computational cost and still maintaining high performance
- to sum up, it can
    - make the inference cheaper
    - and make the performance higher
- However,
    - if gating is inefficient, it won’t perform better
    - when specific experts are overused(Load balancing issue), training becomes imbalanced(Mode collapse issue)
    - In a distributed computing environment, MoE increases inter-device communication, resulting in performance slow-down.
    - it is harder to train than a general transformer model
- Due to these features, CoT goes well with MoE