Simon Willison’s Weblog: deepseek-ai/DeepSeek-V3-Base

Source URL: https://simonwillison.net/2024/Dec/25/deepseek-v3/#atom-everything
Source: Simon Willison’s Weblog
Title: deepseek-ai/DeepSeek-V3-Base

Feedly Summary: deepseek-ai/DeepSeek-V3-Base
No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund High-Flyer) looks very significant.
It’s a huge model – 685B parameters, 687.9 GB on disk (TIL how to size a git-lfs repo). The architecture is a Mixture of Experts with 256 experts, using 8 per token.
VB from Hugging Face used the config files to compare it to DeepSeek v2:

Property
v2
v3

vocab_size
102400
129280

hidden_size
4096
7168

intermediate_size
11008
18432

num_hidden_layers
30
61

num_attention_heads
32
128

max_position_embeddings
2048
4096

The new model is apparently available to some people via both chat.deepseek.com and the DeepSeek API. As far as I can tell this is a staged rollout – I don’t seem to have access myself yet.
Paul Gauthier got API access and used it to update his new Aider Polyglot leaderboard – DeepSeek v3 preview scored 48.4%, putting it in second place behind o1-2024-12-17 (high) and in front of both claude-3-5-sonnet-20241022 and gemini-exp-1206!

Via @ivanfioravanti
Tags: aider, hugging-face, generative-ai, ai, llms

AI Summary and Description: Yes

Summary: The text discusses the significance of the new AI model (DeepSeek-V3) developed by DeepSeek, a Chinese AI lab. This model is notable for its large size and improvements over its predecessor, DeepSeek v2, particularly with its enhanced architecture and capabilities. Insights about its rollout and performance in comparison to other models are also shared.

Detailed Description:

The new release of DeepSeek-V3 by a Chinese AI lab presents several critical advancements and implications for stakeholders in the fields of AI, particularly in generative AI and LLMs (Large Language Models):

– **Model Specifications**:
– **Parameters**: DeepSeek-V3 boasts an impressive 685 billion parameters, significantly increasing its capacity for complex computations and nuanced understanding.
– **Disk Size**: The model requires 687.9 GB of disk space, indicating its extensive resource needs.
– **Architecture**: It employs a Mixture of Experts (MoE) architecture with 256 experts, activating 8 experts per token, optimizing performance while managing resource use.

– **Comparative Enhancements**:
– The text outlines a comparison between the new model and its predecessor, highlighting substantial upgrades in critical parameters:
– **Vocabulary Size**: Increased from 102,400 to 129,280.
– **Hidden Size**: Increased from 4,096 to 7,168.
– **Intermediate Size**: Increased from 11,008 to 18,432.
– **Number of Hidden Layers**: Increased from 30 to 61.
– **Attention Heads**: Increased from 32 to 128.
– **Max Position Embeddings**: Increased from 2,048 to 4,096.

– **Access and Availability**:
– The model is currently in a staged rollout, becoming available through platforms such as chat.deepseek.com and via the DeepSeek API. As of now, access may be limited to select users.

– **Performance Metrics**:
– The model’s early performance evaluations place it at 48.4% on the Aider Polyglot leaderboard, showcasing its competitive standing among contemporary AI models, particularly as it ranks behind only one other model and ahead of several notable competitors.

Implications for AI and LLM Security:
– With the introduction of such expansive models, the potential for both positive applications and security vulnerabilities increases, necessitating a focus on both AI security and governance.
– Stakeholders should consider the implications of such models in terms of data privacy, responsible AI use, and compliance with emerging regulations surrounding AI technologies.

This new model from DeepSeek represents a significant milestone in AI model development, raising the stakes in the generative AI space and highlighting the need for ongoing vigilance in security and compliance matters.