This page looks best with JavaScript enabled

LLM Parameters

 ·  ☕ 8 min read

When we start learning about Large Language Models (LLMs), it is but natural to become quite interested in how the various parameters, training data size, context size, tokens, etc. affect the performance of the model. And how the existing models out there in the wild; both open and closed source; use the different parameters, what are their strengths and weaknesses, etc. It is also important to know and compare the training data sizes used in such models so one can understand how much resources would a relative model need in order to be trained from scratch.

I have tried to collect the information in the table below. The information will also help an AI practioner to select the right model based on the use case. Please do add your comments and provide more information as needed.

No.Model FamilyDeveloperOpen/Closed
Key Architectures & Encoding
Parameter SizesTraining Data Size
Training Data Sources
Context Size (Tokens)Output Dimensions (Hidden State)
Strengths (+) & Weaknesses (-)
1GPT-4oOpenAIClosedTransformer, Multi-modal, supports audio, vision, and text as input and output.Unknown (est. >1T)UnknownPublic data and “data licensed from third-party providers,” which likely includes web crawls, books, and code.128kUnknown+ State-of-the-art multi-modal capabilities, exceptional reasoning, and coding, excellent for conversational tasks.
- Closed-source and proprietary, high API cost, prone to “hallucinations.”
2Claude 3.5 SonnetAnthropicClosedTransformer, Multi-modal, focus on responsible AI.UnknownUnknownA proprietary mix of publicly available internet data (as of April 2024), non-public data from third parties, and data from human contractors.200kUnknown+ Strong performance in reasoning, coding, and multi-modal tasks, a focus on safety and “Constitutional AI.”
- Closed-source and proprietary, can be overly cautious, API access is required.
3Gemini 1.5 ProGoogleClosedMixture of Experts (MoE), Multi-modal.UnknownUnknownA mix of publicly available data and proprietary datasets. The specific sources are not publicly detailed.1MUnknown+ Exceptionally large context window, enabling recall on very long documents and videos, strong multi-modal capabilities, highly efficient Mixture-of-Experts (MoE) architecture.
- Closed-source, proprietary details about its training data, developing ecosystem.
4Llama 3Meta AIOpen-SourceStandard Transformer decoder-only architecture.8B, 70B, 405B15T tokens“Publicly available sources” with a focus on quality. The dataset was pre-processed using a mix of human annotators and previous Llama models to filter for high-quality data.8k, 128k4,096 (8B), 8,192 (70B), 16,384 (405B)+ Top-tier open-source performance, available in a variety of sizes, and permissive license.
- Smaller models may not compete with state-of-the-art closed models, can require significant fine-tuning.
5Mixtral 8x7BMistral AIOpen-SourceMixtral uses a sparse Mixture of Experts (MoE) architecture. Mistral 7B uses Grouped-Query Attention (GQA).46.7B (MoE)UnknownNot publicly disclosed. It is assumed to be a high-quality, filtered web dataset similar to other large models, but Mistral AI keeps its data sources private.32k4,096+ Highly efficient and fast due to its MoE architecture, strong performance for its size, excellent for running on consumer-grade hardware.
- More complex to fine-tune, performance can be less consistent on highly specialized tasks.
6Gemma 2GoogleOpen-SourceStandard Transformer architecture, based on Gemini’s research.2B, 9B, 27B~2T-13T tokensA mix of publicly available web data and synthetic data, curated to align with the training data for the larger Gemini models.8kUnknown+ Lightweight and optimized for on-device applications, strong performance for its size, built on Google’s ethical AI principles.
- Smaller parameter count limits its overall capabilities compared to frontier models.
7Qwen 2AlibabaOpen-SourceTransformer decoder-only architecture.0.5B to 72BUnknownNot publicly disclosed. Given its strong multilingual performance, it’s likely a diverse dataset covering multiple languages and codebases. The Qwen2.5-Coder model was trained on 5.5T tokens, including source code and synthetic data.128k4096 (for the 7B model)+ Strong multilingual capabilities, supports a wide range of tasks, and a large context window.
- The largest models can have a more restrictive license, not as widely adopted in some regions.
8Phi-3MicrosoftOpen-SourceTransformer decoder-only architecture, focused on high-quality training data.3.8B, 7B, 14B3.3T tokensA mix of “rigorously filtered public documents, high-quality educational materials, and specially created synthetic data.” It emphasizes the quality of data over sheer quantity.4k, 8k, 128kUnknown+ Exceptionally strong performance for its small size, highly efficient and can run on consumer hardware.
- Limited knowledge base compared to larger models, can struggle with complex, open-ended tasks.
9Falcon 180BTIIOpen-SourceCausal Decoder-Only Transformer, Multi-Group Attention (MQA) & Rotary Positional Embeddings (RoPE)180B3.5T tokensPredominantly from RefinedWeb, a massive filtered and deduplicated web dataset. It also includes curated conversational data.2k6560+ One of the largest open-source models available, strong performance on a variety of benchmarks.
- Large size makes it expensive to run and fine-tune, limited context window.
10T5GoogleOpen-SourceEncoder-decoder Transformer.60M - 11B~750GB (Colossal Clean Crawled Corpus)The Colossal Clean Crawled Corpus (C4), a massive, filtered version of the Common Crawl web archive.512512 (base)+ Handles all NLP tasks as “text-to-text” conversion, highly versatile architecture.
- Less effective for creative writing or conversational tasks, very limited context size in original versions.
11BERTGoogleOpen-SourceEncoder-only Transformer.110M, 340M+~2.5B words (Wikipedia + BookCorpus)English Wikipedia and the BookCorpus dataset, a collection of public-domain books.512768+ Excellent for classification and understanding tasks, highly influential foundational model.
- Not a generative model, cannot produce new text, limited context size.
12BLOOMBigScienceOpen-SourceCausal decoder-only Transformer with (Attention with Linear Biases)ALiBi positional embeddings.176B~350B tokensThe ROOTS corpus, which includes hundreds of sources in 46 natural and 13 programming languages, with a large portion from the OSCAR web corpus.2k14336+ The first truly multilingual model trained on 46 natural and 13 programming languages, developed by a large consortium of researchers.
- Can be less performant than monolingual models on some tasks, large size makes it difficult to deploy.
13Code LlamaMeta AIOpen-SourceCausal decoder-only Transformer model that is a specialized version of the Llama 2 architecture. It uses Rotary Positional Embeddings (RoPE) for encoding the position of tokens and is trained with a unique method for code infilling (Fill-in-the-Middle (FIM))7B, 13B, 34B, 70BUnknownA specialized fine-tune of Llama 2. It was trained on an additional 500B tokens of code-specific datasets, along with natural language data related to code.100k4096 (7B)+ State-of-the-art for code generation and understanding, highly effective for programming tasks.
- Less effective for general-purpose language tasks.
14DBRXDatabricksOpen-SourceCausal decoder-only Transformer architecture with a fine-grained Mixture-of-Experts (MoE)132B (MoE)12T tokensA carefully curated, proprietary dataset of text and code. Databricks has emphasized that the data was selected for “quality and domain diversity” using their platform tools.32k6144+ High-performance Mixture-of-Experts model, strong for coding and reasoning tasks.
- Can be more complex to fine-tune than monolithic models.
15ZephyrHugging FaceOpen-SourceCausal decoder-only Transformer design, Grouped-Query Attention (GQA), and Rotary Positional Embeddings (RoPE), Distilled Supervised Fine-Tuning (dSFT) and Distilled Direct Preference Optimization (dDPO) for knowledge distillation7BUnknownAn instruction-tuned model. It was fine-tuned using a mix of synthetic data generated by other LLMs and high-quality human-annotated data. The exact sources are not fully detailed.4k4096+ An instruction-tuned model from the Mistral family, known for its high-quality chat capabilities and strong performance for its size.
- Primarily a chat model, its general-purpose performance is not as strong as its base model.
16Yi01.AIOpen-SourceCausal decoder-only Transformer, Rotary Positional Embeddings (RoPE) with efficiency optimizations using pre-normalization, SwiGLU activation, and Grouped Query Attention (GQA)6B, 9B, 34BUnknownThe training data sources are not publicly specified, but the model’s strong performance in both Chinese and English suggests a high-quality, large-scale bilingual dataset.4k, 8k, 200k4096+ Strong bilingual (Chinese/English) performance, good for code and reasoning tasks, a large context window in some variants.
- Performance on other languages may not be as strong.
17Grok-1xAIOpen-SourceCausal decoder-only Transformer model with Mixture-of-Experts (MoE) using Rotary Positional Embeddings (RoPE) and Grouped-Query Attention (GQA)314B (MoE)UnknownThe training data comes from “the web” and includes real-time data from X (formerly Twitter).8kUnknown+ Designed to be witty and rebellious, drawing from real-time data from X (formerly Twitter), and a large MoE architecture.
- Training data from social media can lead to biased or inappropriate outputs, less predictable behavior.
17PaLM 2GoogleClosedCausal decoder-only Transformer using high-quality training datasetUnknown (sizes: Gecko, Otter, Bison, Unicorn)UnknownA massive, multilingual corpus with a focus on “multilingual text from hundreds of languages, which includes diverse sources like web documents, books, and conversational data.”32kUnknown+ Highly capable and efficient, strong multilingual abilities, and reasoning.
- Closed-source, proprietary, and details are not public.
18DeepSeek-V3DeepSeek AIOpen-SourceCausal decoder-only Transformer architecture combined with a fine-grained Mixture-of-Experts (MoE), Multi-head Latent Attention (MLA)671B14.8T tokensThe training data is a mixture of public web content, code, mathematical data, and books. The exact composition is proprietary but is known to have a significant focus on diverse and high-quality English and Chinese data128K7168+ Efficient Architecture, Strong Reasoning & Coding, High Quality Bilingual Performance
- General Capability Gaps, Regulatory & Bias Concern and Prompt Sensitivity.

This collection of data is valid as of August 2025 but it is expected to change in the future. This is not an exhaustive list by any means and I have missed many of the models. Newer models are being developed almost everyday and the LLMs in general are improving at an exponential rate. It is though good to have the data at hand for comparison and to understand the different parameters and their impact on the performance and applicability of the model for a particular use case.

Please do provide your comments and let me know if you have any questions or if you would like to add more models to the list.

Happy coding!

Share on
Support the author with

Naresh Mehta
WRITTEN BY
Naresh Mehta
Ideas analyzed logically to make sense & grow upon...