LLM Parameters

When we start learning about Large Language Models (LLMs), it is but natural to become quite interested in how the various parameters, training data size, context size, tokens, etc. affect the performance of the model. And how the existing models out there in the wild; both open and closed source; use the different parameters, what are their strengths and weaknesses, etc. It is also important to know and compare the training data sizes used in such models so one can understand how much resources would a relative model need in order to be trained from scratch.

I have tried to collect the information in the table below. The information will also help an AI practioner to select the right model based on the use case. Please do add your comments and provide more information as needed.

No.	Model Family	Developer	Open/Closed	Key Architectures & Encoding	Parameter Sizes	Training Data Size	Training Data Sources	Context Size (Tokens)	Output Dimensions (Hidden State)	Strengths (+) & Weaknesses (-)
1	GPT-4o	OpenAI	Closed	Transformer, Multi-modal, supports audio, vision, and text as input and output.	Unknown (est. >1T)	Unknown	Public data and “data licensed from third-party providers,” which likely includes web crawls, books, and code.	128k	Unknown	+ State-of-the-art multi-modal capabilities, exceptional reasoning, and coding, excellent for conversational tasks. - Closed-source and proprietary, high API cost, prone to “hallucinations.”
2	Claude 3.5 Sonnet	Anthropic	Closed	Transformer, Multi-modal, focus on responsible AI.	Unknown	Unknown	A proprietary mix of publicly available internet data (as of April 2024), non-public data from third parties, and data from human contractors.	200k	Unknown	+ Strong performance in reasoning, coding, and multi-modal tasks, a focus on safety and “Constitutional AI.” - Closed-source and proprietary, can be overly cautious, API access is required.
3	Gemini 1.5 Pro	Google	Closed	Mixture of Experts (MoE), Multi-modal.	Unknown	Unknown	A mix of publicly available data and proprietary datasets. The specific sources are not publicly detailed.	1M	Unknown	+ Exceptionally large context window, enabling recall on very long documents and videos, strong multi-modal capabilities, highly efficient Mixture-of-Experts (MoE) architecture. - Closed-source, proprietary details about its training data, developing ecosystem.
4	Llama 3	Meta AI	Open-Source	Standard Transformer decoder-only architecture.	8B, 70B, 405B	15T tokens	“Publicly available sources” with a focus on quality. The dataset was pre-processed using a mix of human annotators and previous Llama models to filter for high-quality data.	8k, 128k	4,096 (8B), 8,192 (70B), 16,384 (405B)	+ Top-tier open-source performance, available in a variety of sizes, and permissive license. - Smaller models may not compete with state-of-the-art closed models, can require significant fine-tuning.
5	Mixtral 8x7B	Mistral AI	Open-Source	Mixtral uses a sparse Mixture of Experts (MoE) architecture. Mistral 7B uses Grouped-Query Attention (GQA).	46.7B (MoE)	Unknown	Not publicly disclosed. It is assumed to be a high-quality, filtered web dataset similar to other large models, but Mistral AI keeps its data sources private.	32k	4,096	+ Highly efficient and fast due to its MoE architecture, strong performance for its size, excellent for running on consumer-grade hardware. - More complex to fine-tune, performance can be less consistent on highly specialized tasks.
6	Gemma 2	Google	Open-Source	Standard Transformer architecture, based on Gemini’s research.	2B, 9B, 27B	~2T-13T tokens	A mix of publicly available web data and synthetic data, curated to align with the training data for the larger Gemini models.	8k	Unknown	+ Lightweight and optimized for on-device applications, strong performance for its size, built on Google’s ethical AI principles. - Smaller parameter count limits its overall capabilities compared to frontier models.
7	Qwen 2	Alibaba	Open-Source	Transformer decoder-only architecture.	0.5B to 72B	Unknown	Not publicly disclosed. Given its strong multilingual performance, it’s likely a diverse dataset covering multiple languages and codebases. The Qwen2.5-Coder model was trained on 5.5T tokens, including source code and synthetic data.	128k	4096 (for the 7B model)	+ Strong multilingual capabilities, supports a wide range of tasks, and a large context window. - The largest models can have a more restrictive license, not as widely adopted in some regions.
8	Phi-3	Microsoft	Open-Source	Transformer decoder-only architecture, focused on high-quality training data.	3.8B, 7B, 14B	3.3T tokens	A mix of “rigorously filtered public documents, high-quality educational materials, and specially created synthetic data.” It emphasizes the quality of data over sheer quantity.	4k, 8k, 128k	Unknown	+ Exceptionally strong performance for its small size, highly efficient and can run on consumer hardware. - Limited knowledge base compared to larger models, can struggle with complex, open-ended tasks.
9	Falcon 180B	TII	Open-Source	Causal Decoder-Only Transformer, Multi-Group Attention (MQA) & Rotary Positional Embeddings (RoPE)	180B	3.5T tokens	Predominantly from RefinedWeb, a massive filtered and deduplicated web dataset. It also includes curated conversational data.	2k	6560	+ One of the largest open-source models available, strong performance on a variety of benchmarks. - Large size makes it expensive to run and fine-tune, limited context window.
10	T5	Google	Open-Source	Encoder-decoder Transformer.	60M - 11B	~750GB (Colossal Clean Crawled Corpus)	The Colossal Clean Crawled Corpus (C4), a massive, filtered version of the Common Crawl web archive.	512	512 (base)	+ Handles all NLP tasks as “text-to-text” conversion, highly versatile architecture. - Less effective for creative writing or conversational tasks, very limited context size in original versions.
11	BERT	Google	Open-Source	Encoder-only Transformer.	110M, 340M+	~2.5B words (Wikipedia + BookCorpus)	English Wikipedia and the BookCorpus dataset, a collection of public-domain books.	512	768	+ Excellent for classification and understanding tasks, highly influential foundational model. - Not a generative model, cannot produce new text, limited context size.
12	BLOOM	BigScience	Open-Source	Causal decoder-only Transformer with (Attention with Linear Biases)ALiBi positional embeddings.	176B	~350B tokens	The ROOTS corpus, which includes hundreds of sources in 46 natural and 13 programming languages, with a large portion from the OSCAR web corpus.	2k	14336	+ The first truly multilingual model trained on 46 natural and 13 programming languages, developed by a large consortium of researchers. - Can be less performant than monolingual models on some tasks, large size makes it difficult to deploy.
13	Code Llama	Meta AI	Open-Source	Causal decoder-only Transformer model that is a specialized version of the Llama 2 architecture. It uses Rotary Positional Embeddings (RoPE) for encoding the position of tokens and is trained with a unique method for code infilling (Fill-in-the-Middle (FIM))	7B, 13B, 34B, 70B	Unknown	A specialized fine-tune of Llama 2. It was trained on an additional 500B tokens of code-specific datasets, along with natural language data related to code.	100k	4096 (7B)	+ State-of-the-art for code generation and understanding, highly effective for programming tasks. - Less effective for general-purpose language tasks.
14	DBRX	Databricks	Open-Source	Causal decoder-only Transformer architecture with a fine-grained Mixture-of-Experts (MoE)	132B (MoE)	12T tokens	A carefully curated, proprietary dataset of text and code. Databricks has emphasized that the data was selected for “quality and domain diversity” using their platform tools.	32k	6144	+ High-performance Mixture-of-Experts model, strong for coding and reasoning tasks. - Can be more complex to fine-tune than monolithic models.
15	Zephyr	Hugging Face	Open-Source	Causal decoder-only Transformer design, Grouped-Query Attention (GQA), and Rotary Positional Embeddings (RoPE), Distilled Supervised Fine-Tuning (dSFT) and Distilled Direct Preference Optimization (dDPO) for knowledge distillation	7B	Unknown	An instruction-tuned model. It was fine-tuned using a mix of synthetic data generated by other LLMs and high-quality human-annotated data. The exact sources are not fully detailed.	4k	4096	+ An instruction-tuned model from the Mistral family, known for its high-quality chat capabilities and strong performance for its size. - Primarily a chat model, its general-purpose performance is not as strong as its base model.
16	Yi	01.AI	Open-Source	Causal decoder-only Transformer, Rotary Positional Embeddings (RoPE) with efficiency optimizations using pre-normalization, SwiGLU activation, and Grouped Query Attention (GQA)	6B, 9B, 34B	Unknown	The training data sources are not publicly specified, but the model’s strong performance in both Chinese and English suggests a high-quality, large-scale bilingual dataset.	4k, 8k, 200k	4096	+ Strong bilingual (Chinese/English) performance, good for code and reasoning tasks, a large context window in some variants. - Performance on other languages may not be as strong.
17	Grok-1	xAI	Open-Source	Causal decoder-only Transformer model with Mixture-of-Experts (MoE) using Rotary Positional Embeddings (RoPE) and Grouped-Query Attention (GQA)	314B (MoE)	Unknown	The training data comes from “the web” and includes real-time data from X (formerly Twitter).	8k	Unknown	+ Designed to be witty and rebellious, drawing from real-time data from X (formerly Twitter), and a large MoE architecture. - Training data from social media can lead to biased or inappropriate outputs, less predictable behavior.
17	PaLM 2	Google	Closed	Causal decoder-only Transformer using high-quality training dataset	Unknown (sizes: Gecko, Otter, Bison, Unicorn)	Unknown	A massive, multilingual corpus with a focus on “multilingual text from hundreds of languages, which includes diverse sources like web documents, books, and conversational data.”	32k	Unknown	+ Highly capable and efficient, strong multilingual abilities, and reasoning. - Closed-source, proprietary, and details are not public.
18	DeepSeek-V3	DeepSeek AI	Open-Source	Causal decoder-only Transformer architecture combined with a fine-grained Mixture-of-Experts (MoE), Multi-head Latent Attention (MLA)	671B	14.8T tokens	The training data is a mixture of public web content, code, mathematical data, and books. The exact composition is proprietary but is known to have a significant focus on diverse and high-quality English and Chinese data	128K	7168	+ Efficient Architecture, Strong Reasoning & Coding, High Quality Bilingual Performance - General Capability Gaps, Regulatory & Bias Concern and Prompt Sensitivity.

This collection of data is valid as of August 2025 but it is expected to change in the future. This is not an exhaustive list by any means and I have missed many of the models. Newer models are being developed almost everyday and the LLMs in general are improving at an exponential rate. It is though good to have the data at hand for comparison and to understand the different parameters and their impact on the performance and applicability of the model for a particular use case.

Please do provide your comments and let me know if you have any questions or if you would like to add more models to the list.

Happy coding!

LLM Parameters

See Also