ShadAI Framework
  1. Features
ShadAI Framework
  • Quick Started
    • Using python client
    • Javacript Client
  • Features
    • Supported Data Types by File Extension
    • ShadAI Agents
    • Supported Models
  • Pricing
    • Pricing Plan
  1. Features

Supported Models

🤖 LLM Model Selection Guide (2025 Q2)#

Only pay for the power you need—this guide distills public benchmarks, vendor docs, and independent evaluations so you can juggle price, speed, context window, multimodality, tool-use, and raw reasoning in seconds.

🗺️ Quick-pick matrix#

Need this mostPick this modelWhy
⚡ Sub-300 ms, < $0.001Gemini 2.0 Flash Lite · Claude 3 HaikuFirst-token latency ≈ 0.25 s; rock-bottom cost
🧠 Elite reasoning (budget ≠ issue)Claude 3 Opus / 3.7 Sonnet IRTop scores on MMLU, GPQA, HumanEval
📚 ≥ 1 M-token contextGemini 1.5 Flash / ProUp to 2 M tokens (1 M input + 1 M output); multimodal
🖥️ Self-host / on-premLlama 3 70B InstructOpen weights, GPT-3.5-grade accuracy
🔍 Turn-key RAGCohere Command R / R-PlusBuilt-in retrieval & function-calling; 128 K context
💵 AWS-native, 32 K contextAmazon Nova ProRuns inside Bedrock; IAM, VPC, KMS integration
🐍 Bilingual EN-ZH codingDeepSeek R1-V190 % MMLU, 97 % MATH500; open weights
(Prices = input-token unless noted. See the pricing sheet.)

Google · Gemini family#

Gemini 2.0 Flash Lite • 1 M ctx#

Multimodal: text + image + audio in, text out
Tool use: Native function-calling (JSON schema)
Benchmarks: MMLU-Pro 71.6 %; HiddenMath 55 %
Strengths: fastest Gemini, cheapest; watermarking on media
Caveats: shallower reasoning vs Flash/Pro

Gemini 2.0 Flash • 1 M ctx#

2× speed of 1.5 Pro; adds multimodal output (image & speech)
MMLU-Pro 77.6 %; HiddenMath 63.5 %
Parallel search/tool calls improve accuracy & grounding

Gemini 1.5 Flash 8B / Flash • 1.5 M ctx#

Distilled from 1.5 Pro; sweet-spot cost/quality
MMLU-Pro 67 %; good for RAG over giant PDFs & video

Gemini 1.5 Pro • 2 M ctx#

First model past 90 % MMLU; excels in code & long-form reasoning
Closed beta; pricey & slower

Anthropic · Claude 3 / 3.5 / 3.7 line#

ModelContextBenchmarks*MultimodalTool useNotes
Haiku200 K55 % MMLUText⚙︎ (prompt-level)Sub-sec, budget tier
Sonnet 3.5200 K78 % MMLU, 64 % HumanEvalVision SOTA (June 24)⚙︎Mid-tier; beats 3 Opus
Sonnet 3.7 IR200 K (+128 K CoT)≈ GPT-4 on hard mathsVision⚙︎ + Integrated reasoning switchFast or reflective modes
Opus 3.0100 K80 % MMLUVision⚙︎Deep reasoning, higher latency
*percentages are representative single-shot scores.

Alignment & safety#

All Claude 3 models use Constitutional AI; 3.7 adds “Constitutional Classifiers” → toughest to jailbreak among majors.

DeepSeek · R1-V1 (open)#

Arch: 37 B active · 671 B MoE, 128 K ctx
Benchmarks: 90.8 % MMLU, 97 % MATH500, Codeforces 96th pct
Multimodal: Text-only
Tool use: No native JSON schema, but excels with ReAct prompts
Safety: minimal RLHF → add your own moderation

Meta · Llama 3 & 4#

Llama 3 Instruct#

SizeContextMMLUHumanEvalBest for
1 B / 8 B128 K60 %20-30 %Edge, CPU, tagging
70 B128 K85 %50 %Self-hosted RAG, privacy

Llama 4 Scout 17 B • 10 M ctx#

Multimodal text + image, 12 langs
Benchmarks: Outperforms Llama 3 70B; SOTA ≤20 B class
Tool use: Designed for agents; Bedrock/Azure endpoints

Llama 4 Maverick 17 B • 1 M ctx#

128-expert MoE, higher accuracy vs Scout, slightly slower
Aims at GPT-4 class while 3-4× cheaper

Cohere · Command family#

ModelParamsContextBenchmarksTool useNiche
Light~ ? B4 K55 % MMLUPrompt-levelCheap chat, routing
R35 B128 K75 % MMLU, top Arena (’24)Native func-call, cites sourcesLong-doc RAG
R Plus104 B128 K≈ GPT-4 on RAG, 2× fasterMulti-step, self-correctFrontier RAG & agents

Amazon · Nova series#

ModelContextMultimodalBenchmarks (≈)Fine-tuneIdeal
Micro128 K❌Beats GPT-4o-mini by ≈2 %✔︎ textLow-latency bulk chat
Lite300 K✔︎ (image + video)Strong vis-text, < Nova Pro✔︎ text+visionDoc & media analysis
Pro300 K✔︎✔︎Near GPT-4o on RAG, 2× faster, 65 % cheaper✔︎Enterprise, multilingual, agents

🛠️ Cost-saving tips#

1.
Prototype on Flash Lite / Haiku then upgrade only where accuracy gaps appear.
2.
Keep sessions alive try to use sessions with your ingested data instead of creating new ones and ingesting the same.
3.
Don't repeat ingestion don't forget to skip ingestion if you already did it t avoid re-processing and more token usage.
4.
Optimize your data the less the best.

Sources: Google DeepMind, Anthropic (Claude 3.7 IR white-paper), DeepSeek-AI, Meta AI (Llama 3 & 4 model cards), Cohere tech blog, AWS re:Invent 2024 Nova launch, FloTorch CRAG benchmark.
Modified at 2025-05-08 07:17:54
Previous
ShadAI Agents
Next
Pricing Plan
Built with