Lesson overview | Previous part | Lesson overview
Capability Benchmarks: Part 6: LLM Benchmark Families to References
6. LLM Benchmark Families
LLM Benchmark Families is the part of capability benchmarks that turns the approved TOC into a concrete learning path. The subsections below keep the focus on Chapter 17's canonical job: measurement, reliability, uncertainty, and decision support for AI systems.
6.1 MMLU and multitask knowledge
MMLU and multitask knowledge is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For mmlu and multitask knowledge, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in mmlu and multitask knowledge |
| Scoring rule | Exact formula for s_m(z_i) | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report mmlu and multitask knowledge with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for mmlu and multitask knowledge:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, mmlu and multitask knowledge is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. MMLU and multitask knowledge is one place where that habit becomes concrete.
6.2 HumanEval and functional correctness
HumanEval and functional correctness is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For humaneval and functional correctness, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in humaneval and functional correctness |
| Scoring rule | Exact formula for s_m(z_i) | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report humaneval and functional correctness with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for humaneval and functional correctness:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, humaneval and functional correctness is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. HumanEval and functional correctness is one place where that habit becomes concrete.
6.3 BIG-bench and capability extrapolation
BIG-bench and capability extrapolation is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For big-bench and capability extrapolation, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in big-bench and capability extrapolation |
| Scoring rule | Exact formula for s_m(z_i) | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report big-bench and capability extrapolation with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for big-bench and capability extrapolation:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, big-bench and capability extrapolation is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. BIG-bench and capability extrapolation is one place where that habit becomes concrete.
6.4 HELM and multi-metric transparency
HELM and multi-metric transparency is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For helm and multi-metric transparency, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in helm and multi-metric transparency |
| Scoring rule | Exact formula for s_m(z_i) | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report helm and multi-metric transparency with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for helm and multi-metric transparency:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, helm and multi-metric transparency is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. HELM and multi-metric transparency is one place where that habit becomes concrete.
6.5 MT-Bench, Chatbot Arena, and LLM-as-judge
MT-Bench, Chatbot Arena, and LLM-as-judge is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For mt-bench, chatbot arena, and llm-as-judge, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in mt-bench, chatbot arena, and llm-as-judge |
| Scoring rule | Exact formula for s_m(z_i) | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report mt-bench, chatbot arena, and llm-as-judge with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for mt-bench, chatbot arena, and llm-as-judge:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, mt-bench, chatbot arena, and llm-as-judge is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. MT-Bench, Chatbot Arena, and LLM-as-judge is one place where that habit becomes concrete.
7. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Treating a point estimate as exact | Every finite evaluation has sampling error in capability benchmarks. | Report uncertainty with the point estimate. |
| 2 | Changing prompts between models | The protocol changed with the treatment in capability benchmarks. | Lock prompt, decoding, and grader before comparison. |
| 3 | Ignoring invalid outputs | Missingness can be correlated with model quality in capability benchmarks. | Track invalid, timeout, refusal, and parse-failure rates. |
| 4 | Overfitting to a public leaderboard | Repeated testing leaks information from the benchmark in capability benchmarks. | Use private holdouts and regression suites. |
| 5 | Averaging incomparable metrics | Different scales do not have shared units in capability benchmarks. | Normalize by a stated decision rule or report separately. |
| 6 | Forgetting paired structure | Two models often answer the same items in capability benchmarks. | Use paired bootstrap or paired tests where possible. |
| 7 | Reporting only aggregate performance | Subgroup failures can be hidden in capability benchmarks. | Add slice and tail-risk views. |
| 8 | Trusting model judges blindly | LLM judges have position, verbosity, and self-preference biases in capability benchmarks. | Calibrate judges against human labels. |
| 9 | Peeking during online experiments | Optional stopping inflates false positives in capability benchmarks. | Use fixed horizons or sequential-valid methods. |
| 10 | Conflating evaluation with monitoring | Chapter 17 measures controlled evidence; production monitoring is ongoing operations in capability benchmarks. | Hand off drift dashboards to Chapter 19 concepts. |
8. Exercises
-
(*) Benchmarks as noisy estimators. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(*) Capability versus observed score. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(*) Metric pluralism for LLM systems. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(**) Benchmark lifecycle and saturation. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(**) What benchmark scores can and cannot certify. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(**) Model and system under test. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Task, item, and evaluation sample. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Prompt protocol and decoding policy. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Scorer, metric, and aggregate estimate. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Confidence interval and leaderboard rank. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
9. Why This Matters for AI
| Concept | AI Impact |
|---|---|
| Protocol as measurement | Prevents hidden prompt or grader changes from masquerading as model progress |
| Uncertainty intervals | Keeps model rankings honest when differences are smaller than sampling noise |
| Slice metrics | Reveals failures on languages, domains, formats, or user groups hidden by averages |
| Calibration | Lets systems decide when to answer, abstain, ask for help, or escalate |
| Robustness | Tests whether behavior survives realistic perturbations and distribution shift |
| Ablations | Separates real improvements from accidental metric movement |
| Online tests | Measures causal user impact rather than offline proxy success |
| Audit trails | Turns evaluation from a screenshot into reproducible scientific evidence |
10. Conceptual Bridge
This section sits after the training-data pipeline because evaluation depends on clean holdouts, contamination audits, and well-documented data provenance. It does not repeat those pipeline mechanics; it consumes their outputs as the basis for credible measurement.
It also sits before alignment and production chapters. Alignment asks how to shape model behavior with supervised data, preferences, policies, and feedback. Production MLOps asks how deployed systems are observed and maintained over time. Capability Benchmarks supplies the measurement discipline both chapters need.
The recurring mathematical pattern is empirical risk with uncertainty. Whether the object is a benchmark item, a calibrated probability, a shifted subgroup, an ablation comparison, or an online treatment effect, the learner should ask: what distribution generated this evidence, what estimator did we compute, and what decision is justified by the uncertainty?
16 Data Pipeline
-> clean eval data, manifests, decontamination
17 Evaluation and Reliability
-> benchmarks, calibration, robustness, ablations, online tests
18 Alignment and Safety
-> SFT, preferences, policies, human feedback
19 Production ML and MLOps
-> monitoring, serving, retraining, observability
References
- MMLU: Measuring Massive Multitask Language Understanding
- HELM: Holistic Evaluation of Language Models
- HumanEval: Evaluating Large Language Models Trained on Code
- BIG-bench
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- OpenAI Evals
- EleutherAI lm-evaluation-harness
- Concentration Inequalities
- Estimation Theory
- Hypothesis Testing