Few-shot prompting can improve classification accuracy but after a point, it often starts degrading it. The main reason:
Example influences the model probabilistically ,not deterministically
-
Example distribution bias
The model starts overfitting to the example instead of the actual task
Example :
Positive → enthusiastic language
Negative → angry language
The model may misclassify because the examples unintentionally taught emotional tone instead of sentimental criteria.
-
Spurious pattern learning
LLM’s pick up accidental correlation
Example
Class A examples are short
Class B examples are long
Here the model now uses lengths instead of semantics
-
Context crowding
More example makes the model have less attention budget .
Eventually :