𝗦𝗼𝗮𝗿𝗶𝗻𝗴 𝘁𝗼 𝗡𝗲𝘄 𝗛𝗲𝗶𝗴𝗵𝘁𝘀: 𝗧𝗵𝗲 𝗜𝗺𝗽𝗮𝗰𝘁 𝗼𝗳 𝗣𝘆𝘁𝗵𝗼𝗻 𝗮𝗻𝗱 ...
Anyone evaluating a code LLM on HumanEval, MBPP, or DebugBench can hit all three. It takes 3 seconds and exits with code 1 on critical failures so you can block an eval run in CI before wasting ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results