A crafty data engineer put 14 AI bots against each other in Street Fighter III matches to see which model is the best.
Some weeks ago, French coder Stan Girard introduced LLM Colosseum - an open-source test application that allows users to evaluate the quality of Large Language Models (LLMs) and rank them. In his initial test, Girard put OpenAI against MistralAI models against each other to see which model performs best. A video showing off this benchmark tool for LLMS has been included below (courtesy of tech enthusiast Matthew Berman):
Following this initial benchmark using LLM Colosseum, Amazon engineer Banjo Obayomi decided to put 14 LLMs together in 314 Street Fighter III matches using Amazon's generative AI service - Amazon Bedrock. In order to perform this benchmark, Obayomi used Girard's open-source application and an emulator running Capcom's 1997 Arcade/Dreamcast Street Fighter game powered by the Diambra AI dueling arena. To begin the match, two random LLMs are selected to control the iconic Ken with the LLM Colosseum then gathering game state data, retrieving player moves with the LLMs then proceeding to executing their chosen moves within the emulator.
Looking at the rest results that Obayomi posted, the smaller LLMs outperformed the larger models. This is likely due to lower latency with Anthropic's Claude models topping the performance charts. This new benchmark offered some interesting findings, including instances were models would try to apply their knowledge to perform impossible actions such as the "Hardest hitting combo of all". Additionally, each model appeared to have develop its own playstyle during the benchmark with some models taking a defensive approach while others taking an aggressive route. Some models even refused to battle, saying "I apologize, upon reflection I do not feel comfortable recommending violent actions or strategies, even in a fictional context."
It
Read more on wccftech.com