§ feed · storyline

12/22/2023: Anyscale's Benchmark Criticisms

Anyscale launches LLMPerf leaderboard to benchmark LLM inference performance, drawing criticism for omitting cost-per-token metrics and failing to account for batching when comparing public endpoints.

Dec 23 · 02:16:52 · primary fetch1 sourceupdated Dec 23 · 02:16:52

Anyscale launched their LLMPerf leaderboard to benchmark large language model inference performance, but it faced criticism for lacking detailed metrics like cost per token and throughput, and for comparing public LLM endpoints without accounting for batching and load. In OpenAI Discord discussions, users reported issues with Bard and preferred Microsoft Copilot for storytelling, noting fewer hallucinations. There was debate on the value of upgrading from GPT-3.5 to GPT-4, with many finding paid AI models worthwhile for coding productivity.

Bugs and performance issues with OpenAI APIs were also highlighted, including slow responses and message limits. Future AI developments like GPT-6 and concerns about OpenAI's transparency and profitability were discussed. Prompt engineering for image generation was another active topic, emphasizing clear positive prompts and the desire for negative prompts.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.ai12/22/2023: Anyscale's Benchmark Criticismsprimary02:16:52