Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
I infotopia
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 55
    • Issues 55
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
Collapse sidebar
  • Aliza Flinn
  • infotopia
  • Issues
  • #54

Closed
Open
Created Mar 05, 2025 by Aliza Flinn@alizaflinn3888Maintainer

If there's Intelligent Life out There


Optimizing LLMs to be proficient at particular tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you purchase through links on our site, we may earn an affiliate commission. Here's how it works.

Hugging Face has actually released its second LLM leaderboard to rank the very best language models it has checked. The new leaderboard seeks to be a more challenging consistent standard for checking open large language design (LLM) efficiency throughout a range of tasks. Alibaba's Qwen models appear dominant in the leaderboard's inaugural rankings, pipewiki.org taking 3 spots in the leading 10.

Pumped to reveal the brand name brand-new open LLM leaderboard. We burned 300 H100 to re-run brand-new examinations like MMLU-pro for all major open LLMs!Some knowing:- Qwen 72B is the king and Chinese open models are dominating general- Previous examinations have ended up being too easy for recent ... June 26, 2024

Hugging Face's second leaderboard tests language designs throughout four tasks: understanding testing, reasoning on extremely long contexts, complex mathematics capabilities, and instruction following. Six standards are used to evaluate these qualities, with tests consisting of resolving 1,000-word murder mysteries, explaining PhD-level concerns in layperson's terms, and asteroidsathome.net a lot of complicated of all: high-school math formulas. A full breakdown of the criteria utilized can be found on Hugging Face's blog.

The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, which takes 1st, 3rd, and 10th location with its handful of versions. Also appearing are Llama3-70B, Meta's LLM, and a handful of smaller sized open-source projects that handled to outperform the pack. Notably absent is any indication of ChatGPT; Hugging Face's leaderboard does not test closed-source designs to ensure reproducibility of outcomes.

Tests to certify on the leaderboard are run exclusively on Hugging Face's own computer systems, pipewiki.org which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collaborative nature, anybody is totally free to send new models for screening and admission on the leaderboard, with a brand-new voting system prioritizing popular new entries for screening. The leaderboard can be filtered to show only a highlighted range of substantial models to prevent a complicated glut of small LLMs.

As a pillar of the LLM area, Hugging Face has actually ended up being a relied on source for LLM learning and community collaboration. After its very first leaderboard was launched in 2015 as a method to compare and reproduce testing arise from numerous established LLMs, the board rapidly took off in popularity. Getting high ranks on the board became the goal of numerous designers, little and large, and as models have actually become usually stronger, 'smarter,' and enhanced for the specific tests of the first leaderboard, its outcomes have become less and less significant, thus the creation of a 2nd version.

Some LLMs, including more recent variants of Meta's Llama, seriously underperformed in the brand-new leaderboard compared to their high marks in the first. This came from a pattern of over-training LLMs just on the first leaderboard's standards, causing falling back in real-world performance. This regression of performance, thanks to hyperspecific and self-referential information, follows a trend of AI performance growing worse in time, showing once again as Google's AI responses have revealed that LLM efficiency is only as great as its training information and that real synthetic "intelligence" is still lots of, several years away.

Remain on the Leading Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's finest news and in-depth reviews, straight to your inbox.

Dallin Grimm is a contributing writer for Tom's Hardware. He has actually been building and breaking computers considering that 2017, working as the resident child at Tom's. From APUs to RGB, Dallin guides all the current tech news.

Moore Threads GPUs presumably reveal 'outstanding' reasoning performance with DeepSeek models

DeepSeek research suggests Huawei's Ascend 910C delivers 60% of Nvidia H100 inference efficiency

Asus and MSI trek RTX 5090 and RTX 5080 GPU prices by up to 18%

-. bit_user. LLM performance is just as good as its training information and that true synthetic "intelligence" is still numerous, several years away. First, this declaration discount rates the function of network architecture.

The definition of "intelligence" can not be whether something procedures details precisely like people do, or else the search for additional terrestrial intelligence would be totally useless. If there's smart life out there, it probably doesn't believe quite like we do. Machines that act and townshipmarket.co.za act smartly also need not necessarily do so, either. Reply

-. jp7189. I don't enjoy the click-bait China vs. the world title. The truth is qwen is open source, open weights and can be run anywhere. It can (and has currently been) tweaked to add/remove predisposition. I praise hugging face's work to create standardized tests for LLMs, and for putting the focus on open source, open weights initially. Reply

-. jp7189. bit_user said:. First, this statement discount rates the function of network architecture.

Second, isn't a binary thing - it's more like a spectrum. There are various classes cognitive tasks and abilities you may be acquainted with, if you study kid development or animal intelligence.

The meaning of "intelligence" can not be whether something procedures details precisely like human beings do, otherwise the look for additional terrestrial intelligence would be entirely futile. If there's intelligent life out there, it most likely doesn't think rather like we do. Machines that act and behave smartly likewise needn't always do so, either. We're developing a tools to assist humans, therfore I would argue LLMs are more useful if we grade them by human intelligence standards. Reply

- View All 3 Comments

Most Popular

Tomshardware is part of Future US Inc, an international media group and leading digital publisher. Visit our corporate site.

- Terms. - Contact Future's specialists. - Privacy policy. - Cookies policy. - Availability Statement. - Advertise with us.

  • About us. - Coupons.
  • Careers

    © Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York, NY 10036.
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking