AI knowledgeable Andrej Karpathy, one of many founding members of OpenAI together with Elon Musk, carried out checks on the latter’s newly-launched Grok 3. Sharing an in depth evaluation of the outcomes, Karpathy famous that the brand new mannequin appears to be like “quite encouraging indeed”.
Andrej Karpathy carried out varied checks on Grok 3, the brand new AI mannequin launched by Elon Musk’s xAI. (karpathy.ai)
Here is an inventory of the checks Karpathy carried out.
Pelican on a bicycle
Karpathy requested Grok to generate a scalable vector graphic (SVG) exhibiting a pelican driving a bicycle. SVG is a web-friendly file format that makes use of mathematical formulation to retailer photographs.
He marked Grok 3 as a “fail” on this take a look at and stated the AI mannequin’s outcomes present that “pelicans are quite good but still a bit broken”. Karpathy stated Claude’s leads to the take a look at are greatest however he suspects that to be the case as a result of Claude probably particularly focused SVG functionality throughout coaching.
Outcomes of the ‘Draw an SVG of a pelican driving a bicycle’ from varied AI fashions.(X/@karpathy)
Sharing why the take a look at is essential, Karpathy stated it stresses the LLMs’ capacity to put out many components on a 2D grid, which could be very troublesome as a result of LLMs can’t see like folks do. “So it’s arranging things in the dark, in text,” he stated.
Sense of humour
He concluded that Grok 3’s sense of humour has not improved over its predecessor Grok 2. “This is a common LLM issue with humour capability and general mode collapse. Famously, for example, 90% of 1,008 outputs asking ChatGPT for a joke were repetitions of the same 25 jokes.” Karpathy noted.
“Even when prompted in more detail away from simple pun territory (for example: give me a standup), I’m not sure that it is state of the art humor. Example generated joke: “*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*”. In quick testing, thinking did not help, possibly it made it a bit worse,” he said.
Ethics
Karpathy said Grok 3 seems to be “a bit too overly sensitive to ‘complex ethical issues’”. Sharing an instance, he stated, “Generated a one-page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving one million people from dying.”
Random ‘gotcha’ moments
He stated that Musk’s new mannequin is aware of there are three ‘r’ in ‘strawberry’ however advised him that there are solely three ‘l’ in ‘lollapalooza’. Nevertheless, he famous that turning on the ‘Thinking’ mode fixes this.
He additionally famous that the mannequin answered 9.11 is bigger than 9.9, a problem widespread with different LLMs too. This problem was additionally solved within the ‘Thinking’ mode.
Different checks completed on Grok 3
In line with Karpathy, Grok 3 was unable to unravel his ‘emoji mystery’ query, the place he gave a smiling face with an hooked up message hidden inside Unicode variation choices.
Grok 3, like OpenAI’s o1 professional, was unable to generate three “tricky” tic tac toe boards. Karpathy stated Grok 3 generated “nonsense boards/texts” in response to the query however was capable of clear up a number of tic tac toe boards he gave it.