Generating and displaying diagrams in mermaid, svg, or css has become one of my go-to tests for reasoning. This seems fair because while SVG is admittedly syntactically difficult and maybe not emphasized in training, CSS is certainly a popular output target, and mermaid is very simple. It seems like SOTA should be able to draw and modify things that it "understands".
I'm much more interested in stuff like Venn diagrams and bipartite graphs than pictures of cats or pelicans riding bikes. It's similar to a code-generation problem in that output is a new artifact that's one step away from the problem-presentation, but it has the advantage that it's simpler than code, is less likely to have exact-match training data, usually has one correct answer, and is easy to check. Try making venn diagrams on a few circles with "exactly and only the following intersections" and gradually elaborating the spec.
This is a great way to get a starter diagram boilerplate if that's what you're looking for. One shot prompts for simple things are ok, sometimes. But it always completely falls apart when you try to iterate with small modifications, introducing errors in parts that were correct previously or ignoring requested changes. Maybe it's wrong to conclude anything from that, but to me this looks bad for the "they can reason!" argument and very bad for trusting complicated work in other domains that are harder to check. Haven't read TFA yet, but whether it confirms or denies my gut here hopefully it's going to add some perspective
I'm much more interested in stuff like Venn diagrams and bipartite graphs than pictures of cats or pelicans riding bikes. It's similar to a code-generation problem in that output is a new artifact that's one step away from the problem-presentation, but it has the advantage that it's simpler than code, is less likely to have exact-match training data, usually has one correct answer, and is easy to check. Try making venn diagrams on a few circles with "exactly and only the following intersections" and gradually elaborating the spec.
This is a great way to get a starter diagram boilerplate if that's what you're looking for. One shot prompts for simple things are ok, sometimes. But it always completely falls apart when you try to iterate with small modifications, introducing errors in parts that were correct previously or ignoring requested changes. Maybe it's wrong to conclude anything from that, but to me this looks bad for the "they can reason!" argument and very bad for trusting complicated work in other domains that are harder to check. Haven't read TFA yet, but whether it confirms or denies my gut here hopefully it's going to add some perspective