Not quite what you suggested, but I did some experiments several months ago "enhancing" the samples in tracker music with some models, and they sounded terrible. There really is something about the sound of tracker files that's just right. But sure, you could generate lo-fi samples, there's a lot of computer generated samples in music, but putting them together into a pleasing combination is the hard bit.
If you're just having fun with it, there are a whole bunch of other things that produce interesting options, like asking it to theme according to a movie (think Clockwork Orange, Backrooms, anything with a strong aesthetic), or throw screenshots and photos at it and use it as a "design system" (magazine/print layouts can work well with this on stronger models).
As a few people have asked for screenshots, I spun it up. Here's a video of the basic gameplay: https://peterc.org/misc/fpscob.mp4 .. it's clunky, but it does play.
I'm not getting anywhere near the speeds advertised on my 3090 Ti, alas, but it's fun watching it "fill out" its answers. I did Simon's "SVG pelican on a bicycle" test on it and the result was quite minimalistic but fit the brief: https://gist.github.com/peterc/7672e74ec1437945e5fca5ce2c1c9... -- this was on the Q4 quant running on patched llama.cpp. I will be interested to see if Simon's looks much different.
Yeah, the patched llama.cpp. The reason is I saw that using the Q4 quant on vLLM is discouraged and the int8 won't fit on my 3090 Ti, but I could certainly give it a go. I also skipped Transformers as it needs to download the full weights and quantize them locally and I didn't fancy waiting for a 50GB download.
Stuttering John used to do this back on Howard Stern by asking celebrities questions that were far out of the expected gamut at red carpet events. This was all for shock/comedy value, but "who are you and what makes you famous" type questions can really throw celebs off script: https://www.youtube.com/watch?v=8P0hENpnMXk
I'm not the OP and I imagine all cases are different, but my dad was a software developer who had early cognitive decline in his 60s (he died of vascular dementia recently) and he used to talk about it a lot. He said it was like his tolerance for complexity kept closing in.
Where he could once hold an entire system and its details in his head (almost an essential skill in the 80s/90s), he could only instead focus on smaller pieces at a time. Any new tooling or approaches that came along, he was fascinated to hear about them, but no longer felt able to pick them up. He could still solve algorithmic problems and debug "in the small", but it was like he had to do math on a Post-it note where once he had a huge sheet of paper.
Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.
It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.
For Qwen 3.5 0.8B presumably you're running it unquantized, because it's so small. Get at least the Q8 of Gemma 4 12B with the F32 mmproj and use an f16 kv cache.
Then run it with the latest llama.cpp that contains the Gemma 4 12B unified bug fixes, using --image-min-tokens 560 --image-max-tokens 2240 --batch-size 4096 --ubatch-size 4096 --temp 1.0 --top-p 0.95 --top-k 64 --jinja
It's understanding far more complex things for me and can reliably handle tiny text, so it should be easily understanding an image that only contains the text "This is a test".
I guess Google implements more / stronger guard rails than Alibaba and thus confuses these small models. At least this was my impression with Gemma3 models where it often said that the image contains some nudity / sex scenes and therefore it cannot give a description of the image. Never understood the point of this behavior....
The biggest problem with all the Google models has always been RLHF, particularly safety training. They take a good, smart model and make it behave like a corporate person that has been to far to many forced anti-{sexism, racism...} seminars so that it is now living in fear of saying something that could be construed as wrong by some moral standard.
The Qwen series adopted vision wayyy earlier than anyone else. No idea why the other labs were sleeping on it but they had about 2 years of experimentation without any competition.
reply