You might want to check the "Warp Processing" project out: http://www.cs.ucr.edu/~vahid/warp/. It is probably exactly what you are thinking about. Transparent analysis of the instruction stream at runtime and synthesis and offloading of hot spots to the FPGA.
Why is that surprising. A LOT of statements that are done in programming can be executed in parallel. It's just not worth it to actually make threads for them since the overhead of threads is larger then just executing the set of instructions sequentially. In fact all modern processors take advantage of the data dependencies and execute it in parallel if possible.