As the article says, it is significantly a matter of lock-free structures and not crossing the kernel/userland barrier. Also cache locality (combined with DirectIO which somebody else mentioned) is great.
With a userspace stack, you also get lots of tuning capability, which is not available with the kernel. One cool tunable is how long to busy-wait before sleeping for an interrupt.
I'm not sure. My colleague explained me that the main gain came from avoiding to do a malloc in kernel space and a data copy. But it's true also that many small reads in a single block implies many system calls. I don't know the real relative processing cost of these operations.
For what I have seen, most applications using DPDK do not require TCP/IP handling. For example, packet monitoring, data transfers between nodes that are just connected by a single cable (thus no need for routing or TCP arrival guarantees)...
But even reimplementing the TCP/IP stack it would probably be more faster: you don't make system calls (which are expensive), you don't go around copying every other to socket buffers, you don't have to decide which application should receive the packet... It's hard, but it can yield performance improvements.
On the other hand, I'd try to activate other features for improved performance, such as running applications in the same cores that the queues are running on, configuring the NIC RX queues and using jumboframes (>1514 bytes) if you control the network path. All of these can yield noticeable improvements without much effort.
I haven't yet run into a case where it makes sense to do it, but you can gain some efficiency because you avoid context switching between the kernel and the program.