“We are finding that the computation in cloud data centres they are now constrained by their cooling requirements and Google and Amazon are offloading their computation to FPGAs to reduce their power consumption so they can pack in more compute power,” said Bryan Donoghue, digital system lead at Cambridge Consultants, which is part of the Altran group.
“We have had a number of projects exploring flexible systems with the power efficiency that approaches that of dedicated hardware,” said Donoghue. “You can build a 16x16 multiplier in 5000 gates, a Teak-lite II DSP takes 100K gates, and ARM’s Cortex R7 core takes 1.3M gates. So I want to add more flexibility but not go all the way to a CPU."
So the team has developed a very long instruction word (VLIW) approach coupled with dedicated modules such as a MAC, ALU or FFT and connected via a programmable multiplexer.
The key is that the VLIW can be 100 or 200 bits long with dozens of mini op codes that control modules, memory interface and the routing of the multipliers to provide a dynamic data path on a cycle by cycle basis.
This gives a lot of flexibility in the development stage. “When you are coding the algorithm and need more modules you can add those in and experiment,” he said.
The end result is a two stage pipeline and so it has a very low control and datapath overhead. The design philosophy is very different from a CPU. Rather than going as fast as possible, a 40nm design runs at 100MHz so that the core works at the same speed as memory so there aren’t wait states and the team can use a low power library and optimise that even further.