At 100MHz this can give 1GMAC/s of performance and in many systems the algorithms paralellise well so multiple cores can be used to get to several 10s of GMAC/s of performance with tens of milliwatts of power consumption.
As compilers aren’t very efficient with VLIW code, the core is programmed in assembler and the team has also built a set of tools to support the development.
“We have a toolset that helps us build these cores and have a big library of these, mix and match the modules and that squirts out the Verilog. We code in assembler rather than C or CUDA – but the competition is Verilog and it’s a lot easier to program in assembler.”
A graphical simulator called Sapphyre is configurable with chosen modules, allowing developers to chose the data path. This is bit and cycle accurate which is important to provide the required performance, but it also produces cycle by cycle vectors that are then used as the test vectors from the Verilog.
“We also have a real time debug monitor embedded in the silicon via the multiplexer – that helps developing code on the actual silicon and it provides visibility of all the data in the system,” he said. “You can take that data and feed it back into the simulator for a replay and that gives great visibility.
A typical design using the core in 40nm runs at 96MHz and uses 116K gates. This provides 384MMAC/s at 8mW peak performance and a 1mW average power in 0.25mm2 of silicon. This can be used to replace a CPU or DSP core in an ASIC to reduce power and area and boost performance
The core can also be used for machine learning in AI systems, he says. “The modules change to array-based processing modules for CNN layers but the same architecture works well and we are doing work in that space,” he said.