If it wasn’t bad enough that Moore’s Law improvements in the density and cost of transistors is slowing. At the same time, the cost of designing chips and of the factories that are used to etch them is also on the rise. Any savings on any of these fronts will be most welcome to keep IT innovation leaping ahead.
One of the promising frontiers of research right now in chip design is using machine learning techniques to actually help with some of the tasks in the design process. We will be discussing this at our upcoming The Next AI Platform event in San Jose on March 10 with Elias Fallon, engineering director at Cadence Design Systems.
(You can see the full agenda and register to attend at this link; we hope to see you there.) The use of machine learning in chip design was also one of the topics that Jeff Dean, a senior fellow in the Research Group at Google who has helped invent many of the hyperscaler’s key technologies, talked about in his keynote address at this week’s 2020 International Solid State Circuits Conference in San Francisco.
Google, as it turns out, has more than a passing interest in compute engines, being one of the large consumers of CPUs and GPUs in the world and also the designer of TPUs spanning from the edge to the datacenter for doing both machine learning inference and training. So this is not just an academic exercise for the search engine giant and public cloud contender – particularly if it intends to keep advancing its TPU roadmap and if it decides, like rival Amazon Web Services, to start designing its own custom Arm server chips or decides to do custom Arm chips for its phones and other consumer devices.
With a certain amount of serendipity, some of the work that Google has been doing to run machine learning models across large numbers of different types of compute engines is feeding back into the work that it is doing to automate some of the placement and routing of IP blocks on an ASIC. (It is wonderful when an idea is fractal like that. . . .)
While the pod of TPUv3 systems that Google showed off back in May 2018 can mesh together 1,024 of the tensor processors (which had twice as many cores and about a 15 percent clock speed boost as far as we can tell) to deliver 106 petaflops of aggregate 16-bit half precision multiplication performance (with 32-bit accumulation) using Google’s own – and very clever – bfloat16 data format. Those TPUv3 chips are all cross-coupled using a 32×32 toroidal mesh so they can share data, and each TPUv3 core has its own bank of HBM2 memory. This TPUv3 pod is a huge aggregation of compute, which can do either machine learning training or inference, but it is not necessarily as large as Google needs to build. (We will be talking about Dean’s comments on the future of AI hardware and models in a separate story.)
Suffice it to say, Google is hedging with hybrid architectures that mix CPUs and GPUs – and perhaps someday other accelerators – for reinforcement learning workloads, and hence the research that Dean and his peers at Google have been involved in that are also being brought to bear on ASIC design.
“One of the trends is that models are getting bigger,” explains Dean. “So the entire model doesn’t necessarily fit on a single chip. If you have essentially large models, then model parallelism – dividing the model up across multiple chips – is important, and getting good performance by giving it a bunch of compute devices is non-trivial and it is not obvious how to do that effectively.”