A highlight from my time in IBM’s C/C++ Compiler team was working on the compiler for the Cell processor. The Cell processor was designed for high-performance computing in mind. Its architecture is composed of one main general-purpose core (Power Processor Element, or PPE) accompanied by eight special-purpose cores (Synergistic Processing Elements, or SPE) for blazing fast vector processing. This is similar to how GPUs are used today to speed up multimedia, gaming, and AI workloads, but everything was on one chip.
Of course, today it is very common for CPUs to have integrated GPU and AI coprocessing units, but the Cell processor came out in 2007!
There are multiple reasons that such an architecture is difficult to develop software for. One of the reasons is that every element has its own memory and hence its own address space. This means that moving data between PPE main memory to SPE local memory needs to be explicit and is tedious. I worked on a feature that maps all of the individual address spaces into one global, virtual, effective address space. Programmers could then use an effective address pointer (a C/C++ pointer qualified with __ea keyword) to abstract away the communication and transfer between the various memory systems, and they would just write their program as they normally would, leading to great productivity boost.