The following blog documents a series of interviews with VMS Software Inc.’s John Reagan, who heads the compiler group. John was giving a series of presentations on leveraging LLVM for the new OpenVMS X86 port. After listening to his presentation in Malmo, Sweden, I decided to contact him and ask him to expound on his fascinating view of modern compiler technology and how they work. John has worked on various compilers at DEC, Compaq and HP and has a very deep background in many 3GL and 4GL languages and their tools.
John spent 31 years working for Hewlett Packard as a Compiler engineer for OpenVMS, specializing on Pascal, COBOL, Macro-32, the GEM backend, and assorted other compiler-related projects. He is currently working for VMS Software, Inc. as a Compiler architect/developer for compilers and compilation toolchain for OpenVMS on Itanium and future platforms.
Q: I have seen your presentation at the Bootcamp and at the Connect Conference in Malmo earlier this year. I had a number of questions on the presentation and I will start there. So, in your presentation, you explained you worked on the development of GEM and it was created as a common code generator for the Alpha 64 bit RISC architecture with a target independent IR which is used by all the languages. You mentioned GEM and GEMIR – those terms were unfamiliar to me, so I had several questions:
- What is the difference between the two?
- Does IR stand for Intermediate Representation?
Can you describe how this was done on GEM and how LLVM does it and how is this step is important for generating object code?
A:” Those are all good questions. To be clear, I wasn’t one of the original GEM developers. As a front-end developer, I used GEM, provided feedback to GEM, even worked with them to solve particular problems. I only became a real GEM developer when most of the original GEM developers were transitioned from Compaq to Intel. You asked about GEM; GEM is not an acronym. Back in the Digital days, there was a Prism project and Prism was going to be the follow-on architecture to VAX. It wasn’t done and Alpha was done instead and it was almost Prism. At the time, everyone got enamored with things that looked like jewels, so there was GEM and Opal and there are other names that were involved. GEM is not an acronym that had to do with Prism. So all of our VAX compilers were in this match of modules: different code generators, different approaches, different optimizers – some had great optimizers like the VAX Fortran compiler and some had almost no optimizers: the VAX COBOL compiler had no optimizer at all. So deciding from everything we learned on VAX, going to Alpha they decided: let’s write a common back end that everyone uses. We had some experience on VAX; you may have heard of the VCG or VAX Common Code Generator which was used by the VAX PL1, the VAX Ada and the VAX C compiler shared a common code generator. It wasn’t really common – it was cloned from each other and shared resources, but we knew as a compiler group we were going to have write compilers not only for the VMS, but we had to write compilers for MIPS Ultrix. That group came to us and said we want your Fortran compiler; it is a great Fortran compiler. We said how do we get our VAX Fortran compiler on MIPS Ultrix? You switch architecture and operating systems.
GEM was invented as a multi OS, multi-target backend. It provides interfaces that deal with command line processes, with file reading, file writing, etc. So a front end could say I need to read my source file and on VMS it knew how to call RMS and on Tru64 or MIPS ULTRIX, it knew how to call those file systems. That part of GEM is dealing with environmental issues. The other part of GEM is this intermediate representation where you take the source code and you turn it into a…think of it as a high level assembler: I want to create some variables and I want to add them together or multiply them or do a test and a branch. If you dumped it out you would say this has an assembler feel to it, but it is a little more abstract. It knows about descriptors, it knows about loops, but it has almost no target architecture information in it at all. For some things like weird BLISS linkages, we had to have register numbers encoded in some linkages, but for 99% of the generated code its target independent. GEM took that intermediate representation and internally, depending on whether it was an Alpha or MIPS target version of GEM, it turned it into a tree, does optimizations of code hoisting, and common sub-expressions and loop unrolling, stuff that you needed to do for Fortran and we needed to make sure that all the tricks we played for VAX Fortran were also performed by GEM. GEM also had to the learn new tricks for Alpha; it would be horribly disappointing if we came up with this really fast Alpha chip and then find out the Fortran compiler really sucked and then gave you worse performance than your older VAX did.
Having fast chips are one thing, but if you don’t have a good compiler, you are in trouble. So GEM does all that transformation, it knows how to write out the object information, because on different platforms it knows the object file is a different format. I look at GEM sometimes as Mr. Potato head; there’s different pieces inside of GEM; there’s the optimizer which is mostly target independent, but there are several code generators, there is one for Alpha, there is one for MIPS, there’s one for x86 32-bit for Visual Fortran. It knows how to write COFF files for Tru64 object file, it knows the Windows object format for the Windows systems, it knows how to write VMS object format, so when we build GEM for different targets, we just pick and choose the pieces we want and end up with a GEM that gets linked with the front end. The front end produces a symbol table and a generic sequence of the intermediate representation and tells GEM: here, go have fun and go generate code. What that enabled is that the Fortran compiler front end now can be the same Fortran front end for VMS Alpha as it was for MIPS Ultrix as it was for Tru64 Alpha as it was for linux on Itanium as it was for all those different targets and flavors that we had. It allows for the first time for COBOL to be optimized. COBOL now runs the same optimizer that Fortran runs through. The Alpha COBOL could now do strength reduction and loop unrolling. Our Alpha BASIC compile knows how to do pointer analysis. We had never done that before because it wasn’t worth the effort because no one was really looking for that type of performance on those languages, but you get all for free, right? That’s the advantage of that backend.
Going to something like LLVM, we have the same advantage again. We were presented with the issue of how do we keep the existing VMS customers happy; they’re moving code forward, but I also want to modernize VMS and give you newer language standards for things like C or C++ and Fortran and good quality code for all the different types of X86 chips whether they come from AMD or Intel. LLVM has support for all the different types, not just one or two type of chips. On Alpha, there’s just two or three different flavors of Alpha chips and differences between them were pretty minor in most cases. Internally they were faster, but from an outside view they behaved pretty much the same. Go look on X86 and over the years there are dozens of subsets of different instructions on different chips. Some chips like one sequence of instructions versus another sequence of instructions so the optimizations need to almost know what chip they are optimizing best for. To try to keep track of that is a difficult job; it takes a lot people. Something like LLVM; there are hundreds of people around the planet that are watching that stuff, constantly tweaking it and keeping track of that so I don’t have to, right?
We said we have all these front ends that generated this GEM intermediate language, how do we get that same front-end hooked to LLVM? We need those front-ends, because we have customers out there with millions of lines of BASIC, of Pascal, COBOL and Fortran, all those things – they expect our frontends – there’s no other place to go for those things. We could have made massive changes to each frontend to have GEM directly generate LLVM intermediate language, but why bother? They all generate a nice GEM intermediate language – pretty understandable, so we have written a GEM intermediate language converter. It takes the GEM intermediate language and generates the LLVM intermediate language. Now LLVM, being a compiler tool, this is not really rocket science. At this level for people in this industry it is the same thing: the LLVM intermediate language lets you record variables, which you can add two together and then talk about the results.
There are 200 GEM intermediate language intermediate representation node types: add, subtract, multiply, shift left, shift right, bit fetch, bit store all the various flavors 75% of them are one to one mappings on something equivalent on LLVM. Add is add, and so on. There is not really much else you need to do. There are a few places that are a little different since GEM has some built-in knowledge about knowing how to build VMS descriptors that made it easy for every frontend to build VMS descriptors. LLVM doesn’t know about VMS descriptors, so the one GEM intermediate node that talks about building a VMS descriptor that the frontend generates, the converter has to generate a sequence of fifteen or twenty intermediates on LLVM, but pretty much filled the datatype in first byte, filled the class in the second byte, filled the length in the next word it is relatively straight forward and very mechanical. All we had to go through these several hundred GEM nodes to convert them all: just brute force. Just pound your way through it. For the most part that gets you 90% the way there.”
Q: So what about the language of GEM? Is it written in C?
A: “No, GEM is a mixture. It started life as all BLISS. Over the years, it morphed into different languages. There was a switch over when all new stuff was written in C, but there is a lot written still in BLISS. Of course, if you work on GEM, you have to learn how to do it in BLISS. Our converter that we just wrote is in C++, there’s no reason to do it in BLISS. But of course all of these front ends, the BASIC front end, the COBOL front end, the Pascal front end – all in BLISS. The Fortran front end is in C Of course the C frontend is in C, and the BLISS frontend is in BLISS. So being in our group, you still need to know how read and manipulate BLISS.”
Q: So LLVM – is it written in C?
A: “Yes , all in C – well, 99% in C++ and uses extensive C++ features. There are a few lower level utility routines that are straight forward C99 C. It is very object oriented in its interface to build the LLVM IR. You do have to go and learn to read classes and inheritance and all those things. Having a good working knowledge of C++ is a good thing to read the LLVM sources. Whether it is things like C++, Swift or Rust or other things in the industry on linux platforms, the LLVM IR has already been shown to work for a lot of different languages.
I downloaded the LLVM source code and built it on OpenVMS Itanium to generate code for X86. That is what we are using for our cross compilers. There are hundreds of files that are part of LLVM that we compile. We had to make a few changes for VMS. I think right now it is under 500 lines across the entire code base. We’ve had a couple of extensions, for example in the unix world, every routine doesn’t have an argument count; in the VMS world, every routine expects an argument count passed in. So we’ve added some support to make sure every routine call has an argument count. That had to be done inside the code generator, not in the converter, that’s too high level for that. There’s some things we have to add for debug information; there’s some debug information for some of our languages like BASIC or PASCAL or COBOL or BLISS that still aren’t in the current DWARF standard.
So we have added some extensions to make sure those additional types get produced that our Debugger expects. It is pretty much very mechanical and high level. We haven’t made any fundamental changes to LLVM because we want to pick up new versions of LLVM going forward. I don’t want make the mistake of snapshotting LLVM today and sticking with it for the next ten years. It will be out of date in six months. LLVM churns out new versions quite often to add additional optimizations, new chipsets support, code mitigations for security issues like SPECTRE and Meltdown. So we need to stay current with that code generator technology is very important. “
Editor’s note: This seems to be a good place to stop the first blog article. Coming up in the next blog of our interview: history of LLVM and futures.