six thoughts on generating c — wingolog

So I work in compilers, which means that I write programs that translate
programs to programs. Sometimes you will want to target a language at a
higher level than just, like, assembler, and oftentimes C is that
language. Generating C is less fraught than writing C by hand, as the
generator can often avoid the undefined-behavior pitfalls that one has
to be so careful about when writing C by hand. Still, I have found some
patterns that help me get good results.

Today’s note is a quick summary of things that work for me. I won’t be
so vain as to call them “best practices”, but they are my practices, and
you can have them too if you like.

static inline functions enable data abstraction

When I learned C, in the early days of
GStreamer (oh bless its heart it
still has the same web page!), we used lots of preprocessor macros.
Mostly we got the message over time that many macro uses should have
been inline functions;
macros are for token-pasting and generating names, not for data access
or other implementation.

But what I did not appreciate until much later was that always-inline
functions remove any possible performance penalty for data abstractions.
For example, in Wastrel, I can
describe a bounded range of WebAssembly memory via a memory struct,
and an access to that memory in another struct:

struct memory { uintptr_t base; uint64_t size; };
struct access { uint32_t addr; uint32_t len; };

And then if I want a writable pointer to that memory, I can do so:

#define static_inline \
  static inline __attribute__((always_inline))

static_inline void* write_ptr(struct memory m, struct access a) {
  BOUNDS_CHECK(m, a);
  char *base = __builtin_assume_aligned((char *) m.base_addr, 4096);
  return (void *) (base + a.addr);
}

(Wastrel usually omits any code for BOUNDS_CHECK, and just relies on
memory being mapped into a PROT_NONE region of an appropriate size.
We use a macro there because if the bounds check fails and kills the
process, it’s nice to be able to use __FILE__ and __LINE__.)

Regardless of whether explicit bounds checks are enabled, the
static_inline attribute ensures that the abstraction cost is entirely
burned away; and in the case where bounds checks are elided, we don’t
need the size of the memory or the len of the access, so they won’t
be allocated at all.

If write_ptr wasn’t static_inline, I would be a little worried that
somewhere one of these struct values would get passed through memory.
This is mostly a concern with functions that return structs by value;
whereas in e.g. AArch64, returning a struct memory would use the same
registers that a call to void (*)(struct memory) would use for the
argument, the SYS-V x64 ABI only allocates two general-purpose registers
to be used for return values. I would mostly prefer to not think about
this flavor of bottleneck, and that is what static inline functions do
for me.

avoid implicit integer conversions

C has an odd set of default integer conversions, for example promoting
uint8_t to signed int, and also has weird boundary conditions for
signed integers. When generating C, we should probably sidestep these
rules and instead be explicit: define static inline u8_to_u32,
s16_to_s32, etc conversion functions, and turn on -Wconversion.

Using static inline cast functions also allows the generated code to assert
that operands are of a particular type. Ideally, you end up in a
situation where all casts are in your helper functions, and no cast is
in generated code.

wrap raw pointers and integers with intent

Whippet is a garbage collector
written in C. A garbage collector cuts across all data abstractions:
objects are sometimes viewed as absolute addresses, or ranges in a paged
space, or offsets from the beginning of an aligned region, and so on.
If you represent all of these concepts with size_t or uintptr_t or
whatever, you’re going to have a bad time. So Whippet has struct gc_ref,
struct gc_edge,
and the like: single-member structs whose purpose it is to avoid
confusion by partitioning sets of applicable operations. A
gc_edge_address call will never apply to a struct gc_ref, and so on
for other types and operations.

This is a great pattern for hand-written code, but it’s particularly
powerful for compilers: you will often end up compiling a term of a
known type or kind and you would like to avoid mistakes in the residualized
C.

For example, when compiling WebAssembly, consider struct.set‘s
operational
semantics:
the textual rendering states, “Assert: Due to validation, val is some
ref.struct structaddr.” Wouldn’t it be nice if this assertion could
translate to C? Well in this case it can: with single-inheritance
subtyping (as WebAssembly has), you can make a forest of pointer
subtypes:

typedef struct anyref { uintptr_t value; } anyref;
typedef struct eqref { anyref p; } eqref;
typedef struct i31ref { eqref p; } i31ref;
typedef struct arrayref { eqref p; } arrayref;
typedef struct structref { eqref p; } structref;

So for a (type $type_0 (struct (mut f64))), I might generate:

typedef struct type_0ref { structref p; } type_0ref;

Then if I generate a field setter for $type_0, I make it take a
type_0ref:

static inline void
type_0_set_field_0(type_0ref obj, double val) {
  ...
}

In this way the types carry through from source to target language.
There is a similar type forest for the actual object representations:

typedef struct wasm_any { uintptr_t type_tag; } wasm_any;
typedef struct wasm_struct { wasm_any p; } wasm_struct;
typedef struct type_0 { wasm_struct p; double field_0; } type_0;
...

And we generate little cast routines to go back and forth between
type_0ref and type_0* as needed. There is no overhead because all
routines are static inline, and we get pointer subtyping for free: if a
struct.set $type_0 0 instruction is passed a subtype of $type_0, the
compiler can generate an upcast that type-checks.

fear not `memcpy`

In WebAssembly, accesses to linear memory are not necessarily aligned,
so we can’t just cast an address to (say) int32_t* and dereference.
Instead we memcpy(&i32, addr, sizeof(int32_t)), and trust the compiler
to just emit an unaligned load if it can (and it can). No need for more
words here!

for ABI and tail calls, perform manual register allocation

So, GCC finally has
__attribute__((musttail)):
praise be. However, when compiling WebAssembly, it could be that you
end up compiling a function with, like 30 arguments, or 30 return
values; I don’t trust a C compiler to reliably shuffle between different
stack argument needs at tail calls to or from such a function. It could
even refuse to compile a file if it can’t meet its musttail
obligations; not a good characteristic for a target language.

Really you would like it if all function parameters were allocated to
registers. You can ensure this is the case if, say, you only pass the
first n values in registers, and then pass the rest in global
variables. You don’t need to pass them on a stack, because you can make
the callee load them back to locals as part of the prologue.

What’s fun about this is that it also neatly enables multiple return
values when compiling to C: simply go through the set of function types
used in your program, allocate enough global variables of the right
types to store all return values, and make a function epilogue store any
“excess” return values—those beyond the first return value, if any—in
global variables, and have callers reload those values right after
calls.

what’s not to like

Generating C is a local optimum: you get the industrial-strength
instruction selection and register allocation of GCC or Clang, you don’t
have to implement many peephole-style optimizations, and you get to link
to to possibly-inlinable C runtime routines. It’s hard to improve over
this design point in a marginal way.

There are drawbacks, of course. As a Schemer, my largest source of
annoyance is that I don’t have control of the stack: I don’t know how
much stack a given function will need, nor can I extend the stack of my
program in any reasonable way. I can’t iterate the stack to precisely
enumerate embedded pointers (but perhaps that’s
fine).
I certainly can’t slice a stack to capture a delimited continuation.

The other major irritation is about side tables: one would like to be
able to implement so-called zero-cost
exceptions,
but without support from the compiler and toolchain, it’s impossible.

And finally, source-level debugging is gnarly. You would like to be
able to embed DWARF information corresponding to the code you
residualize; I don’t know how to do that when generating C.

(Why not Rust, you ask? Of course you are asking that. For what it is
worth, I have found that lifetimes are a frontend issue; if I had a
source language with explicit lifetimes, I would consider producing
Rust, as I could machine-check that the output has the same guarantees
as the input. Likewise if I were using a Rust standard library. But if
you are compiling from a language without fancy lifetimes, I don’t
know what you would get from Rust: fewer implicit conversions, yes, but
less mature tail call support, longer compile times… it’s a wash, I
think.)

Oh well. Nothing is perfect, and it’s best to go into things with your
eyes wide open. If you got down to here, I hope these notes help you in
your generations. For me, once my generated C type-checked, it worked:
very little debugging has been necessary. Hacking is not always like
this, but I’ll take it when it comes. Until next time, happy hacking!

Source link