Architecture of an Assembler Code generation converting tokens to binary

Article by Ayman Alheraki on January 11 2026 10:37 AM

Architecture of an Assembler: Code generation: converting tokens to binary

The code generation phase is the assembler’s core responsibility—translating the parsed instructions and directives into raw machine code. This involves converting high-level symbolic tokens into binary encodings that conform precisely to the x86-64 instruction set and machine code format.

This section provides a comprehensive breakdown of the tasks, rules, and decisions involved in binary code generation for the x86-64 architecture.

1. Objective of Code Generation

The code generation phase produces:

The final binary instruction stream, byte-aligned and ready for execution or linking.
An accurate relocation and symbol mapping table (for relocatable objects).
A consistent mapping from mnemonics + operands to opcode + encoding bytes, including ModR/M, SIB, displacement, and immediate fields.

This binary output must reflect all size, addressing, operand type, and instruction semantics determined during previous phases.

2. Instruction Encoding Pipeline

To generate correct machine code, each instruction undergoes a well-defined encoding process:

Opcode Determination The assembler selects the opcode byte(s) based on the instruction mnemonic and operand types. x86-64 instructions often use a primary opcode (1–3 bytes), potentially prefixed by REX or legacy prefixes.
REX Prefix Emission For x86-64 instructions requiring 64-bit operands or extended registers (R8–R15, XMM8–XMM15, etc.), the assembler emits a REX prefix. This prefix controls operand size, register extensions, and operand roles.
Operand Encoding Operands are encoded based on type and role:
- Register operands are translated into 3-bit binary codes embedded within ModR/M or REX.
- Memory operands require ModR/M and often SIB bytes, with possible displacement.
- Immediate operands are encoded directly in little-endian byte format.
ModR/M and SIB Generation The assembler calculates ModR/M and SIB bytes when applicable, based on operand addressing mode and register roles (explained in detail in Chapter 3).
Displacement and Immediate Emission Any required displacement (for memory access) or immediate value is converted to the correct size (8, 16, 32, or 64 bits) and stored in little-endian order.
Instruction Size Finalization The total size of the instruction is computed, which may be needed to adjust future relative addresses (e.g., for jumps or calls).

3. Data Segment Emission

Code generation also handles data directives, such as DB, DW, DD, DQ, which declare initialized data values. These are converted to raw bytes and appended to the data section.

For example:


DB 0xFF, 0x00, 0x7F

Will generate:


FF 00 7F

Expressions or label references within data definitions are resolved during symbol resolution and relocation.

4. Handling Address Relocations

When labels or symbols appear in operands (especially for jumps, calls, or memory accesses), the final value may not be known during initial code generation. The assembler flags such locations for relocation and stores:

The offset within the instruction where the fix-up is needed.
The type of relocation (relative, absolute, segment-relative).
The symbol to be resolved by the linker.

The assembler must differentiate between:

Relative encodings: e.g., JMP label with PC-relative offset.
Absolute references: e.g., MOV RAX, [some_data].

5. Literal Pools and Constant Folding

Modern assemblers often support literal pools—dedicated segments for constants referenced by the code but stored outside the instruction stream. The code generation phase allocates and aligns such constants where necessary.

Constant folding is also applied to optimize fixed arithmetic expressions at assembly time, replacing calculations with precomputed immediate values where legal.

6. Optimized Encodings and Short Forms

The assembler prefers short-form encodings when possible. Examples include:

Encoding MOV AL, imm8 with a single-byte opcode and 8-bit immediate.
Using JMP rel8 instead of JMP rel32 when within range.

This decision is made after computing instruction and displacement sizes, and may trigger backtracking or instruction rewriting if initial guesses were incorrect.

7. Assembler Listings and Debug Information

In many implementations, code generation is also responsible for producing listings or debug maps, which align:

Source lines
Binary addresses
Generated bytes

This aids in debugging and external tool integration. While not directly part of machine code, this metadata is tightly coupled with code generation.

8. Post-2020 Enhancements

Recent assembler designs include notable improvements in code generation:

AVX-512 encoding support: Handling EVEX prefixes and extended register formats.
Instruction bundling alignment for advanced CPUs and speculative execution.
Compact instruction encoding selection based on CPU feature level and mode.
Plugin-based code generators, allowing instruction classes (SIMD, FPU) to be modularly encoded.

9. Summary

The code generation phase is the culmination of all previous analysis. It synthesizes architecture-compliant binary output from a tokenized and parsed input stream, handling encoding complexity, relocation, alignment, and optimizations. This phase must conform strictly to the x86-64 ISA encoding rules and generate valid binary code ready for execution or linking.