Article by Ayman Alheraki on July 4 2025 11:58 AM
Designing and implementing a full assembler in C or C++ is an excellent practical exercise to consolidate the understanding of assembler architecture, data structures, parsing, code generation, and output formats. This section outlines a step-by-step approach to writing a minimal yet functional assembler targeting the x86-64 architecture, emphasizing modularity, clarity, and maintainability.
The sample assembler’s primary goals are:
Accept simple x86-64 assembly source code as input.
Parse and tokenize source lines into instructions, operands, and directives.
Build and maintain symbol tables and instruction tables.
Implement a two-pass assembly process for symbol resolution and code generation.
Output a relocatable object file in ELF64 format.
Include minimal error handling and diagnostics for usability.
Provide extensibility hooks for future features like macros or debugging info.
The assembler is designed as a learning tool rather than a production-ready solution but should follow modern C++17 or later standards to leverage features such as std::string_view
, std::optional
, and modern container classes for robustness.
The assembler is structured into distinct modules:
Lexer: Reads source lines and produces tokens. Handles whitespace, comments, identifiers, numbers, registers, and punctuation.
Parser: Analyzes token sequences to recognize instructions, directives, labels, and operands. Builds abstract syntax trees or intermediate representations.
Symbol Table: Maps labels and symbols to their addresses or values, supporting forward references.
Instruction Encoder: Converts parsed instructions and operands into machine code bytes.
Relocation Manager: Tracks references to unresolved symbols and prepares relocation entries.
Output Generator: Writes the ELF64 object file, including headers, sections, symbol tables, and relocation data.
Error Handler: Reports syntax, semantic, and resolution errors with line numbers and descriptions.
Each module is encapsulated in C++ classes or namespaces to promote maintainability.
The lexer reads source code line-by-line and produces tokens such as:
Identifiers: instruction mnemonics, labels, directives.
Numbers: decimal, hexadecimal, or binary literals.
Registers: recognizing valid x86-64 register names (e.g., rax
, rdi
).
Punctuation: commas, colons, brackets, operators.
Modern implementations use finite state machines or regular expressions for efficient tokenization. The lexer should support peeking and lookahead to facilitate parsing.
The parser consumes tokens from the lexer to identify:
Labels: identifiers followed by a colon.
Instructions: mnemonics with zero or more operands.
Directives: special assembler commands starting with a dot (.
).
Operands: registers, immediates, memory references, or symbols.
Parsing can be implemented via recursive descent or table-driven parsers depending on complexity. The parser validates operand counts, addressing modes, and syntax consistency.
During the first pass, labels are recorded in the symbol table with their corresponding offsets. Forward references (symbols used before definition) are recorded with placeholders, and backpatching is deferred until the second pass.
Symbol table entries include:
Symbol name (string).
Address or value (numeric).
Type (label, constant, external).
Linkage and visibility attributes.
Using a hash map or unordered map keyed by symbol names provides efficient lookup.
The assembler converts parsed instructions into binary machine code by:
Mapping mnemonics to opcode bytes.
Encoding operand types and addressing modes.
Generating prefixes (REX, operand size override) as needed.
Emitting ModR/M and SIB bytes for memory addressing.
Writing immediate values and displacement bytes.
Instruction encoding is a complex task due to x86-64's variable-length instructions and numerous addressing modes. Building an instruction table and helper functions to generate opcode sequences is recommended.
Symbols unresolved during the first pass are resolved in the second pass. The assembler:
Applies backpatches for all forward references in the generated code.
Generates relocation entries for external symbols or those resolved during linking.
Updates relocation tables accordingly.
Relocation data is critical for producing valid ELF object files usable by linkers.
The assembler constructs the ELF64 object file including:
ELF header describing file type, architecture, and entry points.
Section headers for .text
, .data
, .bss
, symbol tables, and relocation sections.
Raw machine code in .text
and initialized data in .data
.
Symbol tables recording defined and external symbols.
Relocation entries to support linker processing.
Binary output is written in standard ELF format with correct endianess and alignment.
Implement comprehensive error reporting to aid users and developers:
Syntax errors: unexpected tokens, missing operands.
Semantic errors: invalid registers, undefined symbols.
Linkage errors: unresolved externals.
Provide source line numbers and clear messages.
Modern C++ exception handling or error codes combined with logging can be employed.
Utilizing modern tools improves development efficiency:
Use unit tests for lexer, parser, and encoder components.
Employ debug builds with verbose logging.
Profile to optimize performance bottlenecks, especially in parsing and encoding.
Modular design enables incremental feature addition like macros or debug info generation.
// Simplified lexer interface
class Lexer {
public:
explicit Lexer(std::istream& input);
Token getNextToken();
Token peekToken() const;
// ...
};
// Parser example
class Parser {
Lexer& lexer;
public:
explicit Parser(Lexer& lex) : lexer(lex) {}
bool parseInstruction(Instruction& instr);
bool parseLabel(std::string& label);
// ...
};
// Symbol Table
class SymbolTable {
std::unordered_map<std::string, SymbolEntry> symbols;
public:
void defineSymbol(const std::string& name, uint64_t address);
bool lookupSymbol(const std::string& name, SymbolEntry& entry) const;
// ...
};
// Instruction Encoder
class Encoder {
public:
std::vector<uint8_t> encode(const Instruction& instr, const SymbolTable& symtab);
// ...
};
The above classes illustrate modularity; each can be expanded with error handling, detailed parsing rules, and encoding logic.
Writing a complete assembler in C/C++ involves coordinating multiple complex components, including lexical analysis, parsing, symbol management, instruction encoding, relocation, and output generation. A clean, modular design paired with modern C++ features facilitates maintainability and extensibility. The result is a functional x86-64 assembler capable of converting human-readable assembly code into machine-level object files, providing a strong foundation for deeper exploration or production-grade development.