Article by Ayman Alheraki on January 11 2026 10:37 AM
A formal grammar definition is fundamental for designing an assembler’s parser. This appendix provides a comprehensive grammar specification for a sample x86-64 assembler, covering lexical tokens, instruction syntax, operand types, directives, and macros. The grammar is designed to be both expressive and extensible, facilitating efficient parsing and error checking in assembler implementations.
Identifiers: Names for labels, macros, variables, and symbols. Must begin with a letter or underscore, followed by letters, digits, or underscores.
Numbers:
Decimal: Sequence of digits without prefixes.
Hexadecimal: Prefixed with 0x or suffixed with h (case-insensitive).
Binary: Prefixed with 0b or suffixed with b.
Octal: Prefixed with 0o or suffixed with o.
Strings: Enclosed in double quotes ("..."), supporting common escape sequences.
Comments:
Line comments start with ; or # and extend to the line end.
Block comments enclosed between /* ... */.
The grammar is context-free, specified in Extended Backus-Naur Form (EBNF)-like notation for clarity.
program ::= { line }line ::= [ label ] statement [ comment ] newlinelabel ::= identifier ':'statement ::= instruction | directive | macro_definition | macro_invocation | emptyinstruction ::= mnemonic [ operand_list ]mnemonic ::= identifieroperand_list ::= operand { ',' operand }operand ::= register | immediate | memory | label_reference | expressiondirective ::= '.' identifier [ parameters ]parameters ::= parameter { ',' parameter }parameter ::= identifier | immediate | string | expressionmacro_definition ::= '%macro' identifier parameters newline { line } '%endmacro'macro_invocation ::= identifier [ operand_list ]register ::= (register names set defined per ISA)immediate ::= number | expressionmemory ::= '[' expression ']'label_reference ::= identifierexpression ::= ... (supporting arithmetic, symbols, and addressing modes)comment ::= ';' .* | '#' .*newline ::= '\n' | '\r\n'Instructions start with a mnemonic token, case-insensitive.
Operands may include registers, immediate values, memory references, or labels.
Memory operands support complex addressing modes, such as base + index * scale + displacement.
Support for suffixes indicating operand size (e.g., movb, movw, movl, movq) or size overrides through directives.
Registers: Full coverage of x86-64 registers (e.g., rax, rcx, xmm0).
Immediate values: Numeric constants or expressions that evaluate at assembly time.
Memory operands:
Format: [base + index * scale + displacement].
All components optional except at least one must be present.
Support for segment overrides and RIP-relative addressing.
Expressions:
Arithmetic operations: +, -, *, /, %.
Parentheses for precedence.
Symbol resolution at assembly time.
Support for relocatable expressions with labels.
Start with a dot (.) and specify assembler instructions such as:
.data, .text, .bss for section definitions.
.byte, .word, .long, .quad for data allocation.
.ascii, .asciz for string data.
.globl, .extern for symbol visibility.
.equ for symbol definitions.
Parameters follow the directive keyword as per syntax.
Macros are defined with %macro and terminated by %endmacro.
Support for positional parameters, optional parameters with defaults, and local variables.
Macro invocations match the macro name and accept argument lists.
Recursive macros and nested macro calls are supported.
Macro expansion occurs before instruction parsing to allow complex syntactic abstractions.
Grammar includes robust handling of common syntax errors:
Missing operands.
Invalid operand types.
Incorrect directive usage.
Allows partial parsing to isolate errors and continue assembly for remaining lines.
Incorporates meaningful diagnostic messages tied to line and column numbers.
Grammar designed for easy extension to support:
New instruction sets or ISA extensions.
Custom directives and pseudo-instructions.
Additional macro capabilities.
Modular design separates core grammar from architecture-specific extensions.
This grammar specification serves as a foundation for building a reliable, maintainable, and efficient assembler parser for the x86-64 architecture. It balances complexity with usability, ensuring that assembler writers can adapt it to evolving instruction sets and toolchain requirements.