Larc Assembly Langauge for CS 220

An Assembly Language for Larc

This document describes a basic assembly language for the Larc model computer. This language is to be implemented in an assembler named Lasm, written in Java. The language is also referred to as "Lasm," meaning "Larc Assembler."

In the Lasm assembly language, registers are referred to using a '$' followed by a decimal number: $0, $1, $2, ..., $15. Numbers can be specified in decimal form, including an optional minus sign, or in hexadecimal form, starting with 0x or 0X (and no minus sign). Instructions are always specified by mnemonics. Mnemonics are not case-sensitive.

Registers $12, $13, $14, and $15 cannot be used in assembly language programs. Registers $12 and $13 can be used by the assembler in its implementation of pseudoinstructions. Registers $14 and $15 are reserved for the operating system.

Certain instructions assume that register $10 is the stack pointer, and certain instructions assume that register $11 is the link register, which holds the return address for a subroutine call. These uses are not enforced on the machine language level, but they are used by the assembler.

In additions to instructions, a program can include comments, labels, and directives. A comment begins with the character # and extends to the end of the line. Empty lines and lines containing only a comment are ignored. A comment can also occur after other items on a line.

A label name must begin with a letter and can contain only letters, digits, and the underscore character. Labels are case sensitive. A label name followed by a colon can be used to label a position in a program. Such a label can occur on a line by itself or preceding other content on a line. It is OK to have several such labels consecutively, even on the same line; all of those labels mark the same position in the program. Label names are also used in certain instructions, such as branch instructions.

The supported directives are .asciz, .word, and .space; they are case-sensitive. A directive must be followed by whitespace, then the data for the directive on the same line,

A .word directive is followed by a 16-bit integer, in the range -32768 o 65535 (or 0x0 to 0xFFFF in hexadecimal). This directive takes up one memory location in the assembled program, and the given number is laced into that location.
A .space directive is followed by a positive integer. It takes up several locations in memory, with the number of locations is given by the integer. Those locations are occupied by zeros in the assembled program.
An .asciz directive is followed by a string in double quotes. The string can contain the special escaped character sequences \", \n, and \\, representing a double quote, a line feed, and a single backslash. No other escape sequences are allowed. Except for the escape sequences, the string can contain only printable ASCII characters (with ASCII codes in the range 32 to 126). If the length of the string is N, it will take up N+1 locations in memory. Those locations contain the ASCII codes for the characters in the string, one character per location, followed by a zero to mark the end of the string.

A line can contain at most one instruction or directive, and an instruction or directive cannot extend over more than one line. White space is insignificant, except that there must be whitespace between a directive and its data, between an instruction name and its arguments, and between the arguments of an instruction.

The assembly language has 14 instructions that correspond directly to machine language instructions. In this table, $ra, $rb, and $rc must be replaced by legal register names:

add $ra $rb $rc	Reg[ra] = Reg[rb] + Reg[rc]
sub $ra $rb $rc	Reg[ra] = Reg[rb] - Reg[rc]
mul $ra $rb $rc	Reg[ra] = Reg[rb] * Reg[rc]
div $ra $rb $rc	Reg[ra] = Reg[rb] / Reg[rc]
sll $ra $rb $rc	Reg[ra] = Reg[rb] << Reg[rc]
srl $ra $rb $rc	Reg[ra] = Reg[rb] >>> Reg[rc]
nor $ra $rb $rc	Reg[ra] = ~(Reg[rb] \| Reg[rc])
slt $ra $rb $rc	Reg[ra] = (Reg[rb] < Reg[rc])? 0 : 1
li $ra LIMM	Reg[ra] = sign_extend(LIMM) LIMM is an integer in the range -128 to 255, but values from 128 to 255 are converted to the corresponding negative 8-bit number
lui $ra LIMM	Reg[ra] = LIMM << 8 LIMM is an integer in the range -128 to 255, but values from 128 to 255 are converted to the corresponding negative 8-bit number
lw $ra SIMM($rb)	Reg[ra] = Mem[ Reg[rb] + sign_extend(SIMM) ] SIMM is an integer in the range -8 to 15, but values from 8 to 15 are converted to the corresponding negative 4-bit number
sw $ra SIMM($rb)	Mem[ Reg[rb] + sign_extend(SIMM) ] = Mem[ra] SIMM is an integer in the range -8 to 15, but values from 8 to 15 are converted to the corresponding negative 4-bit number
jalr $ra $rb	temp = PC; PC = Reg[rb]; PC = Reg[ra]
syscall	call a system trap

The two branch instructions, which take numerical LIMM values as offsets in machine language, now take labels instead. However, the difference between the address of the instruction, plus one, and the address of the label must still be in the range -127 to 128. This ensures that the instruction corresponds to a single machine language instruction.

beqz $ra Label	if Reg[ra] == 0, then PC = address of Label.
bnez $ra Label	if Reg[ra] != 0, then PC = address of Label

The assembly language also introduces several "pseudoinstructions," which must be implemented using one or more machine language instructions. The implementation can make use of the assembler registers $12 and $13, if necessary.

la $ra Label	load address	Reg[ra] = address of Label
lbi $ra BIMM	load big immediate	Reg[ra] = BIMM BIMM is an integer in the range -32768 to 65535 or 0x0 to 0xFFFF
lwa $ra Label	load word from address	Reg[ra] = Mem[address of Label]
swa $ra Label	store word to address	Mem[address of Label] = Reg[ra]
mov $ra $rb	move (actually "copy")	Reg[ra] = Reg[rb]
b Label	branch	Reg[12] = address of Label, then jalr $0 $12
bl Label	branch and link (call subroutine)	Reg[12] = address of Label, then jalr $11 $12
ret	return from subroutine	jalr $0 $11
push $ra	push onto stack	subtract 1 from register 10, then sw $ra 0($10)
pop $ra	pop from stack	lw $ra 0($10), then add 1 to register 10

Implementation note: Implementing the lbi instruction requires loading a 16-bit number into a register. The Larc machine language does not have such an instruction, so the load must be done using li, lui, and other instructions. For example, suppose that you want to load the hexadecimal number 0xFACE into register $3. Note that if the last 8 bits, 0xCE, are used as a LIMM in the instruction "li $3 0xCE," it will be treated as a negative number and sign-extended to 0xFFCE; that is, the number loaded into $3 is 0xFFCE. What we really need is 0x00CE. We can get that by applying a logical left shift by 8 bits to the number, followed by a logical right shift. So, the instruction "lbi 0xFACE" can be implemented as:

    li $3 0xCE
    li $12 8
    sll $3 $3 $12
    srl $3 $3 $12
    lui $12 0xFA
    add $3 $3 $12

Each of these instructions can be translated directly to a single machine language instruction. Depending on the value that we need to load, a shorter sequence of instructions might be used. (Another option would be to have 0xFACE in some memory location, as part of the machine language program, and use the lw instruction to load the value from that memory location.)

A similar issue arises with the la instruction, since the address of a label can be an arbitrary 16-bit number.

Compatibility note: Larc actually comes with an assembly language, which is documented in the Larc manual. It has the same 14 basic instructions and the same versions of the beqz, bnez, and la instructions as Lasm. Many programs are written with just those 17 instructions. To make it possible to run such programs, the Lasm parser implements the following extensions: It accepts .asciiz as equivalent to .asciz. It allows certain mnemonic names, such as $sp and $ra, to be used for registers. And it allows two additional directives, .text and .data. In the original Larc assembly langauge, a program has two parts, text and data. The text section can only contain instructions, and the data section can only contain the data directives .space, .word, and .asciiz. The beginning of the text section is marked by the .text directive; the beginning of the data section is marked by the .data directive. In the source code, the two sections can come in either order, but in the assembled program, the data section always comes at the end. Lasm will respect this structure, but does not require it.