CS220, Fall 2018, Lab 11

CPSC 220, Fall 2018
Lab 11: An Assembler for Larc

As we discussed in class, this lab and the next will be devoted to writing an assembler program for Larc. The assembler is to be written in Java. Our Larc assembly language, "Lasm," was discussed in class, and a handout about it was distributed. Your starting point is a program that already does the parsing phase of the assembler. Your assignment is to do code analysis and machine language code generation.

You are encouraged, but not required, to do this lab with a partner. I might try to pair up people who are still looking for a partner at the beginning of lab. If two people work together, only one should submit their work. Please make sure that both people's names are listed in the program!

To begin, you should start a project in Eclipse and copy the folder /classes/cs220/lasm into the src directory of the project. Be sure to copy the entire folder, not just the files inside the folder. The Java files in the folder are in a package named "lasm". (You could also download a zip file of the directory.)

Your Lasm assembler will be due at the last lab for the semester, on Thursday, December 6 by 3:00 PM on the last day of class, Monday, December 10. (There will be a new assignment to work on for that lab!)

About Classes in the lasm Package

The assembler is defined by a set of classes in package lasm, and must be in that package when submitted. You will work on the main program, which is defined in Lasm.java. It should not be necessary to modify any of the other Java files, unless you decide to do some extra work for extra credit. You will, however, have to be familiar with the other classes in the package, and you will probably need to read some of the documentation in the comments in those classes. The other classes are:

ProgramItem.java — An abstract base class that represents one item in an assembly language program, which can be either a label, a directive, or an instruction. A ProgramItem has properties named lineNumber, size, and address. The line number is set by the Parser to record the line in the source code file where the item was found. The size and address are not currently used, but might be useful for recording information about an item during the analysis phase.
Directive.java — Represents one of the directives .asciz, .word, or .space. Properties include the directive name and the value, which is a string for an .asciz directive and an integer for .word or .space.
Label.java — Represents a label that is used to mark a position in a program. The only new property is the label's name. Names are case-sensitive. Only one Label object with a given name can exist.
Instruction.java — This is the most complicated kind of ProgramItem, and you will need to study this class. An instruction has a name (which is not case-sensitive), an opcode, and whatever data items are needed by the instruction. Note that labels in instructions are represented by Strings, not by objects of type Label.
Parser.java — Defines a static method that reads and parses a Lasm program from a Scanner. The parser will find most syntax errors in the input file. The exceptions are that it does not check that a string used as a label name in an instruction is the name of an actual label in the program, and it does not check that the offset amount for a beqz or bnez instruction is in the range -128 to 127. If syntax errors are found in the program, the return value is null. Otherwise, the return value is an ArrayList of ProgramItems.

Code Analysis

Once the assembler has the parsed Lasm program in the form of an ArrayList<ProgramItem>, the goal is to output a Larc machine language program that is equivalent to the original assembly language program. However, before that can be done, some analysis of the program is needed. In particular, you need to compute the address for each label that is used in the program, and you need to check that any label name that is used in an instruction is the name of a label that exists in the program. You might also want to check that the offsets for beqz and bnez instructions are in the legal range. (That could be done in the code generation phase, but but it's not really nice to start writing a machine language program only to get stopped in the middle by an error.)

The analysis can be done in one or more passes through the ArrayList of ProgramItems. In general, each pass takes the form of a foreach loop:

for ( ProgramItem item : program ) {
    if (item instanceof Label) {
        Label lab = (Label)item;
        ...
    }
    else if (item instanceof Directive) {
        Directive dir = (Directive)item;
        ...
    }
    else { // item must be an instruction
        Instruction ins = (Instruction)item;
        ...
    }
}

In my solution, one of my passes calculates an address for every item, not just for labels. To calculate addresses, you need to know how many locations in memory will be used by each item. Remember that some pseudoinstructions expand into several basic instructions, and so will occupy several locations in memory. For a basic assembler, you can assume that loading any label or BIMM will take six basic instructions. (For a BIMM, you could check whether fewer instructions will suffice. For a Label, things are more difficult.)

The folder /classes/cs220/lasm_test_programs contains some Larc assembly programs that you can use for testing. Most of the test programs are written in a slightly more sophisticated assembly language than the one we have looked at; the Lasm parser accepts them, but it processes the extra features away, so that you don't need to understand them. One thing that you should note is that the ".data" section at the beginning of many of the programs is moved to the end of the parsed program by the Lasm assembler; the same thing is done by the official Larc assembler.

Code Generation

Code generation requires one final pass through the ArrayList, but it is the most difficult pass. Labels don't generate any code, but both Directives and Instructions do generate code. Instructions are the difficult case, since there are many cases to consider.

Recall that a Larc "machine language" program is actually a text file. The syntax is pretty strict. Empty lines and comment lines beginning with a # are allowed. Other than that, a line must contain either exactly 16 zeros and ones representing a 16-bit binary number or a 16-bit hexadecimal number starting with 0x and containing exactly four hexadecimal digits. Note that leading zeros must be included. (This is, of course, not real machine language, which should be given in binary form.)

Once you have successfully produced a Larc assembly language program, you can test it using the Lars simulator /classses/cs220/larc/sim. You might already have "sim" defined as an alias for running the simulator. (You could also test your program in the Larc debugger, /classses/cs220/larc/db.jar, but Warning: If you use the debugger on a .s file, it will create its own .out file, which can overwrite the one created by your assembler.)

To help me with debugging, I used comments in the machine language output file to show where the generated code is coming from. Here, for example, is my output for the hello-world.s test program:

# Code for "li $1 1" at address 0, lineCt 0:
0x8101
# Code for "la $2 string" at address 1, lineCt 1:
0x820b
0x8c08
0x422c
0x522c
0x9c00
0x022c
# Code for "li $3 14" at address 7, lineCt 7:
0x830e
# Code for "syscall" at address 8, lineCt 8:
0xf000
# Code for "li $1 0" at address 9, lineCt 9:
0x8100
# Code for "syscall" at address 10, lineCt 10:
0xf000
# Code for ".asciz "Hello, world!\n"" at address 11, lineCt 11:
0x0048
0x0065
0x006c
0x006c
0x006f
0x002c
0x0020
0x0077
0x006f
0x0072
0x006c
0x0064
0x0021
0x000a
0x0000

CPSC 220, Fall 2018 Lab 11: An Assembler for Larc

About Classes in the lasm Package

Code Analysis

Code Generation

CPSC 220, Fall 2018
Lab 11: An Assembler for Larc