From 15bba61aaef22a67fb6e1d35e577f6441052fae7 Mon Sep 17 00:00:00 2001 From: Connor Olding Date: Wed, 19 Oct 2016 08:21:52 -0700 Subject: [PATCH] write notes on internal workings and potential improvements --- NOTES.md | 340 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ TODO | 20 ++-- 2 files changed, 352 insertions(+), 8 deletions(-) create mode 100644 NOTES.md diff --git a/NOTES.md b/NOTES.md new file mode 100644 index 0000000..73c3ce4 --- /dev/null +++ b/NOTES.md @@ -0,0 +1,340 @@ +# lips internal architecture + +…since i'm new to parsers and doomed to forget how my own programs work. + +## public interface + +(note: this should be moved to the readme or something) + +in the simplest case, you just call the table returned by +`require 'lips.init'` as seen in `example.lua`. + +lips also returns, in said table: the usual package metadata, +and a `writers` table of pre-defined writers. + +in the call interface, +a `writer` function and a table of `options` may be passed +as further arguments after `fn_or_asm`. + +`writer` could be one of the `lips.writers` provided, +after instantiating with a call (it makes use of closure locals). + +options is a table of string keys and any type of value. +currently there is: +``` +.unsafe (default false) + if set, don't wrap the main assembler call in a pcall(). + i want to deprecate this, because it's unnecessary code + that the user could handle just as well from outside the interface. +.offset (deprecated) + sets options.origin and options.base simultaneously. + since the address-base feature was added, + the preferred way of doing this is by setting them individually, + so this option is deprecated. +.origin (default 0) + where to initially start writing in the file. + this simply tacks on an .org directive + to the start of the internal assembly. +.base (default 0) + how far to offset the assembler in where it thinks it's writing. + this is incredibly important for writing code in ROM + that is read into a static place in RAM. + this simply tacks on a .base directive + to the start of the internal assembly. +.path (default containing directory of assembly file, if applicable) + primarily used internally for handling relative imports. +.labels (default {}) + note: lips modifies the argument in-place. + allows for exporting/importing of label data. + that means you can declare labels in one file, + and allow a second to access them, in two separate passes. + otherwise, you would have to hardcode label locations. + the label format is quite simply a dictionary of string/number pairs: + { mylabel=0xDEADBEEF, ... } +.debug_tokens (default false) + dumps the statements table after tokenizing and collecting into statements. + this is after UNARY and RELLABELSYM tokens have been disambiguated. +.debug_pre (default false) + dumps statements after basic preprocessing: + variable substitution, expression parsing, + relative label substitution, etc. +.debug_post (default false) + dumps statements after expanding preprocessor commands: + pseudo-instructions, expression evaluation, etc. +.debug_asm (default false) + is arguably the least useful of states to dump in. + this will dump statements after being reduced to + !ORG and !DATA statements. anything else is a bug. + the values of the !BYTE statements are not printed. +``` + +## init.lua + +the path used to import `lips.init` is mangled +so lips can find its components in each file. +this has to be copy-pasted to every internal file, +which is a small inconvenience. + +afaik there isn't really a better way of doing this in vanilla Lua 5.1, +besides mandating to lips be an installed Lua package, +which would be an inconvenience to users (and myself!). + +iirc the `gsub` can silently "fail" and allow a couple other +methods of importing, `import "lips"` maybe? i don't remember. +it might work without being in a dedicated directory too + +other than that there's not a lot to say. +i've intentionally written this file as stripped down as possible. + +i've gone for a one-class-per-file style, +so `file` and `class` will often be synonyms in the following text. + +### room for improvement + +it would be nice to document options in init.lua, +since ATM i'm abusing Lua's default-to-nil behavior of tables. +that means options could be hidden within any file +and don't demand any forward-declaration or inline documentation. + +eventually i'd like to make `writer` a key of `options` +just to simplify the interface even further. +maybe i could pull off `writer_or_options` for backwards compatibility? + +someday i'd like to add a `reader` option for handling of existing data, +e.g. for implementing an automated .hook directive. + +## Parser + +"Parser" is a bit of a misnomer, since +the class doesn't do any parsing itself. +it defers parsing to the Lexer, Collector, and Preproc classes. +it also handles writing of the parsed data through the Dumper class. + +the main method here is `Parser:method` +which simply interfaces all the important bits of the assembling process. + +`self.statements` refers to the "commands" so-to-speak of the assembler +at any point. the general format of this table is: +``` +statements={ + {'!BEEP', Tokens...}, + ... + {'!BOOP', Tokens...}, +} +``` + +the `Parser:dump_debug` method allows for dumping the state of the +`self.statements` table after any of the primary stages of assembling. +refer to the `.debug_token`, `.debug_pre`, `.debug_post`, and `.debug_asm` +options above. + +### room for improvement + +statements could be made type-restricted, instead of +deferring "this crap ain't even assembled" to each individual stage/class. + +i'd like to come up with a better name, but i'm not in any rush. + +the debug dumper could be slightly prettier in certain cases. + +## Lexer + +transforms strings into the tokens they represent. +this does not handle nor consider how they will be collected into statements. + +`.inc` directives (and their friends) are handled here: +the appropriate files are placed and tokenized inline, not unlike in C. + +the `HEX` directive is its own mini-language and thus has its own mini-lexer. + +expressions are not parsed nor lexed here. +they are simply extracted as whole strings for later processing. + +the `yield` closure wraps around the `_yield` function argument +to pass error-handling metadata: the current filename and line number. + +the rest of the code should be self-explanitory, albiet ugly. + +### room for improvement + +this (character-based lexing) is actually a really bad way of writing a lexer. +it doesn't clearly represent a syntax grammar, or possibly any grammar at all. + +but it works. for now. +it's the code i need to change the least to add new features, +which has gotta count for something, right? + +there's a couple TODOs and FIXMEs in here. + +## Collector + +## Preproc + +transforms complex statements into simpler statements +that Dumper can later understand. + +the name is a bit of a misnomer, because there's +very little processing after preprocessing. (see: `Dumper:load`) + +the `:check` method +asserts that a token exists and is of a given type (`tt`). +it will defer to the `:lookup` method if the token type mismatches, +which isn't guaranteed to help. + +preprocessing is split into two stages: process and expand; and four passes: + +### pass 1 + +resolves variables by substitution, parses expressions, +and collects relative labels. + +this pass starts by creating a new, empty table of statements to fill. +statements are passed through, possibly modified, or read and left-out. + +the reason for the copying is that taking indexes into an array (statements) +that you're removing elements from is A Bad Idea. + +variable-declaring statements (`!VAR`) are read to a dictionary table, +for future replacement of their keys with values by the `:lookup` method. + +note that the variable-parsing code itself calls `:lookup` through `:check`, +so new variables can simply copy the values of previous variables. + +labels (`!LABEL`) are checked for RELLABEL tokens to collect +for later replacement in pass 2. +the positive and negative relative labels are collected into their own tables, +appended and prepended respectively. +the collection tables are arrays of tables containing the keys +`index` and `name`. + +every statement that isn't eaten has its tokens looked-up by the +`:lookup` method. at this state, it just handles variable substitution. + +### pass 2 + +resolves relative labels by substitution. + +this code enables `self.do_labels` which tells `:lookup` to start +handling relative labels as well, now that they've all been collected. + +`:lookup` is run on every token of every statement. + +the appending/prepending done in pass 1 ensures +that the appropriate relative labels are found in the proper order. + +### pass 3 + +attempts to parse and evaluate constant expressions. + +### pass 4 + +expands pseudo-instructions, including the inferrence of implied registers. + +pseudo-instructions are defined in `overrides.lua`. +overrides act as extensions to the Preproc class; +they are passed Preproc's `self`. +this keeps boilerplate out of `overrides.lua`, +but makes our own file more of a mess, +with more dependencies for arbitrary token/statement handling. + +### room for improvment + +as noted above, the name is a bit of a misnomer, +so this class should probably be split in two. + +pass 3 (expressions) should be an attempt to evaluate constants, +and parsing should be moved to be part of pass 1. + +pass 4 (expansion) is really messy. + +looking back, the `new_statements` ordeal +only seems necessary for the (poor) error handling it provides. + +the handling of statement tables could be made better. + +## Expression + +handles parsing and evaluation of simple (usually mathematical) expressions. + +this class is actually completely independent of the rest of lips, +besides the requirement of the `Base` class, which isn't specific to lips. + +### room for improvement + +right now, this is just a quick and dirty port of some +C++ code i wrote a while back. so basically, everything could be improved. + +bitwise operators need to be implemented. +possibly with LuaJIT and a Lua 5.1 fallback. +maybe that should be its own file? + +i might want to consider generating a abstract syntax tree, +instead of reverse polish notation, +so that i can handle short-circuiting `&&` and `||` operators, +among other things, like evaluating stuff +in logical order instead of right-to-left for everything. + +## helper classes + +### Token + +implements error-checking for tokens, +and provides convenience methods. + +also handles computation of numeric tokens, +since Token objects contain all the data necessary to do so. + +### Statement + +implements some error-checking for statements. + +### Reader + +inherited by stuff + +### Muncher + +inherited by stuff + +### room for improvement + +Reader and Muncher classes shouldn't even be necessary; +they could at least be reduced into one. + +## etc. + +etc! + +### overrides.lua + +refer to the section on Preproc. + +### data.lua + +contains most of the information required +to assemble MIPS III assembly code. + +this file does not expose any functions or methods, +only constant data. +however, some of the data may be generated through local functions. + +### util.lua + +contains various utility functions to be lightly sprinkled over files. + +most of this shouldn't be specific to lips. + +### writers.lua + +implements a few must-have writer-generators. + +`make_tester` is just a variant of `make_verbose` +that only prints addresses as necessary, reducing noise. + +### room for improvement in general + +for proper documentation, +i need to copy-paste and rewrite most of the crap here into +the appropriate files themselves. + +see also the TODO file. diff --git a/TODO b/TODO index af780b9..e983802 100644 --- a/TODO +++ b/TODO @@ -1,11 +1,5 @@ add basic command-line interface (patch.lua) -document options -maybe deprecate options.unsafe - -add delay slot warnings - -add arithmetic (using %() syntax?) add macros implement push/pop/jpop as macros be able to point to specific args of push/pop using variables @@ -14,5 +8,15 @@ allow generation of shared object files (zelda overlays specifically) don't require colons for +/- labels (this shouldn't break anything right?) -write tests for everything (try to focus on code paths) -test unary tokens +write tests for everything (try to focus on code paths and edge cases) +test unary tokens in particular + +improve parser terminology + +add a gameshark writer + +improve writer performance (just copypaste what you did in patch.lua) + +long term: add delay slot warnings + +externally document more stuff like syntax