write notes on internal workings and potential improvements

2024-11-14 09:19:02 -08:00 · 2016-10-19 08:21:52 -07:00 · 2016-10-19 08:21:52 -07:00 · 15bba61aae
commit 15bba61aae
parent cdc0f8edb2
2 changed files with 352 additions and 8 deletions
--- a/NOTES.md
+++ b/NOTES.md
@ -0,0 +1,340 @@
+# lips internal architecture
+
+…since i'm new to parsers and doomed to forget how my own programs work.
+
+## public interface
+
+(note: this should be moved to the readme or something)
+
+in the simplest case, you just call the table returned by
+`require 'lips.init'` as seen in `example.lua`.
+
+lips also returns, in said table: the usual package metadata,
+and a `writers` table of pre-defined writers.
+
+in the call interface,
+a `writer` function and a table of `options` may be passed
+as further arguments after `fn_or_asm`.
+
+`writer` could be one of the `lips.writers` provided,
+after instantiating with a call (it makes use of closure locals).
+
+options is a table of string keys and any type of value.
+currently there is:
+```
+.unsafe (default false)
+    if set, don't wrap the main assembler call in a pcall().
+    i want to deprecate this, because it's unnecessary code
+    that the user could handle just as well from outside the interface.
+.offset (deprecated)
+    sets options.origin and options.base simultaneously.
+    since the address-base feature was added,
+    the preferred way of doing this is by setting them individually,
+    so this option is deprecated.
+.origin (default 0)
+    where to initially start writing in the file.
+    this simply tacks on an .org directive
+    to the start of the internal assembly.
+.base (default 0)
+    how far to offset the assembler in where it thinks it's writing.
+    this is incredibly important for writing code in ROM
+    that is read into a static place in RAM.
+    this simply tacks on a .base directive
+    to the start of the internal assembly.
+.path (default containing directory of assembly file, if applicable)
+    primarily used internally for handling relative imports.
+.labels (default {})
+    note: lips modifies the argument in-place.
+    allows for exporting/importing of label data.
+    that means you can declare labels in one file,
+    and allow a second to access them, in two separate passes.
+    otherwise, you would have to hardcode label locations.
+    the label format is quite simply a dictionary of string/number pairs:
+    { mylabel=0xDEADBEEF, ... }
+.debug_tokens (default false)
+    dumps the statements table after tokenizing and collecting into statements.
+    this is after UNARY and RELLABELSYM tokens have been disambiguated.
+.debug_pre (default false)
+    dumps statements after basic preprocessing:
+    variable substitution, expression parsing,
+    relative label substitution, etc.
+.debug_post (default false)
+    dumps statements after expanding preprocessor commands:
+    pseudo-instructions, expression evaluation, etc.
+.debug_asm (default false)
+    is arguably the least useful of states to dump in.
+    this will dump statements after being reduced to
+    !ORG and !DATA statements. anything else is a bug.
+    the values of the !BYTE statements are not printed.
+```
+
+## init.lua
+
+the path used to import `lips.init` is mangled
+so lips can find its components in each file.
+this has to be copy-pasted to every internal file,
+which is a small inconvenience.
+
+afaik there isn't really a better way of doing this in vanilla Lua 5.1,
+besides mandating to lips be an installed Lua package,
+which would be an inconvenience to users (and myself!).
+
+iirc the `gsub` can silently "fail" and allow a couple other
+methods of importing, `import "lips"` maybe? i don't remember.
+it might work without being in a dedicated directory too
+
+other than that there's not a lot to say.
+i've intentionally written this file as stripped down as possible.
+
+i've gone for a one-class-per-file style,
+so `file` and `class` will often be synonyms in the following text.
+
+### room for improvement
+
+it would be nice to document options in init.lua,
+since ATM i'm abusing Lua's default-to-nil behavior of tables.
+that means options could be hidden within any file
+and don't demand any forward-declaration or inline documentation.
+
+eventually i'd like to make `writer` a key of `options`
+just to simplify the interface even further.
+maybe i could pull off `writer_or_options` for backwards compatibility?
+
+someday i'd like to add a `reader` option for handling of existing data,
+e.g. for implementing an automated .hook directive.
+
+## Parser
+
+"Parser" is a bit of a misnomer, since
+the class doesn't do any parsing itself.
+it defers parsing to the Lexer, Collector, and Preproc classes.
+it also handles writing of the parsed data through the Dumper class.
+
+the main method here is `Parser:method`
+which simply interfaces all the important bits of the assembling process.
+
+`self.statements` refers to the "commands" so-to-speak of the assembler
+at any point. the general format of this table is:
+```
+statements={
+    {'!BEEP', Tokens...},
+    ...
+    {'!BOOP', Tokens...},
+}
+```
+
+the `Parser:dump_debug` method allows for dumping the state of the
+`self.statements` table after any of the primary stages of assembling.
+refer to the `.debug_token`, `.debug_pre`, `.debug_post`, and `.debug_asm`
+options above.
+
+### room for improvement
+
+statements could be made type-restricted, instead of
+deferring "this crap ain't even assembled" to each individual stage/class.
+
+i'd like to come up with a better name, but i'm not in any rush.
+
+the debug dumper could be slightly prettier in certain cases.
+
+## Lexer
+
+transforms strings into the tokens they represent.
+this does not handle nor consider how they will be collected into statements.
+
+`.inc` directives (and their friends) are handled here:
+the appropriate files are placed and tokenized inline, not unlike in C.
+
+the `HEX` directive is its own mini-language and thus has its own mini-lexer.
+
+expressions are not parsed nor lexed here.
+they are simply extracted as whole strings for later processing.
+
+the `yield` closure wraps around the `_yield` function argument
+to pass error-handling metadata: the current filename and line number.
+
+the rest of the code should be self-explanitory, albiet ugly.
+
+### room for improvement
+
+this (character-based lexing) is actually a really bad way of writing a lexer.
+it doesn't clearly represent a syntax grammar, or possibly any grammar at all.
+
+but it works. for now.
+it's the code i need to change the least to add new features,
+which has gotta count for something, right?
+
+there's a couple TODOs and FIXMEs in here.
+
+## Collector
+
+## Preproc
+
+transforms complex statements into simpler statements
+that Dumper can later understand.
+
+the name is a bit of a misnomer, because there's
+very little processing after preprocessing. (see: `Dumper:load`)
+
+the `:check` method
+asserts that a token exists and is of a given type (`tt`).
+it will defer to the `:lookup` method if the token type mismatches,
+which isn't guaranteed to help.
+
+preprocessing is split into two stages: process and expand; and four passes:
+
+### pass 1
+
+resolves variables by substitution, parses expressions,
+and collects relative labels.
+
+this pass starts by creating a new, empty table of statements to fill.
+statements are passed through, possibly modified, or read and left-out.
+
+the reason for the copying is that taking indexes into an array (statements)
+that you're removing elements from is A Bad Idea.
+
+variable-declaring statements (`!VAR`) are read to a dictionary table,
+for future replacement of their keys with values by the `:lookup` method.
+
+note that the variable-parsing code itself calls `:lookup` through `:check`,
+so new variables can simply copy the values of previous variables.
+
+labels (`!LABEL`) are checked for RELLABEL tokens to collect
+for later replacement in pass 2.
+the positive and negative relative labels are collected into their own tables,
+appended and prepended respectively.
+the collection tables are arrays of tables containing the keys
+`index` and `name`.
+
+every statement that isn't eaten has its tokens looked-up by the
+`:lookup` method. at this state, it just handles variable substitution.
+
+### pass 2
+
+resolves relative labels by substitution.
+
+this code enables `self.do_labels` which tells `:lookup` to start
+handling relative labels as well, now that they've all been collected.
+
+`:lookup` is run on every token of every statement.
+
+the appending/prepending done in pass 1 ensures
+that the appropriate relative labels are found in the proper order.
+
+### pass 3
+
+attempts to parse and evaluate constant expressions.
+
+### pass 4
+
+expands pseudo-instructions, including the inferrence of implied registers.
+
+pseudo-instructions are defined in `overrides.lua`.
+overrides act as extensions to the Preproc class;
+they are passed Preproc's `self`.
+this keeps boilerplate out of `overrides.lua`,
+but makes our own file more of a mess,
+with more dependencies for arbitrary token/statement handling.
+
+### room for improvment
+
+as noted above, the name is a bit of a misnomer,
+so this class should probably be split in two.
+
+pass 3 (expressions) should be an attempt to evaluate constants,
+and parsing should be moved to be part of pass 1.
+
+pass 4 (expansion) is really messy.
+
+looking back, the `new_statements` ordeal
+only seems necessary for the (poor) error handling it provides.
+
+the handling of statement tables could be made better.
+
+## Expression
+
+handles parsing and evaluation of simple (usually mathematical) expressions.
+
+this class is actually completely independent of the rest of lips,
+besides the requirement of the `Base` class, which isn't specific to lips.
+
+### room for improvement
+
+right now, this is just a quick and dirty port of some
+C++ code i wrote a while back. so basically, everything could be improved.
+
+bitwise operators need to be implemented.
+possibly with LuaJIT and a Lua 5.1 fallback.
+maybe that should be its own file?
+
+i might want to consider generating a abstract syntax tree,
+instead of reverse polish notation,
+so that i can handle short-circuiting `&&` and `||` operators,
+among other things, like evaluating stuff
+in logical order instead of right-to-left for everything.
+
+## helper classes
+
+### Token
+
+implements error-checking for tokens,
+and provides convenience methods.
+
+also handles computation of numeric tokens,
+since Token objects contain all the data necessary to do so.
+
+### Statement
+
+implements some error-checking for statements.
+
+### Reader
+
+inherited by stuff
+
+### Muncher
+
+inherited by stuff
+
+### room for improvement
+
+Reader and Muncher classes shouldn't even be necessary;
+they could at least be reduced into one.
+
+## etc.
+
+etc!
+
+### overrides.lua
+
+refer to the section on Preproc.
+
+### data.lua
+
+contains most of the information required
+to assemble MIPS III assembly code.
+
+this file does not expose any functions or methods,
+only constant data.
+however, some of the data may be generated through local functions.
+
+### util.lua
+
+contains various utility functions to be lightly sprinkled over files.
+
+most of this shouldn't be specific to lips.
+
+### writers.lua
+
+implements a few must-have writer-generators.
+
+`make_tester` is just a variant of `make_verbose`
+that only prints addresses as necessary, reducing noise.
+
+### room for improvement in general
+
+for proper documentation,
+i need to copy-paste and rewrite most of the crap here into
+the appropriate files themselves.
+
+see also the TODO file.
--- a/20
+++ b/20
@ -1,11 +1,5 @@
 add basic command-line interface (patch.lua)

-document options
-maybe deprecate options.unsafe
-
-add delay slot warnings
-
-add arithmetic (using %() syntax?)
 add macros
 implement push/pop/jpop as macros
 be able to point to specific args of push/pop using variables
@ -14,5 +8,15 @@ allow generation of shared object files (zelda overlays specifically)

 don't require colons for +/- labels (this shouldn't break anything right?)

-write tests for everything (try to focus on code paths)
-test unary tokens
+write tests for everything (try to focus on code paths and edge cases)
+test unary tokens in particular
+
+improve parser terminology
+
+add a gameshark writer
+
+improve writer performance (just copypaste what you did in patch.lua)
+
+long term: add delay slot warnings
+
+externally document more stuff like syntax