Research Project: Jaws

Every day, security researchers find bugs in programs that would enable malicious actors to perform nefarious deeds by exploiting them. Although most of the successful hacking groups are migrating towards “file-less malware” and “living-off-the-land techniques” to establish persistence and control, old school malware is still the bane of any unprepared organization. Antivirus products have been pretty effective against various types of malware, but they are not without their weaknesses. Signature-based antivirus struggles to detect polymorphic malware as well as new threats that haven’t made it to the various malware signature databases. AI-based antivirus fixes the last couple of issues, but it still struggles to detect malware that runs on virtual machines. Research done by my colleague Tony Nizzi concluded that an AI-based antivirus program would need to include at least some form of the virtual machine to be able to detect malicious code that runs on it.

The goal of my research is to shed light on a vulnerability in antivirus programs as a whole, much like other security researchers do when finding bugs in programs. The vulnerability is one that I think may enable a complete bypass of any antivirus program existing today.

I plan to use this post to follow my progress with my research project. I’ll try to just include helpful links to the really verbose documentation because I don’t want this post to get ridiculously thick or to read like a tutorial. Since this is an ongoing project, I will continue to update this post as my research develops. Also, due to the possible implications of this research, I will not be providing a link to my code at this time.

How I got the idea for my project:

I was in the Computer Science lounge at the University of Northern Iowa (where I attend college), and I was talking with another student who was working on an assignment for one of his classes. His assignment was to choose an esoteric programming language and write a short simple program in that language. He was showing me some of the more outrageous ones that he might choose when one of them piqued my interest. It was a language called ‘whitespace‘. The interesting thing about whitespace is that it is a completely invisible programming language, with its syntax consisting entirely of spaces, tabs, and line feed characters (whitespace characters). It was made as a joke on April fool’s day some years ago because most programming languages ignore whitespace, only using it as an arbitrary delimiter between the actual code that makes up the program.

That was when I thought about a cool feature of whitespace: it could be used to easily create polyglot code (a file that contains a valid program in more than one language). Since most programming languages ignore whitespace and the whitespace programming language ignores all other characters, whitespace code could be injected into any non-whitespace-controlled file and could be programmed to do anything. Even a harmless HTML or source C file could be made to do malicious things with the spaces between the actual code. It would be a manual analysis nightmare.

I thought about the implications of this kind of polyglot code. To most humans, such a file would appear to be doing one thing while an unknown virtual machine could be using it to do something else. More than that, though: if it was interpreted and ran on a virtual machine, it may also be able to trick antivirus programs that can’t interpret the code. I became curious to see if it would actually work. Since I had to do undergraduate research for my Computer Science degree and an Honors Thesis for the UNI Honor’s Program, I decided to dedicate my research in both to this idea.

Planning, planning, planning:

The first thing I needed to do for my research was lay out a plan for what I was actually going to need to be able to test my ideas. I came up with three major tasks that all break down into lots of smaller parts.

The first of the three major tasks is to create a new virtual machine that runs malicious code for testing. I quickly realized that the whitespace programming language doesn’t do what I need it to in order to emulate malicious behavior. Somewhere I read that programming languages have ‘computation’ and ‘side-effects’. Computation is the part that does the math and logic, but without side-effects like printing the output to a screen or saving it to a file all you’re doing is making your CPU hot. Whitespace’s side-effects only included printing output to the screen and taking input from a keyboard. Most malicious behavior involves interacting with files and networking, and whitespace simply wasn’t built to do those things. So instead I have decided to create my own programming language based on whitespace, but with modifications and additions. I think the best part of creating a programming language is getting to name it, so I took care to give mine some deep thought and came up with ‘Jaws’. Jaws is an acronym for Just-Another-White-Space, and it also has some other hidden meanings. I thought it was appropriate to give it the same name as the movie ‘Jaws’ because both describe a threat hidden beneath the surface. Also, jaws can be thought of as interlocking teeth, much like the way the Jaws code can interlock with other code in the same file.

Once the Jaws virtual machine is created, I will get to the second major task of my project: a traditional compiler that translates a visible programming language into my invisible one. After giving it some thought, I realized it would be miserable to write and attempt to debug code that I can’t even see. So I decided that I need to create a second compiler to at least translate a visible version of Jaws into the invisible version. If I have the time, I would like to add more features to my visible interpreted language to enable me to create complex programs more easily. I will have to decide later how much of that will be worth my time.

The third major task of my research will be to actually do the testing against different antivirus products. I’ve chosen C as the language that the Jaws virtual machine will be written in, so Jaws will be able to run on any architecture that C compiles to (which is a LOT). I’ve even considered that Jaws could be tested in things like web pages and browser extensions via WebAssembly, but that is probably beyond the scope of this project. I plan to compile the virtual machine to Windows and then run tests in which the Jaws VM executes malicious code alongside various antivirus programs. In the end, I plan to offer ways to improve antivirus design so that this kind of threat will be mitigated before it can be redeveloped and unleashed by a malicious party.

With introductions and explanations out of the way, the rest of this post will be tracking the progress of my research.

Task One – Creating the Jaws VM:

The first thing I need for my research is the actual virtual machine that my invisible code will run on. If you don’t know much about compilers and the like, I highly recommend reading this article to be able to understand some of the terminology I’ll be using.

My junior year of college when I took Compilers, we actually wrote the entire thing ourselves. Luckily for me, I found a better way to do it this time around. There is a super cool pair of tools called Lex & Yacc that help greatly with the creation of a compiler. Flex & Bison (evolutions of Lex and Yacc) are what I will use to define the tokens and grammar for my language, and then they will actually generate the parser for me!

The first thing you should know when reading Flex & Bison files is that they are divided up into 3 sections:

  1. the first is sort of “control” information,
  2. the second is the actual token (Flex) or grammar (Bison) definitions,
  3. the last is C code to be copied verbatim to the output.

These sections are split up with the ” %% ” delimiters. The control section includes things like C headers (between ‘%{‘ and ‘%}’ ) that will get placed in the generated file, as well as options for Flex or Bison when generating the lexer and parser. The second section is specific to Flex or Bison. My Flex file’s second section includes my definitions of each token and what to do with them when they are found in the input file. You can read more about Flex’s token definitions in the Flex manual. My Bison file’s second section consists of the formal grammar of the language. You can read more about Bison’s grammar syntax on this site. The last section containing C code is for defining the ‘main’ function and other functions that may be needed. It is left blank in my Flex file because my Bison file contains the ‘main’ function definition and there can only be one when the generated C files are compiled together.

Flex

The first code that I wrote for the Jaws virtual machine was to make the file that Flex uses to generate my Jaws lexer. In my Flex file, I have only defined 3 tokens (SPACE, TAB, and LF). Everything else is ignored as specified by a line containing the wildcard for all other characters ” .   ; “. Basically, all Flex does is capture these tokens and pass them on to Bison. My Flex file was relatively short, because there are only 3 tokens for my language and they are each only 1 character.

Bison

Next, I made the parser with Bison. Bison generates the parser from the grammar that I define in the Bison file. First, this parser takes the tokens in the order that they are read from Flex. Then, it groups the tokens into instructions, and makes sure that the instructions are in a legal order. I had to make sure that my formal grammar was correct in order for Bison to parse the individual tokens correctly. The most time consuming part of building the grammar was when I added the start/stop feature of my Jaws language.

I’ve mentioned that Jaws can be injected into files alongside programming languages like C and HTML, but what about whitespace-controlled programming languages like Python? In the case of Python, you can have arbitrary whitespace on lines separate from Python code. However, the whitespace characters in the lines of Python would break the functionality of the Jaws lines. For cases like that, I added a feature to Jaws that enables it to start and stop interpretation of whitespace characters using header and footer statements. Once the rest of the grammar was parsing correctly, I integrated the header and footer statements into the grammar. It was during this stage that I committed the most Computer Science act of them all: forgetting a single character and wasting an entire week figuring out what the extremely vague error it produced could possibly mean.

Once the code generated by Bison could correctly parse the complete formal grammar and print them back to the command line, it was time to figure out how to execute the instructions.

Code Execution

Having only written a single traditional compiler before this project, creating a runtime system for my virtual machine was uncharted territory. I initially had some preconceived notions of how to do it that I quickly figured out wouldn’t work. Bison’s grammar definitions allow for code execution as semantic actions while parsing the grammar. At first I thought, “Hey, I can just execute the instructions as it parses them!” That idea was quite literally too good to be true. It turned out, there was no way with Bison to execute flow-control operations such as jumping to a previous line of code that had already been parsed (ex. simulating a loop).

After brainstorming and consulting with my academic advisor, Dr. Eugene Wallingford, I decided the best plan of action was to create a data structure for an instruction and a program (which contains an array of instructions). At compile time, the program data structure fills its array of instructions. This is accomplished by instantiating an instruction after it is parsed and added it to the program’s array via a semantic action. For execution, each type of instruction has it’s own runtime function, and a pointer to that function is added to the instruction structure as it is parsed and instantiated.

I also had to create data structures to represent the memory of my runtime system. Jaws is a stack-based language, so I created memory structures for a stack and a heap. At the recommendation of Dr. Wallingford, I also created a data structure for a jump table that could help my runtime system execute jumps more quickly.

This is probably over-simplifying the rest of the implementation, but for the most part the rest of the runtime system was really just building out the instruction runtime functions and making sure they execute and interact with the memory structures correctly. After parsing is complete and the program is built, an instruction pointer is created for the array of instructions, and the program is executed by calling the runtime function each instruction points to.

Once all that was completed and the lingering bugs were found and fixed, I had completed my initial version of my invisible programming language! It was a lot of work to get to this point, and I was reaching new levels of geeking out seeing my seemingly blank files running like they were Python code.

Task Two – Creating the Traditional Compiler:

As I mentioned earlier in this post, I decided that I would need to create a visible language that I can compile into Jaws code. This would enable me to write and debug code much more quickly when I am developing Jaws programs. Since I am creating a second programming language for this purpose, I get to have more naming fun! I decided to call my Jaws visible code “Fin”. My reasoning can probably be inferred without me boring you with an explanation a second time.

For the Fin compiler, I decided to use Flex and Bison again to generate the lexer and parser. That allows me to reuse the formal grammar I defined in Jaws’s Bison file, and it also keeps me in the same mindset and language as I finish the Code Execution part of the Jaws VM in parallel with writing this compiler.

Flex

The Flex file for the Fin compiler was more work than it was for Jaws. Jaws only has 3 tokens, but a readable version of Fin called for a separate token for each instruction. For example, ‘add’, ‘sub’, ‘mult’, and ‘div’ are all individual tokens, rather than each one being made of the same 3 characters.

Another thing that added complexity to the Fin lexer was the addition of literals as tokens. In Jaws, instruction parameters such as numbers and IP addresses were simply defined in binary, with 1 being a tab and 0 being a space. However, in order to write code more easily, I want to be able to write things like “2048”, “C”, and “127.0.0.1:22” as parameters. To do this, I had to define regular expressions for each literal. Other than the number of tokens and the regular expressions for the literals, there wasn’t anything else that needed to be added to the Fin Flex file.

Bison

I was able to copy a lot of the existing Jaws Bison file to use for Fin. I had to redefine the tokens coming in from Flex. I also had to place their corresponding terminal symbols in the grammar and rewrite the semantic actions (more on that in the next paragraph). Other than that, though, the formal grammar remained mostly the same! Once I got all the tokens parsing correctly, I moved on to the code generation part of the compiler.

Code Generation

The code generation for the Fin compiler ended up being extremely simple, which was a welcome surprise. First, I defined a Jaws output file to write to as code was generated. Then I simply made a code execution block (semantic action) after each terminal symbol in the Bison grammar. In each code execution block, I used fprintf to write the Jaws invisible character combination to the output file. And that was it! To add some polish to the compiler, I added some command line arguments. Now the Fin compiler is complete and working like I want it to. After months of compiler development, it’s finally time to write some malicious Jaws code and test it!

Task Three – Jaws Malware vs. Antivirus:

(to be continued…)

6 Comments

  1. Hey really nice project ! I would love to read more about how Jaws is accessing the filesystem and working with sockets, and if there is any way I can contribute, please do inform ! Thanks again for such an innovative idea !

    • Windows sockets isn’t working quite yet, but Linux testing has been successful so far. The implementation can be found in the GitHub repo https://github.com/lawndoc/jaws, but basically it’s just using regular C instructions to perform the operations since Jaws is an interpreted language. The runtime system implementation can be found in runtime.c in the JawsVM directory. The file and network IO executors are in the ioa, ioc, and netcon IMP sections starting at line 582. If you have any further questions, feel free to start a discussion in https://github.com/lawndoc/jaws/discussions

    • Good question 🙂 I’ve been focusing my time on other projects lately, but I could probably be convinced to start working on Jaws some more again if there is a lot of interest in its continual development and academic exploration. The initial reason for the drop-off is because this was an undergraduate research project and the semester ended. Since then I’ve started a full-time job and have had other priorities that I’ve been spending my time on.

      Part of the reason I open-sourced Jaws is so that if others feel compelled to learn more, they can do so by contributing themselves. Time is worth a lot, and I only have so much of it as one person.

Leave a Reply

Your email address will not be published. Required fields are marked *