Reverse engineering (RE) is the process of discovering features and functionality of a hardware or software system. RE of software is applied where the original source code for a program is missing, proprietary, or otherwise unavailable. Motivation for RE ranges from extending support of legacy software to discovery of security vulnerabilities to creating open source alternatives to proprietary software.
RE usually targets binary programs with a known instruction set architecture (ISA) and executable format. The RE process proceeds by disassembling the binary into assembly code, and where possible decompiling the assembly to yield high-level source code (for example, C source code).
However, in many cases the ISA is either undocumented, unknown, or unavailable. In addition, malware has been shown to use custom virtual machines to avoid detection. Such cases prove extremely time intensive for the reverse engineer. ISA features such as word size, instruction format, register size, and number of physical registers are a prerequisite to disassembly.
This project aims to discover to what extent machine learning can be used to detect ISA features from binaries of unknown provenance, and if so, whether these features can be used to help disassemble the binary program so that instruction and control flow information can be recovered.
Useful experience for the project includes good knowledge of computer architecture and assembly, machine learning (using Python), and a passion for staring at random-looking byte sequences for hours at a time.