File Format Analysis For Mere Mortals

This article has been a work-in-progress for over several years and might never be finished.

Note: This is largely aimed at reverse-engineering file formats from commercial videogames. A lot of the techniques are still applicable for more conventional software, but this article features game-specific terminology and use cases. Most videogame EULAs expressly prohibit reverse-engineering, so use this knowledge at your own discretion.


The Tools

First things first, you will need a hex editor. Ideally one which allows you to change how many bytes are displayed per line, as this makes it easier to identify multi-byte patterns. Personally, I use XVI32, but I’ve heard good things about 010 Editor, Hex Workshop and HxD as well.
I also recommend getting Process Monitor, which allows you to monitor system API calls in real-time. This is particularly useful for watching file read operations.


There are 10 types of people...

Hexadecimal is a counting system based on sixteen basic numbers instead of the classical ten. It's also known as "base 16" or simply "hex". It is favored by software engineers because it allows easy conversion to/from binary. A single hex digit will always map exactly to four binary digits - it then follows that a byte (8 binary digits) can always be represented as exactly two hex digits.

In most programming languages, you can mark a number to be interpreted as hex by prefixing it with 0x. For example, the code will contain 0x41 and the compiler will interpret that as 65 rather than 41.

Here's a table of single-digit hex values

DecimalHexBinary
000000
110001
220010
330011
440100
550101
660110
770111
881000
991001
10A1010
11B1011
12C1100
13D1101
14E1110
15F1111

Some other useful values to know:
16 = 10 = 0001 0000
32 = 20 = 0010 0000
255 = FF = 1111 1111

It's not too important to be able to convert from hex to decimal in your head, your editor will help out a lot on that front, but it does help to memorize a few common values.

Curtis Lassam has some useful videos covering Binary and Hexadecimal which I recommend watching. You can ignore the bits about octal, that's not going to be relevant here (although it's interesting of its own merit).


Data-type 101

So, for those of you that haven’t studied Computer Science 101, here’s a quick refresher on datatypes. Just cover byte, int16/32/64, float, and bool.


What is a file?

A file is simply a series of bytes.
See also my earlier rant on file extensions


Know your tools

Talk about hex editors, mention that characters outside the ASCII printable range are displayed as '.'


Lilliput and Blefuscu

When dealing with multi-byte types, you’re going to need to be very careful about byte order, also referred to as endianness. In some systems, the bytes are ordered such that the bits are in left-to-right reading order across the multi-byte type. These systems are referred to as big-endian. In other systems, the bytes are in reverse order, and these are referred to as little-endian. Note that only the byte order is reversed, the bits are still in big-endian order within each byte.

Let’s use an example. Say we want to store the number 123,456 as a 32-bit (4-byte) integer. In hex, this is 1E240. We want a 32-bit integer, so we pad the left-hand side with zeroes, giving us 0001E240. If we split this up into bytes, we get the byte sequence 00 01 E2 40. That’s in logical left-to-right reading order, so that’s how the value would be stored on a big-endian system. On a little-endian system, we reverse the order of the bytes, giving us 40 E2 01 00.

Files on a modern PC system will almost always be little-endian. Files on embedded systems, especially game consoles, tend more often to be big-endian (although, that said, the PS4 and Xbox One both use the little-endian AMD Jaguar). In cases where we're unsure, we can try to determine the endianness of a file by using inherent properties of integer and float datatypes, combined with the practicalities of the kinds of values being stored. Side note: In very rare cases, a file will use both endians, so be aware of that.

Integers will generally have more entropy in the lower bits, while the higher bits are more likely to be zeroes. This can be seen in the example above.

I tend to divide floats into two categories; integer-floats (1.0, -16.0, etc) and real-floats (22.741, -0.125, etc). Integer-floats tend to have entropy in the high bits, with zeroes in the lower bits. Real-floats tend to just be a complete mish-mash of bits.

One of the key values I look out for is 3F 80 00 00 (and the little-endian equivalent 00 00 80 3F), which is 1.0 as a float. In 3D model files, this shows up incredibly often. Another good trick is to look for the size of the file in bytes, as an integer, in both byte-orders. It’s not uncommon for files to contain this value near the start.


Strings, Part 1: 文字化け

When reading string data from a file, it's vital to use the correct character encoding, otherwise you may end up with garbled text known as Mojibake. When dealing with binary files, at least from games, there are only 3 encodings you'll usually need to worry about:

ASCII

The American Standard Code for Information Interchange (ASCII) is a 7-bit character encoding first published in 1963, and used heavily by the English-speaking western world. It consists of 33 non-printing control characters (many of which are now obsolete), and 95 printable characters. Typically this will be encoded as a byte per character, with the highest bit of each byte going unused. Various 8-bit extensions to ASCII have been published and used, but they're all kinda icky and for the sake of this article we can safely pretend they don't exist.

asdf

A detailing of the full 7-bit ASCII charset is shown above, thanks to asciitable.com. If you use a *nix system, you can also type man ascii into your terminal. Thanks to Joe McCray for teaching me that.

Unicode - UTF-8

UTF-8 is an extension of ASCII - that is to say that any ASCII text can be read by a UTF-8 decoder and it will work without mojibake. asdf

I highly recommend watching the Computerphile video on UTF8:

Unicode - UTF-16

asdf


Strings, Part 2: The Terminator

Now that we know how to read a string, we need to know when to stop reading. Unlike primitive types such as int and float, strings don't have a constant length. In my experience, there are 3 main types of string termination that you're likely to run across:

can we add pictures for this section?

Null-Terminator

In every character encoding we've covered, a value of zero represents a null value. A null-terminator is where we use a null value to denote the end of a string. Essentially, you keep reading characters until you read a zero, at which point the string has ended. The canonical "Hello, World!" string will look like this:

Hello, World!
48656C6C6F2C20575F726C642100
Null-terminator With Padding

This approach also uses a null character to end the string, but then after that is a series of padding bytes, up to a given maximum length. As such, the string will now take up a constant number of bytes in the file. This has the advantage that you don't have to read from the file one character at a time, but the disadvantage that strings cannot be longer than the prescribed length. Any strings significantly shorter than the max length are also wasting storage space. The length is left up to the programmer, and is not stored explicitly in the file. In the majority of cases I've seen it'll be a multiple of 4 or 16, often a power of 2.

With a max length of 16, the canonical "Hello, World!" string will look like this:

Hello, World!
48656C6C6F2C20575F726C6421000000

The padding bytes after the first null are not always zeroes, they may occasionally be garbage data. Typically the software will ignore them anyway so it doesn't really matter.

Explicit Length

In an explicit-length system, the length of the string (be it in bytes or chars) is stored explicitly in the file just before the string data. The length will usually be either a byte, int16, or int32 (I've seen all three). The string data may use a null-terminator as well, and the length number may or may not account for that null-terminator. Don't assume anything, check for this every time.

Using a 16-bit little-endian length, the canonical "Hello, World!" string will look like this:

Hello, World!
0D0048656C6C6F2C20575F726C6421

Once, twice, three times a datastructure

Quite often games will need to store arrays of data. In memory this will appear as similar-looking repeating patterns. This is where it's very useful to be able to resize your editor's line width, as this can help in identifying the size of a repeating structure. If you find what appears to be a repeating structure, try resizing your editor's line width until the pattern forms vertical lines.

insert GTA self radio screenshot/webm here

discuss length prefixing and how to determine the start of an array


Modelling for dummies

Know about vertex/index buffers
reference back to the 123 datastructure section for vertex structs, as well as general array concepts for both vbuffers and ibuffers


Block party

Know about block-based structures (PNG, et. al)
Lengths sometimes include themselves, and sometimes don’t


Raiders of the Lost Ark

What to look out for when tearing apart archive formats. Recursion vs flat hierarchy


Honey, I Shrunk the Data

Compression and stuff.
Look out for zlib header byte pairs (78 9C if I'm remembering right)


Poke it with a stick

Change things and see how the game reacts


Emergency code-oscopy

Use process monitor to look at read sizes. Note that some games just buffer the whole file in one read. This technique is ONLY useful when structs/primitives are buffered from the file one-at-a-time.


Laziness is the mother of invention

Hex editing is a laborious task, especially when you’re dealing with potentially hundreds or even thousands of files; so let’s try automating things with code. Once I’ve got a rough idea of a file’s structure, I’ll write a small test program to try and load it. Once I can load a single file, I tell the program to try to load all files of that type, and I watch closely to see where it starts falling apart and crashing. I can then inspect the file it was loading when it crashed, and see what’s different, and try to work out why it crashed. The boilerplate loading code I typically use is below:

void LoadFile(string p_filePath)
{
    using (var br = new BinaryReader(File.OpenRead(p_filePath)))
    {
        // replace the following with the actual loading code
        int    foo = br.ReadInt32();
        byte[] bar = br.ReadBytes(128);
        float  qux = br.ReadSingle();
    }
}

We can then wrap that in a foreach loop, to load all the files from the game:

foreach (string filePath in Directory.GetFiles(gameDir, "*.bin"))
{
    LoadFile(filePath);
}

Programming is also really helpful as it allows us to quickly test theories. For example, I noticed that a lot of the image files in Lego Racers began with a byte 04, 08, or 98, and I wanted to know if it was always the case. So I wrote a quick test program, which would shout at me if it found a file that didn’t start with one of those values, and within minutes I had confirmed my suspicions.


End Of File

some sort of conclusion here
note that offsets are only useful if you know what they’re relative to, and it’s not always the start of the file


note: change these to be all IEEE-referenced