File extensions are only hints

Or “if it's called a duck, it still might not be a duck”

Note: This article was originally hosted on Medium.

Let’s consider for a moment what a file is. You have a body, and you have metadata. The body is simply an array of bytes. The metadata consists of a filename, along with details of the creation date, modification date, file size, etc. For a program to be able to open a file, it needs to know the format of the underlying data, and be able to decode said format.

So let’s say we have a JPEG image file. The file contains image data stored in the JPEG format, and the filename is foo.jpg. An image program, such as Photoshop or MS Paint, will contain logic for opening data in JPEG format, and displaying the image to the user. When displaying the Open File prompt, the program will use file extensions to filter the list of files to only display formats it understands.

But file extensions are not file formats.

If I take our foo.jpg, and rename it to foo.wav, I’ve only changed the meta-data; the internal data is unchanged. It has not suddenly become a Waveform Audio File. If an audio program attempts to open the file, it’ll expect WAV data, it won’t see what it’s looking for, and it’ll say to the user “hey, this file is corrupted or something and I can’t open it”. Just because the extension is .wav, doesn’t mean it has to contain data commonly associated with that extension.

Microsoft were nice enough to base a few features in Windows Explorer on file extensions. For one, it displays an extension-specific icon next to the file. Executable applications can have a custom icon specified in the file, which is displayed instead. Supported image formats also display a thumbnail, in “Large Icons” mode (and some others). In some view modes it also tells you the file “type”, based on the extension. For example, “Windows Batch File” for .bat, “Text Document” for .txt, etc.

They also put in a feature that hides the file extension for “known file types”. For example, “TPS Reports.docx” would just appear as “TPS Reports”. That’s not a terrible idea, but then they decided to have it enabled by default.

So, naturally, the bad people on the internet decided to abuse it. “definitely-not-a-virus.txt.exe” would appear to the un-savvy user as a text file, and it’s trivial to set the executable’s icon to look like a text document icon. A less savvy user would simply open the file, expecting it to be a text document, and then be rather confused when their computer grinds to a halt and continually advertises cheap viagra.

Another place where it has been an issue is the videogame modding world. A few years ago I had the pleasure of reverse-engineering the file formats from LEGO Racers (High Voltage Software, 1999), along with my fellow modders from the Rock Raiders United forum. A couple of the formats had the less competent modders utterly stumped, those being .bmp and .mdb.

Everyone knows .bmp is a bitmap image, the format being widely documented online (even on Wikipedia, at the time of writing). But HVS had decided to screw with us; the internal data was a custom image format, and they’d just slapped the .bmp extension on. Several modders made clueless posts, wondering why they couldn’t open them in MS Paint, etc.

A similar issue arose with the .mdb files. It’s lesser-known, but that’s the extension for (older) MS Access databases. Obviously they weren’t actually Access databases, it was a custom material list format, but that didn’t stop clueless modders trying every trick under the sun to get Access to open them. We can thank Windows Explorer telling them “hey, this is an Access Database, look at the shiny icon and everything!” for that.

People ought to be more aware of this. Even fellow programmers have fallen into this trap, mistaking compiler object files (.obj) for Wavefront 3d model files (.obj). File extensions are only hints, they can never definitively tell us the format of the internal data. foo.wav is only probably a Waveform Audio File, we cannot know for sure without looking inside it. Widespread use of particular formats (wav, gif, txt, etc) has led to certain extensions being assumed to mean those particular formats, but it’s important to remember that it’s not always the case.