Whispering PDF's

Mar 20, 2023

8 mins read

Hello, we’re going to be trying something a little bit different today. A colleague recommended this whisper.cpp thing, which is a port of open.ai’s whisper model to C/C++. So rather than type out this blog, the way that it’s going to be formed is I read a question that’s curious to me. And I’m going to talk about it, edit it up, and slap some screenshots and links and see what kind of blog that turns out as. So this may not be the best blog, but really I’m just trying to figure out if this workflow kind of works in general. So we’re going to start off with a question that I found on Reddit on r/asknetsec. You can find it below.

r/asknetsec

This person is talking about books that they’re downloading from the 90s, you know, good for them, looking for educational content. And they have some questions about, you know, it potentially containing ransomware. And this is a valid thing to be concerned about. Things that they’ve checked:

They ran through various antivirus scanners.

That’s always a great start. They posit that potentially some of the things that the antivirus flagged on was to avoid piracy detection. And, frankly, that might be correct. But more likely it is that there has been so many PDF exploits over the years that there’s a lot of fuzzy signatures used by various antivirus engines. Those may be tripping up the antivirus scanners and they’ll get false positives. It’s a real thing that happens. So as we’re talking about like all of the different PDF exploits of the years, we’ll touch back on that at a later point during this blog.

They also say they check that the .exe file just says .pdf.

Now we’re going to dive in a bit here and kind of talk about what a file extension is, what a file system is, and how those two kind of work together. When you’re talking about a file extension such as .exe or in a more generalized case, let’s say .mp3. The file extension doesn’t actually determine the contents of the file. You can prove this by going to any file on your system, preferably one in your documents folder that isn’t load bearing and going to cause your system to crash if you change it. But you can just rename it and change the file extension to something arbitrary. File extensions are really kind of just like helper strings. Certain file extensions are mapped to certain programs. So when you say you know, want to play a new music format, let’s say it’s called the boogie format because you want to boogie down and it has a file extension of .boogie.

How does your system know to open that in the relevant boogie music player?

When the installer runs for the program, it will make a registration in Windows or Linux to say that this certain file extension should be opened with this certain program. So along those same lines, when we talk about .pdf file extensions, it doesn’t actually affect the underlying file contents. The underlying file contents may actually be an executable despite it being named .pdf. But the cool thing is that if you try to open an executable file in a PDF reader or a PDF in an executable parser such as rundll32, it’s just not going to work. So really, they’re kind of right for the wrong reasons.

And as we get into that, one of the ways that you can determine that a file is what it says it is, is that it matches the spec and can be parsed correctly. And rather than having to know the entire spec of PDF, which there is a lot of them, PDF has gone through, I think, seven revisions now, designated as:

PDF 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, and all non-deprecated features defined in previous PDF version were also included in the subsequent PDF version.

If you don’t want to try to memorize all those, there’s these great things called “magic headers” / “magic numbers” / “magic bytes”. And here’s a link below with a nice table where you can kind of look up some of these values.

https://en.wikipedia.org/wiki/List_of_file_signatures : PDF starts with a defined set of bytes of starts with hex 25 50 44 46 2D. And that actually maps to some of the ASCII characters for the letters PDF (%PDF-).

Now, conveniently, you can actually just check this. In the large majority of cases, it’s just going to be a PDF file. Now that doesn’t speak to whether or not it’s going to be malicious, but that’s at least how you confirm that it is that type of file, or at least was intended to be parsed as such. So the user also states “it all opens this actual PDF file on Google, and I can read as normal.”

That’s great.

Actually, inadvertently, perhaps they are doing the exact thing that I would recommend. When it comes to PDFs being potentially malicious, this comes from part of what I spoke about before, where there’s a lot of different features and a lot of different versions of the PDF file format. And with that comes bugs, and bugs can be exploited.

When we’re talking about something that, such as a PDF reader in a very sprawling file format, you want a reader that is battle tested. And in fact, when they say, “opens an actual PDF file on Google,” I assume that they’re referring to a web browser, such as Chrome or Firefox. And in the modern day, these browsers typically auto update. So you’re always on the newest version.

They’re battle tested.

They’re being used by users all around the world thousands and thousands of times per day. Additionally, when they’re trying to parse and read PDF files, they’re actually not using dangerous code languages that may have memory allocation issues and that sort of vulnerability by nature. A lot of them are using a toolkit called PDF.js, which is a memory safe language implementation of a PDF parser. So the other side of this conversation is that this user very clearly just wants to read some PDFs. And for their purposes, opening them in a web browser is exactly what I recommend. There’s very unlikely to encounter an exploit in a web browser. It’s even less likely that a functional exploit is going to be used on some random user trying to download PDFs from the 90s.

The flip side of that is that when we talk about PDF based exploitation, we’re talking about valuable targets, such as people working in the finance department at a company who will not be using a web browser in order to read a PDF, they want to interact with some of the extended features of PDFs, such as signing files or signing documents for contracts. Those features may not be integrated into Firefox or Chrome.

So they will use some things such as like Adobe Acrobat.

And perhaps they pirated their copy of Adobe Acrobat and cannot update it.

As a result, there may be an exploit that can be run through a PDF file resulting in a compromised system by them.

That’s really one of the more common paths that you’ll see when we’re talking about vulnerable or exploitive PDF files. When it comes to a user who just wants to read content from the 90s, I hope I provided some information about how to determine that it actually is a PDF file. But by and large, regardless of if it was a PDF file or if it was an executable file, if you try to open it up in a web browser, you’re very less likely to be exploited and it’s exactly what I would recommend as a baseline.

Now, the one last exploitive step that can occur when you open a PDF file in a web browser, barring some type of zero-day exploit is the human flaw, the phishing flaw.

So when you’re reading a PDF document, you may see links that point out to other places, try to get you to click them, try to get you to type in various information. So long as you remember that you’re reading an untrusted document, trying to gain information from it, and if it asks you for your credit card information, you probably shouldn’t enter that. But this is more generalized phishing training and information security of making sure that you don’t put your information into untrusted sources.

I hope that’s answered some of the questions and I guess we’re going to see how this talk of blog ends up turning out. Hopefully it turns out pretty well. Goodbye for now. (wind blowing)