December 12th, 2024

Converting untrusted PDFs into trusted ones: The Qubes Way (2013)

Qubes OS provides a method to convert untrusted PDFs into trusted ones using Disposable VMs, creating a "Simple Representation" in RGB format, though it limits text search and editing capabilities.

Read original articleLink Icon
CuriositySkepticismInterest
Converting untrusted PDFs into trusted ones: The Qubes Way (2013)

The article discusses a method for converting untrusted PDF files into trusted ones using Qubes OS, which enhances desktop security by isolating potentially harmful files. The author, Joanna Rutkowska, highlights the challenges posed by complex PDFs that can compromise systems. The existing method in Qubes OS involves using Disposable VMs to open files safely, but this can be cumbersome due to the time it takes to create these VMs. The proposed solution involves parsing the original PDF in a Disposable VM and generating a "Simple Representation" of the file, specifically in RGB format, which is easier to handle securely. This method allows for the safe conversion of PDFs by ensuring that only simple, non-malicious data is processed. The article details the implementation of this conversion service using Qubes' infrastructure, emphasizing the importance of strict policies to prevent malicious exploitation. While this approach effectively mitigates risks, it does come with limitations, such as the loss of text search and editing capabilities in the converted files. Overall, the method aims to provide a more efficient way to handle untrusted PDFs while maintaining security.

- Qubes OS offers a method to convert untrusted PDFs into trusted ones using Disposable VMs.

- The proposed solution involves creating a "Simple Representation" of PDFs in RGB format for safer processing.

- The conversion process is designed to minimize risks associated with parsing complex PDF files.

- Limitations include the loss of text search and editing capabilities in the converted documents.

- The implementation relies on strict policies to ensure security during file conversion.

AI: What people are saying
The comments on the article about Qubes OS and its PDF conversion method reveal several key points and themes regarding the approach and its implications.
  • Alternative Solutions: Users mention other tools like Dangerzone, which offer similar functionalities without requiring Qubes OS, and suggest exploring formats like ePub for safer document handling.
  • File Size Concerns: Some commenters express frustration over the large file sizes of the output PDFs, comparing them to PNGs and questioning the efficiency of the conversion process.
  • Security and Isolation: There are discussions about the necessity of using VMs for PDF rendering, with suggestions for simpler sandboxing methods, such as using Docker or chroot jails.
  • Parsing and Simplification: Commenters propose the idea of parsing PDFs to remove unnecessary elements while retaining essential content, which could enhance usability for machine learning applications.
  • General Security Practices: The conversation touches on broader security practices, including the potential for scanning PDFs for unsafe code and the importance of converting documents into safer formats.
Link Icon 12 comments
By @aspenmayer - about 2 months
Related:

https://dangerzone.rocks/

https://github.com/freedomofpress/dangerzone

> Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs.

From the learn more about page:

> Dangerzone was inspired by TrustedPDF but it works in non-Qubes operating systems, which is important, because most of the journalists I know use Macs and probably won’t be jumping to Qubes for some time.

> It uses gVisor sandboxes running in Linux containers to open dangerous documents, instead of virtual machines. And it also adds some features that TrustedPDF doesn’t have: it works with any office documents, not just PDFs; it uses optical character recognition (OCR) to make the safe PDF have a searchable text layer; and it compresses the final safe PDF.

Previously (announcement and details of gVisor sandboxing etc):

Safe Ride into the Dangerzone: Reducing Attack Surface with GVisor

https://news.ycombinator.com/item?id=41630076

By @mjevans - about 2 months
It looks like the qpdf-converter source, along with everything else, is now on Github according to the Developer / Source Code links on the site.

https://github.com/QubesOS/qubes-app-linux-pdf-converter

Their source code seems to take the most obvious path... flatten it to an image printout then possibly do more? https://github.com/QubesOS/qubes-app-linux-pdf-converter/blo... https://github.com/QubesOS/qubes-app-linux-pdf-converter/blo...

Though at a quick skim I can't see any OCR steps.

By @nickpsecurity - about 2 months
This is a good approach. It’s an old, design pattern in high-assurance systems where a gateway converts things into usable, safer form. Another concept, often called LANGSEC, is generating parsers from simple grammars that are hopefully bulletproof. These ideas can be combined.

Two more things can happen.

The increasing volume of memory-safe utilities means they can be used on one or both sides of this. That might prevent the exploit entirely. If a memory-safe CPU, it can still help to isolate in case of hardware failures (esp bitflips).

It can also be used to boost performance in non-Qubes systems where a secure (or OSS) processor is in use. They’re often slower than commodity CPU’s. So, one can use the disposable VM’s on commodity CPU’s to filter data (block most attacks), transform it, and send it over simple, wire protocol. Commodity VM’s might also present it back to the user in dressed up form.

Outside of security, a long time ago, they were doing similar things to decrease latency and boost bandwidth on Beowulf clusters. A team made Fast (or Active?) Messages to eliminate TCP/IP as a bottleneck. So, sometimes a security technique can also be a performance booster.

By @dmwilcox - about 2 months
I haven't used this in a few years since switching off of Qubes but something no one mentioned is that output PDFs are *huge*. They're practically PNGs with a .PDF extension in terms of size.

I love the idea of making PDFs dumber and safer but maybe ePub would fit the bill? I'm just thinking out loud, I would like to do this again, but the Qubes way of spinning up a disposable VM to produce a monster PDF file is unsatisfying. More general Qubes being slow was a big reason I switched off of it

By @dang - about 2 months
Related:

Converting untrusted PDFs into trusted ones: The Qubes Way (2013) - https://news.ycombinator.com/item?id=10538888 - Nov 2015 (5 comments)

By @ddtaylor - about 2 months
I'm curious, we use many PDF parsing and formatting tools as part of an ML ingest pipeline. Our goal is to keep the document as close as possible to the original in meaning, but remove unwanted junk and simplify the document by removing everything non-content related or converting it to text the ML can work with.

Surely you can do that instead? Parse the PDFs and format them in basic ways without support for "extensions" or anything. Let the user read that before using the "real" document with extensions potentially enabled.

By @ngneer - about 2 months
Cute idea. Reminds me of the format conversions typically used to lessen the risk of steganography. But boy, the article took forever to get to the PDF -> RGB idea, talking about Simple Representation and everything in between. RGB has less complex parsing, ergo less attack surface.
By @trollbridge - about 2 months
PDFs are essentially a representation of PostScript. PostScript itself is expected to be run in a VM and relativepy straightforward to be isolated/sandboxed.

So there is a certain sense of absurdity of needing to spin up an entire VM just to render a PDF. Running a standard PostScript renderer in a user executable (perhaps in a chroot jail to be a little bit paranoid) should be enough for safety. Or just stick it inside a Docker.

Restrict the permissions on the user process to “read my static data files like fonts” and “write output to this 1 file, or a parsing error to this other 1 file”.

By @lysace - about 2 months
I wonder how Google handles this. Thousands of their software people will need to read PDFs from all over the web from work machines.
By @brian-armstrong - about 2 months
This seems like a lot of work. Surely you can just rewrite the PDF parser in Rust?
By @random3 - about 2 months
isn't it possible to scan PDFs for unsafe code? Having a tool that makes docs safe from "now on" is semi useful, if you don't know whether you were already compromised.
By @mavhc - about 2 months
What about PDF/A, archive version with restricted features?