Skip Navigation

How the Great Firewall of China Detects and Blocks Fully Encrypted Traffic

Archived version

One of the cornerstones in censorship circumvention is fully encrypted protocols, which encrypt every byte of the payload in an attempt to “look like nothing”. In early November 2021, the Great Firewall of China (GFW) deployed a new censorship technique that passively detects—and subsequently blocks—fully encrypted traffic in real time. The GFW’s new censorship capability affects a large set of popular censorship circumvention protocols, including but not limited to Shadowsocks, VMess, and Obfs4. Although China had long actively probed such protocols, this was the first report of purely passive detection, leading the anti-censorship community to ask how detection was possible.

The paper discloses findings and suggestions to the developers of different anti-censorship tools, helping millions of users successfully evade this new form of blocking.

3
3 comments
  • That can be used as a heuristic, and that may be good-enough to disrupt widespread use of VPN protocols.

    But it's going to be hard to create an ironclad mechanism against steganographic methods, because any protocol that contains random data or data that can't be externally validated can be used as a VPN tunnel.

    I can create "VPN over FTP", where I have a piece of software that takes in a binary stream and generates a comma-separated-value file that looks something like this:

    employee,id,position
    John Smith,54891,Recruiter
    Anne Johnson,93712,Receptionist
    

    etc.

    Then at the other end, I convert back.

    So I have an FTP connection that's transmitting a file that looks like this.

    That's human-readable, but the problem is that it's hard to identify that maybe all of those fields are actually encoding data which might well be an encrypted VPN connection.

    You can do traffic analysis, look for bursty traffic, but the problem is that as long as the VPN user is willing to blow bandwidth on it, that's easy to counter by just filling in the gaps with padding data.

    You can maybe detect one format, but I'd wager that it's not that hard to (a) produce these manually with a lot less effort than it is to detect new ones, and (b) probably to automatically train one that can "learn" to generate similar-looking data by just being fed a bunch of files to emulate.

    A censor can definitely raise the bar to do a VPN. They don't need a 100% solution. And they can augment automated, firewall blocks with severe legal penalties aimed at people who go out of their way to bypass blocks. They can reduce the reliability of VPNs, make it hard to pay for VPN service, and increase the bandwidth requirements or latency of VPNs.

    But on the flip side, steganography is going to be probably impossible to fully counter if one intends to blacklist rather than whitelist traffic. And if you whitelist traffic, you give up the benefits of full access to the Internet. Some countries have chosen to do that -- North Korea, for example. But that is a very costly trade to make.

    EDIT: Probably an even-more-obnoxious "host file" for steganographic data would be a file format that intrinsically encrypts data, like a password-protected ZIP file. For protocols protected by X.509 certificates, like TLS, China can mandate that everyone trust a CA that they run so that they can conduct man-in-the-middle attacks on connections. But ZIP doesn't do that -- it only uses a password. Users cannot trivially backdoor their ZIP encryption so as to let the Great Firewall see inside. So if someone starts using an encrypted ZIP file format to use as an encrypted VPN tunnel, China would be looking at blocking transfers of encrypted ZIP files. And there's gonna be less bandwidth overhead to an encrypted ZIP file in terms of encoding than my above CSV file.

    And even if China, after a long, arduous effort, transitions people off encrypted ZIP, all one needs is a new file format in use that uses encryption.

    • automatically train one that can "learn" to generate similar-looking data by just being fed a bunch of files to emulate

      Sounds like a job for a "compression prompt" for ChatGPT... [and thus, the AI wars began]

  • This really sucks, but we do know it's a cat and mouse game. AI/coded, doesn't matter, it's pattern recognition. It's only a matter of time until someone figured out how to change the pattern in a way that isn't detected.