A primer on perceptual hashing

Published on August 11th, 2021 by ASK-Solutions in Technology & Society

There seems to be much confusion about what perceptual hashing is. It's probably due to the word hashing. It's something that rings a bell with most people about file integrity. Perceptual hashing has nothing to do with that, it's a type of use of AI. It's a machine learning technique.

What is an integrity hash?

An integrity hash is a simple calculation that is performed on the contents of a file or the contents of a packet received through a communications protocol. The simple calculation can be as simple as adding up all the byte values with the result as a byte. For instance the message Hello World has the byte value in decimal 72 101 108 108 111 32 87 111 114 108 100 taking the sum 72 + 101 + 108 + 108 + 111 + 32 + 87 + 111 + 114 + 108 + 100 = 1052 = 28 (reduced to a byte). So we can check if the contents of the file or packet is correct by comparing it with the number 28. If it is, it's most probably Hello World without spelling mistakes, wrong capitals, mangled characters, missing characters or extra characters.

A simple sum would be a weak form of integrity hashing. It cannot detect many errors. In the example there are only 256 possible sums, as a byte can have a value of 0 up to 255. Better integrity hash algorithms are CRC16 and CRC32. Instead of a simple sum, the Cyclic Redundancy Check (CRC) is a linear polinomic equation. For instance the CCITT1 16 bit CRC used in telecommunications has the polynomial equation of x¹⁶ + x¹² + x⁵ + 1.

Security hash

A security hash or cryptographic strong hash is very similar to an integrity hash. Instead of being designed to give a result that is strong in detecting small errors and many errors in a message. It's designed in a manner that the input cannot be reconstructed from the hash output. This means that there's no reverse calculation possible and there are no patterns in the resulting hash with certain input. Security hashes can and are often used as an integrity hash. For instance an MD5 or SHA-3 can be used to securely store a password in a database, but also to verify a large file download.

Perceptual hash

A perceptual hash has little to do with integrity and security hashes. It's a machine learning technique. It is used with neural networks. They are trained to return numbers representing the presence, or lack of presence, of something of importance. These numbers form the resulting perceptual hash. How unique this resulting hash is, is up to the amount of things of importance and the training that the network has received. How good or bad the usability of the perceptual hash is, depends on the type of important things chosen and how the neural network was trained.

Neural network

Neuraal netwerk A neural network is a network of interconnected nodes. Each node has multiple inputs and an output. A node can for instance have input A and B and output Y. The node can for instance perform the simple calculation of A + B = Y or A × B = Y. Which calculation is chosen and how the nodes are connected depends on the application the neural network will be used for. A typical neural network consists of a large number of entry nodes, a huge number of intermediate nodes and one or a handful of exit nodes.

Creating a simple perceptual hash

Let us create a simple perceptual hash. First we need to determine what our input will be. Second we need to determine what type and how many neural networks we need. Lastly we need to determine what we do with the output of the neural network; the perceptual hash.

Step 1: what do we want to achieve? As we're talking about perceptual hashes, let us choose image recognition. Visual perception is most probably the easiest to follow and understand.

We're interested in images that:

contain a My Little Pony2,
contain a flower pot and
doesn't contain a cat.

What is needed to do this:

A medium resolution image.
Greyscale suffices.
A neural network with an entry node for each pixel and 3 exits; lets call them MLP, FP and NOTCAT3.

for training:

a collection of images My Little Ponies,
a collection of images of flower pots,
a collection of images of cat and
a collection of random images.

Let's build it

Our Neuraal netwerk We determine that our input image will be 640 × 480 pixels in 8 bit greyscale. We believe this image has enough definition to perceive the content we're looking for. Humans can do this with at least a quarter of this size. As about everybody in the late eighties and early nineties where able to recognize images on their CGA and EGA computers.

So we need a neural network with 640 × 480 = 307.200 input nodes and 3 output nodes. We choose the intermediate nodes as the weighted sum of two inputs. And lets choose a complexity of the network of being 12 levels deep. This means that the input nodes are connected to the first set of 153.600 intermediate nodes. These first intermediate nodes again connect to 768.000 nodes. This is repeated 12 times. All 3 output nodes are connected to last set of 75 intermediate nodes.

Training our network

Now we've build our neural network, we've got to train it. For training we present the neural network with chosen images. With each image we also set the output to the value we desire. Then for every node, we change the weighing factor of the weighted sum, in such a manner that the output will match the desired output. The more we repeat this learning process, the better (or worse) the neural network becomes. Why worse? If we don't train the neural network in the correct manner with correctly chosen inputs, the neural network will learn something different than what we want it to learn.

We offer three images to the neural network. Two images of a My Little Pony with a flower pot. Where we indicate that the outputs of the neural network should be 100%, 100% and 0%. And a picture of a cat. Where we specify that the outputs of the neural network should be 0%, 0% and 100%. Neuraal netwerk Training

The resulting output

Now we have a trained neural network, leaving us with three outputs MLP, FP and NOTCAT. Which we still have to do something with. In our case, we want a single number, a hash, representative of what we want to percept: A My Little Pony with a flower pot and no cat so we define the calculation Hash = (MLP * 100 + FP * 100) + NOTCAT

We can store this hash into a database along with for instance the name of the My Little Pony. Now we've created a AI recognition algorithm to recognize My Little Ponies. We present an image and get back a name, but only when there's also a flower pot and no cat in the image. Otherwise it should return no name.

Instead of a name we can also store other data such as a flag for adult imagery, a flag for child sexual abuse or a case number of a running or finished child abuse investigation. Now we've got a database of hashes identifying known child abuse; a database of known CSAM. How good or bad this will identify known child abuse depends on many factors as we've seen.

Let's present two of the learning images and an unknown image to our neural network and calculate the perceptual hashes: Image of hashes outputted by the neural network. A double hash is determined due to the chosen simplicity of the network, the few features that can be recognized and the limited training that the network has had. Both the Butter Cup learning image and a new donkey image lead to the same perceptual hash. In this situation, the donkey image results in a perceptual hash present in our database; namely that of Butter Cup.

Practical uses of Perceptual hashes

Perceptual hashes are used for many things. For example by Apple for detecting known child sexual abuse images. The criticism on this is that it is not the job of a commercial company to scan images on its customers' devices. The second problem with Apple's plans, is what wil happen to the hashes, and what value will be given to these hashes.

With Apple's plans, the risks are the same as in this example. What happens with the donkey image above would lead to finding a hash corresponding to known child abuse in Apple's case. That is why it is so important that a human factor looks at the images. The perceptual hash algorithm, if well designed and trained for the purpose for which it is used, is only suitable for making a pre-selection.

After detection, you have the choice of either sending the images that yield the known hash to Apple and having these images viewed by employees. Or sending the values from which the hash is constructed. The first is very dangerous; it is a complete invasion of privacy, and the staff may see very shocking images. Perhaps images of acquaintances or family. Most people can't handle this; it can do a lot of damage to the person. Let alone the amount of discipline people have to muster, in order to make the right decisions. That is why the people employed by government who investigate crimes, especially when it comes to child sexual abuse, are selected and specially trained to be able to perform and handle this kind of work. Sending only the values underlying the hash is very weak and meager; it actually yields nothing.

1: CCITT is the former name of the Comité Consultatif International Télégraphique et Téléphonique, which has been called ITU-T since 1992. The International Telecommunication Union in full. It is an organization that coordinates telecommunications standards.
2: My Little Pony, MY LITTLE PONY®; is a registered trademark of Hasbro, Inc. It is part of the My Little Pony: Friendship is Magic franchise. Among fans abbreviated to MLP.
3: We have chosen this misleading name on purpose to show that how you name something, doesn't have anything to do with what it does. The name "NOTCAT" implies that this involves an output indicating it's not a cat, while we teach and use the network that it indicates what is a cat.

Tags: Privacy & Security, Software Development, Ownership & Control