In this blog post, we’ll learn about a powerful technique called hashing, which is used for many purposes, including data storage, data integrity, data security, and data compression.

The story follows King Data, ruler of a kingdom named DataLand, as he discovers the many benefits of hashing. Through his experiences, we’ll learn how hashing can be used to store data efficiently, verify its integrity, and protect it from unauthorized access. So come along on this magical journey and discover the wonders of hashing!

The story of King Data

Once upon a time, in a land far, far away, there was a kingdom called DataLand. To store and manage their data, the people of DataLand used hashing. King Data, the king of DataLand, was a wise and just ruler, but he had a problem. Data storage and management was becoming more challenging for his kingdom as it grew. He knew he needed a better way to keep track of everything.

One day, the king’s advisors suggested that he use a hash table to store his data. A hash table is a type of data structure that uses hashing to store and retrieve data efficiently. The king was skeptical at first, but he decided to give it a try. And it worked! Using a hash table, the king was able to store and retrieve his data much more quickly and easily than he could before.

But the benefits of hashing didn’t stop there. Hashing was also used by the king to verify the integrity of his data. For each piece of data, he would generate a hash value and store it alongside it. This way, he could be sure that the data had not been modified in any way. Finally, the king used hashing to protect his data from unauthorized access. By hashing his passwords and storing them in a database, he could verify the user’s identity when he logged in by comparing the entered password to the stored hash value.

And so, thanks to the magic of hashing, the kingdom of DataLand flourished and the people lived happily ever after.

So what is hashing?

A hash is a mathematical function that converts data into a fixed-size value. The hash function takes the data as input and produces a hash value as output.

A hash function takes an input, or key, and maps it to an output, or hash value, using a mathematical algorithm:

Hashing

Figure 1: Hash function concept

In most cases, the resulting hash value is shorter and of a fixed length, regardless of the size of the original data. A hash function spreads the data, and the range of operation of such a function is often much smaller than the input value, leading to hash function collisions, where the function returns the same result for different input values.

Any mathematical function can be used as a hash function to assign a fixed-length value to an input:

def my_hash_function(input: int) -> int:
    return input % 10

In example above, we have a function that takes an integer as an input and returns the remainder of dividing it by 10. This function is a hash function, because it takes an integer as an input and returns an integer as an output. The output is always a single digit, regardless of the size of the input. Unfortunately, this function is not very good at hashing. It’s easy to predict the output for any given input. For example, if we know that the input is 123, we can easily predict the output, which is 3. This is because the function is very simple and doesn’t use any randomness. If we know the input, we can easily predict the output.

For example by using my_hash_function hash function we can get hash collisions for inputs 13 and 123:

>>> my_hash_function(13)
3
>>> my_hash_function(123)
3

Ideal hash functions

An ideal hash function should be a one-way transformation, meaning it can’t be predicted/guessed what the input value was based on its calculation. This is because the hash function is used to verify the integrity of the data. If the hash function is reversible, it would be easy to modify the data and still have the same hash value. This would defeat the purpose of using a hash function in the first place.

Ideal hash functions meet the following criteria:

  • one-way and irreversible - on the basis of the calculated hash (output), the input to the function cannot be reconstructed,
  • high volatility and unpredictability - the function, based on very similar and similar inputs, returns values significantly different from each other, which makes guessing the input based on the known results, it is very unlikely that the result will be obtained, e.g. for the input 123 the ideal hash function will return kjabsfbkjashgfoash and for the input 1234 the ideal hash function will return 912nj12rhb12a8h12, which is totally different from the previous result,
  • collision resistance - the function assigns with a very low probability same results for different inputs,
  • no possibility to infer from the hash - based on the result of the function, it should not be possible to get useful information about the input,
  • deterministic - the function should always return the same result for the same input,
  • efficiency - a good hash function should be efficient to compute, meaning that it should be relatively fast to execute and not require a lot of resources.

Hashing MD5

Figure 2: Example MD5 hash function

Few words about hash collisions

Hash collisions are when two different inputs produce the same hash value. This is a problem because it means that the hash function is not perfect. It’s not a big deal if the hash function produces a few collisions, but if it produces a lot of collisions, it can cause problems.

Collisions can occur because the hash function is designed to map a large number of keys to a relatively small number of hash values. Since there are a limited number of hash values, it is inevitable that some keys will be mapped to the same hash value.

Huge file

Figure 3: Compute hash for a huge file

Due to the fact that we have a limited number of hash values, it is inevitable that some keys will be mapped to the same hash value. For example, if we have a hash function that returns a hash value of 10 digits, then we can have 10^10 different hash values. If we have 10^20 different keys, then we will have a collision with a probability of 10^10/10^20 = 10^-10. This is a very small probability, but it is still possible.

Why is it important to avoid hash collisions? For example, if the function used to calculate the hash of the password in application is not collision-resistant, it will be possible to guess the password by trying different combinations of characters. This is because the hash function is not one-way and reversible.

Hash collisions

Figure 4: Hash function collision for two different passwords

Finding a method for effective collision detection of a hash function may completely eliminate such a function from further use (an example of such a situation is the collision detection method in the MD5 function).

It’s important to note that collision is a normal part of the hashing process and can’t be completely avoided. However, the choice of a good hash function and a suitable collision resolution strategy can help to minimize the frequency and impact of collisions.

Real hash functions

In practice, hash functions are usually much more complex than the example above. They are usually implemented as a series of mathematical operations, such as addition, subtraction, multiplication, division, and so on. The result of these operations is then combined with the input data to produce the hash value.

Despite many advantages, hashing has its weaknesses. Each, even a secure cryptographic function will have some risk of collision, moreover, with the ever-increasing computing power, breaking increasingly heavy hashes becomes easier and easier. Therefore, it is worth using a combination of hashing and encryption, which will significantly increase the security of the data.

Hashing in practice

Hashing is a technique that is used for many purposes, including the following:

  • data storage - hashing can be used to store data in a way that is efficient and easy to search. For example, King Data used a hash table to store his data. A hash table is a data structure that uses hashing to store and retrieve data efficiently. It’s a very common data structure that is used in many programming languages, including Python, Java, and C++,
  • data integrity - hashing can be used to verify the integrity of data by generating a hash value for a piece of data and storing it alongside the data. Later, the hash value can be recalculated and compared to the stored value to ensure that the data has not been modified,
  • data security - hashing can be used as part of a cryptographic process to protect data from unauthorized access. For example, the king used hashing to protect his data from unauthorized access. By hashing his passwords and storing them in a database, he could verify the user’s identity when he logged in by comparing the entered password to the stored hash value,
  • data compression - in some cases, hashing can be used to reduce the size of data by representing it as a shorter hash value.

Which hashing algorithm to use?

There are many hashing algorithms that can be used for various purposes, and the best one to use will depend on your specific needs. Here are some general guidelines for choosing a hashing algorithm:

  • data storage and data integrity - for these purposes, any of the hashing algorithms (e.g. MD5, SHA-1, SHA-2, etc.) could potentially be suitable, depending on your specific needs. You should consider factors such as efficiency, collision resistance, and the avalanche effect when choosing a hashing algorithm,
  • cryptographic purposes - for cryptographic purposes, it is important to use a secure and reliable hashing algorithm. Some popular options include SHA-2, SHA-3 or Bcrypt. These algorithms are widely used for cryptographic purposes and are considered to be very secure,
  • password hashing - when storing passwords, it is important to use a secure and slow hashing algorithm to make it difficult for an attacker to crack the passwords. Some popular options for password hashing include Bcrypt, Scrypt, and Argon2. These algorithms are designed to be slow and secure, making them resistant to cracking attempts - they come from PBKDF algorithms family.

Passwords validation

When a user registers on a website, they usually have to provide a password. The password is then stored in a database and used to verify the user’s identity when they log in. It’s important to store passwords in a secure way, so that they can’t be easily accessed by an attacker. One way to do this is to use a hashing algorithm to hash the password and store the hash value in the database.

When a user enters a password, the system will hash the entered password and compare it to the stored hash value to verify the user’s identity. If the entered password produces the same hash value as the stored value, the user’s password is considered to be valid. If the entered password produces a different hash value, the user’s password is considered to be invalid.

It is important to use a strong and secure password hashing algorithm, such as bcrypt, scrypt, or Argon2, to ensure that passwords are properly hashed and secured. Moreover, you can use a salt to further increase the security of the passwords. You can also use a key stretching technique to block brute-force attacks.

Salting

Salting is a technique that is used to increase the security of a hashing algorithm. It works by adding a random string to the input before hashing it.

Salting

Figure 5: Salting mechanism

The salt is usually stored alongside the hashed value, so that it can be used to verify the hashed value later.

Originally, salts were to be used to complicate simple passwords. However, it has been proven that adding a salt to a password increases computation time only slightly. Ultimately, however, salts are used mainly to protect against attacks using rainbow table attacks.

Key-stretching

Key-stretching is a technique that is used to make it more difficult to crack a hash. It involves running the hashing algorithm multiple times on the same input data. This makes it more difficult for an attacker to crack the hash because it takes longer to calculate the hash value. For example, if the attacker tries to crack a password by trying different combinations of characters, it will take longer to try all the combinations if the password is hashed multiple times.

The implementation is very simple and consists in calling the hash function n times:

def key_stretching(password: str, hash_fun: Callable[[str], str], n: int) -> str:
    for _ in range(n):
        password = hash_fun(password)
    return password

Hashing in Python

Python has a built-in hash() function that can be used to generate a hash value for a given object. The hash() function uses the MurmurHash algorithm to generate the hash value. The hash() function is not suitable for cryptographic purposes, but it can be used for other purposes, such as data storage and data integrity:

>>> hash("Hello world!")
-54998617470514566
>>> hash("Kamil")
-5918316543820814548

You can also use the hashlib module to generate hash values for a given string. The hashlib module contains a number of hashing algorithms, including MD5, SHA-1, SHA-2, and SHA-3. The hashlib module is a part of the Python standard library, so you don’t need to install it separately. Here’s an example of how to use the hashlib module to generate a hash value for a given string:

import hashlib

def my_hash_function(string: str, algorithm: str = "sha256") -> str:
    hash_fun = getattr(hashlib, algorithm)
    return hash_fun(string.encode()).hexdigest()