Storage learning and retrieval with hash function

If you’re walking around the City of London, you won’t have much trouble finding Gough Square, which is hidden behind Fleet Street. There stands the home of Dr. Samuel Johnson, who (among other things) wrote the first English dictionary.

There is something surprisingly magical about a dictionary. There is probably only one long-ordered series that you know so well and have known so well for a long time: Your Alphabet. This gives the dictionary some cool properties. Each entry has only one key. And indexing only works one way; So with only a dictionary, you cannot discover an unknown word just by knowing its meaning. Even a young child can search laparoscopy, But simply finding this word by searching the entire book for “a condition of having significant levels of tissue on the buttocks and thighs” can be a very long task.

Some computing fundamentals have overlapping definitions and are obscured by their ubiquity. While hash It has a simple definition, and different aspects of it come into focus at different times. In today’s world, hash security is important, but in this article, we’re looking at Store and retrieve items with a key. We’ll use elementary computing concepts with a small degree of mathematical curiosity. It starts with the first hash table.

A hash function is any function that can be used to map data of arbitrary size to values ​​of fixed size.

Arrays and hash tables

From a computing standpoint, the English dictionary isn’t all that great, as there’s never been a need to limit its size. English was not created with proficiency in mind; If that were the case, the dictionary would start with “A” and end with “ZZZZ”. In fact, only four letters if used efficiently can represent all English words (with enough room for another language).

For a digital dictionary, we are interested in numeric keys and pointers. We know that the simplest form of storage is the constant volume Matrix group. For example, we’ll play with a five-element array to store troubling pictures of our five best friends: Alan, Beth, Kath, Dave, and Eddie. We want to recover photos using only their names; And in the end, delete it.

Now, we can use just two arrays: one with the image data, and one with a matching index that holds the keys. The data matrix will need cells large enough to hold the image data, while the index array will only need cells large enough to hold small strings.

In the example below, the key combination contains the key “Alan” whose index matches the slightly grumpy dark beard look in the image matrix above. I think Midjourney was having a rough day. This system is easy to maintain, as entries can be easily created and deleted in sync.

For our small example, this looks fine. But for a large array, the problem with this scheme is that as the double entries increase, you will likely have to dig around longer to find the correct key. Even worse, you could search the entire array just to make sure the key wasn’t there. Once I start deleting the entries I also have to keep track of the empty cells that are diverging in some way. In other words, this system does not scale well.

What we want is a function that sits between the nice human-readable key and the set of indexed images. Something could look at the key and say “Oh, I put your image data here”.

You may also recognize this model from valet parking. I give the attendant the keys to my car, and in return he gives me a ticket. Then someone else went and parked my car. Later I give you the ticket, that matches my car details and my car is recovered. The collection and retrieval process takes a fixed amount of time. (At least that’s how I think the system works; my only reference is Ferris Bueller’s Day Off.)

So, what is inside the “hash function”? Surprisingly, it can be more or less than anything. As long as it can produce valid index numbers.

Modulo process

It is always the starting point for examples modulo processas that fits the bill perfectly – it provides a figure that is guaranteed to be below a certain maximum.

In arithmetic, the modulo operation returns the remainder, or sign, of division, after one number has been divided by another.

So if we have a dictionary with an array of five slots, we’ll need to output index positions 0 to 4. As you can see, modulo 5 does this with any integer value:

I’m sure you break the pattern. So we have the hash function, right? The index array should not be needed – just get the index from our name-key! Well not quite.

Given that 17, 207, and 2347 all produce the same scale result, this cannot work on its own. Even for five random numbers, there are likely to be some collisions. The only way to avoid collisions, for sure for a trivial algorithm, is to store the key as we did before. This isn’t really surprising – for a fixed size array, no algorithm can fill our index perfectly without remembering what happened before. This could be a valet parking where the details of the car and where it is parked are not recorded anywhere. So our hash function is really saying “Oh, start looking for your image data here.” What I’m describing is called open addressing.

I want the keys to be the name of the person’s avatar, like “Cath” – but that’s not a number. However, I can get a somewhat unique number from a string. We can produce a value based on positional encoding of numbers (think English dictionary order, counting in base 26), but focusing on the very simple modulo function, let’s add all the ASCII values ​​of the characters together.

(Tip: If you’re on a Mac, you can do both modulo and other lines of math right in the Spotlight Search bar.)

collisions

So, finally, let’s deal with collisions. If we store the key, we can fulfill our expectation of keeping the information close to where we looked first, simply by placing it in the next available place down the key array. This method is called linear investigation.

So now we have all the ingredients. First, we tell our hashing algorithm that we want to store the image of “Alan”. It converts the string “Alan” to an integer, and then uses modulo to produce an index in the keys array. Then it adds the data, using the same index, to the values ​​array.

Then we add “beth” in the same way. This time, the pointer is two.

When we try to add Eddy next, we get a possible collision with Alan’s entry. So we move to the next available slot in the index array:

To retrieve our data, we follow the algorithm again, but this time to confirm the index. When we search for “Eddie”, we see “Alan” – so go to one. We didn’t have much to go. The hash algorithm told us where to start so it looked like it promised.

conclusion

Naturally this makes more sense for larger arrays that you might use in real applications. In modern computing languages, you don’t need to specify the dictionary size – they are resized as needed. But for now, I think Johnson and his cat Hodge would be happy with their indirect contribution to computing.

a group Created with Sketch.

Leave a Comment