Chapter 5: Application of dictionaries

Indexing the employees

face Josiah Wang

Let us now further extend what we have just a bit more.

Recall that in the second version of our employee database, we have a list of employee details.

[
 {'id': '14835634', 'name': 'Slađana Ellsworth', 'age': 24, 'nationality': 'poland'}, 
 {'id': '69983058', 'name': 'Arianna Dragović', 'age': 37, 'nationality': 'brazil'}, 
 {'id': '69448225', 'name': 'Delara Babič', 'age': 32, 'nationality': 'russia'}, 
 {'id': '83512249', 'name': 'Goda MacBeth', 'age': 27, 'nationality': 'india'} 
]

While this version gives more details, querying the list might become slow. This is especially true if you have a very large list. For example, you might have to iterate through the whole list to find all employees from Poland.

Now, if you also recall, in the first version of the database we indexed the employee names by their ID.

{'14835634': 'Slađana Ellsworth',
 '69983058': 'Arianna Dragović', 
 '69448225': 'Delara Babič', 
 '83512249': 'Goda MacBeth'
}

This version is very fast. Python does not need to search for anything. When you say employee_dict['14835634'], it just goes straight and retrieves the value for that. Lightning fast! BAM! Sort of like if you have already memorised where the character “B” is on your keyboard, you do not need to go hunting for it when you type!

Advanced discussion (optional):

I’ve actually oversimplified this - it actually involves more than that! Python will actually perform some hashing magic to generate a hash value for ‘14835634’, which gives a big number (think of the git commit hash you’ve seen). The big number tells Python which ‘filing cabinet’ it should use to store the value indexed by ‘14835634’. When you need to retrieve the value for ‘14835634’, Python hashes the number again and produces the same big number, and can then retrieve the value from the correct ‘filing cabinet’ instantly! You can use the hash() function to get the hash value of a Python object. You cannot use a mutable object like a list as a key precisely because it does not have a hash value (not hashable) - the list can change any time!

In any case, it is not important for you to know these details at this point. Just know that retrieving values from a Python dict is usually lightning fast! ⚡

Having said that, you might need to understand hashing more deeply for coding interviews in the future. When you need to read up more on hashing, find a good Data Structures and Algorithms tutorial or book on hash tables (e.g. this one has several online courses). Or just search on YouTube for hashing or hash tables.

Now, let us try to speed up retrieval by indexing our employees. This is how search engines usually work - why do you think they are usually quite fast searching through billions of documents? It’s definitely not by iterating through a loop!

Instead of a list of employees, we will now produce a dictionary of employees, indexed by their ID. So this is like the first version of our database, except that the value is now a dict of employee attributes. The 'id' attribute might be a bit redundant, but we will just keep it there for simplicity!

{ 
  '14835634' : {'id': '14835634', 'name': 'Slađana Ellsworth', 'age': 24,
                'nationality': 'poland'}, 
  '69983058' : {'id': '69983058', 'name': 'Arianna Dragović', 'age': 37, 
                'nationality': 'brazil'}, 
  '69448225' : {'id': '69448225', 'name': 'Delara Babič', 'age': 32, 
                'nationality': 'russia'}, 
  '83512249' : {'id': '83512249', 'name': 'Goda MacBeth', 'age': 27,  
                'nationality': 'india'} 
}

Write a function load_indexed_employees() that reads from a given text file and returns a dict as illustrated above (indexed by ID). This should really just be a tiny modification from your load_employees() function from earlier!

Sample usage

>>> employee_dict = load_indexed_employees("employees_detail.txt")
>>> print(len(employee_dict))
20
>>> employee_ids = list(employee_dict)
>>> print(employee_ids)
['14835634', '69983058', '69448225', '83512249', '28836869', '82660090',
 '61098940', '50196408', '14973705', '25872463', '51904155', '06106396',
 '87491761', '37935295', '15304638', '84819522', '30195178', '55327620',
 '14817102', '54835190']
>>> print(employee_dict[employee_ids[0]])
{'id': '14835634', 'name': 'Slađana Ellsworth', 'age': 24, 'nationality': 'poland'}

Optional task: Once you have the main employee database, you could then index by nationality etc. so that you can find out which employees are from a certain country quickly (and then retrieve more details about the employee from the main database if needed). This is entirely optional (but feel free to do it - I’m not stopping you!) I have an example implementation of this in my solutions.

{'poland': ['14835634'],
 'brazil': ['69983058'],
 'russia': ['69448225'],
 'india': ['83512249', '51904155', '55327620'],
 'sweden': ['28836869'],
 ...
}