Finding Matching Keys In Two Large Dictionaries And Doing It Fast

August 21, 2024 Post a Comment

I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries. Say for example: myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': '

Solution 1:

Use sets, because they have a built-in intersection method which ought to be quick:

myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' }
myNames = { 'Actinobacter': '8924342' }

rdpSet = set(myRDP)
namesSet = set(myNames)

for name in rdpSet.intersection(namesSet):
    print name, myNames[name]

# Prints: Actinobacter 8924342

Solution 2:

You could do this:

forkeyin myRDP:
    ifkeyin myNames:
        print key, myNames[key]

Your first attempt was slow because you were comparing every key in myRDP with every key in myNames. In algorithmic jargon, if myRDP has n elements and myNames has m elements, then that algorithm would take O(n×m) operations. For 600k elements each, this is 360,000,000,000 comparisons!

But testing whether a particular element is a key of a dictionary is fast -- in fact, this is one of the defining characteristics of dictionaries. In algorithmic terms, the key in dict test is O(1), or constant-time. So my algorithm will take O(n) time, which is one 600,000th of the time.

Solution 3:

in python 3 you can just do

myNames.keys() & myRDP.keys()

Solution 4:

forkeyin myRDP:
    name = myNames.get(key, None)
    if name:
        print key, name

dict.get returns the default value you give it (in this case, None) if the key doesn't exist.

Solution 5:

You could start by finding the common keys and then iterating over them. Set operations should be fast because they are implemented in C, at least in modern versions of Python.

common_keys = set(myRDP).intersection(myNames)
forkeyin common_keys:
    print key, myNames[key]

Python Manual