Skip to content Skip to sidebar Skip to footer

Efficient Knn Implementation Which Allows Inserts

Suppose I have multi-dimensional datasets, which have many vectors as data. I am writing an algorithm which needs to do k nearest neighbour searches for all those vectors - classic

Solution 1:

General comment: I don't quite understand why KD-Trees are so popular for high-dimensional kNN queries. In my experience, other trees scale much better with high dimensionality or large datasets (I tested up to 25Million points and (only) up to 40 dimensions). Some more details:

  • KD-Trees: As far as I know, KD-Trees should support insertion at any time, but there is a chance that they get imbalanced. I don't use python, so I don't know why your KD-tree does not support insertion/deletion on the fly.
  • Quadtree: Depending on the dimensionality, you could also use quadtree/octrees, but standard implementations are not good for more than 10 dimensions or so. In the reference above I tested a quadtree with a special 'hypecube' navigation approach. That requires a lot of memory but scales much better with dimensionality in terms of performance.
  • R-Tree/R*Tree: The original R-Trees are not very good with insertion on the fly. However, if you look at R+Trees, (R-Plus-Tree), they are quite fast with reinsertion and kNN queries.
  • PH-Trees have basically the same kNN performance as R+Trees, but much better insertion time, because PH-Trees do not need rebalancing, while having inherently limited depth and nodesize. Unfortunately, implementations gets a lot more complicated for >=64 dimensions (the tree uses one bit of a long integer for each dimensions). I'm not aware of an implementation that supports more than 63 dimensions.

Python:

  • R+Plus trees should be available for Python. If not, you could adapt a normal R-Tree (only the insertion algorithm is different)
  • I heard once of someone starting to implement a PH-Tree in Python, but I haven't seen any open-source variant yet.
  • If you have some time/interest to do your own implementation, you could look at the Java implementations here and translate them to Python. The library contains various multidimensional indexes, except KD-Trees. KD-Tree implementations that allow on-the-fly insertion can be found here and here.

Post a Comment for "Efficient Knn Implementation Which Allows Inserts"