10x speed up - if you do even more proper clustering, you can dispose of very distant clusters that are most likely not relevant;

In on applied case I manage to speed up my calculation from 27 days to 4-5 hours.

Clustering

In my view, UMAP + HDBSCAN work reasobably well for word vectors. UMAP even has a cosine distance option (though everything is not 100% smooth there).

Also HDBSCAN has a great perk - it produces a -1 "rubbish" cluster. In practice you may end up with something like this, which has reasobably good local structure

So, using UMAP + HDBSCAN you can remove ~50% of rubbish data. But you can also cluster your dataset into a reasobable amount of clusters (notice - you do not set number of clusters as a parameter of HDBSCAN, which arguably is its best advantage) - 5k - 10k. Then you can just calculate distances between your target phrases and these clusters and throw away the distant ones.

In plain terms - 10^3 x 10^6 << 10^6 x 10^6. Also cosine distance on such large scale is usually normally distributed, so you can get away with keeping all relations with distance being smaller than mean - standard deviation.