3 Answers
3

NOTE: this code has been quickly evolving into the functionality presented here. Eventually I will break it out into a simple exe such as listed below, but for now......

Currently the only way to find a user by the MD5 hash of a lowercased email address (user.email_hash as well as gravatar key) is to maintain a local datastore containing the users you are interested in servicing, which you may index and query for this and other information.

At first blush, this task seems straight forward. And it is. You surely can point a firehose at the API and get what you want. If you wish to piss off people you don't really want to piss off ;-) Not to mention poorly shepherding your rate-limit quota, network, disk and CPU saturation.

With a bit of planning and a measured approach this task can be broken down into a handful of not-long and not-short running processes that can gently provide you with a valuable dataset with many uses, one of which will be included in the following example.

Enough talk, let me introduce you to a simple Soapi spike named UserDatabase.exe, a windows console app that you may re-purpose to your heart's content.

UserDatabase.exe

The purpose of this app is to responsibly and efficiently maintain a local database of all (or most) Stack Exchange users above an arbitrary reputation threshold.

This implementation will use a System.Data.SQLite data store the code is quite easily ported to any RDBMS.

There are 2 modes of use for this application:

Performing a full update via sequential trickle pull of all users at or above minimum rep threshold. mode="PULL"

Refreshing with only users created since last pull. mode="REFRESH"

At the completion of each mode, users will be ranked ordered by siteUrl, rep desc, creation, and userId. (this is the little bit extra I was talking about).

Additionally we will explore using Site.Aliases to migrate users when a site moves.

What to pull

For most use cases, you are only interested in users above a certain reputation.

The minimum recommended reputation is 100 as the vast majority of users on the larger sites are both under 100, anonymous and/or inactive.

If you pull >= 1000 rep, the cycle will be very prompt, less than 5 minutes for ALL stackexchange users with rep >=1000 and result in a SQLite file around 10mb with around 220 total requests.

If you pull >= 100 rep, the cycle will be lest prompt, around 25 minutes to finish the sequential trickle pull of Stack Overflow and result in a SQLite file around 50mb.

This is at max 1 request per second. If you open the firehose and pull at full speed you will save only a very small percentage of time, so being conservative is the best approach all the way around.

A full pull will take several hours to complete the sites with larger user bases such as Stack Overflow and Meta Stack Overflow.

In the case of Stack Overflow, this means 3000+ requests vs 100 requests for rep 1000 and 400 requests for rep 100. The choice is up to you.

If there is a valid use case for a pull to 0, I would suggest performing it but once a week at a time when you know you have 3000 extra stack overflow requests to burn.

to top-off or refresh just those users that have been created since last pull:

// this will typically involve a single request for each site and take less
// than a minute unless your minRep is set very low.
UserDatabase.exe refresh "my-foo-bar-fu-is-strong" "data source=users.db" 1000

NOTE: Pull processes are throttled at a very low rate, max 1 request per second, and full pulls should be considered long running processes. The upside is that the impact on your overall throttle quota will be negligible as will network and CPU usage. Another compelling reason to restrict request rate as such is the fact that many threads will be in contention for a lock on the database file. This restricted rate gives those threads a good chance at the lock and actually adds to efficiency.