I am a Software Engineering Master’s student @ Faculdade de Engenharia da Universidade do Porto, currently working on my Master’s thesis “Feature selection for automatic hate speech detection in text”. In other words, I am testing new features for hate speech detection in Twitter comments.

Most of the approaches target text classification itself, but my goal is to investigate how (Twitter) user profiling can improve the classification of tweets. For that matter, one of the features that is crucial for my research is to model the users’ social graph aka social network analysis.

The drawback in generating a user social network is that the API rate limits are REALLY low for the information required. My algorithm is quite simple, it only considers 1 layer of depth, i.e. I check which users a certain user follows and is followed by, and for those I check whether they follow themselves or not. So, for each user I make a few requests:

List of followers

List of friends

For each follower I check if he follows/is followed by each friend by accesing the ‘show_friendship’ parameter.

The API rate limits to list the friends and followers are 15 per 15 minutes and for the ‘show_friendships’ it is 180 requests per 15 min.

Practically speaking, for an user with 67 followers and 13 friends, it takes more than 1 hour (!!) to generate his social graph. Considering my sample contains 1k users, it would take 1000 hours to generate all the required social graphs. And that considering each user has a maximum of 67 + 13 connections, which is quite low.

Finally, is there a way for my API rate limit to be increased? I am desperate enough to even pay for a temporary raise in my request limit for both friends/followers list and ‘show_friendship’ paramters. Kindly please.

I’d love to read what you come up with! Even if it’s just a draft. Feel free to send it on!

There’s a few ways you can optimise things: show_friendships might be ok to use when working with 1 “target” user at a time, but since you’ve 1000 in your sample, it might be easier to avoid that call entirely and focus on the 1000 target users as a whole.

Rate limits are per 15-min window, but remember there is also App Only Authentication, which gives you an additional 15 requests in every rate limit window, effectively doubling your capacity for most endpoints.

You can use both friends/list and friends/ids to make calls - friends/list if the number of friends is < 200 will get you all friends in 1 call. friends/ids can get 5000 in 1 call. You can use users/lookup first to check the numbers of friends and followers a target user has ahead of time to decide which endpoint to use for which target user.

(friends/list returns full user objects while friends/ids returns just IDs but you can just extract the IDs from users, or use lookup on the ids if you need to)

This means you have effectively 60 calls every 15 minutes to fetch Friends. If you spend 1 call per target user you now have a maximum “throughput” of 240 twitter accounts per hour.

The above for “friends” works for followers in the same way, and remember you can crawl Friends and Followers in parallel.

Once you crawl your original 1000 target users, after 4 or 5 hours you now have all their friends and followers - now instead of doing pairwise checks for which friends and followers of those target users follow each other, group all the IDs, and run them through the same process - spending 1 call per user, for your throughput of 240 per hour.

This will take a while too - but you’ll end up with the entire wider social network around those 1000 sample users, you can then apply whatever network analysis you need - i’d actually expect the network information to be far better at classifying things than content, but that’s just me.

If you’re careful about how you cache things and keep track of what user ids you’ve crawled you can end up skipping a lot.

Also setting a maximum “budget” of calls for each target user is important because you don’t want to get stuck crawling something like 60 Million Followers for @ArianaGrande for example.

First of all, thank you very much for your reply and patience to go through everything.

My ultimate goal would be to feed each user’s network to a neural network but I probably won’t have enough time or knowledge, as of now, to pull it off. Initially, I’m looking to compute the graphs and get some statistics (e.g. clustering coefficient, centrality measures, etc.). I can link you my work in about 1 month if you would like to!

Took me a while to find out how to go for the app only authentication on Tweepy. Turns out all it required was to change one keyword during the authentication (tweepy.OAuthHandler(...) to tweepy.AppAuthHandler(...)). Unlike you mentioned, not all the endpoints’ rate limits double. friends/list and ‘followers/list’ do double, but friends_ids and followers_ids don’t, sitting at around 45 (instead of 60) calls every 15 minutes (180 twitter accounts / hour).

IgorBrigadir:

now instead of doing pairwise checks for which friends and followers of those target users follow each other, group all the IDs, and run them through the same process

So you suggest grouping all the followers/friends collected (excluding duplicated) and find the respective friends/followers list? If so, I guess it could work as long as users with more followers/friends than a certain threshold are excluded or their lists randomly truncated to a fixed size. Although I still feel like this would likely take too much time.

IgorBrigadir:

i’d actually expect the network information to be far better at classifying things than content

By 60 i meant using a combination of all those - both app and user auth on 2 endpoints. But i could be wrong - what i’d actually trust most is the x-rate-limit-remaining in the responses.

Also when making calls give at least an extra ~5 seconds after the reported Rate Limit Reset time - i found it can be flaky sometimes and report “Rate limit exceeded” even when it should have reset after 15 min.

I can recommend Gephi for the network centrality measures - surprisingly for some measures it’s a lot faster than most Python implementations on larger graphs, and it makes prettier visualizations.

And yes, grouping users and budgeting only 1 API call per user still takes a lot of time to crawl, but it’s still faster than with friendships/show. Another thing to keep in mind is that both of those friends and followers endpoints return Most Recent users first - so it could turn out that this way could miss a lot of important relationships between users.

By 60 i meant using a combination of all those - both app and user auth on 2 endpoints. But i could be wrong - what i’d actually trust most is the x-rate-limit-remaining in the responses.

The 45 limit I was referring to was if I combined followers/lists and followers/ids. If I excluded users with more than 200 followers and friends, the following pseudo code would work:

if rate_limit(followers/list) not exceeded:
get_followers_list(user)
else if rate_limit(followers/ids) not exceeded:
get_followers_ids(user)
else:
# All rates exceeded, waiting for the reset
get_followers_list(user)

, which would result in 30 (lists) + 15 (ids) API calls per 15 minutes window. Had no idea you could combine app’s and user’s authentication calls without consuming each other’s. In that case I’m guessing I can make 45*2 api calls per window (~360 users / hour). This number sounds better.

IgorBrigadir:

Also when making calls give at least an extra ~5 seconds after the reported Rate Limit Reset time - i found it can be flaky sometimes and report “Rate limit exceeded” even when it should have reset after 15 min.

I’m using the tweepy’s default class implementation wait_on_rate_limit=True. I looked at the code and they take those 5 seconds into account.

IgorBrigadir:

I can recommend Gephi for the network centrality measures - surprisingly for some measures it’s a lot faster than most Python implementations on larger graphs, and it makes prettier visualizations.

I have some experience with Gephi, and I like it alot. But I’ve never used a programming interface to communicate with it. I just googled it and found GephiStreamer. I’m not sure if it grants the same speed as Gephi or if it use’s Gephi’s algorithms. Any recommendations? I hardly see myself manually computing statistics for 1000 users.

I will look into this. Thanks for the feedback. I’m open for new Twitter network analysis ideas.

Edit:

I have already coded using both user and app api and [followers/friends]/[list/ids]. I can make 30 + 15 calls under the app auth, while 15 +15 under the user authentication which results in 75 requests per window, or 300 per hour (instead of 360 as I mentioned before). This means I can extract the followers/friends lists (for 1000 users) in around 3 hours. Still a significant breakthrough.

I still have to think through how I’m gonna handle the followers/friends threshold so I won’t lose valuable information.