Spammers: Check. Projects next. And Other Goodies.

Hail Open Hubbites! We’ve been working hard and focusedly over the past month and would like to share some updates.

The first, and biggest news, is that we ended our offshore partnership at the end of September. There were a number of drivers for this, but the immediate impact is that we are currently a 4 person team. This explains why it’s been even more difficult than usual to keep up with all the ways folks get in touch with us: forum, blog posts, email, and tweets. We’re sorry about that and are working hard to keep with with everyone’s questions. We are planning on adding more members to the team over time, so this pressure will ease a bit.

We also completed the process of our Machine Learning trained dataset and permanently deleted some 60K spam accounts. We’re really grateful for the work of our intern, Sourav Das, currently at MIT, for his amazing attitude and contribution. He lead this work and created great results. This effort really caps the push we made to get spammers off our our site, when we suspected that as many as 2/3 of our accounts were spam. We currently have some 232K accounts in good standing. Having cleaned out some 500K accounts, we were pretty close in our estimate. The good news is that the rate of spam account creation is well within our ability to monitor.

There is still more to do in this regard. There are a accounts that created projects and edit on the site, which makes them look like legitimate users, but the projects they created are really spammy advertisements. We plan on applying the ML learning we’ve done to train the algorithm to detect spam projects and start cleaning them off the site. Let’s take a quick look at what that means:

In our home page, we say that we are indexing 472K projects. That’s the number of undeleted projects on the Open Hub. However, not every project has analyzable code. When we count projects that are not deleted and have had an analysis in the past — at any time — we see there are 292K such projects; about 62% of all projects. This really could mean that the difference, some 180K projects that have never been analyzed, could really be spam projects. Some of them are legitimate OSS projects that don’t have analyzable code — documentation projects and that ilk. But there really are only a tiny number of those. I’d wager that most all of those 180K unanalyzed projects are junk. So that next ML project to find them and get rid of them is important so that we have real numbers about activity in the OSS community.

Let’s turn our attention back to the nearly 300K projects that have had analysis. Of them there are nearly 28K projects that have a CodePlex repository. As you probably know, CodePlex has gone into Read-Only mode and will be shutting down entirely by the end of this year. Most of those projects have only CodePlex repositories. In accordance with our mandate to provide analytics on available and active Open Source Software project, we will be deleting all those projects. (We’ll start a background effort to find any new locations for those projects). This will drop the number of undeleted projects that have been analyzed at some point in the past to some 263K projects.

Finally, when we consider projects that have jobs in a permanent state of failure, which blocks our ability to generate updated analytics, we have to remove an additional 40K projects, which drives down the number of active projects that can reliably be updated: we’re really looking at 223K projects.

We have two major strategies to increase the number of available and active OSS projects on the Open Hub. The first is to lower the threshold to getting new projects into the Open Hub. We are working on an overhauled and streamlined workflow for adding projects from GitHub to the Open Hub. Right now, we support only a bulk-upload of all repositories in a GitHub account into a single project. We plan on letting users create new projects, assign repositories to the same new projects, and assign repositories to existing projects. This will let users quickly get their projects into the Open Hub.

If this works as well as hoped, we can expand it to other forges, such as GitLab. Your requests for other forges are most welcome.

The other strategy is to continue the cleanup work to examine failed jobs and see what can be recovered. But our user community is the best strategy we have.

Even with “only” 223K projects, there is no way we can manually review even the majority of them. Therefore, we will be making it possible to use your GitHub account to create and sign into the Open Hub. By lowering the barrier to being able to make edits on the Open Hub and relying on the maturity of the GitHub environment, we hope that more users will be willing to push their Open Source Software project to the Open Hub, and review existing projects to ensure we have the most up-to-date information.

We are also working on capping the number of enlistments that any one project can have. This will make it easier to review and maintain projects. We will introduce a new feature ‘Open Hub Collections” to gather blocks of related projects, like Linux distributions, so that users can quickly see which projects make up a collection and quickly navigate to related projects.

Gosh, there is so much more. There are ongoing database architecture upgrades, rewrite of the analytics engine, a future overhaul of Ohloh SCM, ongoing reviews of pull requests against Ohcount, and the daily task of making sure everything we have is working the way it’s supposed to. Oh, and we updated the CSS so that the Open Hub is in line with the standards defined by the Black Duck Marketing team. But, these will have to wait for another post.

Thank you for being part of the Open Source Community and the Open Hub. (And please provide your feedback in our survey, which will be open for only a little bit longer: https://www.surveymonkey.com/r/G5HN2JH)

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor