Purpose

This is a small test before starting working on something more robust for event growth tracking. I wanted to play with the data generated by the #mozsprint hackathon organized by Mozilla Science Lab.

In this short notebook I will keep adding notes on tests I am doing to get insights from these 2 days hackathon
Any contribution is more than welcome

Also if you want to join me on trello I created a board for this and I will try to write a blog post about tips for getting ready to take a hackathon :)

In this notebook I will be doing a lot of data transformation and conversion. I will use classic python libraries like pandas and matplotlib, but also I will introduce some others that are needed for specific purposes like creating a word cloud from a text

Questions we want to answer

When it comes to an event like #mozsprint, it is hard to get realtime feedback about its growth unless it is big enough to be discussed and debated in a social network such twitter. The Mozilla Science team did an amazing job to prepare for the #mozsprint by making a lot of people involved all over the world, which results in a considerable amount of tweets (data) that one can extract and play with to see the event itself from different angles.

These are the questions that I was personnally intereted in :

How many tweets containing the hashtag mozsprint (general idea on the size of these 2 days event)

What is the growth of these tweets over time

What is the distribution of the retweets containing the hashtag mozsprint

What is the list of influencers tweeps

Who is more active using the hashtag mozsprint

What are the most tweeted terms related to mozsprint

What is the top 10 most important tweets

What is the demographics of tweeps involved in mozsprint

These are basic ideas that we can improve later, but this already gives an insight on the evolution of the mozsprint event

Libraries used

Libraries used here are classic, like pandas to store extracted information into dataframe, matplotlib and seaborn to plot the data we need, etc..

Because we are interacting with Twitter, we need a python library that is flexible enough to make the information extraction as easy as possible. I chose tweepy because it looks enough for what we want to achieve here. We will be asking Twitter to give us stream of information using #mozsprint keyword, the returned information will be saved in json format in the rest of this session

Now let's set some options for the plots we will code, I like the way R plots look like, so we will add a small trick to make matplotlib plots look like ggplot2 in R. We also define how seaborn plots wiill look like

Prepare your app

Because we are using tweepy, and if you go to their official documentation, you will learn that you need to create an app first related to your twitter account on the apps.twitter.com website. Once you create the application that will be used with your code, you will be given a set of tokens and secret code that you need to provide to your code here in order to connect to the twitter API, I will put the ones I used myself at the time of this tutorial writing, just to get an idea on how these tokens look like, but by the time you read this, these will no longer be useful.

Ok now let's set the maximum number of tweets to extract, remember ! Twitter usually limit the number of records you can extract at a time, and when they spot your account asking too much information from the API, you can be unable to use the API for at least 15 min (we will manage this later)

MAX_TWEETS = 8000

Ok now let's connect to the API using the credentials we defined earlier

You will notice the wait_on_rate_limit=True this is to tell your program to not return the 402 error code from twitter and to wait the needed time in order to resume the task.

Data extraction

Now we are connected to the API and we are ready to send our first query to Twitter. We will used the classic search API function to ask for the hashtag #mozsprint

data = Cursor(api.search, q='mozsprint').items(MAX_TWEETS)

Now we will initialise an empty list called mozsprint_data this will contain the returned results as json objects, each list element will be basically a json format of a tweet with all the information.

If you are wondering what a tweet looks like in json this is an approxaimation of its structure, we will be actually looking at something very similar

Now we will :
- Fill that list of tweets
- save the json tweet results for future loading without asking the same information from the API

mozsprint_data = []
# You will use this line in production instead of this
# current_working_dir = os.path.dirname(os.path.realpath(__file__))
current_working_dir = "./"
log_tweets = current_working_dir + str(time.time()) + '_moztweets.txt'
with open(log_tweets, 'w') as outfile:
for tweet in data:
mozsprint_data.append(json.loads(json.dumps(tweet._json)))
outfile.write(json.dumps(tweet._json))
outfile.write("\n")

Awesome, it looks like a table we can use now. You will notice that some tweets start with RT which is a retweet, so this may create a bias if we want to now the number of original tweets and the number of retweets.

Ok ! This is may be a bit biased. What this is telling us is that first of all, not all tweeps specify locations in their profile data on Twitter, among the few of thos who are doing this and that are actually involved in the mozsprint, most of these people are in India. I guess Twitter is giving this option to users to prevent their privacy, but it makes all the geo queries a bit biased, .. somehow

Of course ! @MozillaScience was very active followed by @kaythaney (Kaitlin Thaney is the director of MSL :) ), well it seems like it make sense, this is great. Note here that we are not counting usernames in the tweet texts, we are playing with the users column in our tweet dataframe.

Ok now let's see the distribution of retweets, for this we will create a similar generic function but we will be using seaborn instead :

distplot provides one interface for plotting histograms, kernel density plots, rug plots, and plotting fitted probability distributions. distplot tries to use a good number of bins for the dataset you have, although all of the options for specifying bins in hist can be used.

Plotting engagement (data transformation)

One more thing we can examine is the trend of the tweets, in other words when people started tweeting about #mozsprint and when the peak was reached, for this we will focus on the tweet submission date column

But for that we need to transform the dates, ang group by days, and then summing the number of variables we want to plot, this is a good example of data transformation

In fact, first day of the hackathon was the peak, but we can observe a plateau before the mozsprint starts, that means two things, the event was successfully planned and announced, and the peak is a translation of how much viral it went. Very good job Mozilla Science Team !

Final touch : The Word Cloud

Like any event, we are always curious about that number of words that gravitates around a topic, when mozsprint started, you can observe a whole bunch of tweets and other hashtags as well as recurrent terms. In order to see if there is a correlation between these words and the current topic, we can follow several approoaches, one of these is to extract all tweets text and proceed with a textmining technique to classify words by category or by context. But as a closing note to this notebook I wanted to do something lighter, word cloud. Simple, clear and speaks for itself. So let's create one.

We are lucky because there is a library just for that.

text = " ".join(tweets['text'].values.astype(str))

This will concatenate all tweets in a single string ! (you don't wana read that :) )

no_urls_no_tags = " ".join([word for word in text.split()
if 'http' not in word
and not word.startswith('@')
and word != 'RT'
])

We will remove all retweets prefixes like RT or username mention or http etc ..

]]>http://blog.coderscrowd.com/twitter-hashtag-data-analysis-with-python/b0c6885a-3db3-4a16-8c43-032e2b843c16Tue, 01 Sep 2015 22:34:09 GMTThis is a documentation for QuotaWatcher utility, a small cron job developed to monitor disk usage on our servers.

In this post I am going to explain how this agent works, what are the steps we need to build it and how it can be improved. Please feel free to comment and add your input

All the code is heavily pep8'd :) , I use PyCharm and Sublime to tackle every single formatting and quality problem

I like this way of importing libraries, if some libraries are not already installed, the system will exit. There is another room for improvement here, if a library does not exist, it is possile to install it automatically if we run the code as admin or with enough permission

We init the class as an object containing some features, this object will have a threshold upon which there will be an email triggered to a recipient list. This obect is looking ath the size of each subdirectory in path. You need to create an email addresse and add some variables to your PATH ( will be discussed later)

@property
def loggy(self):
return self._log

We need to inherhit logging capabilities from the logging class we imported (see later the code of this class). This will allow us to log from within the class itself

@staticmethod
def load_recipients_emails(emails_file):
recipients = [line.rstrip('\n') for line in open(emails_file) if not line[0].isspace()]
return recipients

We need to lad the emails from a file created by the user. Usually I create 2 files, development_list containing only email adresses I will use for testing and production_list containing adresses I want to notify in production

This is a wrapper of the famous du command. I use GNU_PARALLEL in case we have a lot of subdirectories and in case we don't want to wait for sequential processing. Note that we could have done this in multithreading as well

I didn't want to use the -h flag because we may want to sum up subdirectories sizes or doing other postprocessing, we'd rather keep them in a unified format (unit). For a more human readable format, we can use du_h() method

@staticmethod
def list_folders(given_path):
user_list = []
for path in os.listdir(given_path):
if not os.path.isfile(os.path.join(given_path, path)) and not path.startswith(".") and not path.startswith(
"archive"):
user_list.append(path)
return user_list

we need at some point to return a list of subdirectories, each will be passed through the same function (du)

Finally we create the function that will bring all this protocol together :

Read the list of recievers

load the path we want to look into

for each subdirectory calculate the size of it and append it to a list

create a Table to be populated row by row

add subdirectories and their sizes

Calculate the total of sizes in subdirectories

If one of the subdirectories has a size higher than the threshold specified, trigger the email

Report the usage as a percentage

def arguments():
"""Defines the command line arguments for the script."""
main_desc = """Monitors changes in the size of dirs for a given path"""
parser = ArgumentParser(description=main_desc)
parser.add_argument("path", default=os.path.expanduser('~'), nargs='?',
help="The path to monitor. If none is given, takes the home directory")
parser.add_argument("list", help="text file containing the list of persons to be notified, one per line")
parser.add_argument("-s", "--notification_subject", default=None, help="Email subject of the notification")
parser.add_argument("-t", "--threshold", default=2500000000000,
help="The threshold that will trigger the notification")
parser.add_argument("-v", "--version", action="version",
version="%(prog)s {0}".format(__version__),
help="show program's version number and exit")
return parser

The program takes in account : the path to examine, the list of emails in a file, the subject of the alert, the thresold that will trigger the email (here by defailt 2.5T)

Note that in the main we load some environment variable that you should specify in advance. This is up to the user to fill these out, It is always preferable to declare these as environment variable, most of the time these are confidential so we better not show them here, it is always safe to set environment variable for these

How to run the program

python quotawatcher.py dev_list -s "Hey Test" -t 2500000000000

]]>http://blog.coderscrowd.com/setting-up-a-quota-watcher-agent-in-python/f9940ae2-e575-4eb9-adf2-e046472f5547Mon, 31 Aug 2015 17:54:11 GMTThis is a quick tip on using is, when to use, and when not to use it and what you should really consider when using it.

This is an issue that can affect your results, the effect can be small or big depending on the outcome of using the is operator.

In this small tutorial I will show you how is is not really what is

Comparing Integers and Strings

In this section we will be comparing integers and string and test if they are equal to each others

x = 1
y = 1
print('x is y', bool(x is y))
('x is y', True)

That's great ! Now we will try to replicate this to another pair ov values for x and y

x = 1000
y = 1000
print('x is y', bool(x is y))
('x is y', False)

As you can see, the comparison is returning False here ! Let's test with some float numbers instead of integers

x = 0.005
y = 0.005
print('x is y', bool(x is y))
('x is y', False)

Strange ? What is this type of comparison was added in a program that compares pvalues ? or selecting a type of action based on a threshold ?

Ok : now we are comparing values of objects which returns the correct answer

That's it for today, I hope this will make you think that python is not as straightforward than it seems to be, in comparison to other languages that is correct, but you still need to pay attention to these small things that may affect deeply your results and your data description

Why is it working for small numbers ?

== is a value comparison, is is an object identity (memory location) comparison. You will often see that comparisons like x = 0 will give the intended result because small values are usually cached in Python, but you always want to be using == instead of is when checking equality because this behavior cannot be relied upon.

More .. I think the current implementation is setting the cache range from -5 to 256

]]>http://blog.coderscrowd.com/when-python-is-is-not-what-it-is/7e8f4075-b657-4c9b-8c29-06ec6ca77997Sat, 08 Aug 2015 05:44:59 GMTI came across gource an open source software for version control visualization and I kept playing with it on some of the projects I am working on in our group at BC Cancer Agency.

I wanted to blog about it, and share my experience using it.

Visualizing git commits is very informative, it gives an idea on the status of a given project, and the contributions done by different team members as well as the major changes done within a single project.

What I am missing in this though, is a practical way of translating this into a (re)usable format, but I'll tackle this issue later in this post.

I generated a video for the Targeted Sequencing Pipeline, a project that I've been working on for a while now. In the video we can see some color coded nodes and here is a brief description of the most important ones :

Green : when a developer pushes to a git repo

Red : when a developer removes from a git repo

Yellow : when an update is made on some files

I tried to highlight some nodes in order for you to see some file names being changed in real time. Videos are accelerated since the first git commit. You can appreciate how big are some commits :) which is not really recommended, I usually keep my commit as small as possible because I like to think about them as units rather than doing a lot of tasks at the same time.

I used to run a blog at biocodershub.net before, but since I moved all my efforts towards developing coderscrowd.com, I was forced to shut it down. That said I will keep releasing some successful posts that were highly debated because they reflect the reality of a day to day data scientists work. This is one of these famous posts, originally shared by Paul Michael Agapow @agapow

Something cynical for a rainy Friday at the end of a long week :^)

“The data is all in these [proprietary and undocumented format] files.”

“What I want is a program to browse, edit and validate gigabyte-size whole genome sequencing runs. It should import and export all known formats. And it has to run in a browser. And some of staff refuse to use anything but IE6.”

(After delivering an insignificant or negative result) “Can’t you analyse it again?”

“Why don’t we put the new server rack in your office?”

“That software you wrote is buggy! [What happened?] It’s not working! [How do you know that?] It’s broken! [In what way?] Can’t you just fix it? [How? I don't know what's wrong ...]“

(While waiting for the result of a Bayesian calculation) “Why does it take so long to get this answer? Can’t you just make it go faster?”

“I know you said that 30 data points were the minimum for statistical rigour. But we only got 5. Can’t you analyze it anyway?”

“We keep all those records in Excel files … uh, I think this is the most current version …”

“The Z lab showed you could do this [with 10 genes and a computing cluster]. So do you think you could this this with our data [200 whole genomes, on a PC]?”

“Good news – we got a huge grant for sequencing and annotating 6 squillion whole genomes. You’re not on the grant and we didn’t budget for any bioinformatic work but here’s the data. Can you have this done by next week?”

(After being told that an analysis is impossible or ill-considered) “But X over in Y’s lab does it all the time.”

“Uh, so what is it that you’re doing again?”

PR on Github or comment here , and your contribution will be added to the post, Thanks

]]>http://blog.coderscrowd.com/words-a-bioinformatician-never-wants-to-hear/b45277e1-b388-4162-a96e-20f26908ea20Mon, 02 Mar 2015 20:05:36 GMTAre you using CodersCrowd ? No ? You should ! :)
What first started as a side project for me became rapidly something I'm excited to work on. Each time I think of a new feature, I just run and sketch some pros and cons, draw a cartoon and open my editor and let the magic happen.

The application is becoming a real tool, that we should all support as bioinformaticians, because it will be (is) changing the way we work. Why ? Because it is implemented in a way that expose your skills to the crowd, you're like a circus trapezist (without the rescue net), you write code, fearless of exposing your problems to the crowd and looking always to write better readable and reproducible code.

Well ! look no more camarade ! It is right here in your browser, I spent hours designing that tool just for us.

Here are the milestones so far for CodersCrowd

First the app was built as a code sharing platform

Then I added networking around source code (reinventing social network)

I made a giant step by making all these code exchanged by the users runnable, including the most used libraries in Bioinformatics, such as bioperl, biopython, bioconductor (and the most used libraries that I am aware of), thanks to Docker, my dream came true and I developed my Dockerfiles that I kept refining until I got what for me seems to be a reasonable ones for such a project, then I worked with these awesome and very responsive guys at Linode to make things run like a charm, and then I plugged this to what was done already at CodersCrowd !

Then and most recently I decided to include some realtime interaction between the users by adding Togetherjs library from Mozilla, and Man ! That looks fantastic and it is working exactly like I imagined it, we now can open team sessions to work on the same code, it can be training sessions, teaching an entire classroom, live code reviewing, and most importantly we will keep learning something new every day !

Here is what the realtime crowd programming look like

Check out this awesome video where I am running a demo with 2 different users in 2 differents browsers.

MAKE SURE YOU CLICK THE HD ON THE VIDEO

Feel free to share that video with your buddies, create as much sessions as you want and just keep writing awesome and clean code !

]]>http://blog.coderscrowd.com/real-time-programming-for-bioinformatics-and-for-fun/c221e3f3-15bc-4281-b294-06f4f78b8d62Wed, 30 Jul 2014 06:28:23 GMTI received a considerable amount of feedbacks concerning the layout, and suggestions to make user experience at CodersCrowd as smooth and enjoyable as possible.

Whereas a lot of CodersCrowd members seem to be comfortable with the tabular presentation of the latest published problems, I recieved a lot of suggestions to make the members contributions accessible by filtering on programming languages, I liked the idea and decided to give CodersCrowd a new homepage :)

I have been working with genomics data and neo4j for about a year now. One of the biggest piece of advice is how you architect your queries for any moderately sized set of data. However, we need some background first:

Vcf Files

Variant Call Format files or VCF files are a common format for storing data for clinical sequencing pipelines. Below is what a file looks like:

What a mess! These quickly become unreadable. Luckily parsers exist to make our lives much easier and create a usable API. PyVcf is one common popular one if you are interested. The basic structure of these is

Which we can easily convert to:

Note: While long time bioinformaticians will know this architecture differs from the standard spec, it works for us when we assume one sample per vcfFile (which is what we are doing)

Neo4j

We have since taken this schema and loaded it into Neo4j. This story is about finding all such files that had a particular allele. Initially I wanted to find all the vcf records that had that allele. I knew the various properties of the vcf record

This would be an appropriate solution if we were using mongodb or sql. However, Graph Databases aren't multi for query by restriction. Don't get me wrong, they can do it. However, its not where they shine.

How do we Optimize?

So how do we fix this? We use a trick called anchoring. Traversals are very fast, but when we have to scan tons of nodes it isn't going to be fast. So for this query, we know we are looking for a particular allele. The alleles have all been indexed and more than that they have been constrained

CREATE CONSTRAINT ON (allele:Allele) ASSERT allele.name IS UNIQUE

where name is a parsable name that allows to uniquely identify the allele. Since the number of alleles is very small compared to the number of vcf records and they are indexed. We anchor our query and then traverse.

It is imporant to note that while node labels are good, node labels get applied after the query. Relationship labels however get applied before the query. Thus, best practice is to always use relationship labels when applying any traversal and node labels.

As stated above, the former query takes approximately two hours. However, the latter query takes 104 miliseconds. That is a HUGE speed up! The speed up remember is a function of having to iterate over a small number of linked lists in Allele compared to VcfRecord. Remember, when possible we want to avoid using properties at all. For reference, we have approximately 40000 Alleles and 135,000,000 VcfRecords that we have stored. The Alleles are indexed as it is our central fact for analysis on a unique name. The VcfRecords have no index.

Lessons Learned

Always use indexes

Apply uniqueness constraints wherever possible

Filter on your smallest indexed set then traverse from there.

Try to avoid filtering on properties when you can

If you must filter on properties make sure you can filter on a small set of nodes

I hope this helps in your architecting efforts. Please feel free to reach out to me at Alex@Frieden.org with questions and suggestions.

]]>http://blog.coderscrowd.com/cypherization-optimizing-cypher-queries/05010855-f0f3-42b4-99fd-6b86f7e8311cSun, 08 Jun 2014 05:54:03 GMTA few months since CodersCrowd went online, and the amount of comments I received to criticize some of its functions and to improve some feature were a source of joy for me each time I read my emails.

At the time I am writing this article, we are 573 users on CodersCrowd, and I wanted first to thank you all to join us for this CrowdCoding experience.

Ok let's get right to the exciting updates we have for the next release of CodersCrowd.

The next version of CodersCrowd 1.2 will be released with a very exciting feature : Running bioinformatics gists live on the browser !!, ...., Ok let's think about it for a second .. what does it mean ? ..

Live code debugging

Being able to reproduce your bugs/results

Embark your software into a demo mode instead of writing long description pages

Reimplement algorithms live with coders all over the places

Code on the move !! You can code from your mobile devices from everywhere

Create Interactive Tutorials

Test codes

and the list is long ... :)

CodersCrowd now brings that possibility in its new version to be released mid April 2014 (but already accessible)

Here is a sneak peek of the new release.
When you will open a code to view, either it is a problem to be fixed or a valid code shared by a member you will be able to see that small button below the code :

That small little button will allow users to run the code live and see the exact same message that he would see on a terminal ... it is like you have a terminal in your browser.

In that example in particular, it is a python code that implements Smith Waterman and Needlman Wunsch alignment. I found that code on Github, tried it and made sure it is working, then I introduced intentionnally some errors inside, in order to see if we can get the correct error message that we would see on a terminal.

Ok, now ! this is what you would see when you hit the Run button :

Awesome ! A nice Traceback with the exact line where the problem is !

Ok to be honest I introduced 2 errors in the method names (Neeeedle() instead of needle(), and waterm() instead of water()), it is a python code, so it stopped at the first bug found and reported the Traceback

Let's push it further, now let's fix the first error and post a solution and see what's the outcome of that :

Excellent ! It is partially working, we can see the diff in the solution posted, and run it directly ! The solution shows another Traceback, that's the second bug, let's fix it and run another solution : And the winner iiiiis ... :)

Now let's see the entire code review of this example :

What can we run on CodersCrowd ?

I'd love to answer : pretty much everything you want :) but let's be realistic, but what I am going to list here is already awesome. You can run :

Perl / Bioperl

Python / Biopython

Java / Biojava

Scipy, Numpy, Matplotlib

R / Bioconductor

Shell (awk..)

Yes, I am serious :)

Only Python and Perl are supported for the test phase though

Update : perl, python, shell and R are runnable now

Limitations ?

Hmm, yes sure ! Currently codes using arguments are not supported (they run ! but they will output errors), the reason is that I am not yet allowing access to IO on my server. For codes using access to files, I will create demo files covering pretty much the long list of format we use in bioinformatics, under a same location. If you think of a specific format please leave a comment ;

Fasta

Fastq

Bam/Sam

Genbank

alignments formats

text

Vcfs

etc ...

I will make an update with files that you can use as a test for your codes.

How is that possible ?

The idea of running code on the browser was one of the main important features I wanted to implement at CodersCrowd. First I started with the CrowdCoding aspect of the application, and then I started to look around at how to implement that BIG CHALLENGE task of running user's problem online as fast as a microsecond :) something similar to what JSFIDDLE is doing for Javascript community. The task was huge, and the challenge was even bigger, until I met Docker

I started to dig deep into their API and succeeded to do what will be (hopefully) of a great help to the community towards code reproducbility

You need to be a member to run codes on CodersCrowd

Runnable code will be accessible only for members since it is still experimental, when we pass the test phase I will update the public access to this feature.

Have fun !!

Rad

]]>http://blog.coderscrowd.com/runnable-gists/0a2f9e56-e7ec-45c8-aa1c-b5368ab6394dSat, 29 Mar 2014 19:38:24 GMTAs a registered member at CodersCrowd, you own a public page that is built and updated automatically as you contribute and share codes and knowledge.

Your public profile page is accessible through a link like this one : http://coderscrowd.com/app/vcard/user/**<YOUR_USERNAME>**

Which will give you something like this :

Your public page will record your contributions and give your visitors an idea on your skills and interests, it will show statistics about the codes you post (how open you are to the community), the solutions you share ( how knowledgeable you are), points you earn and its rank among coderscrowd users (1 being the top) and finally some statistics about the number of people in your network (coders you help).

Additionally, your visitors will have an idea on the programming languages you frequently use with a number showing how many codes you wrote depending on the tag.

As a member you can also edit your profile and update your information, such as a profile image, a bio, your experience and your publication lists (something similar to LinkedIn if you are familiar with)

We have several programming languages implemented so far, so it is very easy to pick one problem and work on it.

First login to your account. You will be able to see the list of all codes posted on your dashboard.

One important feature is that you can easily pick a subject of your choice from there using the search feature or sorting the list of codes by category and here is an example :

This is an example of what you see when you login

You can sort this by language by clicking on : language

Or you can sort it by status focusing on pending codes which means that the code is not solved yet

Or you can sort by the number of posted solutions

We have some recommendations though :

First you need to prioritize your needs, if you want to try to solve while you're learning a new language then sort by language and focus on the one you want

Your chance to have the problem holder accept your solution is higher when you target codes with small number of solutions posted

Codes locked in the knowledge base are not 'workable' anymore since the problem holders closed them by accepting other solutions, so you may want to visit them to learn more, if you are a hungry coder you may want to focus on the codes with pending status

When you select the code to be fixed, now comes the serious business. Click on the code and hit click here to add solution you will come to a page similar to the one below and that I will describe in details :

1 shows an idea on the problem holder (number of contributions and how many coders he helped)

2 shows some stats about the code posted, in particular hom many coders are engaged with it, how many solutions they posted, and a score to normalize these two metrics in regards to the entire codes with respective solution in the database (in other words, is this code engaging a lot of interactions or not, which is a different way to say that this is a hot topic)

3 If ou want to add your contribution you should hit that red button

4 You have to read carefully the problem posted in order to add a solution that solve that issue

5 Here is a cool way to note the problem solver, by design we decided to make the rating at CodersCrowd useful (no upvoting, no downvoting, which is very subjective), we don't rate the people, we rate the code. So you have several options here to judge how useful is this contribution. Rates go from 1 to 10. Is the problem well documented (the description)? Is it reproducible ? Can you read the code easily (comments, structure...) ? Is the error / problem well described ? How do you judge the code size in regards to the task to be done (smaller codes for big problems are usually smarter) ?

6 This is the list of all contributions, each contributor provides a solution that is different from the original code. Differences are highlighted by this beautiful diff display, green are insertions, red are deletions (heuh ? InDels ?)

Additionnaly below the code you can chat, if you want to add your opinion or long paragraphs.

That's it ! Have Fun !

]]>http://blog.coderscrowd.com/solving-problems-at-coderscrowd-where-to-start/41e681be-cd5c-4f10-bb56-2b63395b8f84Thu, 12 Dec 2013 19:27:22 GMTThanks to a brilliant suggestion from Matt Shirley we added the Distraction Free Mode, which is simply a full screen code writing feature on your browser (it works perfectly on all browser, androids, and less on iOS devices but we are working on it)

This feature is enabled in two cases : When you add a new code / problem, or when you write a contribution to someone else's problem

To activate this feature all you have to do is to click on "DFM" button on top of the editor like shown in the figure below

This will transform your browser into a code editor !

And that's awesome ! To leave the DFM all you have to do is to click on "Turn off DFM"

Have fun !

PS : If you have any idea that you want to see implemented at CodersCrowd You can use our help center to submit it to members votes : all you have to do is write it down here

]]>http://blog.coderscrowd.com/introducing-the-distraction-free-mode/5bfbad7e-21ed-4173-a503-a0da351003b3Thu, 12 Dec 2013 15:45:07 GMTAdding a code with bugs, for correction or for review is very easy. We tried to design the interface in order to be as intuitive as possible and here is a small tutorial on how to add your code.

First of all, you have to login and you will be directed automatically to your dashboard.
On the top right corner you will find a red button "New Code / Problem" as shown in the figure below :

When you click on this button you will see the form below :

It says it all :

Enter your code title, in a way to make people interested in intercating with, remember, CodersCrowd is not a q&a so it is better to take an affirmative tone rather than a question, besides, when the code will be solved it will go directly to a knowledge base so it is better not to have a question as a title.

Enter a problem description : be as precise as possible, tell people what your working environment is so that they try to reproduce your bug, and tell them what you did, how you did it and what you expect as output. More importantly, tell them the error output you have

Enter your programming language so that your code is highlighted accordingly, and be more precise on the Category

Now time to enter your code, simply copy paste your code in the editor

If your code contains a bug and you want to find a solution to it, chose "This is a problem to be fixed", otherwise if you are just adding a code that is working and that you want to push directly to the knowledge base, just chose the other option

The last option is to choose if you want your problem to be public or not. Some coders want to share their problem with a restricted audience, so this option is for them, what it means is that only the authors and those aware of the URL of the code will have the chance to work on it (something similar with private vides on youtube), this is useful in case you are teaching a class and you want to share a code with sudents, or you are exchanging code with collaborators prior to a publication, or just if you are versionning your code yourself and you want it to be private for a while, until you publish your work. One important thing here, if you accept solutions from contributors, the code will go public on the knowledgebase !

Yet another Crowd thing ? What is Crowd Coding and why we badly need it ?

I am a big fan of Crowd Sourcing, a word coined by Jeff Howe, contributing editor at Wired Magazine, and Mark Robinson, featured editor at Wired Magazine. I especially spent a lot of time reading about how powerful this process is to solve particular problems, from business related problems to very specific tasks.

Shortly after the crowdsourcing concept started to gain popularity, we witnessed the birth of other concepts such as CrowdFunding, which is another powerful funding tunnel that can save a lot of ambitious projects that lack funding support.

The power of the crowd to solve different kinds of problems started to rise into my head, especially that we are seeing a lot of collaborative efforts making a lot of successful projects in Bioinformatics. A good paper was recently published by Benjamin M. Good and Andrew I. Su exhibits how useful this concept could be for Bioinformatics, but with the lack of support to such ideas we might just not have then applied to the field at all (I will talk about this later in another post).

That said, even though the term crowdsourcing is not applied perse, we watched a lot of projects that are actually crowdsourced, such as UCSC genome browser , GEOmnibus and others, wehere scientists contribute by submitting the results of their experiments, and put their data available to the public for further analysis and investigation.

What about Bioinformatics development

Some other examples, link bioinformatics to the crowd contribution, we may cite as an example Taverna workflow development or The Galaxy project, but are they really crowdsourcing projects ? The answer is No.

What motivates people for sharing with the public ? Reputation ? Community recognition ? Publications ? probably ! but the most important thing to consider as a scientist, is to ask yourself a simple question, each and every morning you wake up : What did you do to change your world today ? Even if your contribution is very small, what impact it can have on your microenvironment ?

I was browsing a lot of publications lately and noticed that people are racing to publish papers, getting known for that small yet rebranded discovery that will be tightly associated with the author's name, but I think personnally that this is serious and the most dangerous practices and the biggest harm one can do for science. As a result, you can count a lot of dead projects (codes, softwares) probably because the ambition behind it was driven by a time-limited goal and not a solution that can have a long term impact on people's life (jobs). How many papers you browse looking for a source code to try at and you end up on a page not found or a server error page ? A lot right ?

There are probably a lot of reasons to explain that, but the one I am thinking of all the time, is that the authors failed to make their solution a need or a must have tool for other scientists. On the other hand, all successful scientists in Bioinformatics (I am talking about software development) usually find THE niche they want to focus on, build a useful product and keep improving it through versionning and features improvements, until the day (after many years) their product becomes obselete, but by then one's scientific career is probably over, if not, they use the same reciepe, find the niche, and make a useful product.

To do that, one cannot work in a closed environment. Especially, we cannot achieve such goal by doing it the same old manner. Science is growing fast, Bioinformatics is growing fast, and so is computer science. I personnally was interested in so many programming languages in the past, and engaged in a lot of discussions (the preferred ones for bioinformaticians) on what's the best language to use or what's your favorite tool etc ... but I ended up adopting all of them, the one that's more useful is the one you know, and the most practical is the one that does the job you need to finish, rapidly (not a problem at all if you have clusters) and in as less as possible lines of codes; I, for example replaced some of my old perl pipelines with others calling awk, sed and bedtools, to the point I was convinced that you only need that to do any kind of job !

The most important thing though is to learn how to do things, and the best way to do it is to have an idea how other developers would do the same task, if by the end of the day you find a smarter way to do job, well you definitly learned something !

That's how I came up with the CrowdCoding concept (Crowd Computing).

CrowdCoding is the interaction with other programmers around a given piece of code, using a web framework that capture the original source code and all possible other implementations, in an effort to solve a bug, review a code or reimplement an algorithm

I spent months drafting the framework and working on the concept, thinking about a name for the web application and finally ended up calling it simply CodersCrowd

What is CodersCrowd ?

Everything at CodersCrowd is centered around Source Code. The application is basically bringing a solution to the bioinformatics scattered knowledge on the web. The most common source of information we have until now to solve a particular issue, is either mailing lists, forums or q&a websites. Although they can be efficient in solving a time point problem, the most annoying and damaging effect in countrepart is the generation of a massive amount of 'wasted' knowledge. Because with the time, that solved problem in particular will be the problem of someone else, and because it was not efficiently stored as a use case, you will spend a precious time looking around on the web, for a similar problem, torturing your neurons to find the right keyword to write on the browser to get the most accurate result from your search engine, sometimes you do it, sometimes you just give up and decide to break your problem into smaller tasks that you can try to align to similar cases that you can find easier on the web. What a waste of time !

What if you have a knowledge base ? What if all these solutions aroud the web are stored somewhere and easily searchable ? What if you can easily see a problem and all the related contributions that led to solve it ? What if you had a CodersCrowd account ?

CodersCrowd is divided into two blocks. First, you submit a problem, by trying to describe as clearly as possible what you're trying to do, and what problems you have exactly, then you monitor other developers contributions trying to solve the issue you have, you will see a lot of different contributions, some of them will work for you, some other wont. Once you find one or several solutions you will just have to validate them. By doing this, you will simply create a record to a knowledge base, a use case repository that others will find useful. As simple as that !

Why would others solve my problem ? and for free ?

The most important thing we all want is : Getting things done, and rapidly ! To do so, we want our problems to be solved as soon as possible, learn as much as possible and move on. If you're a CodersCrowd user, your profile will speak for you, you have to keep a balance between the number of problems you post and the number of solutions you provide to people. The more you give to the community, the more you'll recieve, and the more you'll move on rapidly, that's the main idea behind CodersCrowd.

It is an all-in-one action, at the same time you : ask, get response, learn, get reputation, increase the likelihood of people helping you in the future, get things done, get a lot of connections and probably future collaborators.

In the meanwhile you profile page will exhibit your achievements and your records.

I will be writing another detailed post on how exactly to use CodersCrowd and give you tips to maximize the chance of getting people interacting with your problems.

Thanks for your time reading this very first post on CodersCrowd's blog !