User Feedhttps://www.biostars.org/Posts for users that match 2+4+6en-usSun, 17 Mar 2019 07:33:59 +0000Comment: C: Retrieve many FASTA of the same gene from NCBI..https://www.biostars.org/p/369806/#369834What https://www.biostars.org/u/25721/ alludes to is that your answers seem to go off tangents that are not relevant here. "Click BLASTN, enter query sequence, paste into Word" etc
instead, your answers should be more concept oriented. Here for example what you are trying to suggest is that one could use BLAST to search for similar sequences, though you should point out that in general there is no guarantee that only the sequences from genes with the same name will be returned as hits. In addition what the OP wants are the complete sequences, whereas in your answer, at best, one would get the aligned region, a big difference.
For your approach to work one would then need to post-process the results to select for unique taxids, and also match the desired gene name, and using those select the accession numbers for the individual genes. Finally one would need to use the accession numbers and download the full sequences for each gene. See how a more appropriate answer is a lot more complicated even at a conceptual level - talking about Word and pasting queries, selecting options etc are distracting and also not quite correct overall.
Sun, 17 Mar 2019 07:33:59 +0000369834Answer: A: Bam header editinghttps://www.biostars.org/p/369739/#369833Sounds like what you need is to replace a readgroup where you overwrite all alignments with a new readgroup:
samtools addreplacerg
prints:
Usage: samtools addreplacerg [options] [-r &lt;@RG line&gt; | -R ] [-o ]
Options:
-m MODE Set the mode of operation from one of overwrite_all, orphan_only [overwrite_all]
-o FILE Where to write output to [stdout]
-r STRING @RG line text
-R STRING ID of @RG line in existing header to use
--input-fmt FORMAT[,OPT[=VAL]]...
Specify input format (SAM, BAM, CRAM)
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
now if you really don't want to process the entire BAM file (and note that any text-based editing means turning into SAM then back to BAM and would probably be slower than `addreplacerg`) you could edit the BAM file directly, though with that you could easily corrupt the files if done incorrectly. Here is how a BAM format starts:
... Sun, 17 Mar 2019 07:04:21 +0000369833Comment: C: Can we consider Bioinformatics as a engineering discipline? https://www.biostars.org/p/369811/#369823traditional biology is descriptive, sentence driven, enumerative, dare I say "mechanistic" - this thing does this, that thing does that, but what if different combinations of "things" can give you the same "that"? There is no way to capture, model or understand those phenomena.
in my opinion, traditional biology is not well suited for representing the complexity of information processing inside the cell - hence a conundrum
Sat, 16 Mar 2019 21:55:58 +0000369823Answer: A: Can we consider Bioinformatics as a engineering discipline? https://www.biostars.org/p/369811/#369816Bioinformatics is about information as the name aptly states (even though originally that was not the intent).
Initially, "informatics" meant the use of information technology (computers) to process data of biological significance. But the more we know about how the cell operates the more we understand that it is the information encoding and information processing of the cell that we measure with bioinformatic techniques. We are in the science of information interpretation first and foremost.
A cell is not a like a clock, a cell is not like a transistor, a cell is not like an engine. All of these examples are engineering products - the cell is radically different from each. So how could bioinformatics be engineering? I might even venture to say that bioinformatics is the "opposite" of engineering.Sat, 16 Mar 2019 21:14:33 +0000369816Answer: A: Small dataset for data analysis in the own laptophttps://www.biostars.org/p/369759/#369812Saccharomyces Cerevisiae has a genome size of a mere 12 million bp, it is also one of the most studied model organisms. Thanks to the small genome size, you can re-run just about any analysis on a laptop.
Another common strategy, for say human size genomes is to find an experiment that also publishes BAM files, then use that BAM file to extract only the fastq data that aligns to a smaller chromosome. For example, chromosome 22 is around 50 million bp long. Doing so will give you access to the"original" data but instead of using the entire human genome you can repeat the analysis just by using the target chromosome. The reduced size of the data and reference would allow you to practice the data analysis on a laptop. Sat, 16 Mar 2019 20:55:28 +0000369812Comment: C: Are parasites good as model organisms?https://www.biostars.org/p/368964/#368985I haven't seen the original title but I will say that we can definitely learn a lot from parasites when it comes to the function of cells. Then what we learn could be put to good use.
I think the title should reflect that message: what can we learn from parasites, or what are parasites good for etc
I don't think that parasites are good model organisms. They are usually very specific to one particular niche - and what you learn from one parasite will not translate to another organism - that's how I would define a "model" organism. Tue, 12 Mar 2019 14:11:19 +0000368985Answer: A: BWA error when aligning multiple fileshttps://www.biostars.org/p/368881/#368882at least one of you fastq file pairs (among the many) uses an old school naming scheme where the read names are not identical across the pair.
`29_1_1101_2537_1125/1` vs `29_1_1101_2537_1125/2`
Tue, 12 Mar 2019 01:53:49 +0000368882Comment: C: SAM to BAM conversion problem after HISAT2.https://www.biostars.org/p/368774/#368824note how above you run
samtools view out.bam
which is empty so will, of course, raise an error. See what
samtools view out.sam
prints
I think your BAM file is empty, for whatever reason, you would need to run the conversion again and see why it failed. I think it ran of out "something" and truncated the BAM file but the SAM file should be fine.
Mon, 11 Mar 2019 19:05:36 +0000368824Comment: C: Progressive programming and multiple sequences alignment?https://www.biostars.org/p/368594/#368610I think the right course of action here is to ask the OP to add the answer separately rather than closing the question. They can't add the answer if the question is closed.Mon, 11 Mar 2019 02:20:00 +0000368610Comment: C: Progressive programming and multiple sequences alignment?https://www.biostars.org/p/368594/#368609you might want to add the answer you found as a standalone answer to the question. It would nicely complete the question Mon, 11 Mar 2019 02:19:33 +0000368609Comment: C: GUESSmyLT guess the library type of your RNA-seq data (orientation, strandness)https://www.biostars.org/p/368544/#368608interesting tool, the descriptions on the possible library types are also quite handy.
some feedback on usage,
the example invocations are needlessly lengthy, you should not need to list the files as `home/.../read1.fastq` just call the files `read1.fq` and `read2.fq` why bother with the absolute paths
the use cases should be labeled by the information that is available to the end user:
1. if you have reference genome and annotations
1. if you have reference genome but no annotations
1. if you have transcript sequences but no genome
1. if you have no other information just the reads
requiring snakemake to run your tool seems to add unneeded complexity.
in general it seems there seem to be too many dependencies. It feels like the task at hand (determine the library type) ought to be much simpler than having to first assemble a transcript. Not sure what the right answer is here, but this might be an interesting research problem on its own. How to detect the library type without assembling transcripts?
What I am basically saying is that transcript assembly is a different and much bigger/complicated task than library type detection.
Mon, 11 Mar 2019 02:09:34 +0000368608Comment: C: Review of the GFF and GTF formats (GFF / GFF1 / GFF2 / GFF2.5 / GFF3 / GTF / GTFhttps://www.biostars.org/p/368167/#368497pretty mindboggling.
I personally blame tophat as the tool that resurrected the dead GTF format. GFF seemed to have a great uptake and seemed to squeeze out GTF. Then tophat (version 1) was released and it required a GTF format. It is hard to convey how influential tophat was just a few years ago. By today tophat 1 is obsolete, but the resurrected format lives on.
Sat, 09 Mar 2019 16:47:37 +0000368497Comment: C: Finding Reads from a DNA Virushttps://www.biostars.org/p/366209/#368232I was not aware of that - thanks for the informationFri, 08 Mar 2019 02:31:09 +0000368232Comment: C: need help with script for hisat2 alignmenthttps://www.biostars.org/p/366924/#366934A more appropriate course of action is to follow up with the answer, and with that, you can help someone else that may run into a similar problem. That's the entire purpose of the site. Imagine if every post with a solution would be deleted.Fri, 01 Mar 2019 18:36:03 +0000366934Comment: C: How did Dobin et al. create these figures?https://www.biostars.org/p/366414/#366464I would ask the authors for this information - unfortunately data distribution - especially intermediate data distribution, and that of showing the data that plots are relying on is not a priority
my guess is that they counted the reads that overlap with each exon junction coordinate. For example, using bedtools you could create 1 bp flanking regions for each exon then select all the alignments that overlap with these coordinates.Wed, 27 Feb 2019 17:40:57 +0000366464Answer: A: Finding Reads from a DNA Virushttps://www.biostars.org/p/366209/#366271since the DNA is covered relatively evenly whereas the RNA depends on the copy number of each transcript detecting from DNA ought to work more effectively. Especially if you can poinpoint the location of the insertion.
Another way to say this is that the amount of viral RNA in the sample is probably at trace levels, whereas the DNA should end up at the same coverage as all the other DNA.
Wed, 27 Feb 2019 00:23:23 +0000366271Comment: C: Evolution of Biostarshttps://www.biostars.org/p/365738/#365800actually, the title should be New Posts per year Mon, 25 Feb 2019 04:00:35 +0000365800Answer: A: Evolution of Biostarshttps://www.biostars.org/p/365738/#365799Made a chart with total posts (question+answer+comment) for each year
![enter image description here][1]
[1]: https://i.imgur.com/nJRaKE9.pngMon, 25 Feb 2019 03:54:26 +0000365799Comment: C: Evolution of Biostarshttps://www.biostars.org/p/365738/#365796Turns out if the traffic were a "country" we'd be the 83rd highest populated country right between Greece and Bolivia.Mon, 25 Feb 2019 03:38:16 +0000365796Comment: C: why is variant calling difficult?https://www.biostars.org/p/365706/#365766the length of the read does not help in the case of misleading alignments - it is the "mathematics" that it is broken, or shall we say "overly simplistic" and not all that well suited to biology.
the alignments in question will be mathematically accurate - just biologically "wrong"Sun, 24 Feb 2019 17:58:02 +0000365766Answer: A: why is variant calling difficult?https://www.biostars.org/p/365706/#365765Variant calling is difficult because the alignments are mathematical constructs that work by maximizing various predetermined rewards and penalities and by those imply that the simplest "explanation" is correct. Biology does not work this way. I have a chapter called [Misleading alignments][mis] in the Biostar Handbook that I will summarize below:
Imagine that the sequence below is subjected to two insertions of Cs at the locations indicated with carets:
CCAAACCCCCCCTCCCCCGCTTC
^ ^
The two sequences, when placed next to one another, would look like this:
CCAAACCCCCCCTCCCCCGCTTC
CCAAACCCCCCCCTCCCCCCGCTTC
A better way to visualize what’s happening in this example is shown in the expected alignment that would reflect the changes that we have introduced:
CCAAA-CCCCCCCT-CCCCCGCTTC
||||| |||||||| ||||||||||
CCAAACCCCCCCCTCCCCCCGCTTC
Now suppose we did not know what the changes were. Can we discover the variation that we introduced by using a global aligner? Let’s see:
global-align.sh CCAAACCCCCCCTCCCCCGCTTC CCAAACCCCCCCCTCCCCCCGCTTC
Here is what we obtain:
CCAAACCCCCCC--TCCCCCGCTTC
|||||||||||| .||||||||||
CCAAACCCCCCCCTCCCCCCGCTTC
The variation indicated by the alignment is different: instead of the two insertions of `C` it shows one insertion of `CT` followed by a mismatch, adding insult to injury the variation is also shown to take place at a completely different location altogether.
You can see how diffi ... Sun, 24 Feb 2019 17:50:01 +0000365765Answer: A: Evolution of Biostarshttps://www.biostars.org/p/365738/#365763Here are some traffic data over the last five years.
* 11 million users
* 57 million pagevies
PS. total number of posts (including commens/answers) per year would also be an interesting plot to make.
![enter image description here][1]
[1]: https://i.imgur.com/oPl1e1Q.pngSun, 24 Feb 2019 17:29:04 +0000365763Comment: C: Bioinformatics word cloud to use in classeshttps://www.biostars.org/p/365479/#365528Another word cloud, this time based on the words in the 1000 most highly voted post titles:
* <http://data.biostarhandbook.com/data/biostar-most-voted-question-titles.txt>
and using the https://wordart.com service.
![enter image description here][1]
[1]: https://i.imgur.com/2NTSDYs.jpgFri, 22 Feb 2019 16:34:14 +0000365528Comment: C: Bioinformatics word cloud to use in classeshttps://www.biostars.org/p/365479/#365517What tool creates the cloud itself? It has cool looking styling, what parameters does it need to make it look like that? Now that I have played a bit with word clouds I think that figuring out the right styling is a separate challenge onto its own.Fri, 22 Feb 2019 15:37:40 +0000365517Comment: C: Bioinformatics word cloud to use in classeshttps://www.biostars.org/p/365479/#365515looks like R is missing, second most common tag, probably because it is one letter long. Fri, 22 Feb 2019 15:31:57 +0000365515