Introduces to the commands that you need to manage and analyze directories, files, and large sets of genomic data. This is the fourth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Taught By

Liliana Florea, PhD

Assistant Professor

Transcript

In this section we will be looking at how we can compare content in two files in Unix. In particular we will be looking at the commands diff and comm. Let me go straight there. So we are in our plans directory In which, just to remind you, we have one directory containing information for each of the tree species, apple, pear, and peach. And then we have a number of files that we will we use to illustrate some of these concepts and commands. The first command that we're going to look at is diff which allows us to compare files, line by line. For the purposes of this illustrations, I would like to show you the file orchard. More, orchard. And I quote, under the evening stars I walk, the smell of cool damp air beneath my feet on a summer night in the orchard. Tiger army in the orchard. So these are not my lyrics and now let's make a slightly modified version of this file. So I'm going to just copy the file orchard into orchard.1. This for instance, might happen when you have to make changes to a particular file, to a text file, or when you might have a program that modifies slightly and now you have just slightly different versions of the output and you would like to compare. So we're going to try to illustrate it there. Again, I'm going to involve my editor and let's say that I don't want a second line. So now the second line I'm going to save it You see the content of the first file which is four lines and now let's take a look at the newly created file. Which was just slightly edited. Under the evening stars I walk on a summer night in the orchard by tiger army in the orchard. So we're missing the second line. Lets do a diff on the two fives. Diff orchard and orchard.1. So the command diff is going to tell us all the differences between those two files and their location and type. So the location within the first file and the second file. So here this indicator tells us that line number two, in the first file was deleted right after line number one in the second file. Then it shows us the line, and at the beginning of the line it shows us the less than sign which is used to mark all the changes as they occur in the first file. And indeed, as you might recall, we simply remove the second line in the file orchard. Now let's make a view more changes in orchard.1 to illustrate some other types of changes. So let's say that instead of under the evening stars I walk, I'm going to say under the bright night stars I walk, okay? So now as you may recall, this file differs from the first one in the first where we changed evening with bright night, and by not having the second line, the second verse. And let's again invoke diff orchard orchard.1 So we have a longer report now because we have more changes, more differences. So now it shows us that the minimum number of changes that you can, that you can use to transform the first file into the second one is to simply change the c. Lines One and two underline one in the second file. So under the evening stars I walk, the smell of cool damp underneath my feet into the third line and so on. Just as a parentheses here, I should say that the algorithms that were used to implement, that were using the diff method are the ones that also started the field of alignment. Of sequence alignment for computational biology. So there you have it. I'm now going to move to illustrate the next comment, which is comp. So com simply submits or rejects line without any comment to different files, and for the purposes of illustrating this first I'm going to show you the experiment, the sample files for for our species. So apple, apple.samples, so root, leaf and food. And let's look at pear and then peach. So one practical question might be what is the difference between the samples that we have from apple versus the samples that we have from pear? So from apple we have root, leaf, fruit, and we have flower, leaf and root for pear. So first we're going to just type comm. Apple. And then followed by the name of the two files that we want compared. And that shows the output and I also measured the carry out. Okay, so if you're looking at the output, just the weight structure you're going to see it has three counts and each line that the files is shown among the columns. So if the line appears only in the first file, it's going to be listed in column one. If the file is going to appear in the second file only is going to be listed in the second column. And if it appears in both of them it's going to be listed in the third. So according to this output, root is the only organ that has been sampled in both apple and pear. However, you might be looking at this output and ask me why? It appears that leaf is listed individually for apple and for pear so why was it not reported as being common? And here's the caveat In order to be able to apply com, the files need to be sorted. So, the lines have to be sorted, have to appear in the same order. So, let's do that first. You might remember the comment sort. So, we're going to sort apple.samples. Do a-p-p. Actually apple.samples.sorted and let's put that in the Apple directory as well. And we'll do the same with pear. We're just modifying the previous comment. So we're sorting the pair.samples file in the pair directory and we're saving that as the pair.samples.sorting file in that same directory. And now let's apply the command com to our sorted versions. And now you might see that food appears only in the first file, only in Apple. So, food was only sampled from apples. The flower was only sampled from pear. And that leaf and root were both sampled from both apple and pear. So, this makes sense. There are several options that allow us to simply carve out and see only one of these coms. So, let's apply the same comment, com, and now let's say that we only want to see how many organs were sampled in both of these. The way to do that is Sorry. So the way to do that is to simply ignore the lines that appear only in the first line which is dash one and to ignore the lines that appear only in the second In the second box, dash two. And that gives us leaf and root. Of course, we can, for instance, pipe this through a comment such as, sorry, WCL, so we can get the number of such ordinals, or teachers, if you wish. And that's two. We can also use these command line arguments, -1, -2, and -3, in different combinations to obtain those tissues or those lines that are represented only in the first file, or only in the second file. So for instance if we want to see only the lines that appear in the apple file then we should ignore those lines that appear only in the second, in the pear file and that appear in both. So -2, -3 and then gives us food which is only sampled in apple. And similarly, we can ignore those lines that appear only in the first file, that's Apple-1, or that appear in both files -3. And that will give us. So four is the line that is specific to the file pair dot samples, and pair dot samples dot sorted. So, this tool comments, diff and com, can be used effectively to compare two files in UNIX. Which concludes this section.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.