Reading a tab seperated column, and pulling specific columns to a new file

I have a text file with columns separated by tabs. I need to go through, find the columns with headers matching what I need(I know the names of the columns I need), then take those columns and put them in a new file.

So for example

ColumnA ColumnB ColumnC ColumnD
Data. Data. Data. Data

I need to pull the data from columnA and columnD, then put them in a new file, as well as the data in the column. How would I go about doing this. I have managed to open the file in Perl, and I can create a file. But other than that, I can't seem to figure out how to go through the columns and get the ones I need.

I'm using activeperl on windows right now, but am trying to switch to Linux.

But instead of writing the proper data to the file, it hangs writing endless streams of data into the text file. I got it to work correctly by copying the data from the file into the __DATA__ section on the original script, but for obvious reasons this is not the solution. What am I doing wrong?

I think it might be because the file has 2 lines before the header on it. How would I remove the first two lines, and then read the file?

EDIT: When I try and read the file, I get "Uninitialized value $_ used in scalar chomp"

The only issue is that some of the other files I have to read are just a little bit different (Acc as opposed to ACC#, or ACCESSION even) can I have the code that looks for the column look for a partial match, or do I have to write separate code for each one? It wouldn't be too hard to write separate code, then have a bootstrapper that calls the different ones depending on the file name(if I can determine the file name)

Thanks for all the help you have given me. I'm new to perl, coming from C# so its a bit hard.

Avoid doing that unless you have a very good reason to do it: there is no point in your case to store your all file into an array, whereas you can simply iterate over the lines one by one. You are using uselessly a lot of memory, and you also lose on performance by copying twice every piece of data. If your file gets really big, you'll simply run out of memory.

But if you really have to load your file into an array (sometimes you do, because you need to go back and forth into the data, for example, or update it several time until you are ready to print it out, or you need to sort it before proceeding), then try to do it in a more perlish way:

Perl Code:

my@lines=<FH>;

In your case, only the first line (the header line) deserves a special processing, because you need to parse it to figure out which are the fields you will keep in the rest of your data, and then you just go though the lines one by one.

So it could be:

Perl Code:

my$header=<FH>;#get the first linemy@fields_names=split,/\t/,$header;#...

Then you need a bit of work, some simple or nested foreach or map command, to list in an array the fields that you need to keep or print out. There a numerous ways to do that, I'll leave it to you to find out (but please don't hesitate to ask if you don't succeed).

Once you have figured the array of subscripts of the fields you need, say @col_to_keep, which could be something like (1, 4, 3), just read the rest of the file and do something like:

Perl Code:

while(my$line=<FH>){my@splitted_line=split/\t/,$line;my@output=@splitted_line[@col_to_keep];print OUT "@output";}

Of course, I won't come back to the error of using a while on the array pointed out by Keath, or, maybe I will. You could do it if your command was removing the read element from the array (for example with a shift command). Foir example,

should work fine, because shift will progressively deplete @lines until if becomes empty, at which point while will fail and the loop will stop.

But the real good way of scanning though the elements of the array is:

Perl Code:

foreach(@lines){# do something with $_

or

Perl Code:

for(@lines){# do something with $_

(for and foreach are exactly equivalent in this context, but I personally tend prefer foreach because it conveys quite well in English the idea that you will visit once every element. But I also use for, especially when, for some reason, I wish to have a slightly more compact syntax.)

Alright. Sorry for leading you astray with that hash and header stuff. I like to use any data given to me, and I like to name my fields when possible.

If the position of the wanted columns is consistent across files, but the names are different, you can use numbers instead and an array slice to get the data you want.

Code:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
# make a list of the fields you want
# first field index 0
my @wanted_fields = qw/0 2/;
# no need to open file listed as argument
# will be read from magic <> operator
#Create new file for writing;
open my $outfh, ">", 'Output.txt' or die "Cannot create output file: $!";
# pull first line (header) out of file. Not using it.
my $header = <>;
while (<>) {
chomp;
my @row = split /\t/;
#print Dumper \@row;
print $outfh join("\t", @row[@wanted_fields]), "\n";
}

haha you're fine. I appreciate the help. like I said, managed to get it working with some tweaks. I think the files have the same header order, but im not sure. and I wouldn't know without looking at the files first. I think this script is going to run on a Computing Cluster and should be automated. hopefully The header order is the same. again thanks for the help and tips.