I tried to use wildcards (i.e. 1:*:*:ACAGTG, 1:*:**:ACAGTG, 1:*:***:ACAGTG, 1:*:****:ACAGTG in four different sed commands) but it didn't work. Any ideas how I can replace them all? There could be about a hundert variations of the numbers in about 30 million entries per file and I don't want to replace them individually.

jthill

11-22-2011 07:23 PM

I think you're confusing the shell's wildcard * with regular-expressions' 0-n repeat * operator. Your 1:*:*:ACAGTG (etc.) specifies a 1 followed by any number of colons followed by any other number of colons followed by a single colon .... but nowhere in there are you searching for anything between the colons. The only text your expression can match is " 1" followed by at least one colon followed by ACAGTG.

What I think you want is sed -i 's, 1:[^:]*:[^:]*:ACAGTG,/1,g'. That'll match " 1:Flew:OvertheCuckoo'sNest:ACAGTG" so you may want to hunt up how to restrict the matches a bit better.

Lokelo

11-22-2011 07:36 PM

That's great, thanks!

I'm pretty sure that the start and end are quite unique and there is no need to restrict the matches better.

Edit: actually, if I run it in a script it seems to work, but I have only tested the first command so far.

Hi,

the ! runs the last command that matches the following letters (history expansion). Example:

Code:

$ echo hello
hello
$ !ec
echo hello
hello
$

If you run the above 'sed' with double-quotes instead of single-quotes then bash will try to match a command that you issued earlier that starts with
/1g

Since there is no such command it gives the 'event not found' error. Are you sure that you used single-quotes? Those should prevent this kind of error.

Lokelo

11-23-2011 09:15 AM

Thanks for all the answers and sorry about the double posting of my question.

I couldn't get the version of sed. I'm working remotly on a high performance computer and sed -V (or v) didn't come up with a version number.
I just started working with linux a month ago, so I'm not 100% sure yet what I'm doing all the time.

The lines I posted above containing the error message were directly copy/pasted from the terminal.

Crts, you said if I run it with double quotes it will try to redo a previous command, but I ran it with single quotes.

The thing is, it runs perfectly fine if I use it within a script (see below), but I got the error message when using them directly in the terminal.

I only tried the second command once in a script, which didn't work. But since the command above worked I didn't follow it up further, although the other command is a bit more elegant.

grail

11-23-2011 09:28 AM

No short option for version so you would need --version.

I would ask, does the above actually work, ie have the changes been made in the file(s)?

Are you pushing all the commands into the background because the file(s) are so large?

Lokelo

11-23-2011 10:14 AM

The files are about 5 Gb each. Am I correct in the understanding that if I didn't use &, they would just be carried out in sequence? And with the & they are done in parallel?

The above works and the changes were made in the files. I checked by grepping the adaptor sequences (I.e. the strings of A,C,T and G at the end of the ID) and the searches came up with nothing. Each file has about 30 million entries containing the four lines shown above. That's why I didn't find the variation in the numbers until I learned about grep, since they are comparably rare.

I have a limited time to assemble my transcriptomes from scratch and I as much as I would love to read up on everything I'm detail, I'm just focussing on what I need for the time being. Hence I use a lot of copying with just enough understanding to make it work.
However, I'm highly fascinated by this experience (when I was 16 I had the choice to go into chemistry or IT, and chemistry won, even though I still like using computers on a higher level) and certainly will get as much Linux knowledge as I can over time.

I will get the version number tomorrow morning.

jthill

11-23-2011 11:40 AM

No need for the version number: crts nailed it. Lose the bangs ("!").

Instead of s!this!that!g use s,this,that,g or s`this`that`g or whatever.

I like commas or backticks myself, they make a visible break. Until you have time to get better acquainted with shell syntax and its interactive assists, get in the habit of single-quoting any argument that has anything but alphanumerics or +-_/,. You'll gradually find more safe ones, but bang ("!") is high-priority metasyntax for interactively constructing command lines, fast, from pieces of earlier ones.

gtg, sorry if this was too elliptical, happy thanksgiving,
Jim

crts

11-23-2011 11:56 AM

Quote:

Originally Posted by Lokelo
(Post 4531703)

I couldn't get the version of sed. I'm working remotly on a high performance computer and sed -V (or v) didn't come up with a version number.
I just started working with linux a month ago, so I'm not 100% sure yet what I'm doing all the time.

The lines I posted above containing the error message were directly copy/pasted from the terminal.

Crts, you said if I run it with double quotes it will try to redo a previous command, but I ran it with single quotes.

The thing is, it runs perfectly fine if I use it within a script (see below), but I got the error message when using them directly in the terminal.

Hmm, this is strange. But since you are working remotely I wonder which shell you are using on the remote system.
Another alternative would be to deactivate history expansion. In bash you can do it with

Code:

set +H

The reason why it works inside a script is because history expansion does not work inside a script.

Can you post the name of the system and the shell you are logged in? And how do you log in (ssh, telnet ...)? This is definitely not a 'sed' issue.

Lokelo

11-23-2011 07:15 PM

That seems to have done the trick. Thanks for all your help.
I agree, I like the look of commas. I will make sure that I acknowledge this forum in my thesis for your continued help!