Document transcript

Note: Presentation of class paper will be arranged sometime in early April 2006.Final exam will be held sometime in the week of April 24 to 28, 2006.

Instructor:

Shekhar Joshi (C. P. Joshi),

Associate Professor of Plant Molecular Genetics, SFRES

Room 168, Forestry, Phone: 487-3480 (cpjoshi@mtu.edu)

Office hours: 9 am to 6 pm except when I teach this class!

Teaching assistants:

Shiv T. and Frank Xu

(FMGB Graduate students)

Course Description

The main purpose of this course is to provideextensivehands-on-experience inusing a varietyofBioinformatics tools

and

in future you could extrapolate

thatknowledge to other fields of biology such as genomics, molecular phylogenetics, andbiotechnology. You will not write

Bioinformatics

programs but

willuse the availableones for extensive sequence analysis.

Why was this course proposed?

A number of sequence analysis packages and databases are currently available fromthecommercialsourcesas well as public web sites. In our day-to-day molecularbiologyresearch, we use some of these programs and databases to analyze thesignificance of the newgeneticinformation that we obtain. But it is not always easyto choose the correct approach

or appropriate tool. Databases are growing at averyfast pace and new questions are constantly popping up.Moreover, genomics is anew and exciting field of biotechnology that has recently witnessed many conceptualand technical advances. Ability to make sense of this information explosion willmake our students more competitive in the current job markets in the fields ofacademics and industries. There is no doubt that this knowledge will be extremelyvaluable for living in this century.

2

FW4089/5089

Tools of Bioinformatics

GENERAL TEXTBOOKS (Optional Reading material)

1)

Genes VII

Benjamin Lewin, 2000, Oxford University Press

2)

Molecular Biology

Robert F. Weaver, 1999, McGraw-Hill Press

3)

Bioinformatics

David W. Mount, 2001, CSH Press

All these books will provideonly supplemental material for thecourse and may be available at the MTU Book Store or in thelibrary.

Reading materials for the topics being covered in the class willbe provided.

Although there is no specific prerequisite for this class, it isadvisable to have taken at least one of the following and havesome background in genomics and bioinformatics:

Some internet addresses where Bioinformatics information is available:

National Center of Biotechnology Information (GenBank)http://www.ncbi.nlm.nih.gov/

Genetics Computer Group: http://www.GCG.com

Protein analysis:http://www.expasy.ch

Celera Genomics:http://www.celera.com

4

FW4089/5089

Bioinformatics

GRADING SYSTEM

Grade Scale

100-

95 = A

Excellent

94-

90 = AB

Very Good

89-

85 = B

Good

84-

80 = BC

Above Average

79-

75 = C

Average

74-

70 = CD Below Average

69-

60 = D

Inferior

60-

= F

Failure

Course Points

Home work, quiz etc=

30%

Mid-term

Exam 1 = 30%

Final Exam = 30%

ClassParticipation= 10%

Exams: Themidterm andcumulative finals

will be worth 100 points.

Class Paper = One Credit

for FW5089

5

Jobs! Jobs! Jobs!

Current Job trends:http://www.sloan.org/programs/scitech_page1.html

Jobs in Genomics:http://www.genomejobs.com

See also Science and Nature for Job ads.

Bioinformatics is a young science but the information explosion hasdemanded more people in academics and industries. It is easy to get either amolecularbiologist or a computer scientist but the

jobof bioinformaticianneeds both. Biologist who can compute and a computer scientist who canmake sense out of biological data are hot commodities.

Supply and demand!

This is what I heard but do not quote me anywhere!

MS in Bioinformatics: 60-100 K

Ph.D. in Bioinformatics: 80-100K or higher

All CS people do not find money that attractive! But those who areinterested in the topic do very well

in this field. New challenges andquestions biologists are facing every day and CS is providing the answer.True collaboration!

Having this course listed in your CVs will help in your job prospects.

6

http://www.bio.mtu.edu/campbell/bl4820/intro/plagiarism.htm

Plagiarism-

What It Is and How to Avoid It!

Adapted from Notes prepared by Ron Gratz

Scientists do not work in isolation from each other. Attendance at scientific meetingsexposes us to the work of our colleagues and allows for the free exchange of ideas.Reading the published literature in our fields is vital for all scientists, who must keepthemselves current with what is being done in other laboratories. Scientists continuallyrefer to the work of their colleagues and most scientific research is based at least in parton ideas derived from others. Review articles and

textbooks are often wholly based onalready published work. It is thus necessary for you as developing scientists to learn howto properly use previously reported knowledge.

While a free flow of ideas and information is vital to scientific progress, it also presentsavenues for fraud, particularly plagiarism. Plagiarism can be defined as "Taking the ideasfrom another and passing them off as one's own" (Webster's New World Dictionary) andis unacceptable under any circumstances. Despite this universal disapproval, it is one ofthe more common faults with student papers. In some cases, it is a case of downrightdishonesty brought upon by laziness but more often it Is lack of experience as how toproperly use material taken from another source.

To avoid plagiarism you must not only properly attribute the ideas of another but mustalso either paraphrase what the original author said or wrote or you must enclose thatperson's exact words in quotation marks. To use another's exact words with attributionbut without quotation marks implies that the ideas belong to the original source but thatthe words are your own. Besides being dishonest, copying another’s work defeats thepurpose of your education. Writing about the subject you are studying is a great way tolearn. Ideas become more firmly implanted in your memory if you have to think aboutthem and then write a coherent statement using them. Copying another’s work preventsyou from learning, which is the whole purpose of your education.

Whenever the words or ideas of another individual are used, proper attribution must begiven. In other words, you must give credit for those ideas and words to their originator.Not to do so is a clear case of plagiarism. Plagiarism in classwork may result in a failinggrade or even expulsion from the university. Plagiarism in professional work may resultin dismissal from an academic position, being barred from publishing in a particularjournal or from receiving funds from a particular granting agency, or even a lawsuit andcriminal prosecution.

In a review article, the author attempts to summarize all of the pertinent work done in aparticular field of study. The goal is generally twofold: (1) to report what has been doneand what has been learned; and (2) to use this knowledgeto generate general conclusionsbased on these previous works. The author of a review article must be able to present thecited work accurately and be able to synthesize new ideas from this work. In order to

7

accurately represent the work of others and at the same time avoid plagiarism, the authorof a review will often paraphrase the statements made in the cited work.

The problem for many students, and some professional scientists, is that they do not knowhow to properly paraphrase another's words. Several

general rules for paraphrasing thatare relevant for students learning to master this skill are:

1. You should change both the sentence structure and the non-technical terms in order toavoid plagiarism.

2. You can also avoid plagiarism by altering the sequence of subject matter within andbetween sentences.

3. Don't paraphrase technical terms unless you are certain of their exact meaning and canprovide an exact equivalent.

4. Accredit the original author within the group of sentences using his/her work.

8

FW4089and

FW5089: Bioinformatics questionnaire

Your name:

ID number:

Department:

Graduate student/Undergraduate:

Name of Advisor if Graduate student:

Motivation for taking this course:

Previous experience with Unix, GCG or other sequence analysis packages

What do you expect to get out of this course?

Have you understood the problems of plagiarism? Yes No

Do you know what my office hours are? Yes No

Are you clear about grading policy? Yes No

9

First QUIZof Plant Bioinformatics

Date: January 10, 2006

Write one line answers to as many questions as possible in next 45 minutes. Feel free torefer to books/web etc. This will not be counted towards your grade. I just want to knowwhere you stand with molecular biology background:

1.

DNA stands for

2.

RNA stands for

3.

DNA is made up of

4.

RNA is made up of

5.

What is the difference between Deoxyribose sugar and ribose sugar?

6.

What are the different types of nitrogen bases in DNA?

7.

What are the different types of nitrogen bases inRNA?

8.

What is the difference between purines and pyrimidines?

9.

Name 2 purines and three pyrimidines

10.

Which purine pairs with which pyrimidines? State the number of H bondsbetween each pair.

11.

What are the differences between DNA and RNA?

12.

What is transcription and translation?

13.

What is central dogma in molecular biology?

14.

What is reverse transcription?

15.

What is a prokaryote?

16.

What is a Eukaryote?

17.

What are the differences between prokaryote and Eukaryotes

18.

What is a genome?

10

19.

What is genomics?

20.

How many genomes are present in viruses, prokaryotes, plants and animals?Where?

21.

What is bioinformatics?

22.

What is the biological name for humans (binomial)

23.

How big is the human genome?

24.

How many chromosomes are there in a human diploid and haploid cell?

25.

How are human genes arranged in the genome?

26.

How many human genes are there?

27.

What proportion of human genome is made up of genes?

28.

What is a gene?

29.

Why eukaryotic genes are said to be split?

30.

How does DNA replicate? Conservatively or semi-conservatively? What is thedifference?

31.

How does DNA make RNA?

32.

How many types of RNA are produced in a cell?

33.

How many of these RNAs are said to be protein coding?

34.

What is pre-mRNA? Is it present in bacteria?

35.

What are the main three steps in pre-mRNA processing?

36.

What is the 5’leader and 3’trailor sequence in pre-mRNA?

37.

What is the difference between exons and introns?

38.

How are introns spliced off?

39.

Why are introns there?

40.

How transcription process in regulated in prokaryotes?

11

41.

How transcription process is regulated in eukaryotes?

42.

What is a TATA box and AATAAA box?

43.

What is a transcription factor?

44.

Why TFIID is said to a commitment factor?

45.

What is a transcription start site?

46.

What is polyadenylation? Why is it an important biological process? Is it presentin bacteria?

47.

Describe the process of polyadenylation.

48.

Define “protein”. What alternative forms are proteins present in a cell?

49.

How many types of amino acids are typically present? Name five amino acids?What are their 3 letter and 1 letter codes?

50.

How does a code presentin DNA is used to make proteins?

51.

Do you believe that genome is life’s instruction book? Why?

52.

If you have a disease gene (what does that mean), do you always get the disease?

53.

What is a mutation? Name a few types of mutations.

54.

What are the translation start and stop sites?

55.

What is tRNA?

56.

What is rRNA?

57.

What is ribosome?

58.

What is the genetic code? Who discovered it?(Bonus)

59.

Is genetic code Universal? What does it tell about our evolution?

60.

Why a code is said to be made up of triplet?

61.

What is codon bias?

12

62.

What is wobbling hypothesis?

63.

Who discovered the structure of DNA?

64.

What is reverse transcription? Who discovered it?

65.

Do you believe that viruses are most evolved organisms? If yes, Why? If not whynot?

66.

What is mitosis and meiosis?

67.

What are the main steps in mitosis? How many cells are produced at the end ofone cycle of mitosis?

68.

What are the main steps in meiosis?

How many cells are produced at the end ofone cycle of meiosis?

69.

What is the recombination?

70.

Do bacteria recombine?

71.

What is DNA sequencing? Who discovered it?

72.

What is dideoxynucleotides? Why they are important in sequencing?

73.

How can you sequence a gene?

74.

Why DNA sequence is written in only one line when it is double stranded?

75.

Which DNA strand is always denoted when writing a gene sequence?

76.

How can you derive which protein a gene encodes by just looking at a genesequence? (BONUS).

13

Bioinformatics andTheHumanGenome

Human genome is the biggest gift

of scienceto humanity.

We have achieved something new in 2001 that we

have only dreamed of for manyyears.Human genome is just the beginning of our exciting and sometimes fearful journey. Fearof unknown lurks around there but the promise of tomorrow is also bright and vivid.

Sequenced organisms (From Science 291, Feb 2001pp 1178)

Organism

genome size

year completed

No. of genes

H. influenzae

1.8 MB

1995

1740

S. cerevisiae (yeast)

12.1 MB

1996

6034

C. elegans (worm)

97 MB

1998

19099

A. thaliana (water cress)

100MB

2000

25,000

D. melanogaster (fruit fly)

180

MB

2000

13,061

H. sapiens(human)

3000 MB

2001

35-45,000

Rice…Poplar…mouse…

more than 200 genomes sequenced and list is ever-increasing.

Human genome was a dream for which thousands of scientists worked for over 15 years.

Celera and HGP provided two books for price of one. Celera achieved it in 3 years butheavily depended on public data. How did we do what we set out for? That is what is nowwritten in Science and Nature articles.

What it means is still unknown.

They say that 200 telephone books

of New York equivalent pages will be needed to print3 billion bp of genome per cell. But Internet would allow this easily.

Humans were supposed to have 100,000 genes but seems like only 32,000 are possible.

Does that makehumans

less powerful or inadequate in any way?

No, “The purpose of science is to find meaningful simplicity in the midst of complexity”Herbert Simon (Nature 409, 771, 2001). DNA structure and PCR are best examples.



One gene works

harderat many places andmanytimes. So less isbetter in thatcrammed nuclear space.

Alternative splicing.



Human proteins havethesame domains as worms but the way these domainscome together is unique.



We will know one day what makes up a human.



Weall are unique!

All sexually reproduced organisms have the entire ensemble ofthegenes

in one organism only once. One genotype occurs only once.

14

There arealsosome surprises in human genome!



SNPs accumulate with a specific pattern



Regulatory CpG islands occur more in gene rich regions than gene less



TEs in gene poor regions



Only 1.1-1.5 % of the genome is coding not even 3% as widely estimated earlier



Parts of chromosome 12 in men and chromosome 16 in women are recombinationprone.



Repetitive DNA is only 40-45%



Humans share 223 genes from bacteria that are absent in worm, fly and yeastgenome.



Did genome duplicate early on similar to plants?



We will know how humans develop from zygote: ontogeny



We will know our phylogeny looking at ontogeny: molecular archeology



One day we will be trace our evolutionusing the genome information.



Geneology of human race!

CLASS PAPER (1 credit worth of extra work)

Each of you will select a different gene family from human genome to write an essay on

How to build a better human?

You will also present your researchfinding to class. You may select either a humandisease or a trait that you are interested in studying further. Collect all necessarybackground information and collect genes associated with your topic. Find thecounterparts of your gene of interest in other organisms and develop a phylogenetic tree.

You are expected to use as many bioinformatics programs as possible that you learnt inthis class to create a comprehensive database of genes that you have selected.

Important: Provide me with a list of all

reference work (printed materials and web siteaddresses) that you used. Write in your own words. I plan to put your essays anddatabases on web so watch out that you are not accused of plagiarism. See the handout formore information on plagiarism.

ForFW5089: You have to do one more extra project to earn the fourth credit. I willdiscuss this separately with you all.

15

FW4089: How to useGCG

in the GIS lab?

Sit on any computer and shake the mouse to activate or wake the computer up.Presscontrol alt delete and then

Enter your username and password (first initial of your first name and first 7 numbers ofyour id)

Your userids may be the MTU ones.

The following procedure you will do every time you come for the class

(unless thingschange in next few days due to new arrival of GCG at Mango server):

Go to telnet and connect with oak by typing

telnet oak.ffr.mtu.edu

You will get windowfor login: type your login name and

enter password; seeoak%

Typesource

/gcg/gcgstartup

then hit return

Then typegcg

You should seeGCG logo!

Start using GCG programs!

For GCG manuals go to:

http://forestry.mtu.edu/manuals/gcg/index.htm

16

Tutorial on using Unix:

Useful Unix Commands: GCG is unfriendly!! It is not Mac or PC based.

Not for distribution. For personal use only.

Login: connect or telnet withoak

the server where GCG is loaded!

Type the password correctly and enter

You should seeoak%

Logout: Do not forget to logout at the end of the session. Nothing saved will be lost.

Important note:Do not give your username or password to anyone. If someone wants touse it for GCG, ask him or her to contact his or her supervisor and then me. Anyunauthorized use will cost you the loss of GCG privileges.

UNIX Commands

UNIX commands are entered at the prompt> and delivered to the system with the<RETURN> key.

UNIX commands have a syntax, just like any language; there is a correct order for thewords in a command, and MANY incorrect orders. Mix up the order, and UNIX isunlikely to be clever enough to understand what you want it to do!It is a dumbComputer!

The most general form of UNIX command syntax is

Prompt> command-flag(s) argument(s)

Prompt. =oak%

The command is WHAT you want to do, the-flags help refine the command, sayingHOW you want it done, and the arguments tell the OBJECT of the command-

the thingsto be acted upon.

UNIX expects all of its commands to be lower-case, though flags and arguments may be amixture of cases. Remember,UNIX is case-sensitive!

As a trivial example,

suppose you wanted to translate the following English request

"Would you please quickly shovel the snow in the driveway today?"

into UNIX. The translation might look something like

17

prompt> shovel-quickly-today snow

In fact, given the absence of vowels and longer words from most UNIX commands andflags, the actual command is more likely to be

prompt> sw-f-n snow

where sw is short for shovel,-f is short for fast (=quickly), and-n is short for now(=today).

For a genuine example of a UNIX command, consider

mango% ls-la Dirname

Here, ls is short for list,-l is short for long (=all details), and-a is short for all (=all files,even the hidden ones). Dirname is the name of the directory of files for which you wantthe listing.

Finally, when using GCG commands in UNIX, there is one important "feature" forthe arguments; the case you use for the names of database entries is unimportant,but all filenames must be in lower case and typed or copied and pasted correctly.

Text files

Data on computers (text, programmes, sequences etc.) is held in blocks of informationcalled 'files'.

Different files have different names and/or different locations-

and there is a conventionthat filenames end with a three-letter extension that indicates the type of data

held in the file, e.g., .txt for text, .seq for sequences, .pep for peptides, .dat for genericdata, etc.

Files can be created, deleted, altered, overwritten, moved around, copied, renamed,printed out to a screen or aprinter, searched, compared, sorted, counted and transferredover the network to computers on other sites.

Some UNIX commands for file management:

touch filename-

create a file [ holding no information! ]

pico filename-

edit the fileusing the pico editor [ use <CTRL> X to exit ]

cp filename newfilename-

copy a file to a new file [ retains the old file ]

mv filename newfilename-

move (rename) a file to a new file [ deletes the old file ]

18

cat filename-

concatenate (print) a files contents to the screen

more filename-

print a files contents to the screen, one page at a time [ use<SPACE> to see the next page ]

cat filename1 filename2 > filename3-

concatenate (print) the contents of the first twofiles into the third

rm filename-

remove (delete) the filedangerous to use with wildcard *

Exercise DNA Analysis-

UNIX 1: create and manage files

Create a file named easyunix.txt

prompt> touch easyunix.txt

(NB: you may

use any UNIX text editor you like-

pico is

probably the simplest

but we will use vi today)

prompt>vi

easyunix.txt

Edit the file and enter "UNIX is EASY!". Exit by typing:X

and save the changes.

To print easyunix.txt to the screen.

prompt> more easyunix.txt

Copy easyunix.txt to the file opinion.txt (How would you do this with cat? Hint!)

prompt> cp easyunix.txt opinion.txt

Rename easyunix.txt to unixcmds.txt

prompt> mv easyunix.txt unixcmds.txt

Edit the file unixcmds.txt

with vi editor. Move down the screen with the arrow cursorkeys and type what you now know about UNIX. Exit and save the new changes.

prompt>vi

unixcmds.txt

Print unixcmds.txt to the screen to see how clever

you have become.

prompt> more unixcmds.txt

19

Delete opinion.txt.

prompt> rm opinion.txt

Directories

A directory is a group of files or other directories. A directory within another is oftencalled a sub-directory, to reflect this hierarchical organization.

Directories can be created, copied, deleted, renamed, searched and transferred over thenetwork to computers on other sites. Files can be moved between or copied amongspecified directories.

You work in one directory at a time.This is known as the present working directory. Thedirectory you begin with when you login is your home directory.

PWD: print working directory

You can easily return to your home directory from any other directory by giving theUNIX command "cd" with no argument.

Some UNIX commands for directory management:

cd dirname-

change to the directory named dirname

cd ..-

change to the directory above the present one [ ".." = up ]

cd-

change to your home directory [ the default argument for cd is your homedirectory ]

ls-

list the files in the present working directory

ls-l-

a file list that is longer, more detailed

mkdir subdirname-

make (create) a new sub-directory in the present directory

rmdir subdirname-

remove (delete) a sub-directory in the present directory

mv filename dirname-

move a file into a sub-directory

Exercise: create and manage directories

20

Create a sub-directory named Unixinfo

prompt> mkdir Unixinfo

Switch your present working directory to the new sub-directory

prompt> cd Unixinfo

Check to see you are there

prompt> pwd

Move a file from the directory above into your new present working directory (".." is ashortform for the directory above, and "." is a short form

for the present directory)

prompt> cp ../unixcmds.txt .

Has the file moved? It should occur in the second list (";" separates the two listcommands)