05_PythonDictionaries

Python Dictionaries
“mappings”
1
Python Dictionaries
•
Sequences are accessed by position – that is, by an index indicating how many entries
they are past the first entry.
•
There is another class of collections known generally as mappings
•
The items in a mapping are accessed by a key and no particular ordering may be
assumed
•
Python has only one built-in mapping type, namely dictionaries with type name
dict
•
Aside or the CS students in the class: python dictionaries are implemented using hash
tables
•
Dictionaries are mutable:
•
They can be changed in place and can grow and shrink on demand, like lists.
2
Python Dictionaries
•
The literals used in directly defining a dictionary are a sequence of “key:value” pairs
enclosed in curly braces.
D = {'food': 'Spam', 'quantity': 4, 'color':'pink'}
•
We “index” the dictionary by key to fetch and change the key's associated value:
•
>>> D['food']
'Spam'
>>> D['quantity'] += 1
>>> D
{'food': 'Spam', 'color': 'pink', 'quantity': 5}
3
Python Dictionaries
•
A dictionary can also be built up one item at a time
•
First, create an empty dictionary:
>>> D = {}
•
Then create keys by assignment:
>>> D['name'] = 'Bob'
>>> D['job'] = 'dev'
>>> D['age'] = 40
>>> D
{'age':40,'job':'dev','name':'Bob'}
>>> print(D['name']
Bob
4
Dictionary Methods
may be used like a sequence; to convert to an actual sequence, use list(d.keys())
5
•
Dictionaries are the natural Python representation of tabular data.
•
We will illustrate this by representing the codon table showing hoq the codons
determine amino acids during translation.
6
Codon Table as Dictionary
>>> RNA_codon_table = {
#
#
U
C
# U
'UUU': 'Phe', 'UCU':
'UUC': 'Phe', 'UCC':
'UUA': 'Leu', 'UCA':
'UUG': 'Leu', 'UCG':
# C
'CUU': 'Leu', 'CCU':
'CUC': 'Leu', 'CCC':
'CUA': 'Leu', 'CCA':
'CUG': 'Leu', 'CCG':
# A
'AUU': 'Ile', 'ACU':
'AUC': 'Ile', 'ACC':
'AUA': 'Ile', 'ACA':
'AUG': 'Met', 'ACG':
# G
'GUU': 'Val', 'GCU':
'GUC': 'Val', 'GCC':
'GUA': 'Val', 'GCA':
'GUG': 'Val', 'GCG':
}
Second Base
A
G
'Ser',
'Ser',
'Ser',
'Ser',
'UAU':
'UAC':
'UAA':
'UAG':
'Tyr',
'Tyr',
'---',
'---',
'UGU':
'UGC':
'UGA':
'UGG':
'Cys',
'Cys',
'---',
'Urp',
#
#
#
#
UxU
UxC
UxA
UxG
'Pro',
'Pro',
'Pro',
'Pro',
'CAU':
'CAC':
'CAA':
'CAG':
'His',
'His',
'Gln',
'Gln',
'CGU':
'CGC':
'CGA':
'CGG':
'Arg',
'Arg',
'Arg',
'Arg',
#
#
#
#
CxU
CxC
CxA
CxG
'Thr',
'Thr',
'Thr',
'Thr',
'AAU':
'AAC':
'AAA':
'AAG':
'Asn',
'Asn',
'Lys',
'Lys',
'AGU':
'AGC':
'AGA':
'AGG':
'Ser',
'Ser',
'Arg',
'Arg',
#
#
#
#
AxU
AxC
AxA
AxG
'Ala',
'Ala',
'Ala',
'Ala',
'GAU':
'GAC':
'GAA':
'GAG':
'Asp',
'Asp',
'Glu',
'Glu',
'GGU':
'GGC':
'GGA':
'GGG':
'Gly',
'Gly',
'Gly',
'Gly'
#
#
#
#
GxU
GxC
GxA
GxG
7
Codon Table as Dictionary
•
•
>>> RNA_codon_table
{'ACC': 'Thr', 'GUC': 'Val', 'ACA': 'Thr', 'AAA': 'Lys',
'GUU': 'Val', 'AAC': 'Asn', 'CCU': 'Pro', 'UGG': 'Urp',
'AGC': 'Ser', 'AUC': 'Ile', 'CAU': 'His', 'AAU': 'Asn',
'AGU': 'Ser', 'ACU': 'Thr', 'GUG': 'Val', 'CAC': 'His',
'ACG': 'Thr', 'CAA': 'Gln', 'CCA': 'Pro', 'CCG': 'Pro',
'CCC': 'Pro', 'GGU': 'Gly', 'UCU': 'Ser', 'GCG': 'Ala',
'UGC': 'Cys', 'CAG': 'Gln', 'UGA': '---', 'UAU': 'Tyr',
'CGG': 'Arg', 'UCG': 'Ser', 'AGG': 'Arg', 'GGG': 'Gly',
'UCC': 'Ser', 'UCA': 'Ser', 'GAA': 'Glu', 'UAA': '---',
'GGA': 'Gly', 'UAC': 'Tyr', 'CGU': 'Arg', 'UGU': 'Cys',
'AUA': 'Ile', 'GCA': 'Ala', 'CUU': 'Leu', 'GGC': 'Gly',
'AUG': 'Met', 'CUG': 'Leu', 'GAG': 'Glu', 'CUC': 'Leu',
'AGA': 'Arg', 'CUA': 'Leu', 'GCC': 'Ala', 'AAG': 'Lys',
'GAU': 'Asp', 'UUU': 'Phe', 'GAC': 'Asp', 'GUA': 'Val',
'CGA': 'Arg', 'GCU': 'Ala', 'UAG': '---', 'AUU': 'Ile',
'UUG': 'Leu', 'UUA': 'Leu', 'CGC': 'Arg', 'UUC': 'Phe'}
8
“Pretty Printing”
>>> pprint(RNA_codon_table)
{'AAA': 'Lys',
'AAC': 'Asn',
'AAG': 'Lys',
'AAU': 'Asn',
'ACA': 'Thr',
'ACC': 'Thr',
'ACG': 'Thr',
'ACU': 'Thr',
'AGA': 'Arg',
. . .
'UCU': 'Ser',
'UGA': '---',
'UGC': 'Cys',
'UGG': 'Urp',
'UGU': 'Cys',
'UUA': 'Leu',
'UUC': 'Phe',
'UUG': 'Leu',
'UUU': 'Phe'}
9
Using the RNA_codon_table
>>> def translate_RNA_codon(codon):
return RNA_codon_table[codon]
>>> translate_RNA_codon('AGA')
'Arg'
10
Streams
• A stream is a temporally ordered sequence of indefinite length
• Usually limited to one type
• Two ends:
– source,
– sink,
provides the elements
absorbs the elements
• Examples of Python stream sources:
– files
– network connections
– output of special functions called generators
• Examples of stream sinks:
– files
– network sources
11
Streams
• Input to a command-line shell or the Python interpreter becomes a stream of
characters
• When Python prints to the terminal, also a stream of characters
• Illustrates the temporal nature of streams
• Keystrokes don't "come from" anywhere
• They are events that happen in time
• Implementation detail: buffering
12
Files
• Depending on a parameter used on creation, the elements “flowing” to/from the
file stream are either bytes or Unicode characters
• Some methods treat files as streams of bytes or characters, others as lines of
bytes or characters
• Most of the time, a file is a one-way sequence – it can be read from or written to
• While it is possible to create a two-way file object, better to think of it as two
separate streams
• Files opened for reading are assumed to already exist
• An attempt to open a non-existent file for reading results in an error
• Files can also be opened for appending to an existing file
• When a file is opened for writing it is created if it does not exist
• If the file does exist, its contents are erased as the result of being opened
13
Creating File Objects
• File objects are created by a call to the built-in function open(path,mode)
• path is a string that specifies the location for the physical file represented by
the Python file object
• mode is a string of length one or two or three which specifies the type of file
interaction '
• A substring specifying the intended use of the file is mandatory
• The use options are 'r', 'w', 'a', 'r+', 'w+', 'a+'; 'r' is the default
• An optional single-character specifies the file object's value type
• The value options are: 't', 'b'; 't' is the default
• The meaning of the mode string contents are given in the following tables
14
File Modes
(Unicode)
Correction
15
Creating File Objects
• Simple use of the open function:
f = open('C:\Users\rtindell\myfile','r')
• When you are finished using the file object, you must close it:
f.close()
• There are hazards to using this approach, which, although relative rare, can have
bad results
• If your script crashes before the close statement is executed, there may be
writes whose data was not actually written to the physical file
• This is because transfers to and from external drives are done in fairly large
chunks (blocks) because of the way the underlying hardware works
• The chunks are kept in special pieces of computer memory called buffers
• Requests for reading or writing are satisfied by the buffers until all the contents
of the buffer has been used
• At that point, entire blocks are transferred between main memory and the drive
• If the buffer was half-full when your script crashed, the buffer data would never
be written to the disk
16
The with Statement
• Python provides a way to make sure files are closed regardless of other events
17
File Read Methods
In the following, f is a file object
• f.read(count)
– Treats the file as an input stream of characters
– Reads up to count bytes from the current file position into a string and returns that
string
– The file position is then the next byte in the file
– If there were fewer than count bytes left in the file, returns just the remaining bytes
– If the file position is the end of the file, returns the empty string
• f.read()
– Reads the bytes from the current position to the end of the file into a string and
returns that string
– The file position is then the end of the file
18
File Read Methods
In the following, f is a file object
• f.readline()
– Treats the file as an input stream of lines
– Reads one line from the file and returns the entire line, including the end-of-line
character
– The file position is then the beginning of the next line of the file or the end of file if all
bytes in the file were exhausted
– If the initial file position is the end of the file, returns the empty string
• f.readline(count)
– Same as f.readline(), but limits the number of bytes read to count.
19
File Write Methods
In the following, f is a file object , s is a string and seq is a sequence object
• f.write(s)
– Treats the file as an output stream of characters
– Writes the string s to the file represented by f
• f.writelines(seq)
–
–
–
–
Treats the file as an output stream of lines
All elements of seq must be strings
Writes each element of seq to the file
Despite the name, it does not add newline characters to the elements
• The print statement also provides a mechanism for writing to a file
• You just use a keyword argument as the last argument
• Example:
print('Hello','young','bioinformaticians',file=f)
20
FASTA Format
• FASTA formatted files are widely used in bioinformatics
• They consist of one or more base or amino acid sequences broken up into fixed
size lines, each preceded by a single header line
• The header line starts with a " >" symbol.
• The first word on this line is the name of the sequence. The rest of the line
optionally provides a description of the sequence.
• Meaning of the header line entries is given in the first line below and does not
exist in the FASTA file. The second line is the actual entry for our example.
Identifier Molecule Type Gene Name
FOSB_MOUSE Protein
fosB
Sequence Length
338 bp
21
FASTA Format
• FASTA sequence identifiers are usually more complex than previously shown and
distinguish various possible sources for the sequence
• Below is a table of identifier formats
In a genomic contect, locus refers to
position on a chromosome. It may,
therefore, refer to a marker, a gene,
or any other landmark that can be
described.
accession = Accession Number
22
FASTA Example
>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
23
Creating a Sequence Dictionary from a FASTA File
• We next present three ways to read the contents of a file containing FASTA
format data into a dictionary whose keys are the sequence identifiers
• A dictionary value will be a length-two list of strings
• The first string will contain the sequence description if present in the FASTA file,
otherwise it will be the empty string
• The second string will be a single string containing the sequence itself
• If D were the name of our dictionary and seqid the identifier of a sequence
from the file, we could access the sequence of that name as
D['seqid'][1]
24
Method 1: Reading the Entire File into a String
# file fasta_dict1.py
def fasta_to_dictionary(fpath):
D = {}
with open(fpath,'r') as f: # Separate entries
S = f.read()
J0 = S.split('>')
J = [j for j in J0 if j != ''] # Eliminate empty lines
# J is now a list of strings, each of which contains one of
# the sequence specifications from the FASTA file
25
# (fasta_to_dictionary definition continued)
for B in J:
C = [k for k in B.splitlines() if k != '']
# C[0] is the first line of B
# and is thus the name-description line
comps = C[0].split()
key = comps[0]
# First word is the identifier for the sequence
# Remaining words in comps are sequence description components
if len(comps) > 1:
descr = ' '.join(comps[1:])
else:
descripion = ''
# Remaining lines of B contain the split-up sequence
# so join them into a single line
seq = ''.join(C[1:])
D[key] = [descr,seq]
f.close()
return D
26
Main Body of fasta_dict1.py
# file fasta_dict1.py continued
# Test the function
D = fasta_to_dictionary('fdata')
if len(D) == 0:
print('No FASTA data found')
else:
for k in D:
print('Sequence Identifier:\t\t',k)
print('Sequence Description:\t',D[k][0])
print('Sequence:')
print(D[k][1],'\n')
27
Contents of Test File fdata
>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
>FOSB_RAT Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
28
Output of fasta_dict1.py
Sequence Identifier: FOSB_MOUSE
Sequence Description: Protein fosB. 338 bp
Sequence:
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDL
QWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVS
ARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEI
AELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQS
SRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSL
LAL
Sequence Identifier: FOSB_RAT
Sequence Description: Protein fosB. 338 bp
Sequence:
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDL
QWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVS
ARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEI
AELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQS
SRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSL
LAL
29
Possible Problems with Script fasta_dict1.py
• If we were trying to process a very large file, reading its entire contents into a
string in memory might not be possible
• We will modify the script so that it processes one line at a time using the
readline method
• Of course, this is only one part of a correction since the dictionary we build
would be at least as large as the file!
• We will address that problem later
• We will use the with statement so that you can see it in use
• Note that all statements that use the function f must be in the block of the with
statement
• Why? After you leave the with block, f has been closed.
30
Method 2: Reading the File One Line at a Time
• Since the two scripts only differ in the fasta_to_dictionary function, we only
show that part here.
# file fasta_dict2.py
def fasta_to_dictionary(fpath):
D = {}
with open(fpath,'r') as f:
key = ''
descr = ''
seq = ''
31
Method 2: Reading the File Line at a Time
for line in f:
line = line.strip()
if line.startswith('>'):
if key != '': # Finished with a sequence
D[key] = [descr,seq]
comps = line.split()
key = comps[0]
if len(comps) > 1:
descr = ' '.join(comps[1:])
else:
descr = ''
seq = ''
else:
seq += line # If there are lines preceding the first
# '>' line they will accumulate here and
# be discarded when we start processing
# the first '>' line
# END OF with SUITE
# Save the final sequence, which was terminated by file's end
if key != '':
D[key] = [descr,seq]
return D
32
Exploring the Preceding Examples
• The files fasta_dict1.py and fasta_dict2.py have been posted in the Practice
Problems page of the course website (not Blackboard)
• One way to get an understanding of the scripts is to insert print statements to
print intermediate data that appear in the script
33
Exploring the Preceding Examples
• For example, in fasta_dict1.py, you could replace
with open(fpath,'r') as f:
S = f.read()
J0 = S.split('>')
J = [j for j in J0 if j != '']
with
with open(fpath,'r') as f:
S = f.read()
print(S)
J0 = S.split('>')
print(J0)
J = [j for j in J0 if j != '']
print(J)
34