On Fri, Mar 13, 2009 at 10:59 AM, psaffrey at googlemail.com
<psaffrey at googlemail.com> wrote:
> I'm reading in some rather large files (28 files each of 130MB). Each
> file is a genome coordinate (chromosome (string) and position (int))
> and a data point (float). I want to read these into a list of
> coordinates (each a tuple of (chromosome, position)) and a list of
> data points.
>> This has taught me that Python lists are not memory efficient, because
> if I use lists it gets through 100MB a second until it hits the swap
> space and I have 8GB physical memory in this machine. I can use Python
> or numpy arrays for the data points, which is much more manageable.
> However, I still need the coordinates. If I don't keep them in a list,
> where can I keep them?
Assuming your data is in a plaintext file something like
'genomedata.txt' below, the following will load it into a numpy array
with a customized dtype. You can access the different fields by name
('chromo', 'position', and 'dpoint' -- change to your liking). Don't
know if this works or not; might give it a try.
===============================================
[186]$ cat genomedata.txt
gene1 120189 5.34849
gene2 84040 903873.1
gene3 300822 -21002.2020
[187]$ cat g2arr.py
import numpy as np
def g2arr(fname):
# the 'S100' should be modified to be large enough for your string field.
dt = np.dtype({'names': ['chromo', 'position', 'dpoint'],
'formats': ['S100', np.int, np.float]})
return np.loadtxt(fname, delimiter=' ', dtype=dt)
if __name__ == '__main__':
arr = g2arr('genomedata.txt')
print arr
print arr['chromo']
print arr['position']
print arr['dpoint']
=================================================
Take a look at the np.loadtxt and np.dtype documentation.
Kurt