On Tue, Mar 6, 2012 at 4:45 PM, Chris Barker <chris.barker@noaa.gov> wrote:
> On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque <jayvius@gmail.com> wrote:
>> > 1. Loading text files using loadtxt/genfromtxt need a significant
> > performance boost (I think at least an order of magnitude increase in
> > performance is very doable based on what I've seen with Erin's recfile
> code)
>> > 2. Improved memory usage. Memory used for reading in a text file
> shouldn’t
> > be more than the file itself, and less if only reading a subset of file.
>> > 3. Keep existing interfaces for reading text files (loadtxt, genfromtxt,
> > etc). No new ones.
>> > 4. Underlying code should keep IO iteration and transformation of data
> > separate (awaiting more thoughts from Travis on this).
>> > 5. Be able to plug in different transformations of data at low level
> (also
> > awaiting more thoughts from Travis).
>> > 6. memory mapping of text files?
>> > 7. Eventually reduce memory usage even more by using same object for
> > duplicate values in array (depends on implementing enum dtype?)
>> > Anything else?
>> Yes -- I'd like to see the solution be able to do high -performance
> reads of a portion of a file -- not always the whole thing. I seem to
> have a number of custom text files that I need to read that are laid
> out in chunks: a bit of a header, then a block of number, another
> header, another block. I'm happy to read and parse the header sections
> with pure pyton, but would love a way to read the blocks of numbers
> into a numpy array fast. This will probably come out of the box with
> any of the proposed solutions, as long as they start at the current
> position of a passes-in fiel object, and can be told how much to read,
> then leave the file pointer in the correct position.
>>
If you are setup with Cython to build extension modules, and you don't mind
testing an unreleased and experimental reader, you can try the text reader
that I'm working on: https://github.com/WarrenWeckesser/textreader
You can read a file like this, where the first line gives the number of
rows of the following array, and that pattern repeats:
5
1.0, 2.0, 3.0
4.0, 5.0, 6.0
7.0, 8.0, 9.0
10.0, 11.0, 12.0
13.0, 14.0, 15.0
3
1.0, 1.5, 2.0, 2.5
3.0, 3.5, 4.0, 4.5
5.0, 5.5, 6.0, 6.5
1
1.0D2, 1.25D-1, 6.25D-2, 99
with code like this:
import numpy as np
from textreader import readrows
filename = 'data/multi.dat'
f = open(filename, 'r')
line = f.readline()
while len(line) > 0:
nrows = int(line)
a = readrows(f, np.float32, numrows=nrows, sci='D', delimiter=',')
print "a:"
print a
print
line = f.readline()
Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120307/9ca012e1/attachment.html