Requires Python >= 2.5. Stores a flattened version of the fasta file without
spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the
sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop
(for fseek/mmap) locations of each header in the fasta file for internal use.

Sometimes your fasta will have a long header like: “AT1G51370.2 | Symbols: | F-box family protein | chr1:19045615-19046748 FORWARD” when you only want to key off: “AT1G51370.2”. In this case, specify the key_fn argument to the constructor:

It’s also possible to specify another record class as the underlying work-horse
for slicing and reading. Currently, there’s just the default:

NpyFastaRecord which uses numpy memmap

FastaRecord, which uses using fseek/fread

MemoryRecord which reads everything into memory and must reparse the original
fasta every time.

TCRecord which is identical to NpyFastaRecord except that it saves the index
in a TokyoCabinet hash database, for cases when there are enough records that
loading the entire index from a pickle into memory is unwise. (NOTE: that the
sequence is not loaded into memory in either case).

It’s possible to specify the class used with the record_class kwarg to the Fasta
constructor:

In order to efficiently access the sequence content, pyfasta saves a separate, flattened file with all newlines and headers removed from the sequence. In the case of large fasta files, one may not wish to save 2 copies of a 5GG+ file. In that case, it’s possible to flatten the file “inplace”, keeping all the headers, and retaining the validity of the fasta file – with the only change being that the new-lines are removed from each sequence. This can be specified via flatten_inplace = True