Have you tried to open the file with the suggested programms from the link (TextPad and UltraEdit). Few weeks before, I open a csv file of 3.5GB with excel.
– TasosNov 7 '13 at 10:25

9

Data like this shouts 'database'. Pull it into any RDBMS you have available (they all have tools), 4GB is no issue for them. Drop the columns you don't need or make views to only the required columns.
– Jan DoggenDec 29 '14 at 12:35

We frequently use bash or other shells in unix/os x for basic file manipulation. If you only need to subset it, commands like "grep", "cut" and "sed" can get you a long way. If you want to work with it extensively and frequently, a SQL database might be a better shot. I don't know if SQLite would satisfy your requirements, but there are also heavy duty tools like mysql and postgresql that are open source.
– respectPotentialEnergyJan 2 '15 at 13:44

This is a perfect use case for Pandas in Python. Check it out!
– Blairg23Jun 9 '16 at 19:40

PSPad is totally free, not crippleware, and supports regex search and replace, shows hidden characters (cuz I get some bad data from customers), opens large files, and much more. I think it can also run Javascript on the current open tab but that requires programming.
– BulrushSep 6 '18 at 11:28

As you're only taking a portion of the file, you may be able to use simple tools to subset it before processing. That may get it down to a reasonable size to work with.

If you're working on a posix (ie, unix-like) system, you can use shell commands to reduce the file:

zcat -cfilename| grep(pattern to match hospitals only)>outputFile

This lets you extract the lines without uncompressing the larger file on disk. (with today's systems, drive speeds are often a limiting factor, so working with the data compressed can be more efficient if you have sufficient CPU).

If it were tab-deliminated, pipe-delimited, or fixed-width, you can also reduce the columns with the unix cut command ... but it's hit-or-miss with CSV, as it'll break horribly if there are commas in strings:

If you are using Windows, you can install something like cygwin.com and you'll be able to use the commands above (cat, grep, etc)
– Jeremiah OrrNov 13 '13 at 18:58

3

This is no good when there are new-lines inside cells. Use tools design for CSV.
– D ReadNov 27 '13 at 12:47

@DRead : quite true. I'd personally argue that if you've got newlines in your cells, CSV might not be the best format to store your data. Unfortunately, we're a bit at the mercy of what other formats people publish their data in.
– JoeNov 27 '13 at 14:06

1

I think this is a bit of a downside of CSV. However I don't think the existence of new-lines should deter you from CSV as there are usually more important reasons to using it. Would you say the same for data with commas quoted in cells? That makes it a pain for tools like sed and awk doing work on cells, in a similar way. I think we just need to accept that although CSV has many advantages because of its simplicity, its not as simple to parse as many would believe and you need to use tools and libraries.
– D ReadNov 28 '13 at 10:38

Others have mentioned way to pull apart this file incrementally. It seems to me like you are also commenting on use of resources for a large file. For some solutions you can incrementally read the compressed file uncompressing as you go and feed it through the csv module. For example in python with gzip'd input you would do this by:

import csv
import gzip
with gzip.open("test.csv.gz", "r") as f:
reader = csv.reader(f)
for row in reader:
pass # do something with the row

You could do the same thing for zip archives with zipfile.

If this data is coming from across a network you can pay very little or sometimes get performance improvements by not copying and then uncompressing it. It can cost you less to pull it across the network compressed (smaller) in pieces and operate on it then it does to pull it across in bulk, uncompress it, and then read it. It also doesn't use up local storage resources.

On Windows, there is also a software called Delimit ("Open data files up to 2 billion rows and 2 million columns large!") http://delimitware.com For instance it can split, sort and extract only some rows or columns.

Load the file into PostgreSQL database table with a Copy statement. This will give you the full capabilities of SQL syntax, plus the ability to index columns for faster access.

For complex queries you have a optimizer that a can figure out the fastest way to access the data. PostgreSQL has smarter I/O than most applications it will detect sequential read access and read-ahead to pre-load data into memory.

Results can be viewed in Excel or other spreadsheets by accessing the data via an ODBC driver.

You can access to PostgreSQL via several Cloud Services like Heroku or AWS. It would be good cloud project to create a new machine instance, install PostgreSQL and copy the data to the instance. Then use the Postgres COPY command to load the data and then access the data with a PostgreSQL client application from your desktop.

I have used Perl to read 2GB text files before. Usually I use tab-delimited data as I don't need a module to process it. In fact, Perl is so fast reading text files that I convert large XLS files to tab-delimited text files before reading them. Perl can also read line-by-line for ginormous files. But I just read the whole 2GB file into an array at once.
– BulrushDec 31 '15 at 13:08

I have used utilities such as (g)awk to readlarge file such as this record by record. I the extract the required information from each line and write it to an output file. For windows users (g)awk is available in cygwin. I have also used python to achieve the same result. You could implement this process in most programming languages.

It can be used for many use cases, including data migration, files processing, etc.
You can easily build jobs using a visual editor to combine specialized connectors (read CSV files, select rows corresponding to your criteria, write result to one or more files or directly to a database, and more). The possibilities are endless because there are more than 800 connectors.

At the end, TOS generates a java application which can be launched from the designer or for the command line (Windows or Unix).

If you're on Windows, I can't sing the praises of LogParser high enough. It allows you to query files in a wide variety of formats (mostly log formats as that's what it was meant for, but XML and CSV are valid). You query the file with a surprisingly complete SQL syntax, and you can even use it to import an entire file directly into an SQL database very easily.

It's extremely fast, too. I've used it to process 5 GB of IIS log files and a query as complex as SELECT DISTINCT cs-username, cs(User-Agent) FROM .\*.log where cs-username is not null took about 2 minutes to complete running on my laptop.

I also like the python solution, but for working with CSV files, it may be safer to use the 'csv' module - docs.python.org/2/library/csv.html. One reason is that without 'csv', splitting on the delimiter doesn't work if the field contains the delimiter.
– philshem♦Dec 31 '14 at 8:49

Yes, you are right, in certain cases it might be the better choice!
– user4331Dec 31 '14 at 16:21

this will split source.csv into files containing 10'000 lines each and named tempfile.part.00, tempfile.part.01 ...

You can use the join command to join the multiple output files from your selections together into one csv file (either by naming the files and piping them into one file or by joining all files within a folder into one output file - please check the join manual pages or online how to do this in detail).

I haven't used this way before, but I have a question for you. If I use split in Windows or Linux, the split doesn't need to load the whole csv file in RAM? If my file is really big, can split handle it? Thank you in advance.
– TasosNov 12 '13 at 22:14

'split' is no good if there are carriage returns inside cells. Stick to tools designed for CSV.
– D ReadNov 27 '13 at 12:45

csvs-to-sqlite to convert import the CSV file into a SQLite database. You can do this as a mapping into a single database table (for example: csvs-to-sqlite myfile.csv mydatabase.db), or you can use the csvs-to-sqlite tool to split the data into separate tables automatically linked by foreign key relationships;

query the SQLite database directly, or publish a simple HTML form that supports faceted and/or form driven searching / filtering as well as a form that lets you enter and run arbitrary read-only SQL queries, usingdatasette. For example, datasette serve -p 8899 mydatabase.db to serve the datasette browser on localhost port 8899.

As an added benefit, the datasette service also provides an API that allows you to run arbitrary SQL queries on the data via a simple URL and retrieve the data as JSON.

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).