Get and Filter Files in Directory

I have a JMP script to prompt the user for a directory, and then filter for specific file types from within that directory. My problem is that the filtering process often takes longer than just getting the files. If the number of files is large (>100,000 files), the total time to collect and filter files can be more than 30 minutes. Is is there a faster or more efficient way to do this? I am using JMP11. Thanks!

Yes, rework the filtering loop to remove the N^2 behavior. JSL { lists } access elements by starting at the front. You started at the back to prevent the deleted elements from messing up the indexing. That leads to pretty much the worst case time behavior for manipulating the list, walking i elements to reach the i'th element. Here's a reworked version that removes the front-most element from the list of files, checks it, and inserts it as the front-most element of the filtered list.

dir = "c:\";

t1 = Tick Seconds();

files = Files In Directory( dir, Recursive );

t2 = Tick Seconds();

t_getfiles = t2 - t1;

n_getfiles = N Items( files );

filteredFiles = {};

t3 = Tick Seconds();

while( (testname = Remove From( files, 1 )) != {},

If( Ends With( testname[1], ".jmp" ),

insertinto(filteredFiles,testname,1);

)

);

t4 = Tick Seconds();

t_filterfiles = t4 - t3;

n_filterfiles = N Items( filteredFiles );

Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );

t_getfiles = 4.0333333333333;

n_getfiles = 171630;

t_filterfiles = 258.5;

n_filterfiles = 975;

About 5 minutes.

Another idea: you can load a data table like this:

dir = "c:\";

t1 = Tick Seconds();

files = Files In Directory( dir, Recursive );

t2 = Tick Seconds();

t_getfiles = t2 - t1;

n_getfiles = N Items( files );

t3 = tickseconds();

dt = New Table( "directory",

Add Rows( 0 ),

New Column( "filename", Character, Nominal, Set Values( files ) ),

New Column( "isTable",

Numeric,

Continuous,

Format( "Best", 12 ),

Formula( Ends With( :filename, ".jmp" ) )

)

);

dt<<runformulas;

dt<<selectwhere(isTable==1);

dtFiltered = dt<<subset(selectedrows(1));

t4=tickseconds();

t_filterfiles = t4 - t3;

n_filterfiles = N rows( dtFiltered );

Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );

t_getfiles = 4.01666666666642;

n_getfiles = 171635;

t_filterfiles = 0.716666666666697;

n_filterfiles = 975;

About 5 seconds. The <<runFormulas is required; the data table will still be evaluating the formula for the isTable column and the selectwhere won't find anything and the subset will be empty without it.

Yes, rework the filtering loop to remove the N^2 behavior. JSL { lists } access elements by starting at the front. You started at the back to prevent the deleted elements from messing up the indexing. That leads to pretty much the worst case time behavior for manipulating the list, walking i elements to reach the i'th element. Here's a reworked version that removes the front-most element from the list of files, checks it, and inserts it as the front-most element of the filtered list.

dir = "c:\";

t1 = Tick Seconds();

files = Files In Directory( dir, Recursive );

t2 = Tick Seconds();

t_getfiles = t2 - t1;

n_getfiles = N Items( files );

filteredFiles = {};

t3 = Tick Seconds();

while( (testname = Remove From( files, 1 )) != {},

If( Ends With( testname[1], ".jmp" ),

insertinto(filteredFiles,testname,1);

)

);

t4 = Tick Seconds();

t_filterfiles = t4 - t3;

n_filterfiles = N Items( filteredFiles );

Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );

t_getfiles = 4.0333333333333;

n_getfiles = 171630;

t_filterfiles = 258.5;

n_filterfiles = 975;

About 5 minutes.

Another idea: you can load a data table like this:

dir = "c:\";

t1 = Tick Seconds();

files = Files In Directory( dir, Recursive );

t2 = Tick Seconds();

t_getfiles = t2 - t1;

n_getfiles = N Items( files );

t3 = tickseconds();

dt = New Table( "directory",

Add Rows( 0 ),

New Column( "filename", Character, Nominal, Set Values( files ) ),

New Column( "isTable",

Numeric,

Continuous,

Format( "Best", 12 ),

Formula( Ends With( :filename, ".jmp" ) )

)

);

dt<<runformulas;

dt<<selectwhere(isTable==1);

dtFiltered = dt<<subset(selectedrows(1));

t4=tickseconds();

t_filterfiles = t4 - t3;

n_filterfiles = N rows( dtFiltered );

Show( t_getfiles, n_getfiles, t_filterfiles, n_filterfiles );

t_getfiles = 4.01666666666642;

n_getfiles = 171635;

t_filterfiles = 0.716666666666697;

n_filterfiles = 975;

About 5 seconds. The <<runFormulas is required; the data table will still be evaluating the formula for the isTable column and the selectwhere won't find anything and the subset will be empty without it.