Data processing

Applications typically have a GUI part and a data processing (DP) part. The DP part is the actual calculations and data manipulation. A good application implementation will be modular and this implies a clear distinction between the GUI and the DP parts.

In this section you will develop the DP part of a simple application. In the next section you will develop the GUI part.

The DP part of the application is specified as follows:

The input is the name of a text file. The output is a string that displays as a table that contains: the file name, a count of lines, a count of characters, and a row for each distinct character in the file and a count of how many times it appears in the file. The rows of distinct characters should be sorted by their counts.

You need to define a verb report that takes a filename as an argument and returns the specified result. You'll build pieces of the definition in the ijx window and then put them all together into the definition in a script.

The input is a filename and in the report verb it will have the name y. , so start by working with y in the ijx window.

y =. 'user\text.txt'

Read the file.

d =. fread y

The report will have two columns. The first column will be the labels 'File:', 'Lines:', 'Chars:', and each distinct character in the file. The second column will be the value for that row. Since the data is a mixture of text and numbers it makes sense to build the result as boxed data.

The next thing is to add the rows with the characters and their frequency counts. The letter is the label and the count is the value, so it just adds more items to r. Let's postpone that part of the problem, and work instead on converting the boxed table to the string result required by the spec. Use a comment to mark the bit we are skipping over for now.

NB. need to add frequency rows to r here

The numbers in the second column need to be converted to characters. The easiest way to do this is to convert the contents of each box to characters. The characters are already characters and are not affected, but any numbers will be converted.

The display of r with all characters looks the same, but each box now contains characters.

The next step is interesting and the details are left for you to puzzle out. It adds a TAB after each label and an LF after each value. In the final result the TAB separates the label from its value, and the LF causes a new line for the next label. The boxed display shows the TAB and LF as blanks, but they really are in there.

Now calculate the frequency rows. You need a verb freq that returns a table of boxes where the first column is the distinct characters and the second column is the count of times they are in the file. The argument to freq is the file data and inside freq it will have the name y , so let's start with y defined as the file data.

y =. fread fn

The data can include TAB, CR, and LF characters and they should be removed. The dyad -. (less) can remove these unwanted characters.

d =. y -. TAB,CR,LF
d
abcbcb

The utility nubcount, defined by script misc.ijs, returns a table of boxes with a first column containing the distinct items in its argument and the second column containing the counts.