Currently, I am working on a little demo app trying to figure out the advantages or disadvantages of uses lists or array. WIth "lists" I mean simple structures blocks of text written to memory and accisable using simple chunk expressions ("item x of line z of myVariable"). Here I use a return-tab delimited list for storing text data.

I assume, in 90% of all cases, we are just working with text or numbers which are text but look like text.

I tried to make a test to compare small and increasingly larger arrays/lists
I am working with LiveCode latest version 9.0 (dp 6) on Windows 10, 64 bit.

global gTableArray, gListArray
local sTestResult
on mouseUp
lock screen
set cursor to busy
testArrayVizList
end mouseUp
command testArrayVizList
put empty into msg
put empty into sTestResult
set the itemdelimiter to tab
put "50,500,5000,50000,500000,1000000,2000000" into tTestIterations
put replaceText (tTestIterations , "," , tab ) into tTestIterations
repeat with i = 1 to the number of items of tTestIterations
put item i of tTestIterations into tIteration
put tIteration &tab after sTestResult
put buildArray ( tIteration ) &tab after sTestResult
put buildList ( tIteration ) after sTestResult
put cr after sTestResult
end repeat
delete last char of sTestResult
put sTestResult into msg
end testArrayVizList
function buildList pIteration
put the milliseconds into startSec
put empty into gListArray
put "A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z" into tCol
put the number of items of tCol*1 into tNum
put 0+0 into i /* Init local variable i */
repeat pIteration
add 1 to i
put item random ( tNum ) of tCol into tChar
put i &","& tChar &CR after gListArray
end repeat
delete last char of gListArray
return ( the milliseconds - startSec ) / 1000
end buildList
function buildArray pIteration
put the milliseconds into startSec
put empty into gTableArray
put "A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z" into tCol
put the number of items of tCol*1 into tNum
put 0+0 into i
repeat pIteration
add 1 to i
put item random ( tNum ) of tCol into tChar
put tChar into gTableArray [ i ] [ "col" ]
end repeat
return ( the milliseconds - startSec ) / 1000
end buildArray

I used globals for testing reasons since I can access the global variables or the global array from anywhere, also from the message box. (Usually I would use script variables.)

Interesting enough, when smaller amounts of data are used, the list performs better than the array. With large amounts of data, the situation changes. In my case, 500,000 repeats are the border line. From there building a list takes more and more time, while the array starts to perform better.

Here are the results in seconds (in fractions of seconds as milliseconds)

Iterations Arrays and Lists.JPG (18.2 KiB) Viewed 1328 times

There is a problem: Building lists or arrays with something like 3-5 million repeats in this specific test case, LiveCode becomes non-responsive. On Windows it will show a turning icon. I thought it would regain focus, having the application run for several hours. But then I had to force-quit it. I found this situation in many cases when large amounts of data are loaded to memory. Is this a bug? Or is this something one should know beforehand?

Livecode has a pretty heavy system load. It uses 500 MB of memory, compared to Microsoft Excel with 24 MB, and Google Chrome with 130 MB. I was also wondering about this.

Questions:

Assumed that only text data is used (no images, no binary files to be stored), would simple lists not be better than using arrays? Or is this a matter of personal preference (style/choice) ?

I can't give you any advice about the array versus list choices- it is usually a matter of preference and expediency what to use. In the situation of a non-responsive repeat loop through, within the loop include a line

function alphabetizeFld aString
repeat for each trueWord tWord in aString
put tWord after sWordListArray[getFirstCharacterOf(tWord)]
end repeat
end alphabetizeFld

...and create an alphabetical list of all the words in a text field. You could create a bunch of variables to do this. But you might not need all of them, and it would create a lot of script variables to deal with weather they are used or not. This away is cleaner, is created when needed, an it doesn't really matter which letters will be used because the containers will be created as needed.

Keys were a difficult thing to get my head around at first. But it's really simple. The array is a dataPair name/data, and the key is the name. So, when you "get the keys" of an array, your basically doing is...

put the name of dataPair & cr after tDataNames.

...only much more streamlined.

At that point you can do anything to tDataNames that you could do to any other container of lines: search, sort, filter, etc.

After that just run it back through the array, for example:

put keys of myArray into tDataNames
put filter tDataNames without tThingIDontWant into tDataNames
put myArray[tDataNames] into field "myFld"

or send an array to the message box so you can get a look at the contents

Hans-Helmut wrote:There is a problem: Building lists or arrays with something like 3-5 million repeats in this specific test case, LiveCode becomes non-responsive. On Windows it will show a turning icon. I thought it would regain focus, having the application run for several hours. But then I had to force-quit it. I found this situation in many cases when large amounts of data are loaded to memory. Is this a bug? Or is this something one should know beforehand?

LiveCode will generally do what you ask of it: if you ask it to run in a tight loop, it'll run in a tight loop. It should not, however, trigger the Windows warning about being unresponsive. IMO that would be a bug. If you have a recipe for reproducing that you may consider filing a bug report:http://quality.livecode.com/

Livecode has a pretty heavy system load. It uses 500 MB of memory, compared to Microsoft Excel with 24 MB, and Google Chrome with 130 MB. I was also wondering about this.

The core LiveCode engine takes very little RAM. For example, when run as a faceless process it can take as little as 6 MB. But as we add objects and data into its memory space, it can be larger roughly in proportion to what we're asking it to work on.

The standalones we make with LC often require far less memory than the LC IDE, because the IDE has a lot of stacks and data to maintain.

The IDE itself though, large as it is, doesn't account for such a large amount of RAM. 500 MBs is a lot; I've only seen that when working on unusually large amounts of data. On my Windows box the full LC IDE with just a couple modest stacks open takes less than 70 MB, and on Linux around 60 MB.

What sort of data were you working on at the time? And which version of LC and of Windows?

Assumed that only text data is used (no images, no binary files to be stored), would simple lists not be better than using arrays? Or is this a matter of personal preference (style/choice) ?

Both are very useful. Like a screwdriver and a wrench, they each have a place in your toolbox.

It may be helpful to think of "lists" as delimited strings, as it can help anticipate where they're efficient and where they're not.

When we use chunk expressions (the generic term for expressions that allow graceful traversal of strings, like "get word 2 of item 3 of line 4"), the engine needs to examine each character to find the delimiter in question, keeping count as it goes.

Chunk expressions make for wonderfully readable code, and are often quite efficient, certainly more so than counting delimiters and maintaining pointers on your own as we'd need to do in lower-level languages.

And for aggregate operations across an entire data set, they can often be faster. Using "repeat for each..." as you've done above lets the engine parse only one chunk each time through the loop, keeping track of where it left off, and often those are pretty darn fast.

Random access of specific elements within a data set, however, is where arrays can shine, sometimes offering an order-of-magnitude performance gain if the data set is large enough.

Arrays are essentially a collection of pointers to specific memory locations, in which each key is run through a simple hash to arrive at the address where the data resides. The hash used is a very simple one, so it tends to take few clock cycles, with the benefit being that once finished the engine knows exactly where to look for the contents associated with a given key.

There the key "10000" is hashed to obtain the address where the data associated with it is found, and it just grabs that and returns it.

So for any task that requires access to specific elements within the data, you'll find arrays are usually much faster. The amount of time needed to get a chunk from a delimited list is proportionate to the amount of the data preceding it, whereas time to access an array element is nearly constant regardless of how many elements are contained within it or the size of the data in those elements.

Which one is best will often depend on the nature of the task, and it considerations aren't limited to performance.

For example, array elements can hold any data, including data with tabs and returns, and even binary data. Strings, of course, need to contain only text, and you must remain mindful that a given segment doesn't contain any characters used to delimit it.

Strings are easy to conceptualize and work with - they can be easily displayed in a field any time you want to see what's in them. Since arrays are a collection of addresses, they have no single contiguous representation that allows us to see them as a whole in one readable form.

Strings are great for reading and writing to disk, easily interoperable with nearly any other program. If you use tab and CR, you can open your text in LibreCalc or Microsoft Excel as easily as dropping it into a LiveCode field.

Arrays can also be stored to disk, but first they need to be serialized, converted from the collection of memory addresses to a contiguous bytestream. In LC we do this with arrayEncode, and we can read the resulting binary data back into an array by converting it with arrayDecode. LC's encoded arrays use a unique format useful only in LC; they're structurally similar to MongoDB's BSON format, but not compatible with it. Great for exchanging data between LC processes, but not for exchanging data with other processes (though there are JSON translators available).

Both delimited strings and arrays are very powerful and useful. As you spend more time with LC, chances are you'll use both liberally.

Richard Gaskin
Community volunteer LiveCode Community LiaisonLiveCode development, training, and consulting services: Fourth World Systems:http://FourthWorld.comLiveCode User Group on Facebook :http://FaceBook.com/groups/LiveCodeUsers/

I usually use lists - they are quite fast & easy to maintain. When there's a lot of data, I think of using arrays - as you showed there's a point when they become faster.

Special cases:

When having lists with many items where:

a lot of different items need to be changed while processing,

or when lines need to be added/ deleted,

or when I'm not sure if the last item of a line might not be empty,

I use arrays.

If I'd need to loop using "repeat with" (and have more than a few lines) I switch to arrays,
and "repeat for each".

If I have a lot of code to write for the data, I switch to arrays: "myArray[23][42]" is less fingertip abrasive compared to "item 42 of line 23 of myLine". ;-))

In any case I often keep the original data untouched, write the changed data into a new variable, and replace it when done.

Extra special case:
Playing with really big data (in this thread) I found out that sometimes it's faster to build the replacement variable using a continuously appended file, while processing the data! Imagine - writing to a file can be faster than appending to a variable!

Btw., I had much fun re-reading the thread Monki linked above - there I learned to wrap my brain 'round arrays ;-) Since then I'm using them quite often, especially the form "myArray[23][42]" became a welcome tool.

FourthWorld wrote:Random access of specific elements within a data set, however, is where arrays can shine, sometimes offering an order-of-magnitude performance gain if the data set is large enough.

Arrays are essentially a collection of pointers to specific memory locations, in which each key is run through a simple hash to arrive at the address where the data resides. The hash used is a very simple one, so it tends to take few clock cycles, with the benefit being that once finished the engine knows exactly where to look for the contents associated with a given key.

Really informative post, thanks. Something sort of related to this topic, I've been wondering about, is switch vs if then...else. I was wondering if the above is how the switch structure is able to jump directly to the matching case? I'm a bit late moving to the switch structure, but seems much faster when it can be used.

monki wrote:Really informative post, thanks.
Glad that was useful. Thanks for the feedback.

...is switch vs if then...else. I was wondering if the above is how the switch structure is able to jump directly to the matching case? I'm a bit late moving to the switch structure, but seems much faster when it can be used.

Conditionals are so quick to execute it can be difficult to measure differences between forms unless you take the time to carefully craft a good test case. IMO it may not even be worth the effort, as any difference would likely be minimal compared to the developer time using the one that best fits the logic you want to express.

That said, there is one thing about both forms of evaluation that may be helpful here: LC uses "lazy" evaluation, in that it will only evaluate as much as it needs until it determines that it doesn't need to do any more.

if ( "A" = "B" AND "A" = "A") then
DoSomething
else
DoSomethingElse
end if

...only the first comparison ("A" = "B") will be evaluated. Once that's known to be false, it won't bother testing the second comparison because it can't make a difference in the outcome at that point.

Switch expressions are similar, evaluated in the order they're written in, and if using evaluation compound expressions in a case only as much of the evaluation as is true will be performed, otherwise it'll move on as soon as it can determine it'll no longer matter.

This can be helpful when deciding the order to use when arranging comparisons. If you put the least likely comparison first, it won't bother with anything beyond that so you'll save a few clock cycles.

Richard Gaskin
Community volunteer LiveCode Community LiaisonLiveCode development, training, and consulting services: Fourth World Systems:http://FourthWorld.comLiveCode User Group on Facebook :http://FaceBook.com/groups/LiveCodeUsers/

FourthWorld wrote:This can be helpful when deciding the order to use when arranging comparisons. If you put the least likely comparison first, it won't bother with anything beyond that so you'll save a few clock cycles.

I can't give you any advice about the array versus list choices- it is usually a matter of preference and expediency what to use. In the situation of a non-responsive repeat loop through, within the loop include a line

Thank you. This code put at the beginning of each repeat loop which is touching large chunks of data did the trick, even though, there is still a strange behavior of the application after the routine finished.
Also, I think, for beginners, such "wait" instructions are no obvious and not intuitive. I suggest collecting such "best practice hints" for beginners as a starter in the basic documentation of LiveCode. Especially beginners will profit from "best practice" with a detailed enough explanation. It can save us countless hours of unproductive time. Collectively, we are talking possibly about thousands of hours. Charging such time would either make the company bankrupt or us paying for saved time would make it flourish .

Probably I think that it is a bug if without such additional line the application is crashing when large amounts of data are put into memory. But even now using the "wait 0 milliseconds with messages", in LiveCode 9.0 dp6 my application is hanging after processing. The message box is frozen. The computer (Windows 10) signals a lot of heavy work and the fan on my notebook is at high speed. I will try to define a repeatable case when this happens.

I have been testing in different ways and I also want to thank everybody for highly appreciated comments. Still, this is not over to go into more complex structures and relationships with XML, hierarchical lists and JSON lists. To understand arrays in Livecode seems crucial to me and all "advanced" beginners.

My conclusion so far:

Of course, looking up a key value is what arrays do very efficiently. But to also simply look up line 10,000,000 in a list is a matter of milliseconds. Here, the "index" is the line number or chunk expression. The differences are not that big when it comes to time. Sorting is easier using lists. Searching and selecting is more difficult since lists they do not have "keys".

It seems that chunk expressions in LiveCode are very efficient. And they are a more natural way of addressing some piece of information somewhere if well structured. When it comes to addressing key values (for example finding the frequency of true words in large blocks of text, arrays will be the better choice.) For any non-textual data, of course, arrays are the only way to go.

Thank you. This code put at the beginning of each repeat loop which is touching large chunks of data did the trick, even though, there is still a strange behavior of the application after the routine finished.
Also, I think, for beginners, such "wait" instructions are no obvious and not intuitive.

Agreed. The "This application is non-responsive" warning is unnecessarily scary, and should be avoided.

On macOS we used to have to include that "wait 0..." message to ensure redraws, and that was deemed a bug that has since been corrected. I would hope the same criteria would apply here, and I would encourage you to sumbit a bug report on the "non-responsive" warning:http://quality.livecode.com/

Probably I think that it is a bug if without such additional line the application is crashing when large amounts of data are put into memory. But even now using the "wait 0 milliseconds with messages", in LiveCode 9.0 dp6 my application is hanging after processing. The message box is frozen. The computer (Windows 10) signals a lot of heavy work and the fan on my notebook is at high speed. I will try to define a repeatable case when this happens.

Please do. If we can't find a way to optimize the scripts, we may need to optimize the engine. But either way, I'm sure we can get to a satisfying result. I look forward to exploring that routine with you. We've seen some good optimiation explorations in these forums.

Of course, looking up a key value is what arrays do very efficiently. But to also simply look up line 10,000,000 in a list is a matter of milliseconds.

If you're only looking up a single line infrequently, it may not matter much. But for the reasons I outlined in my post above from the 21st, if you need random access to elements within a data set often you should see significant gains with arrays. If you don't we may want to review a complete test stack with sample data so we can make sure we're all poking around in the same use case.

Sorting is easier using lists. Searching and selecting is more difficult since lists they do not have "keys".

Good observations, along with the others you included. I appreciate the time you took writing those here so others can benefit.

Arrays have no sense of order in themselves, but their keys are easy to obtain and sort, and the sorted list of keys can be used to traverse the array in the desired order.

But of course that relies on the thing you want to sort being the key. If you want to sort on any element in a sub-array, more scripting will be required.

Where the computational overhead of delimited strings poses no impairment to performance, they are a joy to work with in LiveCode.

Richard Gaskin
Community volunteer LiveCode Community LiaisonLiveCode development, training, and consulting services: Fourth World Systems:http://FourthWorld.comLiveCode User Group on Facebook :http://FaceBook.com/groups/LiveCodeUsers/

Thank you, Richard. I will do this to identify the problem. Working with LC9 dp6 still from time to time leads to unresponsiveness and crashes at different times and occasions, and it was not possible to repeat using the same scenario, and I must admit, I am also a lazy guy and just want to continue working instead of documenting. When I find something serious and happening, again and again, I will report. LC9 dp6 is not a stable version, I know. Regarding this unresponsiveness related to managing large chunks of data in memory, it should be possible to identify since I found this in earlier versions as well. I will work on this separately.

Hans-Helmut wrote:Regarding this unresponsiveness related to managing large chunks of data in memory, it should be possible to identify since I found this in earlier versions as well. I will work on this separately.

How large is the data on disk?

Richard Gaskin
Community volunteer LiveCode Community LiaisonLiveCode development, training, and consulting services: Fourth World Systems:http://FourthWorld.comLiveCode User Group on Facebook :http://FaceBook.com/groups/LiveCodeUsers/