Archives

RSS

Crowd counting with a 360 camera and Amazon Mechanical Turk

A proof of concept to establish how timelapse 360 photography could be used to count crowds and examine general crowd dynamics at events.

Outdoor events of all types, and especially those that are more temporary, moving (parades, processions) free to attend or unticketed typically have a difficult time establishing accurate figures for attendance. Organizers may be incentivised to over-estimate these figures for funding or political reasons; they might also be incentivised to under-estimate these figures for licensing reasons. For sufficiently large events, or events that have taken place multiple times, organisers and emergency services will likely have produced an estimate though the methods and assumptions behind these often vary and there is rarely a clear process by which the process can be examined or scrutinized. Clearly from history and policy, a scientifically accurate estimate of attendance is rarely a requirement, arguably even ticketed and paid events could feasibly be wrong about their levels of attendance. Increasing study of events of all types mean this is a key area or investigation, or at least a key ‘stat’ for discussion. The technical feasibility and time required to carry out head counting, sampling and wider estimation is generally beyond the usual time-pressed event organiser.

For similar underlying reasons, outdoor events also struggle to collect formal feedback in the same way as events with a clearly defined venue or those that are continuously operating organisations. Organisers and artists will have on the ground knowledge of what the general audience response to the event was. It is hard to survey the audience without ‘breaking’ their experience of the event and in many cases, the practical resources required to do an effective job are seen to be too great for the value of data collected. Participant observation, in the tradition of qualitative or ethnographic research has value, though representativeness and bias are key disadvantages, and it could be argued that while event organisers might be able to use such methods, arguably they are more focused on delivery than on documentation and analysis during events. These methods are also rarely documented and principally relies on the individual assumptions of organisers and artists involved.

There may be a compromise to be made between qualitative documentation (valid but fuzzy) and quantitative surveying (accurate but abstract). There is at least room to experiment and see what quantitative approaches could add, even at a rudimentary level, whilst being practical and repeatable.

Overall, it is argued that both basic levels of attendance and audience behaviour could both be better documented and understood at events; especially those that are outdoor, free to attend, temporary and not ticketed. Resources available for this kind of investigation are limited so methods should be appropriate to the scale involved and the level of accuracy required.

The field work:

On a typical day at work, the researcher took a Ricoh Theta V camera to the food court at university to collect data.

This camera is capable of 360 degree photography. Essentially it has two 12 megapixel resolution cameras facing 180 degrees opposite each other. The camera stitches these images together to create an effective 360 degree view of its whole surroundings, rather than a more traditional or fixed field of vision with a conventional camera. At the time of writing these types of camera cost from £150-400.

It is relatively easy and discreet to operate so can be used by a non-specialist with little previous training in a variety of settings.

The camera was set to capture a photo every 10 seconds, other exposure settings were not adjusted.

The camera took 38 photos over the approximately 6.3 minute period (380 seconds).

The photos were of sufficient resolution and quality to be able to discern people in the food court in a radius of about 20 meters.

There was no particular objective at hand other than to see if it was technically possible to count the number of individuals in a given area and to investigate any possible crowd dynamics or behaviours.

Manual analysis

Each photo/frame was manually inspected to count individual people in each image.

It was decided that only counting heads would be appropriate over counting bodies or partially obscured figures.

All 38 photos were inspected and the total figures were calculated. Summary statistics are given below:

Frame

Mean

Median

Max

Min

Std Dev

Variance

Heads counted (HC)

29.5

30

38

21

3.9

15.2

For the short duration covered, many individuals remained relatively stationary and will likely have been counted in most frames.

At this stage, in this particular environment it could be assumed that: 10 seconds is, arguably too much detail and very similar results would have been observed had the camera only taken a photo every minute:

Frame

Mean

Median

Max

Min

Std Dev

Variance

HC 10 seconds

29.5

30

38

21

3.9

15.2

HC 1 minute

28.7

29

35

21

4.5

19.9

A further analysis looked at clearly identifiable groups or individuals who moved through and entered/exited the frame over this time. Some of these were selected fairly arbitrarily and were ‘colour coded’. The table below shows the total duration they were in frame.

Group

Red

Green

Yellow

Blue

Purple

Seconds in frame

380

100

160

100

90

The chart above also shows the total count and the approximate time related to the groups identified. The individuals in the Red group, were therefore present for the whole time monitored (sat down for the whole time) whereas the others entered/exited (buying food then leaving). One (Blue) left the frame for around a minute, returned, then left again.

When looking to identify groups or individuals, in this case, all of the groups identified as ‘significant’ in some way, appeared for at least 1 minute. This also reinforces the assumption that 1 minute intervals would be sufficient (in this environment at least), and that 2 of the 5 groups identified would appear in multiple frames at this speed. (Red: 380 and Yellow: 160)

Overall, where a single photo might provide some indication of how many individuals; or what groups or individuals were doing at the time, a series of photos allows for a more reliable conclusion again, even if the analysis at this point is fairly mundane. In an event context, as well as the raw figures, we might also be looking at the balance of those who are clearly engaged in the experience (watching, interacting, taking photos, dancing, cheering…) and those who are less engaged or not at all (bystanders, walking past, looking away, disinterest…). Across multiple events, it may be possible to usefully describe the audience response in relation to that at similar events, whether these differences are prompted by changes in the event itself or are as a result of the audience themselves. Importantly, while the protected characteristics of audience (and artists) are frequently a topic of debate it is highly debateable whether photo evidence could be reliably or ethically used to estimate these features. Nevertheless, it could be used alongside more traditional survey data, which arguably is already the case with individuals using their own observations or more traditional photography.

Amazon mechanical turk

“Amazon Mechanical Turk (MTurk) operates a marketplace for work that requires human intelligence. The MTurk web service enables companies to programmatically access this marketplace and a diverse, on-demand workforce.”

For our purposes, it was possible to set up a task (‘Human Intelligence Task’) on the MTurk platform and have a range of individuals (‘workers’) complete the same task detailed above as the researcher; counting the number of heads in a given image. Examples of typical tasks could be identifying items in a picture, features such as buildings on a map and various data entry or categorization tasks.

Remote workers complete small tasks for small incentives which are paid through Amazon credits. According to one survey, most users are based in the US. The platform was initially limited to US users but is opening up to more territories, including the UK. (for both ‘Requesters’ and ‘Workers’)

The person creating the task (‘Requester’) sets the parameters and reward for the task. In this case, the researcher set a time limit of 5 minutes per task and set the reward for $0.06 per task; which seemed to be a similar rate to other tasks of this sort. The researcher also set the number of assignments for each task at two. This meant that two different workers would each give a number for each of the 38 images. This therefore created (38*2) 76 tasks to be completed. Each task could potentially be completed by a different worker, or workers could choose to complete multiple iterations of the task.

A worker is presented with one of the series of images and asked to state in a text box how many heads they can count within the image. The requester can review these results and choose to accept or reject any completed tasks. If one workers result is rejected, the task is released to be completed again by a different worker.

Overall, the tasks were completed in around 2 hours, the job was posted at around 12.00 GMT on a Friday. This including rejecting a number of tasks (30) where the count seemed to be too low (16 or under, though no upper limit was used). Despite this, two completions (13 and 12) were still accidentally approved. Including fees, the whole process cost $5.32 (£3.72) or $0.07 (£0.05p) per task.

32 different workers contributed tasks that were approved. The average (mean) worker completed 3.4 tasks. The typical (median) worker completed 2 tasks. One worker completed 15 tasks. The average time spend by a worker whose task was rejected was 52.2 seconds and they counted an average of 10.7 heads. For workers whose tasks were accepted, this was 86.6 seconds and an average of 32.1 heads. Broadly, we would assume this helps justify why some tasks were approved, in that individuals may have paid more attention to the task. The researcher did not keep track of their own time spent on each image but around 1-2 minutes was estimated.

As each task was completed twice, these were grouped into two columns (A&B) to estimate the potential value of getting two workers to give a response for each image; the average of the two (MTurk Both) is also presented.

Comparing these stats:

Mean

Median

Max

Min

Std Dev

Variance

Researcher

29.5

30

38

21

3.9

15.2

MTurkA

31.9

32

47

13

6.8

46.7

MTurkB

32.4

33.5

44

12

6.7

44.6

MTurk Both

32.2

33

47

12

6.8

45.7

Generally, the two MTurk sets were quite close to Researcher’s own estimate, and taking the average of the two was even closer.

The average difference between researcher and MTurk Both gave a result of -2.7. Therefore on average, the researchers estimate was -2.7 heads fewer than that of Mturk both. Given there were around 30 individuals in all frames, this could be discussed as roughly equivalent to a 10% margin of error. (Note: the researcher later remembered that he actually didn’t count himself in any of the photos, therefore the difference is potentially more like -1.7 or 5%)

Conclusions

A consumer level 360 camera is sufficient to count individuals in a distance of around 20 meters[1]. Further than this distance, individuals may appear too small to distinguish effectively but further experimentation is required. Particularly crowded environments might also present further challenges.

Using MTurk seems to provide a pretty similar result in this experiment to manual examination by one researcher; though the variance and standard deviations are considerably higher. Combining two sets of estimates into one is even more similar. It is relatively cheap to do so and surprisingly fast; infinitely faster than getting a “real world” equivalent of 72 people to do the tasks. Some HTML knowledge is required, but nothing particularly advanced.

Using timelapse photography, it could be useful to establish a ‘base’ interval and a ‘key’ interval; in this case 10 seconds and 1 minute. This is for two reasons:

Looking at group or individual behaviour, it seems that those who would appear in at least two ‘key’ frames (2 minutes) would warrant further investigation : eg, what activities are they doing, how many within the group, what is their reaction to X, why are they still here compared to everyone else? In this example, the reasons are fairly banal but for events it may be useful to examine dwell time, audience reactions, spacing, dynamics and other factors.

For practical reasons and for future work it might be suitable to manually examine every ‘key’ interval to establish a broad estimate, then to leave the ‘base’ interval estimates to a service like MTurk, or perhaps not at all. Functionally, there is little difference in this case, the camera battery would likely run out before the images filled up the built in storage. To cover a two hour event in this way would create 720 photos (120 minutes, 7200 seconds). At around £0.05p per image for MTurk head counting, having 720 images head-counted would cost £36.00.

Probably the biggest weakness at the moment is that the environment used in the example is fairly arbitrary and it may be hard to scale this up to larger crowds. Probably the term ‘views of crowds’ would be more accurate and this approach will be further tested in a more crowded environment. Will the ranges and results be as stable with a larger and more complex view of a crowd, where there are feasibly hundreds of individuals at further distances? Perhaps with larger numbers involved, the variance will reduce. How much longer would researchers or MTurk workers need to work on each image and therefore be paid as a result? On the other hand – coming back to the original debate, how much precision is really needed and for what purposes? If ‘a little’ extra precision is possible for ‘a little’ extra cost, it may be a worthwhile method; especially if the alternative is to do nothing.

[1] For reference, a five lane A-road in the centre of Leicester that hosts a carnival procession is about 24 meters wide at points, including pavements and central reservation. (London Road A6).