First, you can see that mappers had to read all table data from HDFS (80 GB), and only the final count was written to HDFS (11 bytes):

The good news is that each mapper calculated its own local row count, so the reducer just had to sum 242 counts to get the final row count:

So when calculating the number of rows in a table stored in SequenceFile or text format you should remember that map tasks will read all table rows from HDFS and this operation can be quite expensive depending on the table size.

At the same time the reducer does not need to process all rows, it just receives local counts from each map, so network I/O between mappers and the reducer is very low, and the reduce operation is not expensive no matter how many rows are in the table.

Note that if your query includes a WHERE condition, it still requires the same I/O resources to execute the statement:

SELECT COUNT(*) FROM orders WHERE code = 'A1';

Mappers still need to read all data from HDFS, and code = 'A1' condition just affects on their local rows counts (and additional time to compare values). The reducer still needs to read local row counts from 242 mappers and summarize them.