For the dimensions, I think it makes sense to follow the format 200 by 200 pixels like ticket #38932 since we are planning to move forward with that format in 4.8 for suggested image dimensions.

Does min reflect the time format 00:00:00 correctly? For example, if it was 1:02:00 it would have to read 62 min to be correct or change it to 1:02:00 hours. I think adding a unit for the time layout might be a little confusing.

@adamsilverstein I'm not sure that would be feasible or approachable, considering all this should also be translatable...
By the way, the point is screen readers are often unable to get what specific format is, so they will announce something like 1:02:00 just as numbers and the colons doing a little pause:
one ... zero two ... zero zero.
thus making difficult to understand what that means. Providing an alternative text, in this case by the means of an aria-label, would be great and would allow to use a string in a more natural, spoken-like, language. I mean, as you would normally say that in your every-day English :) for example"
one hour, two minutes, zero seconds.
I have no idea how to do that programmatically though.

The size case is a bit easier. I've done something similar in another project recently, just avoiding to use abbreviations. Also, depending on what character is used for the "x", the size may be read out as:
384 /ˈɛks/ 216 pee /ˈɛks/ or 384 times 216 pee /ˈɛks/
Simply using a natural language would solve the issue:
384 by 216 pixels

The part about providing natural language and avoiding abbreviations for the screen reader makes perfect sense. Since we want to keep the displayed text concise, this likely means separate strings for display vs. speaking (which makes sense to me in this case).

In terms of how to announce something like 1:05:12 I think we should leave this up to the screen readers? Trying to accomplish this manually seems tricky, especially with local variations in plurals or even the use of am/pm.

Reviewing the latest patch I see that @milindmore22 already added code to provide a human readable duration value for the aria label. I am going to clean up that patch a little and provide some examples here.

This could use some additional testing (especially with screen readers) and also screenshots showing the results. Is the aria-label announced when it changes, or do we need to call wp.a11y.speak explicitly?

@afercia you can see the way the human_readable_duration function works by looking at the unit test in the patch. I added the test example you gave: array( '1:02:00', '1 hour, 2 minutes, 0 seconds' ) (input, output)

@adamsilverstein I like the approach! Wondering if it could be abstracted and be a human_readable_time thing, to be re-used in other places too. Thinking at the General Settings page, for example. And yes, also a human_readable_date would be very useful :)

Ideally, aria-label should be used on ​labelable elements. When used on other elements (previous patches used it on a span element), assistive technologies expect that element to have a native "role" (or an ARIA role). Since no ARIA role was set in the markup (and there's no applicable role), assistive technologies may announce something unexpected; VoiceOver, for example, fallbacks to a generic "group" role and announced the span element as "group". To avoid this, I've opted to use some screen-reader-text instead of an aria-label.

I've applied the same enhancements to the attachment details in the "Add Media" modal.

Note: the patch removes some unrelated trailing spaces (my editor does that automatically).

Looks good to me, @adamsilverstein please do feel free to review and commit when you have a chance :)