AEM+S3DataStore+Mongo cluster encounter S3 404

Issue

In a multi-node AEM cluster using S3DataStore, if you upload large images or videos (100 MB+) to the Assets UI via a secondary topology node, then errors similar to the one below would be logged.

17.04.2017 23:53:27.106 *ERROR* [JobHandler: /etc/workflow/instances/server0/2017-04-17/update_asset_5:/content/dam/test.jpg/jcr:content/renditions/original] com.day.cq.dam.core.impl.cache.CQBufferedImageCache Error while loading image /content/dam/rrd/dsg/migration-test/to-rrd/aem-tester/Perf1492487597162_Performance_100MB_JPG.jpg/jcr:content/renditions/original
java.lang.RuntimeException: Error occurred while obtaining InputStream for blobId [c260220808fc2ee8550a97daea0444123b02f90b#103376192]
at org.apache.jackrabbit.oak.plugins.blob.BlobStoreBlob.getNewStream(BlobStoreBlob.java:49)
at org.apache.jackrabbit.oak.plugins.value.BinaryImpl.getStream(BinaryImpl.java:53)
at org.apache.sling.jcr.resource.internal.helper.LazyInputStream.getStream(LazyInputStream.java:106)
at org.apache.sling.jcr.resource.internal.helper.LazyInputStream.read(LazyInputStream.java:65)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at com.day.cq.dam.commons.handler.StandardImageHandler.getImage(StandardImageHandler.java:245)
at com.day.cq.dam.core.impl.handler.JpegHandler.getImage(JpegHandler.java:79)
at com.day.cq.dam.core.impl.cache.CQBufferedImageCache.getImage(CQBufferedImageCache.java:116)
at com.day.cq.dam.core.impl.gfx.CommonsGfxRenderer.createImageLayer(CommonsGfxRenderer.java:488)
at com.day.cq.dam.core.impl.gfx.CommonsGfxRenderer.createLayer(CommonsGfxRenderer.java:163)
at com.day.cq.dam.core.impl.gfx.CommonsGfxRenderer.render(CommonsGfxRenderer.java:116)
at com.day.cq.dam.core.impl.RenditionMakerImpl$PlanBasedTemplate.apply(RenditionMakerImpl.java:132)
at com.day.cq.dam.core.impl.RenditionMakerImpl.generateRenditions(RenditionMakerImpl.java:216)
at com.day.cq.dam.core.process.CreateWebEnabledImageProcess.createWebEnabledImage(CreateWebEnabledImageProcess.java:165)
at com.day.cq.dam.core.process.ThumbnailProcess.execute(ThumbnailProcess.java:82)
at com.day.cq.workflow.compatibility.CQWorkflowProcessRunner.execute(CQWorkflowProcessRunner.java:93)
...
Caused by: java.io.IOException: org.apache.jackrabbit.core.data.DataStoreException: Could not length of dataIdentifier c260220808fc2ee8550a97daea0444123b02f90b
at org.apache.jackrabbit.oak.plugins.blob.datastore.DataStoreBlobStore.getStream(DataStoreBlobStore.java:539)
at org.apache.jackrabbit.oak.plugins.blob.datastore.DataStoreBlobStore.getInputStream(DataStoreBlobStore.java:322)
at org.apache.jackrabbit.oak.plugins.blob.BlobStoreBlob.getNewStream(BlobStoreBlob.java:47)
... 26 common frames omitted
Caused by: org.apache.jackrabbit.core.data.DataStoreException: Could not length of dataIdentifier c260220808fc2ee8550a97daea0444123b02f90b
at org.apache.jackrabbit.oak.blob.cloud.aws.s3.S3Backend.getLength(S3Backend.java:485)
at org.apache.jackrabbit.core.data.CachingDataStore.getLength(CachingDataStore.java:671)
at org.apache.jackrabbit.core.data.CachingDataStore.getRecord(CachingDataStore.java:469)
at org.apache.jackrabbit.oak.plugins.blob.datastore.DataStoreBlobStore.getDataRecord(DataStoreBlobStore.java:548)
at org.apache.jackrabbit.oak.plugins.blob.datastore.DataStoreBlobStore.getStream(DataStoreBlobStore.java:533)
... 28 common frames omitted
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 09ED6CDC6129E1DA)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at org.apache.jackrabbit.oak.blob.cloud.aws.s3.S3Backend.getLength(S3Backend.java:478)
... 32 common frames omitted

Environment

AEM Assets 6.1, 6.2 on MongoMK using S3DataStore

Cause

The cause of this issue is due to the async upload of the binary to S3 taking longer than job processing in the cluster. The leader node processes the DAM Update Asset workflow jobs, but the job gets processed before the file is uploaded to S3. See OAK-4903 for further details.

Resolution

This issue is fixed in AEM6.3.

A. Disable S3 Async Uploads

As a workaround in AEM6.1 and 6.2, do the following:

Login to AEM server and open the crx-quickstart/install/org.apache.jackrabbit.oak.plugins.blob.datastore.S3DataStore.config file in an editor

Set this property

asyncUploadLimit="0"

Restart AEM

Note:

Setting the asyncUploadLimit to 0 will have a performance impact on asset file uploads. After setting this, you will need to perform load testing to make sure the environment still meets performance needs.

B. Enable Job Distribution

Another approach to solve the issue without a performance hit is to enable sling job distribution. You can then process
sling jobs directly on the secondary cluster nodes in addition to the primary.