Now, having one of these is nice and tidy. Having them all over the model code is annoying, to say the least.

So, what do you do?

Well, with Jackson you can add custom serializers to the ObjectMapper object. Problem is, we’re using spring, so we don’t really have access to the ObjectMapper object used to do the serialization/deserialization.

The first option was to generate a bean to generate the ObjectMapper , and assign the serializers to it. However, there’s a better way – We can use a configuration bean supplied by Spring:

We’re using Kafka (2.0 on cluster, Java client 1.1) as our messaging backbone. A different cluster is deployed in each environment, and we recently started seeing some weird behaviour in one of our user acceptance environments.

First Problem: Old messages are suddenly reprocessed.

Once in every few version releases, we suddently saw some services re-processing old messages. After a lot of head banging and tedious head scratching, we found out that in Kafka, the retention on the offsets and the retention on the messages is not necessarily the same.

What does it mean? Well, when a messsage is sent to a specific Kafka topic, it’s retained as long as the topic retention. So, if our topic retention is 1 week, then after 1 week it will no longer be available.

The consumer offset, however, is a different story. The consumers offsets are saved in an internal topic called __consumer_offsets, and their retention time is defined in the parameter offsets.retention.minutes in the broker config with a deafult of 24 hours.

So what happened to us is this: Our messages retention was set to 2 weeks, and the offsets retention was 24 hours. After a period of not using the system, we deployed a new version. Once the new version was up, it queried the Kafka topic for it’s latest offset. However, the consumer_offset of this application id was already deleted, and the default behaviour is to read from the begining of the stream – which is exactly what happened to us: This is why we were consuming old messages when we released new versions, and it would only happen if we released versions after more than 24 hours and less than 2 weeks.

Second Problem: The producer attempted to use a producer id which is not currently assigned to its transactional id

This one was even more annoying. We’re using Kafka streams api, which promises an exactly-once message processing. Every once in a while, we’d get the above error after the message has been proccessed. This would cause the Kafka stream to shut down, the app to restart – and then, to process the same message again(!).

Now, this was extremly weird. First of all, it was a violation of our "exactly-once" constraint. In addition, we had no idea what it means!

Lately we started also seeing what seems to be a related error:
org.apache.kafka.common.errors.UnknownProducerIdException: Found no record of producerId=16003 on the broker. It is possible that the last message with the producerId=16003 has been removed due to hitting the retention limit.

The maximum amount of time in ms that the transaction coordinator will wait before proactively expire a producer’s transactional ID without receiving any transaction status updates from it.

Type: int

Default: 604800000

604800000 ms is 7 days. So basically, if we’ve had a streaming application that had no traffic for 7 days, it’s producer metadata was deleted – and that’s the behaviour we’ve been seeing: the application consumed the message, processed it – and when it tried to commit the transaction and update the offset, it failed. This is why we processed, crushed, and re-processed.

Bottom line

Kafka is a tool built for massive data streaming, and it’s defaults are organized around it. Both these issued occured because this specific environment’s usage pattern is random and is not corresponding with the cdefault configuration.

We have a process that saves a file to an S3 bucket. We needed a lambda to read the file, parse part of the content, and move the file to the appropriate folder in the bucket – So we set up a lambda to run whenever a file is created in the base folder of the bucket, read the file, and move it to the appropriate place.

AWS Certified Solutions Architect – Associate

Exam

130 minutes

60 questions

Results are between 100 – 1000, pass: 720

Scenario based qestions

IAM

Users

Groups

Roles

Policis

Users are part of Groups
Resources have Roles : i.e, for an instance to connect to S3, it needs to have a role
All the User groups and Roles get their permissions are through Policies, which are defined by json:

EC2

Placment groups

Cluster placment group – A group of instances within a single AZ that need low latency / high throughput (i.e cassandra cluster). Only available for specific types.

Spread placment group – A group of instances that need to be place seperatly from each other

Placment group name myst be unique within aws account

Only available for certain instance types

Recommended to use homogenous instances within placment group

You can’t merge placment groups

You can’t move an exisitng instance to a placment group, only create it into it

EFS

Supports NFSv4

Only pay for used storage

Scales up to petabytes

Support thousands of concurrent NFS connections

Data is stored across multiple AZ within region

Route 53

DNS overview

NS – Name Server record. Meaning, if I go to helloRoute53gurus.com, and I’m the first one to try it in my ISP, then the ISP server will ask the .com if it has NS recored for helloRoute53gurus.
The .com will have a record that maps it to ns.awsdns.com. So it’ll go to ns.awsdns.com , which will direct it to Route53..\

A – short for Address – that’s the most basic record, and it’s the IP for the url

TTL – time to live – how long to keep in cache

CNAME – resolve one domain to another (can’t be used for ‘naked’ domain names, e.g: ‘www.google.com’ )

Alias – unique to Route53, the may resource records to Elastic Load Balancer, CloudFron, or S3 bucket that are configured as websites. They work like CNAME (www.example.com -> elb1234.elb.amazonaws.com)

MX record – email records

PTR Records – reverse lookups

ELB do not have predefined IPv4 addresses, you resolve them using a dns name. So basically, if you have the domain "example.com" and you want to direct it’s traffic to an ELB, you need to use an Alias (not a cname, because it’s a naked domain name!, and not an A record because it has no IP)

Multivalue Answer Routing – Several records, each with ip addresses, and health check for each resource. The ips will return randomlly, so it’s good for disparsing traffic to different resources.

VPC

NAT Gateways – scaled up to 10G, no need to patch/add security groups/assign ip (automatic), they do need to be updates in the routing table (so they can go out via igw)

Network ACL –

It’s like a SG, in the subnet level.

Each subnet is associated with one, but default it’s blocking all in/out bound traffic. you can associate multiple subnets to the same ACL, but only 1 ACL per subnet.

The traffic rules are evaluated from the lowest value and up.

Unlike SG, opening port 80 for incoming will not allow outbound response on port 80. If you want to communicate on port 80, you’d have to define rule both for inbound and outbound. (Otherwise, it’ll go in and not out)

You can block IP addresses using ACL, you can’t with SG

ALB – you need at least 2 public subnets for an Application Load Balancer

Visibility – once message is consumed, it’s marked as "invisible" for 30 seconds (default, max is 12 hours), and if it’s not marked as "read" within that time frame, it returns to be visible and re-distributed to another consumer.

SWF – Simple Workflow Service

Kind on amazon ETL system, with Workers (who process jobs) and Deciders (who control the flow of jobs). The system enables dispatching of jobs to multiple workers (which makes it easily scalable), tracking the jobs status, and so forth.

SWF keeps track of all the tasks and events in an application (in SQS you’d have to do it manually)

Unlike SQS, In SWF a task is assigned only once and never duplicated (What happens if the job fails? IDK).

SWF enables you to incorportae human interaction – like, if someone needs to approve received messages, for example

SNS – Simple Notifications Services

Delivers notification too:

Push notifications

SMS

Email

SQS queue

Any http endpoint

Lamda functions

Messages are hosted in multiple AZ for redundancy

Messages are agregated by Topics, and recipients can dynamically subscribe to Topics.

Elastic Transcoder

Convert video files between formats – like formatiing video files to different formats for portable devices

The SES -> Lambda invocation only send the Email’s meta-data. If you want to have the email content, you’d need to use SNS (so SES -> SNS Topic -> Lambda), but bear in mind that SNS only supports emails up to 150K, so for anything larger, you’d need to move to S3

We had an issue with some JQL queries returning weird results from the db, so we wanted to see exactly what’s arriving to the psql service. To see that:

Edit the config file: /var/lib/postgresql/data/postgresql.conf

Unmark and change the following:

#logging_collector = off # Enable capturing of stderr and csvlog
# into log files. Required to be on for
# csvlogs.
# (change requires restart)
# These are only used if logging_collector is on:
#log_directory = 'pg_log' # directory where log files are written,
# can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
# can include strftime() escapes
[...]
log_statement = 'none' # none, ddl, mod, all

The logging_collector should be set to on to enable logging

The log_statement should be set to all to enable query logging

The log_directory and log_filename can stay the same, depends on what you want.

So your line should look like:

#logging_collector = on # Enable capturing of stderr and csvlog
# into log files. Required to be on for
# csvlogs.
# (change requires restart)
# These are only used if logging_collector is on:
#log_directory = 'pg_log' # directory where log files are written,
# can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
# can include strftime() escapes
[...]
log_statement = 'all' # none, ddl, mod, all

Now restart your service, and you’re good to go : the logs will be at /var/lib/postgresql/data/pg_log

Don’t run this on production, as it will seriously fuck up your performance!

I’m planning to upload a different post on each one of the sessions I liked at the Re:Invent 2018, but for now, just to have everything at one place, here is the short list:

SVR322 – From Monolith to Modern Apps: Best Practices

We are a lean team consisting of developers, lead architects, business analysts, and a project manager. To scale our applications and optimize costs, we need to reduce the amount of undifferentiated heavy lifting (e.g., patching, server management) from our projects. We have identified AWS serverless services that we will use. However, we need approval from a security and cost perspective. We need to build a business case to justify this paradigm shift for our entire technology organization. In this session, we learn to migrate existing applications and build a strategy and financial model to lay the foundation to build everything in a truly serverless way on AWS.

ARC337 – Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small

Whether it’s distributing configurations and customer settings, launching instances, or responding to surges in load, having a great control plane is key to the success of any system or service. Come hear about the techniques we use to build stable and scalable control planes at Amazon. We dive deep into the designs that power the most reliable systems at AWS. We share hard-earned operational lessons and explain academic control theory in easy-to-apply patterns and principles that are immediately useful in your own designs.

ARC403 – Resiliency Testing: Verify That Your System Is as Reliable as You Think”

In this workshop, we illustrate how to set up your own resiliency testing. We set up a simple three-tier architecture and explore the failure modes with Bash and Python scripts. To participate, you need an account that can run AWS CloudFormation, AWS Step Functions, AWS Lambda, Application Load Balancers, Amazon EC2, Amazon RDS (MySQL), and the AWS Database Migration Service, and Route53.

AWS global infrastructure provides the tools customers need to design resilient and reliable services. In this session, we discuss how to get the most out of these tools.

(Sorry, couldn’t find youtube / slides 😦 )

SRV305 – Inside AWS: Technology Choices for Modern Applications

AWS offers a wide range of cloud computing services and technologies, but we rarely give opinions about which services and technologies customers should choose. When it comes to building our own services, our engineering groups have strong opinions, and they express them in the technologies they pick. Join Tim Bray, senior principal engineer, to hear about the high-level choices that developers at AWS and our customers have to make. Here are a few: Are microservices always the best choice? Serverless, containers, or serverless containers? Is relational over? Is Java over? The talk is technical and based on our experience in building AWS services and working with customers on their cloud-native apps.