0

We have a number of robots installed at various locations, and servicing customers. All robots get their instructions from a central cloud database with customer data, and each have an SQS queue which delivers the commands they have to execute, and the robots broadcast any events using SNS, and some lambdas are triggered by these SNS messages and handling them.

Now we want to have a better handling and overview of errors occurring on the robots and in generel have better statistics.

What we need is:

  • Get an alarm when an error happens that requires manual action to recover.
  • An overview of which types of errors that happens most.
  • What errors happen before others (i.e. what error has lead us to a recovery_error which needs manual maintenance)
  • Overall stats of the performance from a given period

    • Number of successful sessions
    • Failed sessions caused by user error
    • Failed sessions caused by technical errors
    • Errors where the robot cannot automatically recover and go back to initial position.

All messages have a type attribute which can be status, warning, error or recovery_error and a value attribute which describes the type of status, error etc.

My thought is to have a lambda that's subscribed to all SNS messages and will upload these to another system which we will then collect it all and provide what we need for extracting the data mentioned above.

Which AWS products would you recommend for this? I already looked a little at CloudWatch, but I'm not sure if it can cover our needs.

We have also considered just dumping all SNS messages into a database, and do custom queries on the tables. But that sounds like a solution that can quickly require a lot of work on our side, as our need grows.

We'd prefer an off the shelf solution and adjust our workflow to that.

Thanks in advance for any tips.

Esben von Buchwald
  • 251
  • 1
  • 3
  • 9

1 Answers1

0

CloudWatch provides out of box time based metrics and logs ingesting, querying and dashboard. Also it provides alarming based on the metrics. Generally it satisfies your requirements to collect the metrics of your devices, alarming when something error happening, having stats dashboard based on given period. Even you can use CloudWatch agent/API directly sending the data from devices.

Also managed elastic search with Kibana also provides the great data aggregation capability and better dashboard user experience.

Another approaching is leveraging the IoT services they probably better fit into your requirements.

Kane
  • 101
  • 2
  • thanks! I'm confused about Cloudwatch and its metrics. It looks like the normal use case is that you report a "snapshot" number at different times, like disk or memory usage, and then it will show the value over time. But can I also report single events to be counted by CloudWatch? Let's say whenever I have a successful operation by any robot, I want to report that to a metric, by pushing a message to the CloudWatch api? And then let CloudWatch do the counting work? Can I then have a chart showing me the number of successful operations by any robot, the last day, hour, month etc? – Esben von Buchwald Mar 20 '20 at 12:23