Using Custom Metrics to Measure User Load in Cloudwatch
Being part time devops in a small shop can limit the time (and money) you spend load testing a web application or server API. And yet if you’re publishing a consumer application with the hopes of growing your user base, it’s important to keep your finger on the pulse of your infrastructure. So how does a stretched dev team decide when to scale or optimize before the whole house of cards comes crashing down?
There are a lot of right answers, even better ones than described here (ie auto-scaling server groups), but for now we have EC2 instances and a pretty bomb monitoring tool provided by our friends at Amazon. Along with an unlimited supply of Diet Mountain Dew, our stretched devops person needs to work with what’s been given to them.
Enter Cloudwatch
Setting up pre-defined and existing metrics and alerts in Amazon is pretty trivial. Its a table-stakes measure for a team and a good start. Since there are a ton of resources available on the procedures, we’re going to skip straight to strategy. If you need some white papers, Amazon’s are so far, the best.
Anecdotally, our application suffered some common modes of failure as it scaled. The most common were:
- Application server CPU peaks, causing requests to slow down or grind to a halt and in some cases, time out.
- Unrotated log files or phantom logs or unchecked operating system logs, pointers or session files filling disks
- Database server CPU peaks caused by slow queries
Removed some *em* company specific labels.
Enter some basic graphs. CPU utilization and network traffic are a good start. They should show highly proportional changes during normal usage. We set up alerts for when usage peaked above a certain threshold (~75%).
Disk space for EC2 instances was also important given our failure modes, but default usage isn’t available via Cloudwatch’s default metrics. Instead we needed Amazon’s Perl-written custom monitoring scripts.
These aren’t the only graphs we set up for our infrastructure, we have MySQL/Aurora and Dynamo usage graphs, queue server and queue size monitoring as well as network and load balancer monitoring. These are all great, we have a pulse, we can monitor trends and we can get alerts when things start going amiss. But we still don’t understand really how many users our servers can support…
Enter The Meat
Our native mobile game connects to one of many node servers, who process and return data during the user session. In a glorious moment, the dev team delivers an API endpoint which returns a local and global count of connected users on each server. We’re almost there, let’s get that into Cloudwatch too!
To do this we need three things: our chopping block, machete, and the AWS custom monitoring scripts handed to us above. Since Amazon’s team has done the heavy lifting for us, there’s no reason to reinvent the wheel, but as you are about to see, I AM NOT A PERL DEVELOPER. This was my first foray into Perl scripting, I needed to accomplish pulling the necessary pieces from Amazon’s script and include a CURL/request library to contact my server API. I also added a snippet to run an operating-system level TCP connection count (line 152). If this file is confusing I’d highly recommend reviewing the original monitoring scripts package first.
This new file (mon-put-user-data.pl) is designed to live in the same directory as the existing aws examples (mon-put-instance-data.pl):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | #!/usr/bin/perl BEGIN { use File::Basename; my $script_dir = &File::Basename::dirname($0); push @INC, $script_dir; } use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use JSON; use Sys::Hostname; use Getopt::Long; use Sys::Syslog qw(:DEFAULT setlogsock); use Sys::Syslog qw(:standard :macros); use CloudWatchClient; use constant { NOW => 0 }; use Data::Dumper; # # For cloudwatch # my $version = '0.1'; my $client_name = 'CloudWatch-PutUserData'; my $enable_compression; my $aws_credential_file; my $aws_access_key_id; my $aws_secret_key; my $aws_iam_role; my $from_cron; my $parse_result = 1; my $parse_error = ''; my $argv_size = @ARGV; my $mcount = 0; my %params = (); my $now = time(); my $timestamp = CloudWatchClient::get_offset_time(NOW); my $instance_id = CloudWatchClient::get_instance_id(); # # Set default input DS, Namespace, Dimensions # $params{'Input'} = {}; my $input_ref = $params{'Input'}; $input_ref->{'Namespace'}="System/Linux"; my %xdims = (("InstanceId"=>$instance_id)); # # Adds a new metric to the request # sub add_single_metric { my $name = shift; my $unit = shift; my $value = shift; my $dims = shift; my $metric = {}; $metric->{"MetricName"} = $name; $metric->{"Timestamp"} = $timestamp; $metric->{"RawValue"} = $value; $metric->{"Unit"} = $unit; my $dimensions = []; foreach my $key (sort keys %$dims) { push(@$dimensions, {"Name" => $key, "Value" => $dims->{$key}}); } $metric->{"Dimensions"} = $dimensions; push(@{$input_ref->{'MetricData'}}, $metric); ++$mcount; } # # Prints out or logs an error and then exits. # sub exit_with_error { my $message = shift; report_message(LOG_ERR, $message); exit 1; } # # Prints out or logs a message # sub report_message { my $log_level = shift; my $message = shift; chomp $message; if ($from_cron) { setlogsock('unix'); openlog($client_name, 'nofatal', LOG_USER); syslog($log_level, $message); closelog; } elsif ($log_level == LOG_ERR) { print STDERR "\nERROR: $message\n"; } elsif ($log_level == LOG_WARNING) { print "\nWARNING: $message\n"; } elsif ($log_level == LOG_INFO) { print "\nINFO: $message\n"; } } { # Capture warnings from GetOptions local $SIG{__WARN__} = sub { $parse_error .= $_[0]; }; $parse_result = GetOptions( 'from-cron' => \$from_cron, 'aws-credential-file:s' => \$aws_credential_file, 'aws-access-key-id:s' => \$aws_access_key_id, 'aws-secret-key:s' => \$aws_secret_key, 'enable-compression' => \$enable_compression, 'aws-iam-role:s' => \$aws_iam_role, ); } if (!defined($instance_id) || length($instance_id) == 0) { exit_with_error("Cannot obtain instance id from EC2 meta-data."); } # # Params for connecting with and talking to the server API # my $clientId = ''; my $clientSecret = ''; my $clientPass = ''; my $authEndpoint = 'https://path.to.auth'; my $userEndpoint = 'https://path.to.data'; my $asaPort = 81; # # Collect data from netstat command # my $cxns = `netstat -ant | grep $port | grep EST | wc -l`; add_single_metric("TCP Connections","Count",$cxns,\%xdims); # # Get auth token from core # my $ua = LWP::UserAgent->new; my $req = HTTP::Request->new(POST => $authEndpoint); $req->header('response_type'=>'json'); $req->content_type('application/x-www-form-urlencoded'); $req->content('grant_type=client_credentials&client_id='.$clientId .'&client_secret='.$clientSecret); my $res = $ua->request($req); # # check the authorization outcome # if ($res->is_success) { my $auth = decode_json($res->decoded_content); my $token= $auth->{'access_token'}; # # Make the call for active user data # my $req = HTTP::Request->new(GET => $userEndpoint); $req->header('Authorization'=>'Bearer '.$token); $req->header('response_type'=>'json'); my $res = $ua->request($req); if ($res->is_success) { my $data = decode_json($res->decoded_content); my $users= $data->{'data'}; add_single_metric("Active Users","Count", $users, \%xdims); $mcount++; } if($mcount > 0) { # # Attempt to send them to cloudwatch # my %opts = (); $opts{'aws-credential-file'} = $aws_credential_file; $opts{'aws-access-key-id'} = $aws_access_key_id; $opts{'aws-secret-key'} = $aws_secret_key; $opts{'retries'} = 2; $opts{'user-agent'} = "$client_name/$version"; $opts{'enable_compression'} = 1 if ($enable_compression); $opts{'aws-iam-role'} = $aws_iam_role; my $response = CloudWatchClient::call_json('PutMetricData', \%params, \%opts); my $code = $response->code; my $message = $response->message; if ($code == 200 && !$from_cron) { my $request_id = $response->headers->{'x-amzn-requestid'}; print "Successfully reported metrics to CloudWatch. Reference Id: $request_id\n"; } elsif ($code < 100) { exit_with_error($message); } elsif ($code != 200) { exit_with_error("Failed to call CloudWatch: HTTP $code. Message: $message"); } } else { print "Error: " . $res->status_line . "\n"; exit_with_error($res->status_line); } } else { print "Error: " . $res->status_line . "\n"; exit_with_error($res->status_line); } |
View the script on Github.
Lastly, as with any custom monitoring script, a CRON job has to be installed to run the script on a schedule… mischief managed. With the metrics being sent, we can now access them in Cloudwatch and add them to some graphs, measuring active users against resource usage!
Looks like its time to take action on server 3. We have a few options available, from growing the instance size to better distributing the user load.
This opens up a variety of metric and devops KPI options including estimating how much a single user costs to support in infrastructure and lets us predict and project resource requirements at various growth rates. In the end, this is just one approach… the tip of the iceberg in terms of infrastructure strategy. But for a small shop, a single devops guy and limited resources, its invaluable insight to keep servers running and business growing.
Stay tuned, I’ll add devops posts as we continue to refine our strategy.
Up next: optimizing at the application level to improve per-user cost.
0 Comments