I’ve written about installing and using Graphite and it’s a really great tool for measuring lots of kinds of metrics. Most of the guides online don’t touch on the security aspects of this setup, and there was at least one thing that I thought should be worth writing about.
How are we measuring
Metrics we gather from our applications have the current characteristics / requirements:
- We want to gather lots of data over time.
- Any single data-point isn’t significant on its own. Only in aggregate.
- Measuring is important, but not if it slows down our application in any way.
Graphite uses some kind of a stats-collector / listener called Carbon. In a typical scenario Carbon will listen on a TCP port for requests, and clients can reports stats by connecting to it. It will store the stats inside the database (Whisper), which is then used by Graphite to display and query the information.
Given the characteristics above, it’s easy to see why using Carbon to collect our data might not be the ideal choice. Why?
- Carbon requires leaving a TCP connection open. If the carbon server, network connection or anything along the path breaks – it can not only stop gathering monitored data – it can slow down our application.
- Making sure a connection is ‘alive’ requires techniques such as connection-pooling and generally is quite resource-intensive.
- TCP has overhead that might not be necessary and slows things down.
So a fire-and-forget mechanism is much better for this purpose.
This is exactly the problem Statsd is trying to solve. Statsd listens on a UDP port (not TCP), it aggregates/buffers metrics over a short period of time, and then forwards requests to your carbon all at once.
This means that your application is completely decoupled from your monitoring. If any component is down, the app doesn’t need to know about it. It will simply send UDP messages into the void. All you lose is those messages.
No free meals
So what’s the catch? As always, nothing comes for free. So we know we might lose some data, but that’s acceptable. So what else we might compromise on?
We finally get to down to the reason of writing this post in the first place. It’s a really subtle point, but somehow I haven’t seen this mentioned explicitly anywhere else.
Location, Location, Location
The one thing that I never see mentioned is where to place your statsd server in relation to carbon. As programmers we live by the DRY (Don’t Repeat Yourself) principle. If a component is used many times, don’t copy&paste it all over the place. Instead just load and use it once.
So the natural instinct is to place statsd just in front of carbon, which in turn sits in front of our whisper database, on the monitoring server.
We then tell all our apps, running on different servers, to all fire those small UDP messages to our monitoring server. There lies the problem. By its nature, UDP is susceptible to spoofing attacks, which are much harder to do with TCP. That means it’s very easy to create fake statsd requests to your monitoring server, pretending they are coming from the IP address(es) of your apps.
This can completely obscure your real stats, or create a rather easy denial-of-service by overwhelming your monitoring server with fake stats.
So can we use a firewall to block these attacks? Probably not. The only reliable firewall rule would be based on source IP addresses. However, if these are easy to spoof, your firewall isn’t useful.
It can however be very useful if your requests are TCP-based!
So the solution is terribly simple. Place your statsd collector on each server you run your app on, and let it connect to the remote monitoring server using carbon. This way, you can filter which app server is allowed to connect to carbon based on its source IP address. To spoof the source IP with a TCP connection is much harder.
There are very little drawbacks to this solution. Yes, you’d need to deploy more instances of statsd. But statsd is very lightweight anyway and won’t take much resources.
This is far from a huge security hole. I would definitely classify it low when considering the bigger picture. The likelihood of this thing being exploited and the potential gain from exploiting this is in most cases negligible. Nevertheless, it’s something that depends on your exposure/profile and people should be aware of.
I just touched the surface when considering risks to your graphite/carbon/statsd setup. There are of course many other considerations and potential issues. Some are probably worse than spoofing statsd packets. For example, accessing your graphite server (which when installed isn’t even password-protected by default!). I might try to cover some of those aspects on future posts. For now, I felt it’s important to make one point clear about correct placement of your statsd server.
4 replies on “Statsd and Carbon security”
“Place your statsd collector on each server you run your app on, and let it connect to the remote monitoring server using carbon. ”
If you have a web app on a load-balanced web farm in which each instance of the web app can produce the same metric, my understanding is that this approach will not work. The reason is StatsD places metrics in “buckets” and accordingly reports the value of that bucket at each flushInterval, even if it is zero. Carbon directly writes (even overwrites) the value with that timestamp in the Whisper file.
So let’s say counter metricA can be produced more than once (lets say three times) in any given flushInterval — once on webapp-1 and twice on webapp-2. [And using the proposed setup, webapp-1 and webapp-2 would connect to StatsD-1 and StatsD-2, respectively.] Then, the metricA “bucket” on StatsD-1 has a value of 1, and metricA bucket on StatsD-2 has a value of 2. The final value commited to the Whisper file will be either 1 or 2 based on which network call gets there last. However, what you most likely wanted the value to be was 3. In other words, metricA must live on only one StatsD instance. It is still possible to have more than one instance, however the web app would have to have logic to consistently send a given metric (metricA) to the same StatsD instance every time.
You make a good point. One alternative, if you have a farm of load-balanced servers, is to share one statsd collector on the private, internal network. Another one, is to define your stats so that they don’t overlap. i.e. add the server instance name as part of your stats, e.g. server1.events.metricA, server2.events.metricA – then do your aggregation with graphite.
I’m hoping that that bucket problem might be resolved by using something like https://github.com/juliusv/ne-statsd-backend (which /doesn’t/ do the same bucketing as vanilla statsd) on each server, talking to a modified statsd that receives stats over TCP and still allows this more spoof-proof approach.
Thanks for the info Jonathan. I’ve never used ne-statsd-backend, but it looks interesting. Any backend/collector supporting TCP is perhaps not as efficient, but is far less vulnerable to spoofing.