This post involves integrating Avamar VM backup statistics directly with other VM statistics while leveraging deep analytics of vCenter Operations to help manage image based backups (see Chad’s videos here). Before proceeding check out a recent related post here that discussed how vCenter Log Insight can be used to receive Syslog messages and display useful Avamar backup related activities for VM admins. The difference between these two posts is primarily based from a high level on "what happened" (vCLI) versus "what is the state over time" or "what might happen" (vCOps).
Part One of the vCOps integration is also important to review . This was released back in the February 2013 timeframe during VMware’s PEX. This first iteration set the groundwork for performing active collections against Avamar for VM backup events and creating a mechanism to send these to vCenter Operations. The post details towards the end, how to configure the collections. The value in this integration was that you could overlay "change events" per Virtual Machine and see when things backups occurred on top of Health, Risk, Efficiency, and Faults. For example, if the health of a VM decreased, this could be caused by underlying Datastore performance problems. The backups could then be plotted against this decrease in Health. And even beyond this, you could change the context to view backups from multiple VMs and how they aligned against Datastore health. Value? See the impact of backups, and give more information to help plan better.
This integration was very cool and people were able to get it running with Avamar and even VDP (took modifying VDP firewall rule for Postgres access).
Now what’s the next level of coolness? Well it is actually in line with what most people think about traditional monitoring. Can we monitor the metrics of Avamar? Of course. For a VMware administrator can we extend relevant information and monitor the VM backup metrics per VM? Yes.
Are these some of the questions that a VMware admin or even a backup admin might be interested in?
- Is this VM in violation of an RPO?
- Which VMs are consuming the most unique space in our Protection storage?
- Which VMs have the least deduplication rate?
- What is the backup rate?
- Am I impacting storage?
- ..and when did this start and end or what is normal?
Introducing the Part Two! Take a look at the screenshot below. Notice that is a root level grouping of Avamar statistics, and along with them we have an Avamar instance grouping as well. What does this mean? Since the VM object contains these statistics we can easily relate any of our backup activities to VM stats. Do you want to see if storage latency is increased during backups? No problem. How about snapshot space usage during backups? Sure. Plenty of awesome possibilities here that I have really only started to explore!
What are we tracking per VM? Here’s the list. It is important to note that we are really tracking five things and the information is then placed into different calculated metrics.
- Protected/Scanned bytes – The whole size of the VM being protected
- New Bytes – The new bytes after source-based deduplication (on the image proxy) that is absorbed into the Protection storage
- Deduplication – A calculation of New Bytes and Protected
- Duration – How long a backup took
- Currently Available – Recovery points available
It is also important to mention the different calculated values as well. Since vCOps doesn’t do data manipulation (WYSIWYG) without things called Super Metrics, sometimes it is more efficient to send calculated values. An important point here is that we actually need Active Collections against Avamar in addition. Why? If I we want to estimate how many backups are currently around and do other calculations from this, then I could simply look at the expiration of backups that have occurred and figure out which ones *should* be around. But that’s not good enough!
We need active collections against Avamar for things like "How many backups does this VM have right now" or "How much Protection storage is being consumed by all backups of this VM right now"? We then can take the number, and the details for those backup activities and calculate for example, averages, sums, and maxes for the duration of those backups that currently exist. These calculations can then be plotted over time. It will make more sense when you start to see it in action!
See the following for a summary of the metrics being shown.
- Currently Available – Amount of Recovery Points for the VM
- Deduplication Latest (%) – Latest VM Deduplication %(1-(New Bytes/Scanned Bytes))
- Duration Average (seconds) – For recovery points that exist at this point, what is the average duration of backups
- Duration Latest (seconds) – Latest backup duration
- New Bytes All (%) – For recovery points that exist at this point, what is the aggregate percentage of New Bytes (Total New Bytes/Scanned Bytes)
- New Bytes Avg (MB/sec) – Reported backup New Bytes averaged over duration and plotted at start and finish times
- New Bytes Latest (%) – Latest New Bytes (New Bytes/Scanned Bytes)
- New Latest (MB) – Latest New Bytes
- New Max (MB) – For recovery points at this time, what is the maximum New Bytes
- New Sum (MB) – For recovery points at this time, what is the total New Bytes
- Scanned Avg (GB/sec) – Reported backup Scanned/Protected space averaged over duration and plotted at start and finish times
- Scanned Latest (GB) – Latest space Scanned/Protected
- Scanned Max (GB) – For recovery points at this time, what is the maximum Scanned/Protected space
Here we have a screenshot that shows a few of these metrics being graphed. Notice the top graph that we are showing between 7 and 8 recovery points at all times. The next graph is showing the Scanned rate being somewhere around 10 GB/sec during backup operations. Also notice the last graph that shows how much new data for all backups for that VM is actually living on the Protection storage at all times. It maxes at about 2GB.
Another important point to mention here is that the last graph is showing where analytics are starting to predict the growth of the space. The space in grey is considered a dynamic threshold that is generated based on past data and is increasing. These DT’s are things that be included in Health calculations and trigger alerts.
How about some more analytical coolness? The following graphs together give an idea of what is normal for rate of backups, percent of new data, and duration. As you can see our scan rate is within the grey DT, the New Bytes All (%) is slightly below the normal range, and the duration of the backups is now returning to the normal levels. Awesome?
The thresholds in vCOps can be dynamic or static. So if you decide that you want to alert on this information there are plenty of options.
One of the main benefits highlighted here is that we can track Avamar backup metrics against VMware Virtual Machine metrics. This is a hugely important point!
This can help to answer questions, and plan for better backup windows. For example, in the following screenshot we are bringing a backup and it’s average scan rate (10GB/sec) over a period of time, and then showing the storage latency. Notice how the storage latency is not impacted!
And taking this a step further, here is another screenshot of the same information but over a longer period. Here you can see that yes there actually is storage latency at certain times, and if desired you could then drill into those periods.
Here’s a bit of a possible use case.
I mentioned backup planning prior and this can be a very important thing. One of our major goals is to perform backups in a timely manner while not impacting SLAs or performance of a VM. This has been a major theme since the virtualization trend began since there essentially are not wasted resources anymore– all resources are pooled and have a cost. So when we plan for backups we want to leverage all the efficiency technology we can in the hypervisor stack (CBT), and the backup stack (source based dedupe) to minimize impact. So how can we leverage this information to help plan?
When it comes to image based backups one of the major things to overcome and consider is VMware snapshots. They are an essential part of getting Operating System consistency during backups, and being able to present a consistent copy of a VM to be backed up. Consistent problems across customers with VMware snapshots can be boiled down to one thing if used correctly– consolidation of the snapshot (not plural).
Based on how these snapshots work, in order to remove the snapshot the new data that was written after the snapshot was taken needs to be hydrated back into the Virtual Machine disks. Depending on how much new data is being written to your VM, at high levels of new block writes, this process can become unpredictable and thus cause issues in the long run for certain VMs. At times this process can even fail leaving orphaned snapshots. A VM with an unconsolidated snapshot can consume double the space of the VM and can increase the response times for storage operations of a VM.
For this reason understanding how your environment and specifically VMs work with VMware snapshots is key to image based protection! Wouldn’t it be nice to have this kind of insight form vCOps? Here’s a view that is actually unrelated to backups, highlighting how Snapshots relate to a VM.
Awesome! I can see the Snapshot space accumulating and being removed with consolidations at the bottom, and I can actually see the direct relationship between this and Total Latency at the top. Yes, the snapshots do cause latency! So very cool to see, and a quick lesson for image based protections for planning. In summary, we want to minimize how long snapshots occur on VM’s during backup to a) reduce how much data needs to be consolidated b) reduce any impact to storage latency and other VMs.
What else can we do? Well if you are run a copy of vCenter Operations that has the Enterprise licensing then you can build custom dashboards of this information and leverage more analysis and visualization tools or widgets such as the Heatmap. Don’t get me wrong here, all of what you saw previously is available out of the box with vCenter Operations. But if you want pre-built dashboards, Enterprise is currently it.
I hope you liked it! As you can see the capabilities of bringing backup information into vCOps and directly into the VMs is important and will be key to running and relying on image based protection for your VMs. Would love to hear your feedback on this one.