Unofficial vCenter Operations for PowerCLI update!

Posted by vele1655 | Posted in vCOps | Posted on 15-03-2014

Tags:

0

This is going to be a quick update relating to the vCOps for PowerCLI.  In the past few months I have added a few capabilities to module.

  • Easy assessment of pre-defined metrics
  • vCOps resource lookup and cache on connect (can be used without PowerCLI now)
  • New API implementation of getMetricValuesFromMemory

 

Download the module here and at Github.

 

vCenter Operations Assessment

 

I am going to skip over documenting the assessment portion since it is still a work in progress.  You will see the following shortcut when you uncompress the zip file and the shortcut can be used which will start an assessment based on existing metrics that are in the var directory.  There is a file called “attr_resources.txt” which includes three sections.  1) command section for collections 2) metricKey which grabs files with specific metrics for commands 3) metricKeyMatch which grabs metrics based on regular expressions.  The collection takes A LONG TIME to run since it is collecting historical information and must do this metric by metric, resource by resource =)

image

 

 

vCenter Operations Resource Lookup

 

It is now possible to run the metric collections without logging into PowerCLI/vCenter prior.  You can do this since we do a collection of all vCOps resources and their identifiers when logging in.  These items can be passed to metric collection cmdlets.

 

image

 

 

Get-vCOpsResourceMetricRecent

 

There is a newer API method that vCOps has introduced (getMetricValuesFromMemory) which allows access to metrics that reside in memory.  In addition to this, the query allows multiple resources and metrics to be requested with one query.  This helps when it comes to large scale environments looking for the most recent metrics for resources.

 

image

Avamar and VMware vCenter Operations – Part Two!

Posted by vele1655 | Posted in Avamar, vCOps | Posted on 27-08-2013

Tags:

0

This post involves integrating Avamar VM backup statistics directly with other VM statistics while leveraging deep analytics of vCenter Operations to help manage image based backups (see Chad’s videos here). Before proceeding check out a recent related post here that discussed how vCenter Log Insight can be used to receive Syslog messages and display useful Avamar backup related activities for VM admins. The difference between these two posts is primarily based from a high level on "what happened" (vCLI) versus "what is the state over time" or "what might happen" (vCOps).

Part One of the vCOps integration is also important to review . This was released back in the February 2013 timeframe during VMware’s PEX. This first iteration set the groundwork for performing active collections against Avamar for VM backup events and creating a mechanism to send these to vCenter Operations. The post details towards the end, how to configure the collections. The value in this integration was that you could overlay "change events" per Virtual Machine and see when things backups occurred on top of Health, Risk, Efficiency, and Faults. For example, if the health of a VM decreased, this could be caused by underlying Datastore performance problems. The backups could then be plotted against this decrease in Health. And even beyond this, you could change the context to view backups from multiple VMs and how they aligned against Datastore health. Value? See the impact of backups, and give more information to help plan better.

clip_image001

 

This integration was very cool and people were able to get it running with Avamar and even VDP (took modifying VDP firewall rule for Postgres access).

Now what’s the next level of coolness? Well it is actually in line with what most people think about traditional monitoring. Can we monitor the metrics of Avamar? Of course. For a VMware administrator can we extend relevant information and monitor the VM backup metrics per VM? Yes.

Are these some of the questions that a VMware admin or even a backup admin might be interested in?

  • Is this VM in violation of an RPO?
  • Which VMs are consuming the most unique space in our Protection storage?
  • Which VMs have the least deduplication rate?
  • What is the backup rate?
  • Am I impacting storage?
  • ..and when did this start and end or what is normal?

Introducing the Part Two! Take a look at the screenshot below. Notice that is a root level grouping of Avamar statistics, and along with them we have an Avamar instance grouping as well. What does this mean? Since the VM object contains these statistics we can easily relate any of our backup activities to VM stats. Do you want to see if storage latency is increased during backups? No problem. How about snapshot space usage during backups? Sure. Plenty of awesome possibilities here that I have really only started to explore!

clip_image002

 

What are we tracking per VM? Here’s the list. It is important to note that we are really tracking five things and the information is then placed into different calculated metrics.

  • Protected/Scanned bytes – The whole size of the VM being protected
  • New Bytes – The new bytes after source-based deduplication (on the image proxy) that is absorbed into the Protection storage
  • Deduplication – A calculation of New Bytes and Protected
  • Duration – How long a backup took
  • Currently Available – Recovery points available

clip_image003

 

It is also important to mention the different calculated values as well. Since vCOps doesn’t do data manipulation (WYSIWYG) without things called Super Metrics, sometimes it is more efficient to send calculated values. An important point here is that we actually need Active Collections against Avamar in addition. Why? If I we want to estimate how many backups are currently around and do other calculations from this, then I could simply look at the expiration of backups that have occurred and figure out which ones *should* be around. But that’s not good enough!

We need active collections against Avamar for things like "How many backups does this VM have right now" or "How much Protection storage is being consumed by all backups of this VM right now"? We then can take the number, and the details for those backup activities and calculate for example, averages, sums, and maxes for the duration of those backups that currently exist. These calculations can then be plotted over time. It will make more sense when you start to see it in action!

See the following for a summary of the metrics being shown.

  • Currently Available – Amount of Recovery Points for the VM
  • Deduplication Latest (%) – Latest VM Deduplication %(1-(New Bytes/Scanned Bytes))
  • Duration Average (seconds) – For recovery points that exist at this point, what is the average duration of backups
  • Duration Latest (seconds) – Latest backup duration
  • New Bytes All (%) – For recovery points that exist at this point, what is the aggregate percentage of New Bytes (Total New Bytes/Scanned Bytes)
  • New Bytes Avg (MB/sec) – Reported backup New Bytes averaged over duration and plotted at start and finish times
  • New Bytes Latest (%) – Latest New Bytes (New Bytes/Scanned Bytes)
  • New Latest (MB) – Latest New Bytes
  • New Max (MB) – For recovery points at this time, what is the maximum New Bytes
  • New Sum (MB) – For recovery points at this time, what is the total New Bytes
  • Scanned Avg (GB/sec) – Reported backup Scanned/Protected space averaged over duration and plotted at start and finish times
  • Scanned Latest (GB) – Latest space Scanned/Protected
  • Scanned Max (GB) – For recovery points at this time, what is the maximum Scanned/Protected space

Here we have a screenshot that shows a few of these metrics being graphed. Notice the top graph that we are showing between 7 and 8 recovery points at all times. The next graph is showing the Scanned rate being somewhere around 10 GB/sec during backup operations. Also notice the last graph that shows how much new data for all backups for that VM is actually living on the Protection storage at all times. It maxes at about 2GB.

Another important point to mention here is that the last graph is showing where analytics are starting to predict the growth of the space. The space in grey is considered a dynamic threshold that is generated based on past data and is increasing. These DT’s are things that be included in Health calculations and trigger alerts.

clip_image004

 

How about some more analytical coolness? The following graphs together give an idea of what is normal for rate of backups, percent of new data, and duration. As you can see our scan rate is within the grey DT, the New Bytes All (%) is slightly below the normal range, and the duration of the backups is now returning to the normal levels. Awesome?

clip_image005

 

The thresholds in vCOps can be dynamic or static. So if you decide that you want to alert on this information there are plenty of options.

One of the main benefits highlighted here is that we can track Avamar backup metrics against VMware Virtual Machine metrics. This is a hugely important point!

This can help to answer questions, and plan for better backup windows. For example, in the following screenshot we are bringing a backup and it’s average scan rate (10GB/sec) over a period of time, and then showing the storage latency. Notice how the storage latency is not impacted!

clip_image006

 

 

And taking this a step further, here is another screenshot of the same information but over a longer period. Here you can see that yes there actually is storage latency at certain times, and if desired you could then drill into those periods.

clip_image007

 

Here’s a bit of a possible use case.

I mentioned backup planning prior and this can be a very important thing. One of our major goals is to perform backups in a timely manner while not impacting SLAs or performance of a VM. This has been a major theme since the virtualization trend began since there essentially are not wasted resources anymore– all resources are pooled and have a cost. So when we plan for backups we want to leverage all the efficiency technology we can in the hypervisor stack (CBT), and the backup stack (source based dedupe) to minimize impact. So how can we leverage this information to help plan?

When it comes to image based backups one of the major things to overcome and consider is VMware snapshots. They are an essential part of getting Operating System consistency during backups, and being able to present a consistent copy of a VM to be backed up. Consistent problems across customers with VMware snapshots can be boiled down to one thing if used correctly– consolidation of the snapshot (not plural).

Based on how these snapshots work, in order to remove the snapshot the new data that was written after the snapshot was taken needs to be hydrated back into the Virtual Machine disks. Depending on how much new data is being written to your VM, at high levels of new block writes, this process can become unpredictable and thus cause issues in the long run for certain VMs. At times this process can even fail leaving orphaned snapshots. A VM with an unconsolidated snapshot can consume double the space of the VM and can increase the response times for storage operations of a VM.

For this reason understanding how your environment and specifically VMs work with VMware snapshots is key to image based protection! Wouldn’t it be nice to have this kind of insight form vCOps? Here’s a view that is actually unrelated to backups, highlighting how Snapshots relate to a VM.

clip_image008

 

Awesome! I can see the Snapshot space accumulating and being removed with consolidations at the bottom, and I can actually see the direct relationship between this and Total Latency at the top. Yes, the snapshots do cause latency! So very cool to see, and a quick lesson for image based protections for planning. In summary, we want to minimize how long snapshots occur on VM’s during backup to a) reduce how much data needs to be consolidated b) reduce any impact to storage latency and other VMs.

What else can we do? Well if you are run a copy of vCenter Operations that has the Enterprise licensing then you can build custom dashboards of this information and leverage more analysis and visualization tools or widgets such as the Heatmap. Don’t get me wrong here, all of what you saw previously is available out of the box with vCenter Operations. But if you want pre-built dashboards, Enterprise is currently it.

clip_image009

 

I hope you liked it! As you can see the capabilities of bringing backup information into vCOps and directly into the VMs is important and will be key to running and relying on image based protection for your VMs. Would love to hear your feedback on this one.

Integrating Avamar VM Image Backup Reporting with VMware vCenter Operations

Posted by vele1655 | Posted in Avamar, vCOps | Posted on 25-02-2013

Tags:

7

Have you ever wondered how your backups affect performance and other things in your environment?  Are you used to manually hunting and pecking for this information?   Shouldn’t an operator know right away that a backup was taking place when troubleshooting or performing root cause analysis after the fact?  If a Virtual Machine or Datastores are underperforming wouldn’t it be nice to know how many backups were occurring during these times and what is affected? 

 

If you’re a VMware customer and are familiar with vCenter Operations Standard+ and utilize EMC Avamar for VM Image based backups these kinds of questions can be answered!  As usual with my stuff, I am providing an unofficial way to perform this integration, but if you find it valuable let VMware and EMC know!

 

First of all, the integration being shown was performed with vCenter Operations 5.6 Standard and above along with Avamar 6.1.

 

The integration that is occurring between Avamar and vCenter Operations is very simple.  We are taking backup events after they have finished and creating “change events” in vCOps for the start and finish of the backup operation.  You can see this from the Operational Events graph below. 

 

We are looking at the Health graph of a specific VM along with its related events.  Here we are hovering over the graph on an ME logo which corresponds to change events for this VM.  Notice mostly there are clustered events two at a time representing start and finish of backup jobs for this VM.  Also notice how the Health (graph) is lower in some cases during these events?  This represents the analytical capabilities of vCOps determining that there are a certain percentage of metrics or KPI’s that are above their “learned” dynamic threshold (DT) for this VM.  This can result from a Virtual Machine’s own metrics that it relies on.  The event is being shown here with the duration of backup and total bytes that were sent after changed blocks (CBT) were deduplicated within the Avamar VMware image proxy VM.

 

 

clip_image001[8]

 

 

Note!  The concept of the VMware image proxy is important for this use case!  Here a similar graph that is showing the workload instead of health for a Virtual Machine.  Notice how there is not extra workload being reported?  Well this makes sense since the VM is not doing anything during backup time, the proxy is!  This can however in certain cases spike since a VM workload may peak since it’s underlying dependent items such as storage may be hitting peak demand.  Anyhow, this offload is how VM image based backups occur with Avamar, but as you can see can cause lower health due to using underlying shared resources (such as storage) and thus increasing response times or reducing capable throughput.

 

 

clip_image002[7]

 

 

So how can we get to the bottom of this?  We showed the Health prior, but let’s now look at the anomalies view.  This summarizes possible reasons for the health being low and is somewhat inverse to the health graph.  Again we see backup events integrated and the anomalies being reported.

 

clip_image003[7]

 

 

Now let’s flip over to see related items such as Datastores that the VM is using.  You can see this easily from the Environment -> Members tab where the Datastores for the VM are listed.

 

clip_image004[7]

 

 

Here we have the same Operational Events view from the VM and are still looking at anomalies.  We start by reviewing the anomalies and selecting a workload badge on the graph on the left side.  You can see here that there is a time when vCOps determined that a Datastore was being limited by Disk IO and thus the demand was at 100% affecting all resources that relied on the Datastore. 

 

 

clip_image005[7]

 

 

Now let’s overlay the change events on the graph.  We can now see that the backup events can be shown to directly relate to the anomalies.  Notice how there are now multiple events being shown?  This view is from the Datastore and it is showing not only events for itself, but also events that occur from all of its children (VMs).  So here we can see an accumulation of backup events all occurring at the same time for different VMs!  This is a good thing of course since we are intentionally doing this.

 

clip_image006[8]

 

 

So you may be asking now, so what?  Don’t do backups?  The point here is to show that we have Avamar and vCenter Operations can be integrated to give an operator the information they need to properly react and plan for these type of events.  In this example we are doing image based VM backups with source based deduplcation from Avamar along with VMware’s changed block tracking (VADP integration).  These features offload and reduce infrastructure utilization during backups.  So what would your environment look like if you’re not taking advantage of CBT and source based dedupe?  You tell me!

 

 

 

So how can I get this integration going?  The integration used here is written in Powershell/.NET.  It is very simple, we consume events from the Avamar Postgres through a .NET provider and post them to the vCOps REST interface as change events.  Below is a basic outline of how I outline the areas that will be covered when it comes to installing and configuring the integration.

 

 

  • Installing the Avamar vCenter Operations Integration
  • Success!  Continue on with Standard Non-Interactive Usage..
  • Standard Interactive Usage
  • More details on Get-VMAvamarActivities

 

 

 

Installing the Avamar vCenter Operations Integration

Now in order to get the integration working there are a few requirements that I must call out.  First, what you can download from this site is not officially supported, take it as an early alpha release technical preview!  Your feedback can help shape future product integration so send it my way!

 

Requirements

 

Integration Points

  • Avamar Postgres Database
    • Review Avamar 6.1 Administration Guide, “Support for third-party reporting tools”
    • Review Avamar 6.1 Administration Guide, Appendix C “MCS and EMC Database Views”
  • Avamar
  • VMware vCenter
  • vCOps 5.6 REST Interface

 

 

The Powershell module contains cmdlets that can leveraged either in a continuous service oriented fashion, or interactively.  In order to make this easier, as part of the install process there is a step to create a login profile that stores authentication information (in a secure form) as well as a shortcut to launch the Avamar vCOps Service.  So let’s get started!

 

Download Powershell Module and Postgres Data Provider

If Powershell and PowerCLI is installed then the first step is to download the Avamar vCOps Powershell module along with proper Postgres .NET data provider assemblies.  The Powershell module can be downloaded from here.  The Postgres module must be downloaded from another place here.  Notice the screenshot below, from the Downloads page, download any of the version 2 ZIPs that match your .NET version.  In my example, since I am using Powershell 3 I downloaded the Npgsql2.0.12.0-bin-ms.net4.0.zip.  I highly suggest you use Powershell v3 to take advantage of Login Profiles.

clip_image007[8]

 

Unzip Downloads and Unblock

Now once downloaded, unzip the Powershell module to a non-network drive and directory of your choice.  This requirement is in place due to DLLs that don’t play nicely when living on network drives.  Limitation can be overcome with some simple hardcoding if you so please.  Unzip the Npgsql file and grab two files, Npgsql.dll and Mono.Security.Dll.  Copy them to the newly created directory with the Powershell module in it.  This is how the files should look in this directory.

clip_image008[7]

 

One more thing here for Powershell v2 users.  The DLLs may need to be Unblocked for security purposes (we do it automatically in v3).  Right click the DLLs individually and press Properties.  Press the Unblock button at the bottom.

 

 

Create Login Profile

With that done, you should be ready to create your first Login Profile which is used to store credential (as a Secure String) and IP/DNS information for each integration point.  This dramatically simplifies things and enhances the security a bit.  Important note! The Secure Strings are .NET based, and are based on keys that are generated per logged in User per Server to the Powershell instance.  Each User that opens Powershell will need to run this command themselves with the credential information or you will need to generate the proper Secure Strings for them with their unique Key (not shown here but possible).  This leverages the ConvertTo-SecureString cmdlet.

 

Ok, now that you’ve got it downloaded and proper files in place the next thing to do is to open a Powershell window.  Change directory to where you unzipped the Powershell module.

 

Run the following command to import the Powershell module.

Dir *.psm1 | Import-Module

 

Now check to ensure you have the necessary cmdlets from this module loaded.  See screenshot below.

Get-Command -Module avamar_vcops

 

And finally you can create the profile if the cmdlets are loaded as seen in the screenshot below.  Take the following command and copy it to notepad and edit the properties in include the proper IP/DNS, Username, and Password of Avamar, vCenter (VIServer) and vCenter Operations.  If you wish to not include vCenter Operations, remove the “vCOps=” line.  This will mean you can only use the script in an interactive form, no posting the backup job events to vCOps but possibly to a destination of your choice.  Also, ensure you include passwords in single quotes (‘) to ensure special characters are preserved.

New-AvamarLoginProfile -Shortcut -Name “profile1″ -LoginProfile  @{

            Avamar=@{server=”ave02″;username=”MCUser”;password=’pass’}

            AvamarDB=@{server=”ave02″;username=”viewuser”;password=’viewpass’}

            VIServer=@{server=”vcenter01″;username=”root”;password=’rootpass’}

            vCOps=@{server=”vcops01″;username=”admin”;password=’pass’}

        } 

 

 

You should see the following output where “Shortcut has been created” is shown at the bottom.  I will explain the shortcut in a second.

 

clip_image009[7]

 

 

 

Test Login with Profile

Now the next step is to test and use the login profile.  But first if you don’t have a PowerCLI window (you followed instructions, but PowerCLI window would work too), you need to load the PSSnapin from the Powershell window.  By not having you open PowerCLI I can ensure that we are not connecting to multiple vCenter instances.  During development I didn’t plan on this, so for now make sure you are only using 1 vCenter at a time per Powershell instance.

Add-PSSnapin vmware.vimautomation.core

 

From the same command prompt enter the following command, edit the “profile1″ name to whatever you specified in the above cmdlet for the -Name parameter.

Connect-AvamarLoginProfile -Name profile1

 

This command will connect to all of the items that were specified in the Login Profile.  If successful, you will see the output as follows without error.  If it is not successful, then it will show errors.  Most likely errors include password or IP/dns problems.  Please refer to Requirements and previous steps if you are encountering errors.

clip_image010[6]

 

 

 

Success!  Continue on with Standard Interactive Usage..

Since we have successfully created and are using a profile now we have everything we need to get the integration working.

 

Start a VM Image Backup

In the case of this demo, I am using the integration in a fresh lab environment with an unused Avamar Virtual Edition node installed.  So I am going to start a backup so I have a backup record to see.

clip_image011[6]

 

 

Get-VMAvamarActivities

There is a semi-advanced cmdlet that we are going to be using in its most basic fashion to correlate Virtual Machines as Avamar VM Clients and their corresponding backups.  Under the covers the cmdlet can be used in a handful of different scenarios that I will cover in the advanced section.  In this case since there are no parameters specified we are returning every backup job for this VM with the following command.

Get-VM wguest-01 | Get-VMAvamarActivities

 

clip_image012[6]

 

It is important to also call out that there is a much more efficient way to get all VMs (for larger environments) and their associated Avamar accounts.  The following command would return all VMs (-getAllVms) and cache all Avamar VM client accounts without individual lookup (-useBulkAvamarVMLookup) and specifies to only pull the last three days of jobs (-Start_Recorded_Date_Time). 

Get-VMAvamarActivities -getAllVms -useBulkAvamarVMLookup -Start_Recorded_Date_Time ((Get-Date).addDays(-3))

 

These are all important parameters for the next part.  We are working towards being able to prepare for Non-Interactive Usage (service) where we continuously update the backup jobs.  In order to do this however, there is one more concept that needs to be covered.

 

Bookmarking (-useBookmark)

In order to track which records have been sent to vCenter Operations, after a successful post if the (-vCOpsPost) parameter is specified , a bookmark entry is created and stored as a AvamarNameorIP.CliXml.  See the following screenshot where the file was created after a successful post.  The (-useBookmark) parameter if specified respects the bookmark as last record seen, and to retrieve only the newer records.  This way we can either run the integration on demand or run it continuously as a service and it keeps track after each individual successful Post per backup job where it left off and can recover from any error (as long as bookmark is there).  Important note:  If you have multiple vCenter’s that are talking to the same Avamar instance, then bookmarking will not work properly.  In the case you care to integrate multiple vCenters with a single Avamar instance and send the data to vCOps (using bookmarking) then create separate directories with the same files in them.  The separate directories will allow each instance of Powershell to maintain its own AvamarNameorIP.CliXml file.  This is possibly not a limitation in a future rev.

 

Bookmarking is required so that we don’t send duplicate data to vCenter Operations.

 

 

clip_image013[6]

 

 

So before running the Non-Interactive mode you have a choice to make.  You can either sync up the entire Avamar VM image job history to vCenter Operations for VMs that currently exist, or you can create a bookmark to begin at a certain point.  The next command gets all of the VMs, and starts from 3 days prior, posts the results to vCenter Operations and creates proper bookmarks.

 

Get-VMAvamarActivities -getAllVms -useBulkAvamarVMLookup -Start_Recorded_Date_Time ((Get-Date).addDays(-3)) -vCOpsPost

 

Or for pulling all backup jobs and creating bookmarks. (This is what is ran continuously in Non-Interactive Usage)

Get-VMAvamarActivities -getAllVms -useBulkAvamarVMLookup -vCOpsPost

 

clip_image014[6]

 

Important Note:  A bookmark will only be created with a successful post of a backup job!  You can create a bookmark manually by specifying the following command and changing the “AddDays” portion from the previous 30 days, to something of your choice.  This will set you up for running this as a service starting at this point, or interactive usage if specifying -(useBookmark) automatically from this point.  Note how we are specifying ToUniversalTime, may not be extremely important, but can be if you are trying to be exact since the Avamar database stores events in UTC.

New-Object -Type psobject -Property @{recorded_date_time=((Get-Date).AddDays(-30)).ToUniversalTime() } | Export-CliXml “$($DefaultAvamarServer.Server).CliXml”

 

Standard Non-Interactive Usage (service)

Open Created Shortcut

With Explorer open the directory that has the Powershell module.  Notice the “Avamar vCOps (profile1)” shortcut that exists.  This was created automatically for you when we created the Login Profile.  This shortcut can be used to launch the non-interactive “service” method of performing the integration.

clip_image015[6]

 

 

Once you double click the shortcut it will launch a Powershell window and title it appropriately for the profile name and Powershell PID.  You can see here that we perform a Login, Backup Job query and post, and Logout.  This sequence is iteratively ran with a 5-minute delay between runs.  A parameter of (-once) can be added to the shortcut in order to only run once and quit.

 

Below you can see the window with a couple of lines that begin with “Posting.”  Since we ran a backup job prior, it is actually posting this information to vCOps now.

 

clip_image016[6]

 

 

Verify Post of Change Event in vCOps

Now you can see that the event actually made it into vCOps by loading the standard vSphere oriented UI and navigating to the VM on the left side.  You can then click on Operations in the top, and Events below it.  Once there a health screen graph will appear, and you can select the Target above the graph in the “Related Events” portion of the graph toolbar.  If necessary you can modify the date range with the calendar button on the tool bar.  You should then see the appropriate change events marked ME (My Event) with the attached descriptions as you saw in the Powershell window above.

 

clip_image017[6]

 

 

 

All Done!  You now have the interactive and non-interactive modes working and are ready to push backup information continuously to vCOps and possibly do some other really cool stuff with Powershell and Avamar!

 

 

 

 

 

 

 

The vCOps Browser War–Chrome Wins

Posted by vele1655 | Posted in Monitoring, vCOps | Posted on 09-10-2012

Tags:

0

This is going to be a short post about the features that Chrome has that make it the browser of choice, in my humble opinion, for VMware vCOps (if the browser is how you access vCOps).

1. Speed

In working with all of the browsers Chrome seems to do an excellent job with handling large loads of data (1000’s of VMs) when loading widgets and larger resource lists.  You will find it able to load pages, widgets, and lists of resources quicker than other browsers.  If you’re like me and mostly impatient, this will add some time to your day and make you happy =)

2. Zoom

When building larger dashboards with more widgets you will find that sometimes there are sizing issues with what is being displayed.  As well if you move the dashboard between systems with different resolutions, the rendition of it on other systems may be different than expected.  The native zooming feature of Chrome makes an easy way to manipulate the information and is perfect for sizing dashboards for the NOC!

To activate the zoom, first look for the top right options button with the three lines as shown below.

image

Then look for the “Zoom” option as highlighted in yellow.  This also on the far right side has a full-screen option that you can select once you have the proper zoom level.

image

Notice in the screenshots below how the zoom level allows for more viewing options aside from resizing the browser window.

Zoom (200%)

image

Zoom(50%)

image

 

3. Auto-Rotate Between Tabs

More specifically when building dashboards for NOCs, making it more dynamic can be important and if you are squeezed on the amount of viewing stations rotating through dashboards is important.  In vCOps, there is a way to make the pre-defined dashboards rotate automatically (not Chrome tab rotation).  This works in all browsers, and can be accessed from “Reorder Tabs” under Dashboards (vCOps Enterprise).

image

Once in the configuration dialog you are allowed to drag and drop tabs (to reorder them for viewing) and set the order of rotation.  The drag and drop happens by selecting the left side label and dropping in before or after other labels.  In order to rotate between tabs there are three options.  The first being auto-transition on or off, the second being “seconds” to linger on tab, and the third being “to this tab” which tells vCOps which tab to go to next.

image

Once you have that set, you will notice that when you are on a tab that is set for auto-transition there is a play or pause button in green next to the “Dashboard Tools” in the top right corner of vCOps.

image

Again, this feature isn’t a browser feature.  The browser feature is the next coolest thing.  So say you want to view the vCOps vSphere and the vCOps Custom (Entperprise and E+) on a single display in the NOC and have them switch back and forth?  This would be considered an rotation among chrome tabs.  This feature is an add-on to chrome called “Revolver-Tabs”.  Use the following links to add it to your Chrome browser.

https://chrome.google.com/webstore/detail/revolver-tabs/dlknooajieciikpedpldejhhijacnbda

Once it is installed you can go to settings and configure it.

image

Press the Extensions on the left side, and then Options in Blue below the extension.

image

And there you go, here are the options for rotating between the chome tabs.

image

And here is the button that is to the right of the URL location input that allows you to turn on or off the auto-rotation between tabs.

image

Enjoy, I wish I had this back in my Ops days!

End-to-End VDI Monitoring with vCOps and EMC

Posted by vele1655 | Posted in Monitoring, vCOps | Posted on 08-10-2012

Tags:

0

If you haven’t heard yet vCenter Operations rocks! EMC has been doing a lot of work in creating a standard integration framework and building plugins that use this framework in order to feed vCOps more and more information about the physical world.  Now why is this important?  What we’re trying to move away from the monitoring world is silo’d tools, silo’d skills, and the old way of doing things.  If you’ve gone down the cloud and virtualization path you have probably already realized that how we used to do things no longer is good enough, or that there is opportunity with new technology to do things better.

This post is going to show how vCenter Operations through plugins can be extensible and provide end-to-end visibility, alerting, trending, and analytics. 

What kinds of questions can be answered with this kind of solution?

  • What desktop caused my high load on my storage array?
  • What users are at risk during this disk failure?
  • What part of my solution is the most under utilized?
  • How can I give my VDI team more insight into the resources they consume?
  • How can I simplify my monitoring tool strategy?
  • How do I make my infrastructure invisible?
  • What does monitoring a software defined datacenter look like?

It is important to start this out and say that vCOps out-of-the-box monitors the vSphere hypervisor and does the most detailed job I have seen.  However, in order to get a whole picture of the datacenter what kinds of things does vCOps and the hypervisor layer need visibility to? We tend to focus here on things that reside directly below the hypervisor.  In other words, what does it rely on in order to produce it’s services to virtual machines.  The two most common resources are the physical hosts and it’s storage.  The next level below this would then be the networking that connects the hosts to the storage.  Let’s focus however for physical simply on the storage and the hypervisors. 

Now in terms of a solution what does vCOps need visiblity into?  The solution tends to encompass everything, but what we need above the hypervisor is the application information.  Here our application is VMware View for Virtual Desktops, so we need information about the desktops such as PCoIP latency, bandwidth, networking, etc.  If I can then paint the picture of the solution, it would entail View (Desktops) –> VMware vSphere (Hypervisor resources) –> EMC VNX (Storage).   In this example, we have plugins in vCOps that cover all three domains.  Let’s take a look at the following screenshot that gives an end-to-end view.

image

Note: vCOps does not let you build a health tree with more than 3 levels, this was modified to show +3.

You may also ask, so what if my storage vendor does not have a plugin in vCOps that maps relationships internally and between datastore and NFS export/LUN?  The whole view of the right side below the view-efd datastore would not exist!

Now, notice how the top most resource is the VDI Environment.  Notice the color of yellow?  This is a health metric.  In vCOps the health metric is created based on either static or dynamic thresholds and anomalies that violate these thresholds.  All anomalies affect an “Self” health score of a resource, and those anomalies affect any parents.  So in essence all health bubbles up.  So the VDI Environment resource gives us a quick understanding of how all it’s children, or how the whole environment is fairing. 

image

As you look below the VDI Environment you can see that we have summarized other things such as User Sessions, Network, Storage, View Clients, etc.  First let’s take a look at how the VNX Connector integrates with the View connector.  It is important to note that there is no work to make this “connection” we have between the two adapters.  It is simply a matter of each adapter describing it’s own resources and their relationships to their fullest.  For example, VMware vCenter adapter then describes the VM to the datastore, the VNX adapter describes the VMware datastore to a VNX LUN/NFS Export and a View Adapter describes the desktop session to a VM, and thus we have an end-to-end view through the rest of the relationships built internally to each adapter.

Now let’s drill into the storage resource.  Here you can see that we have the view-efd datastore listed as the only datastore involved in desktops.

image

If we drill into the datastore, you will then see the VNX connector’s NFS Export resource listed.

image

If we double click the NFS export, you can then drill into the view-efd filesystem which you can see relies on the server_2 datamover and the EFD pool.  At any point in drilling through these resources you would then be able to see metrics for each of the resources.  Actually, what you can do is limitless based on the idea that you can build your own dashboards to meet your needs.  The VNX and View connector come with their own dashboards, but expect to see more information about how to build cool mashups in the future!

image

Cool?  This kind of drill down is capable with any EMC connector for vCOps by itself.  Now let’s step into the next level up where we start with a View session and then work our way down but even a bit further to the backend disk!

Below we start with the View Environment again.  Here we can choose the User Sessions.

image

From there we see the sessions listed, and we can choose a session.

image

The session chosen then displays as it’s children the Teradici PCoIP information, networking, and the Virtual Machine.

image

If we look into the virtual machine we can see the networking it relies on as well as the datastore.  So here we are, we arrive again at the datastore that VM is using and we can start drilling down in a similar way to what you saw earlier.

image

You can see here the datastore, below it the NFS Export, and above it the Virtual Machines, Hosts, and other things that rely on that datastore. 

image

image

image

image

Here is where things go a bit beyond what we showed above.  At this point you can see we have the disk volume in the middle that support the EFD pool and then below we have the LUN that supports the dVol.  This is where the VNX connection between FILE and BLOCK occurs.  The datamovers (file) which are responsible for NFS operations integrate with the storage processors (block) to provide their services.  In the following you can see this LUN relying on a raidgroup and a storage processor.

image

And to finish it up you see the backend disk that support that raid group!  From a user all the way through the backend disk.  Now is this drill down that important?  Well it’s very cool looking visually to see and be able to navigate in real time.  But even more important is how the alerting and health from the backend disk will bubble its way all the way up the stack to the user sessions!

image

 

Here area  few views of the “View Main” dashboard along with a couple of extra widgets that help show how the two adapters can be used together.

View default dashboard

image

VIew & Desktop Stats

image

View & NFS Stats (VNX)

image

View & Backend Disk (VNX)

image

 

So what’s coming up?  More and more plugins from VMware and EMC!  What’s another gap?  Virtual Networking? Physical Servers? Blade Chassis?  Networking switches?  Google Nicira and APG Watch4Net and see what kinds of acquisitions have been made recently by EMC and VMware!

Unofficial VMware vCenter Operations Powershell Module

Posted by vele1655 | Posted in Monitoring, vCOps | Posted on 04-09-2012

Tags:

6

As you’ve seen in my posts I have documented the usage of vCenter Operations as a killer performance management solution.  One of the things that I would fault it for in the short term is the lack of programmatic abilities (common theme on this blog, I know).  So here is another post that is marked as “Unofficial” since the work done is not in any way condoned by VMware or my parent company, EMC.  However, the work can do some good to get the community involved in what certain things may look like if they were to become official.  There is nothing more important in developing products and features than the community to help direct these things in the most meaningful way!  So if you like what you see tell us about it!

The module that I am posting here was co-developed by Alan Renouf (Virtu-Al.Net). 

 

There can be many perceived purposes of these scripts.

  • How do I get those cool analytical stats (contention, health, stress, etc.) via CLI? And how can I feed these stats to a platform of my choice?
  • How can I build my own custom reporting solution with vCenter Operations?
  • How can I trigger workflows and scripts based on vCenter Operations conditions?

So what is possible with this vCOps Powershell module?  It is a read-only capability to query vCenter Operations metrics based on piping a VMware object.  This means you can give it any VMware object from PowerCLI (Get-VM, and others) and pass this via pipe to vCOps powerhsell cmdlets (Get-VM VM01 | Get-vCOpsResourceMetric -metricKey "badge|alert_count_critical").  This then returns the metrics with the specified or default date range and their calculated dynamic thresholds.

If you are interested in more about why this is unsupported see below the examples =)

Requirements

  • PowerCLI
  • Download the module here
  • Dir *.psm1 | import-module
  • Connect-VIServer VIServer
  • Connect-vCOpsServer -Server mgmt-vcops01 -Username admin -Password pass

 

Examples

The first step to get the cmdlets working is to connect to a working vCenter Operations instance, specifically the UI VM.

Connect-vCOpsServer -Server mgmt-vcops01 -Username admin -Password pass

 

Once you have done this the following cmdlet can be used to identify which attributes are available for a certain PowerCLI object.  Here we choose a VM and then check the attributes.  This is by far the slowest part, and is only necessary when you don’t know what the attr_key is that you are calling.  You can use this information to create an array of strings @(“attr_key”,”attr_key) that can be used with the next cmdlet.

Get-VM | Select -Last 1 | Get-vCOpsResourceAttributes

image

 

This cmdlet is used to actually retrieve specific metrics.  Here we choose the “badge|alert_count_critical” metric with a default startDate.

Get-VM | Select -Last 1 | Get-vCOpsResourceMetric -metricKey "badge|alert_count_critical"

image

 

So say you wanted to do a handful of metrics?  This would be done by passing an array to the metricKey parameter as follows.

(Or) Get-VM| Select -Last 1 | Get-vCOpsResourceMetric -metricKey "badge|alert_count_critical","badge|health"

What if you wanted to specify a custom date range?  This following cmdlet would use the native Get-Date cmdlet to format a date query for ten minutes prior.  There is also an endDate parameter if you want to specify a range.  Keep in mind that the date range that will be returned is based on server time so it is essential that your client time settings align correctly with the vCOps server.

(Or) Get-VM | Select -Last 1 | Get-vCOpsResourceMetric -startDate (Get-Date).AddMinutes(-10)

 

Have fun!

 

An Unsupported Implementation

In order to achieve what we are doing here there were a few things that we had to overcome. 

The first major hurdle had to do with the lack of being able to programmatically ask vCenter Operations for a list of metrics per device in the current (5.0.2) REST interface.  In order to achieve this I had to a little fancy work with a read-only database HTTP form that is used to do debugging on the database (https://vcopsui/vcops-custom/dbAccessQuery.action).  This capability is Googleable, so the part that is unsupported is the programmatic usage of this HTTP form.

The second major hurdle was really a combination of things.  In order to use this interface, I had to “Act as if” I was the normal web UI accessing the form.  This meant diving into the workflow when connecting via the GUI and all of the authentication and license checking that needs to occur to validate a session.. fun, but not my first rodeo and obviously not supported =)

The third hurdle was based on SQL and generating a proper query.  I can’t say I’m even remotely capable at SQL but here’s what I did to get what I needed.  It gets a bit complex based on making sure we capture resource identifiers.  The work is actually about a hair away from working with even non-VMware adapters as well. 

image

 

So where’s the future?  It’s up to you.  At this point the output is a bit raw from the cmdlets.  The next logical step would be for someone to start building some cmdlets that are default views of the important metrics per type of resource (Get-VM | Get-vCOpsPerfView or Get-VM | Get-vCOpsCapacityView).

VMware vCOps: Data Visualization at Scale

Posted by vele1655 | Posted in Monitoring, vCOps | Posted on 26-06-2012

Tags:

4

If you’re like me then you probably noticed the appealing methods at which VMware’s vCenter Operations displays data.  The very first thing that stuck out to me was how it was able to aesthetically present large amounts of data in a seemingly unique way.  It has been through further working with it that I realized that how it presented massive amounts of data and visually how it looked were just one of the many significant strengths it has as a performance management and performance intelligence platform.

This post is going first to show some behind the scenes monitoring work that we did this year for EMC World 2012 in Las Vegas so you get some context on the vCOps dashboards I am presenting later.  I was part of an EMC team that supported the physical and virtual infrastructure behind the Hands On Labs under Jeff Thomas and the vSpecialist Technology Enablement team.  During the show we provided 27 different hands on labs to partners, customers, and EMC employees.  Most of the hands on labs included standard operating systems as well and virtualized versions of our hardware.  Actually taking a step back from this, it is truly stunning what was accomplished and the amount of capital it would have required to create physical infrastructure to accomplish training 1850 people across 3500+ labs in a short period of time.  The actual number of virtualized EMC hardware platforms was more than the labs taken since some labs included multiple storage arrays, this total was somewhere around 3,800+.  Amazing!  What size datacenter would it take to house 3800 pieces of 1/4-2 rack hardware platforms and what would that cost? We fit all of the equipment in around 10 racks.

Some of the stats are as follows based on what the environment would have looked like were it physical.  Is that a fair comparison?  Maybe not, but if you consider what it would take to tear down and refresh hardware between training sessions, then you could probably take at least a 1/5 of this.

vCPU: 35,485

MemoryGB: 82.2TB

Ethernet Cards: 40,216

Virtual Disks: 29,818

Total Mapped Storage: 1PB

ESX Servers: 3,762

vCD Instances: 349

Anyways, take a second to consume some of these stats, they are astounding, 3,762 nested ESX Hypervisors used!  Below is a picture of what it looked like in the labs where we had 180 seats for our virtual labs.  Thanks to our partners VMware, Cisco, and Wyse for helping us put this thing together!

clip_image002

 

If you look at the screen on the left of the panorama picture above you can see the rotating dashboard that was up from vCOps, and on the right you can see the labs taken summary status screen (1st Image below).

image

Now as you can tell vCOps was up live to view for anyone taking the labs.  What I want to do in this post is go over the power of data visualization for NOCs and in tracking down problems in your environment.  This post isn’t going to focus on much of any of the other cool stuff in vCOps (Smart Alerting, Analytics, or the mass of widgets for dashboards).  I will be showing some of the dashboard we had which helped us run 3,000+ VMs at any one time.  By the end of this you should have a pretty good idea of how you can leverage vCOps to change how your NOC monitors (wish I had this a few years back) and displays critical infrastructure information!

I already have done a post on how to use a few of the widgets which should be helpful if you are trying to recreate what you see here (link).  Some of the information presented will be from the EMC VNX storage arrays that we used.  There was a huge announcement about our engineering relationship in creating and delivering all aspects of the VNX storage plugin and analytics suite (here).  This is a huge deal, as the plugin is going to have some uber-value to storage admins as well as virtual admins.  Other plugins exist out there for vCOps, but I believe this is the first initiative with vCOps where a storage company has wrapped their troubleshooting know-how into the plugin, worked vigorously to format the data so vCOps could take maximum advantage of it, and continue to work on things to make the two things better versus simply presenting metrics.  So if you think what you see in vCOps and the synthetic metrics (Health, Risk, Stress, etc.) and out-of-the-box dashboards are cool for VMware information, just wait till you see vCOps with the VNX in action.  The screenshots in this post are based on alpha engineering work, so no promises and this is not even close to everything you will get with the VNX adapter now or in the future =)

Enough already, on to the dashboards for the labs!  Talk about eye candy..

VMware World Dashboard

This dashboard was put together to give us a macro view of how our infrastructure was running.  What you’re going to notice here is that we are using (vCenter Enterprise 5.0.1) which gives us the ability to access the (/vcops-custom) url to create our own dashboards.  The standard canned view (vcops-vsphere) is excellent at what it gives out of the box, but in running NOCs from my past I am a bit picky on how things show up and this definitely did it for me.  Here we can focus more on the meat and potatoes metrics instead of analytics (still have some here though).  So in reality we are using vCOps in more of a 1st generation performance management tool manner, but applying it’s powerful data visualization capabilities to a whole bunch of data.

So the first dashboard displayed in the top left and right corners is a summation of our workload across all of our clusters.  You can see that we had 7 clusters with dedicated vCenter instances, which all rolled up into the World resource.  By the way, the virtual Enginuity (virtual VMAX 10K/40K) were running on clusters 5 and 4 which is why they are at 51% and 42% utilization respectively (6 vCPU 100% consumption all the time per lab per director).  The widget in the middle top is a scoreboard that gives some cool consolidation information, such as 29 VMs per host, 15 vCPUs to pCPUs and a total of 1,160 Intel Cores (thanks Cisco for the UCS, clusters 4 and 5 were E7s—all I can say is wicked fast).  The other six widgets are standard graphing widgets that relate to how someone is used to seeing data from the MRTG/RRD/Cacti days.  However, notice that there are dynamic thresholds being displayed surrounding some of the data.  This is because we didn’t have the environment up for long enough to get that much useful analytical data (notice the focus of this post is data visualization, not the really cool analytics).

image

The bottom section of this dashboard was a summation of datastore stats (Read/Write IOs vs Latency) and ESX CPU utilization.  This dashboard is 100% VMware information, no storage array info presented yet.  If you have vCOps and haven’t checked out the “Analysis” view that shows has pre-configured heatmaps, do so now! =)  These heatmaps can be replicated into your own dashboards similar to what you see below.  The datastore heatmaps should be one of the premiere things to pay attention to (at least a storage head like me would) as it gives the whole most of the picture of IO response time, ie. how long an IO takes to traverse the hypervisor, the network, and the storage and back.  This kind of stuff seems trivial but it ultra critical in maintaining and understanding the scalability of your environment.

image

So let’s break some of these widgets down.  I will give the widget on the left and the configuration of that widget on the right.

Scoreboard Health

image image

 

Scoreboard

image imageimage

 

Metric Graph

image image

 

Heat Maps

I use the heat maps A LOT! As you will see following this, they are an extremely useful way to present a whole bunch of data in a multi-dimensional sense (1D, 2D, and 3D).  By multi-dimensional I am simply referring to how you can present multiple metrics at the same time, ie. utilization is 1D, utilization and response time is 2D, utilization, response time, and IO size is 3D.  You can see this one is focusing on the ESX CPU utilization.  There is actually a grouping taking place here where we are combining hosts in a cluster into a set of resources being displayed in the larger cumulative boxes and the smaller boxes represent ESX hosts themselves.  You can see here that the “DEAL” cluster is running the hottest but is balanced between 45-60% across all hosts.  The least busy cluster is the “MONITORING” cluster in the bottom right corner.  I did not configure anything 3D from a metric perspective, actually just 1D since I am showing a single metric, utilization, for size and color. 

image image

 

The following heatmap is again solely a VMware datasource.  It shows the hypervisor’s perspective on latency when performing read operations to its’ datastores.  This information is a consolidated view from all hypervisors all rolled into a single resource (datastore).  Typically this view in other monitoring tools or ESXTOP would solely represent A SINGLE hypervisor talking to its datastores… not so useful unless consolidate together.  So very cool, a consolidated view and presented in a 2D fashion.  You can see here that the size of the box represents the amount of IOs, and the color represents the health, in our case latency.  What you’re seeing there as an example is the EFD datastore doing .2 ms of latency (NFS) with 1,440 IOs.  Very cool, 1) EFD’s 2) did you know the VNX has dedicated read cache (buffer) for NFS (.2 ms latency) on reads as well as a separate r/w cache for block under the covers?  Talk about a wicked fast array! EMC LOVES linked clones and fast provisioning under vCD for our lab use case.

image image

 

VMware VMs Dashboard

If I was going to sum up why data visualization is important in a large environment, nothing speaks better than this screenshot.  Yes, vCOps does great stuff to roll up all of this into synthetic health metrics based on anomalous activity, but this is to me is like old applied in a new way.  The admin knowing what metrics he cares about, and monitors them very carefully.  In our case here it is Compute (Util/Contention), Memory (Guest Active/Contention), Storage (BW/ReadWrite Latency), Network (Network Packets).  So in this screen we are visualizing around 21,000 metrics (3,000 VMs x 1.75 Metrics x 4 CMSN).  Spot any problem children?  Sure, just look at the color.  This dashboard allowed us to put all of our VMs in context/perspective and we were able to quickly isolate problem children that arose.  We even expanded this at times to include things like broadcast traffic per VM as we did run into situations with heavy packet loss due at the hypervisors due to massive amount of broadcast traffic from single virtual machines.

image

image

Heatmaps

So you can see here that we are grouping by cluster to consolidate ESX hosts.  We are also sizing the boxes by the Utilization % and coloring by contention.  So for the most part we have little to no contention for CPU cycles.  Do you notice anything funny on this one?  I have highlighted a box on the left side that are all equally sized.  Kinda funny, but those are the virtual VMAX 10/40K’s running which are all consuming 95%+ of their available CPU cycles. 

image image
image image
image image
image image

 

Metric Graphs

Ok, so in comes some of the EMC VNX specific stuff metrics, and not necessarily dashboards.  The next dashboard is a combination of widgets and interactions between widgets that allow us to choose metrics on the fly without to be graphed.  By interactions, I mean that it is possible to make certain widgets parents of other widgets so that as one gets updated the child then gets updates with metrics from the selected parent resource, and following the graph may be updates from it’s parent metric table.  We are showing an example of how we can see the Storage Processor Utilization (or anything else) across all of our VNX arrays very easily.

image

image
image image
image
image image
image
image

image

 

VAAI NAS anyone?

image

 

Unofficial VNX Preview Dashboards

I am not going to go into detail for each widget here as the VNX connector and analytics suite is possibly soon to be released (Q3 2012).  Here are some dashboards that you might take a liking to with VNX metrics.  Notice how we are summarizing critical info from 6 VNXs by the large boxes in each widget.  On the left side we have the utilization for Storage Processors, Raidgroups, and Disks.  On the right side we show Utilization for FastCache, LUNs, and LUN throughput vs Response Time.  In the middle we are showing a weather map (by 5 minute iterations for past six hours that updates live) that updates based on response times for the LUNs. 

image

This shows a bit of a Unified scoreboard of Block and NAS.  Notice VNX06 cranking out at 11,500 IOs and 13% utilization.  Also, it is doing 16,200 NAS operations, and only 11,500 block operations meaning that 30% or so are being cached as reads on the NAS head.  We also display a rolling metric view on the right that cycles through the metrics being displayed on the scoreboard.

image

The following view shows the frontend of the VNXs for the show which are the NAS stats.  The top left widget shows the response time for the block LUNs that support the NAS exports and below that are the utilization of those LUNs (all shown in microseconds).  The top right side shows some awfully cool stuff, expanded below.

image

 

Ok so on to the wicked detailed NAS stuff!  This is the NAS NFS Calls as the size of the box and the microSeconds per call for the color.  The actual boxes represent the type of the IO, so as you can see 3 metrics per box!  This shows v3Reads at 638 at 3,061 with a response time of .638ms. 

image

This one might be even cooler.  There are two pieces of data here, the amount of IOs and the size of the IOs.  We used a black to white color spectrum to represent black as no IOs and white as being white hot.  So here you can see in the example that we are doing 66 IOs at 16-32K for one of the six VNX’s.

image

This is all further down on the same NAS dashboard.  How about some VAAI NAS stats?

image

 

So yes, all VNX’s had NAS VAAI enabled!  Here you can see the operations taking place, which ones were the most common, and what the response time was for these operations.

image

As you can see here, our average NFS write size was somewhere in the 12-16KB range across all VNXs.  Reads were closer to the 32-64KB range.

imageimage

And the corresponding dVol (block volume supporting NAS export) was showing a somewhat similar breakdown of IOs.  The size of the boxes shows that we are for the most part equally sending IOs back to all dVols.

imageimage

 

Rotating Dashboards

One more thing.  Once you get all of these dashboards setup in your NOC, then you probably want them to rotate through the view automagically.  Hit the REORDER TABS button on the Dashboards menu. 

image

There are a few of things that are set here.  The first being the whether to include the dashboard in the cycle “Auto Transition”, how long to stay on that dashboard “Seconds”, and which dashboard to transition to next “To this Tab”.  With these settings correct you should have a very entertaining NOC!

image

 

Summary

So that’s a crazy amount of data being displayed in only a handful of dashboards.  Man I wish I had this stuff when I ran my operations center!  One thing that I think is absolutely killer about the data visualizations presented here is that it is a single tool that both the virtualization and storage teams can leverage.  This means that if I know how to work vCOps for my VMware stats, then I am going to be pretty dangerous with storage stats as well. 

Anyone want to see this as a widget? Tell VMware! Check this link to where it came from.

Photo

Deleting Stale VM Resources in vCOps

Posted by vele1655 | Posted in Monitoring, vCOps | Posted on 01-06-2012

Tags:

0

EMC just held EMC World 2012 and we had a lot of really cool stuff going on behind the scenes in order to make the Hands On Labs happen.  For the most part the whole environment was run off of standard VMware products including vSphere, vCenter, vCloud Director, View, and vCenter Operations.  Since we were using a use case a bit outside of the norm there was definitely some custom development that went on behind the scenes to make things work as per expectations for a lab environment.  The next few posts that I release will be around some of the behind the scenes work that was done to make everything a reality. 

One of the special things in lab environments is the unbelievable amount of provisioning and de-provisioning that goes on.  Our goal was to leverage vCenter Operations for as much of the information as possible from the hypervisor up as well as storage.  This gives us a critical view into consumption profiles among certain labs and their respective virtual machines, along with the ability to quickly isolate and remediate problem virtual machines.

Keep in mind that for the EMC World labs, we virtualized 8+ physical hardware platforms, ie. VMAX, VMAXe, VNX, VNXe, Avamar, Data Domain, Green Plum, Isilon, the list goes on.  Needless to say all of these platforms were not intended to be run on top of a hypervisor in their current form.  But regardless, we did it anyways for the lab and ran into some challenges.  Imagine the environment, thousands of virtual machines, hundreds of massive virtual machines (storage arrays) and the unpredictability that can arise.  So in summary, monitoring the virtual machines was absolutely necessary so we could isolate problem children and remediate quickly.  This really is no different than Enterprise environments except we are turning over entire enterprise environments in a matter of hours.

With that said let’s step into the vCOps portion of the discussion here.  vCOps was our main performance and “root cause” platform which we turned to throughout the show.  However, in order to make sure it could handle the turn and burn of the virtual machines we had to make some special tweaks to ensure that the flat files and database stayed at a steady state of metrics versus continually growing.  We also wanted to ensure that the visualizations and heat maps only showed relevant and current virtual machines; more to be said below.

So we achieved the pruning of vCOps by implementing a special cron job that leveraged a vpostgres database query to set a special flag for the VM’s that were marked by the vCenter adapter as Non-Existent.  The special flag is set per VM resource and based on a schedule and option in the configuration file of the analytics VM, the VMs are removed.

So you may be asking, is this the only way to accomplish this?  For our scale, yes!  The 5.0.1 version of vCOps has the ability to scan for items to be deleted and items that are Non-Existent in VC every 24 hours.  Our use case deemed a shorter period for auto-deletion (1 hour).

So here are the steps in vCOps 5.0.1 to automatically remove stale virtual machines. 

 

Steps – Edit the Controller Properties file

1) SSH to the Analytics VM and login as root

2) vi /usr/lib/vmware-vcops/user/conf/controller/controller.properties

image

The first flag, “deleteNotExisting” aligns with the screenshot below.  When browsing Resources, you can see there is a tag for “VM Entity Status”.  These are filters that can be enabled on widgets, or the resource list to pick only VMs that are contained in these tags.  So for example, if you want to do a heatmap with VMs that are only powered on, then in the heatmap widget configuration, select the tag for the appropriate VC’s and the PoweredOn: tag.  Notice how the “deletionPeriod” is sort of what we want as far as scanning but deleteable VMs, but in our case we wanted it to be done per hour.  The “deletionSchedulePeriod” represents how often it will actually do a delete based on a certain flag being set.  For those that only need the VMs deleted every 24 hours, this is actually ALL YOU NEED to keep everything nice and tidy!

 

imageimage

3) Reboot the analytics VM (service vcops restart)

Congrats! You now are scanning every 24 hours for NotExisting items, and then executing a scan to those resources on those items every hour.  What if you need it to be more granular than every 24 hours? Continue on!

 

Please note, the following is NOT OFFICIALLY SUPPORTED!  I am posting it here to give some insight into a new feature and the hint at what might happen if the lab use case is fully baked into vCOps!

Steps – Open Postgres and List Databases (optional)

3) sudo -u postgres psql -U psql -U postgres -c “Select * from pg_database;”

image

There you go, there is the alivevm database

Steps – Open Postgres, Open AliveVM Database, List Tables (optional)

4) sudo -u postgres psql -U postgres -d alivevm

image

5) \d

image

 

Steps – Create and Test clenaupNotRunning.sh File

6) Edit /etc/cron.hourly/cleanupNotRunning.sh 

#!/bin/sh
/usr/bin/sudo -u postgres /opt/vmware/vpostgres/1.0/bin/psql  -U postgres -d alivevm -c “UPDATE AliveResource SET flag = ’13’ where resource_id IN (SELECT resource_id FROM aliveresource WHERE resource_id IN (select ResourceMap.Child_Resource_Id from   AliveResource,ResourceMap where  AliveResource.resource_id = ResourceMap.Parent_Resource_Id and AliveResource.resource_id = ResourceMap.Parent_Resource_Id and    Name like ‘%NotExisting%’));”

7) chmod +x /etc/cron.hourly/cleanupNotRunning.sh

8) /etc/cron.hourly/cleanupNotRunning.sh

image

Notice how there is a line with “UPDATE 315”, this represents the amount of rows that were affected.  This should equal the sum of the tags listed as NotExisting as shown in the screenshot above.  This statement sets a flag of “13” for those resources that have been marked as NotExisting by the vCenter Adapter.

Steps – Create a Crontab Entry

The next step is to create a crontab entry that will execute the script from above every hour. 

9) Edit /etc/crontab

10) At the bottom, place a new line “0 * * * * root /etc/cron.hourly/cleanupNotRunning.sh &> /tmp/cleanupErrors.log” without the quotes, and press enter to ensure there is a blank line. 

11) service cron restart

 

That’s it! See the screenshot below that shows what it should look like!  If you’ve gone through this successfully then you might have a new avenue to understand vCOps a bit more from the ground up and possibly a way to keep your vCOps instance in the lab a bit more clean.

 

image

 

 

Here is a little bonus!  Below is a view of the vCOps Analytics instance for CPU, Memory, and Disk Space.  Notice how the area in red is grown at a slow rate vs the areas outside of it.  The area in red is where we added this tweak, and then we turned it off to keep some historical stats for some of the lab VMs.  Also notice the reduction in CPU load and more prominently Memory that accommodates the pruning.  This vCOps instance collected from 6 VNX’s, 7 vCenter instances, 92 ESX hosts, and 4000+ VMs concurrently.

image

Tutorial: Building Custom Dashboards in vCOps

Posted by vele1655 | Posted in Monitoring, vCOps | Posted on 12-04-2012

Tags:

13

This post is a semi-detailed tutorial on building your own dashboard using out-of-the-box widgets in VMware vCenter Operations Enterprise.  The information presented is based on how I’ve been using vCOps over the past year which should be a good starting point for anyone that is looking to go beyond the standard dashboards, dashboard templates, or analysis views.  I explain some of the basics and also some of the gotchas that you can run into when building these dashboards.  Some of the views presented are specific to storage, but overall it should be useful for any metrics and resources available within vCOps.

It is very important to mention that this configuration is done using the custom portal and NOT the vSphere portal in vCOps.  The vSphere portal is statically configured in a way that allows you to get very useful data and switch contexts for that data in an out-of-the-box way.  The custom portal is where you can go (Enterprise/+ only in v1+v5) and create your own complete dashboards.  The custom portal is also the location where you access the 3rd party metrics that may have been brought in through 3rd party adapters (Java RMI) or the vCOps open adapter (HTTP Post).

Requirements

 

New Dashboard

The first thing that I want to do is walk through how to create a new dashboard. There are default templates that come with vCOps which are those that include a set of configured/unconfigured widgets. We could leverage these templates since they contain widgets and dimensional styles that may be desired, or we could use the template and remove the widgets once they get added. Instead of either of these  approaches, I will demonstrate how to create a dashboard from individual widgets without a template.

Login to vCOps through the custom URL as listed above and press the “+” button on the dashboard row.

clip_image001

You will see the template screen load, go ahead and press the icon in the top left corner as shown below. This will bring up the screen that allows us to simply choose widgets.

clip_image002

Now we get to choose the style and name of the widget that we are creating. On the left side you can see the drop down where we decide how many columns that will be presented in this vCOps dashboard.

clip_image004

We have selected three columns and you can see on the right side that our page setup has changed to three vertical columns. Our next option is to actually slide the two columns in the middle left and right which creates custom dimensions for the width of my widgets.

clip_image006

In the following screenshot we are now showing the list of available widgets on the left and the subscribed widgets for this dashboard on the right. I can simply drag and drop any widget from the left to the right. The order does not matter on the right side and has no permanent effect of how widgets will be displayed on the dashboard, so simply drag and drop any widget that you would like.

clip_image008

I have dragged over a handful of widgets that I will now demonstrate how to configure. You can see the health-workload widget, generic scoreboard, heat maps, and a metric graph. This by no means indicates the usefulness of any of the widgets, they all have value and I encourage you to play and test them out!

clip_image009

 

Gadgets with VNX Data – Performance Metrics

This section will demonstrate some of the common ways that you can visualize important VNX storage array information with the scoreboard, heatmap, metric graph widget. We showed you how to create a dashboard above, go ahead and drag the widgets listed to a new dashboard. Once you  have done this click the configuration button from the Generic Scoreboard.

clip_image011

You will notice that the following widget configuration window arises. The information here should be pretty self explanatory. However, I wanted to take a quick second to describe some of the buttons on the interface that are common to vCOps.

The following shows up under the configuration of the widget. On the bottom left of the red box it has an icon marked with an “x”. This icon refers to “clear selections” and removes any rows that you may have selected in the list. This is important to remember as vCOps is filled with selectable boxes and filters. It can be very easy to not get the results that you were expecting because you have an existing selection that may be hidden from you. The button to the right with the green arrow has to do with multiple selects. In this case we are selecting a resource and on the right side vCOps, it is populating metrics for this object. What if we had a case where we had multiple objects (SPA & SP B) which are the same Resource Kind with the same metrics. Wouldn’t it be easier just to select both SP’s and then have the common metrics be displayed on the right? And then when I select a metric and hit the multi-select above those metrics that both SP’s metrics populate in the list at the bottom? This is exactly how it works.

clip_image012

Ok, time to move on. The following window shows the configuration of the Scoreboard widget. We have populated metrics into the selected metrics box as described. We have also assigned “Box Labels” and “Measurement Units”. You can see from the screenshot below the widget configuration where we display the widget itself that those items show up above the metric and to the right of the metric. The other important thing has to do with the range.

Note: See the User’s Guide (v1.0.2) for more examples and detailed information here.

The range is a tough one to get correct without an example. The “green range” refers to anything up to the green maximum, so in our case we enter “40” for SP utilization. The “yellow range” refers to an amount between the minimum and maximum of that range, so for this we enter “40-60”. We repeat the same in orange as a range of “60-80”. The “red range” is then configured with the minimum of that range, in this case “80” which covers anything between 80 and up. We also configure throughput as IO/s in the scoreboard, but feel free to choose any metric.

clip_image014

clip_image016

The next examples will be leveraging the “Heat Maps” widget. We will be repeating the usage of this widget across many different types of objects, so the explanation of how to configure it in general will only be done once.

The heatmap is an awesome way to visualize data. When looking at the widget generated, there are two main options for displaying metrics. The first is the “Size By” and the second is “Color By”. This allows us to specify two separate metrics for a single resource. The heatmap is used in the Analysis -> VC Analysis tab where VMware has packed in some great generic views. You can refer there for other examples of how you might format information. In terms of storage I would point you to the heatmaps that refer to Datastore, there are some really good ones there.

Getting back to visualizing the data. The heatmap allows us size the boxes displayed by a quantity and color them possibly by a health indicator. For example, if I want to see the hot datastores then I can create a heatmap that looks at the IOs as the size of the box and the response time for that datastore as the color. This is a very powerful view of exactly how my storage is performing out of the box for vCOps. You can even take this to a more VM specific heatmap where you see virtual disk IO and response times. The other bonus for heatmaps has to do with how we group the information being presented. It would be one thing just to show say VMs on a heatmap and size as discussed. It is another thing to actually group those VMs into possibly the ESX host first, and then display the VMs after that. This way we would have another metric displayed which would be the ESX host’s aggregate IO being generated for those VMs along with the previous view of simply VMs.

We leverage this grouping in many different cases. For the examples below we are actually pulling data from 5 different VNX’s. In order to properly segregate out the resources we needed to create a “tier” ahead of time where we could properly define what those resources were. Unfortunately, identifiers (uniqueness of resources) aren’t possible to group by. So we use tiers instead and this generates what you see in the widget screenshot below the configuration where you see the array name itself, and the objects inside of the major group box pertaining to that tier or array.  Don’t worry about tiers if you’re just starting out!

The next thing to point out is the “Resource Kinds” drop down. This drop down is where we specify which resource we want to pull metrics for. By populating this drop down with a resource kind, the lists below it for size and color are populated. The section below the “Size By” and “Color By” lists refers to how we filter the data being returned. This would be a case where we wanted to possibly display the resource kind of Virutal Machine, but there were specific VMs we wanted filtered or not filtered. We could create custom tiers, applications, or any of the listed tags to filter appropriately.  The combinations in this area are essentially limitless.

The next section to highlight is in the top right on the color spectrum. You can see that there are two squares which allow us to customize the colors that show up during a range of values that are returned from the “Color by” metric selection. If these are left blank then the widget will take the maximum and populate the color manually. For percentages that are predictable for their health (utilization is a good candidate), it is always a good idea to fill in 0 and 70 for the range of green to red. However, keep in mind that some metrics may be reversed where being higher might not represent being worse and could be the opposite. In these cases you can press the box below the color spectrum to reverse the colors. You can even use a small trick to choose black for one of the left or right color boxes which (if you have dark set for user preferences) will make the box empty until you hit a certain point in the range. The black would reference no activity value instead of a good or bad value.

One last important thing to mention for heatmaps is around saving your configuration. Use the green plus button at the top right to create a new profile for each heatmap widget and configuration. When you have changed anything for this widget, either press the save button (disk below green plus icon) or press the green plus icon (for a new config).

clip_image018

clip_image020

clip_image022

clip_image024

clip_image026

clip_image027

clip_image029

clip_image030

clip_image032

clip_image033

clip_image035

clip_image036

clip_image038

clip_image039

The next widget that we will run through configuring is the metric graph widget. You can see here that there isn’t much to this one. You simply select a single or many resources from the list in the middle and then using the drag and drop method or the multi-select icon to bring the resource metrics up on the right. This is then followed by the same action, or the multi-select again to get the resources to the bottom list. You can then drag and drop those resources in the “Selected Metrics” window in the desired order.

clip_image041

You can see in the following screenshot that the legend at the bottom lists the values that we selected in the configuration. Defaulty, this widget may display the graphs individually in a list. However, we wanted to see everything overlayed together so we hit the button on the top left with the green and red lines which toggles between many and single metrics per graph.

clip_image043

An important thing to note with the graphing is the calendar portion of configuring the widget. There is a line and an arrow in the middle of the line above the graph. You can press this which will reveal the calendar portion to properly set the desired timeframe. We selected last hour in this example, but this can be whatever you want. Simply press the line with the arrow (divider) to hide the date selection.

clip_image044

 

Health

Health is an very important and useful feature within vCOps. Health relies on a lot of different metrics, their anomalies, and ties this together with context sensitive information about how relevant those metrics are to the overall health of the resource and its dependencies. The longer you analyze the metrics the better that vCOps can make good decisions about proper dynamic thresholds (DT) for that metric. All of the EMC metrics can contribute to the health of that resource over time. Let’s take a look at some different views that might help us dig through the health of resources.

For the following examples we will be leveraging the Health Status and Resources widget. These two widgets actually interact together so that we can at any time change the context of what we are viewing by clicking on the Resource List widget and have that change the view in the Health Tree.

We start by simply dragging the widgets over into the dashboard as described previously and configuring the health status widget. The only selection that we changed has to do with the Mode, where we set “parent”. This simply means that when this widget receives a request to change views that it looks at what was selected and then displayed the parents of that resource. It is important to make sure that “Self Provider” is not enabled here.

clip_image046

The next thing is to click the “interactions” button on the top right of the dashboard view.

clip_image048

This is where we can setup the relationship that determines where the actionable selection as mentioned previously gets displayed. In this case we are selecting the “resources” widget as the provider which means the health status widget will change context whenever we click a new resource in the list of the resource widget.

clip_image050

Go ahead and test this out once you save the configuration. You should be able to click through the resource list and have the information populated to the health tree widget.

Now the resource list that we are viewing may not need to be for all resources that vCOps has access to. If we want to limit the list this is a perfect situation to leverage a filter. The following is a list of proper filters for this example “Datastore,DISK,LUN_*,RAIDGROUP,SP,VIRTUAL MACHINE”. Remember the selection and de-selection process above the tag list. It never hurts to just use the de-select button (not red slash, that is reverse). We have also leveraged the order capabilities below the filter to make the list a bit neater for viewing.

clip_image052

With that complete you can now see the more friendly listing of resources and their health.

clip_image054

Now let’s take this to the next level. There are plenty of other widgets that can be added to this dashboard and viewed at the same time that might be useful. I added the Metric Selector widget which you can now see listed under the Interactions section of the dashboard. I am going to populate the metric selector with the resources as a provider. This will allow us to select a specific metric for a resource once it is selected.

clip_image056

You’ll also notice that we added a Metric Graph widget. We can select the resources provider in the top drop down, followed by the metric provider in the bottom drop down. This will give the metric graph everything that it needs to generate graphs on the fly as we click through our resources and metrics.

clip_image058

There you go, you can now see this dashboard coming to shape all properly related and interacting. Here we have chosen a resource. This resource has populated the parents above it. We then can choose a metric for the resource (double click) which then displays the graph below it.

clip_image060

 

Relationships

Another very cool thing that was slightly highlighted in the previous examples is around relationships. We demonstrated this capability by setting a “parent” for the health status widget. Let’s add a couple of more useful widgets to the dashboard that should better demonstrate what the relationships. We have added another Health Tree and a Root Cause Ranking widget. The interactions have then been set to be populated by the resources widget.

clip_image062

Here are the new widgets stacked on top of each other. We have selected the management datastore which would show us the immediate parents and children. If you have the EMC relationships or relationships with the resource you are viewing properly configured here (as I don’t) you would see the EMC LUN or NFS export listed below the VMware datastore (management). Below this health tree widget you can see that we are calling out the root causes for any health deterioration of the resource in question.
clip_image064

We are going to add one more health tree widget that allows us to take some pretty interesting views of the relationships. Where we previously were showing relationships based on the resources widget, we now can show relationships based on the actual health tree itself. This allows us to simple click a resource in the health tree which then populates the parent/child relationships in the new health tree.
clip_image066

Here is a view of that if it didn’t make sense when I described it. Here we are looking at the templates resource on the top. We then selected the “win7-basevm” resource which populated its direct parents and children. This would be a view that we would typically get if we double clicked the resource since the health tree itself will drill into whichever object we specify. By adding another health tree we can keep the top view stationary and explore relationships of any resource.
clip_image068

Here is a complete picture of the dashboard that we have created. You can see all of the widgets listed that we configured along with their dimension that was defined vertically per widget and horizontally per dashboard.

clip_image070

 

Hopefully this information is useful. I wouldn’t expect anyone to specifically use these examples exactly as is. The information presented is more for educational purposes so that you can go off and create your own dashboards!