vmware

Linux and DHCP reservations aren’t working

This is something I just came across myself. While deploying an Ubuntu Linux VM the DHCP reservation did not work. This was mixed up with a Windows 2016 DHCP server.

After looking at the wrong DHCP lease I quickly saw an extremely long MAC address and figured that the Linux VM used some kind of randomization for the interface.

IFCONFIG on the system showed the correct hardware MAC address.

It took me a few minutes of research and testing till I found the root cause and rather simple solution.

Linux replaced their NIC handling on many distributions with a newer system called NETPLAN.

I read that a 2019 Windows DHCP would likely handle this correctly, did not have time to test this out. But the following worked for me:

  • List the contents /etc/netplan
    • ls /etc/netplan
    • there should be a file ending in .yaml
  • Edit this .yaml file
    • sudo nano /etc/netplan/<filename>.yaml

The file is structured like this:

Under your NICDEVICENAME add the line dhcp-identifier: mac and save (CNTRL+O) and exit (CNTRL+X) the file.

Now you can either try to apply the netplan config via sudo netplan apply or simply reboot.

This should solve the issue.

VMware hosts network speed tests with iperf

VMware hosts network speed tests with iperf

Ever needed to run speed-tests between your VMware hosts? There is an CLI command iperf3 for this.

This command runs as a server and client command. One host will be the server and the other the client. There is further the possibility that some storage vendors even support the iperf3 command.

Example scenario with two VMware ESX hosts:

  • IT-ESX-01P – will act as server
    • IP: 10.0.0.1
  • IT-ESX-02P – will act as client
    • IP: 10.0.0.2

Steps and commands to execute the network speed test:

  1. Enable SSH on both hosts and connect with e.g. Putty to it, logon as well.
  2. IT-ESX-01P will act as our server
    1. disable the firewall
      1. esxcli network firewall set –enabled false
      2. The ESX firewall needs to be disabled temporarily to execute the tests – on client and server
    2. List the kernel network IP addresses
      1. esxcli network ip interface ipv4 get
      2. choose the interface IP that is on the network you want to test, only kernel-IPs will work
    3. go to the directory that holds the iperf3 command
      1. cd /usr/lib/vmware/vsan/bin
    4. start the iperf server on this host on the kernel IP you need it on
      1. ./iperf3.copy -s -B 10.0.0.1
      2. this command starts the server respective listener on the host on the specified IP address
  3. IT-ESX-02P will act as our client
    1. disable the firewall
      1. esxcli network firewall set –enabled false
      2. The ESX firewall needs to be disabled temporarily to execute the tests – on client and server
    2. go to the directory that holds the iperf3 command
      1. cd /usr/lib/vmware/vsan/bin
    3. execute the speed test against the server IP address
      1. ./iperf3 -c 10.0.0.1 -t 10 -V

      2. this will start sending packets to the server – you will see the flow on both sides
      3. cancelling this command – cntrl + c – can take a minute, be patient, especially if you mistyped the IP or forgot to disable the firewall etc..
  4. Review the results on the speed test
    1. Below are result samples for a 1 GB kernel network, a 10 GB kernel network and a 25 GB kernel network.
    2. Sample results – 1 GB
    3. Sample results – 10 GB
    4. Sample results – 25 GB
    5. Be aware, those results will vary and depend on the network bandwidth available in the moment of the test, respective the current load on the network cards of client and server.
  5. IT-ESX-01P exit server mode and enable firewall
    1. cntrl + c will exit the server mode and go back to the CLI
    2. enable the firewall
      1. esxcli network firewall set –enabled true
    3. EXIT SSH
  6. IT-ESX-02P enable firewall
    1. enable the firewall
      1. esxcli network firewall set –enabled true
    2. EXIT SSH
  7. Done

Additional links to this topic:

 

 

 

PRTG and VMware 6.7 vCenter host hardware status

PRTG and VMware 6.7 vCenter host hardware status

The following script was created to bypass an issue in the SOAP API in relation with VMware, hardware vendor drivers and PRTG. In any case, you could use the same script for other monitoring systems or any other purpose – of course while adjusting it to your needs.

You can find more information about the issue here: https://kb.paessler.com/en/topic/82458-vmware-host-hardware-status-soap-sensor-returns-warnings-after-update-to-vmware-6-7

In order to make this work you need to install the VMware PowerCLI PowerShell extension on the PRTG probe server. Further will you need to inject username and password as well as the vCenter name and internal hostname in vCenter.

LDAP authentication activated targets:

  • $host %host “%windowsdomain\%windowsuser” “%windowspassword”
  • $host %host.domain.local “%windowsdomain\%windowsuser” “%windowspassword”

Otherwise – you might need to use this format:

  • $host %host root MyRootPW

Test it in PowerShell as the Probe-User first – you should see the results. Eventually the script create a sensor with multiple channels – sensors in GREEN status will be counted only – sensors in UNKNOWN status will be counted and returned as text, while as long as no YELLOW or RED status (warning or error) occurs, the sensor still stays green/okay. Warning or Error levels will automatically apply and have the problematic hardware systems in the sensor message text.

My first attempt was to show all channels on top of the summary – due to getting over 100 separate hardware statuses back and the limitation in PRTG of 50 channels per sensor, I dropped the idea – while the script still has all the code to handle it.

 

VMware alert monitoring with PRTG and PowerShell

VMware alert monitoring with PRTG and PowerShell

There is a way to read out and process ALL alerts of your VMware environment using PowerShell and reporting the results back to PRTG. The script further down in this article does this. What you get is similar to the graphic here.

This show you the following channels:

  • Overall status
    • this will be green as long there aren’t any not acknowledged warnings or alerts in VMware
    • if the warning or alert is acknowledged, the sensor / script will return to green cause it is nothing that is new
  • Total Alerts – amount of alerts acknowledged and not ackowledged
  • Total Alerts – Acknowledged
  • Total Alerts – NOT Acknowledged
  • Total Warnings
  • Total Warnings – Acknowledged
  • Total Warnings – NOT Acknowledged
  • Total Warnings and Alerts
  • Total Warnings and Alerts – Acknowledged
  • Total Warnings and Alerts – NOT Acknowledged

As you can see – you can get more granular on your PRTG statuses if you use the channels for Warnings/Alerts that are acknowledged. You could set upper warning or error limits of 0 to keep a warning / error level in PRTG if you want to see them still.

While I was writing the script, I decided to create a new lookup value in PRTG to make it more clear. If you adjust the script in regards to add additional statuses for the channel overall status – you will need to adjust this file as well.

Let’s start with the value lookup file, you need to copy the text from the first script block in to a file you store here: C:\Program Files (x86)\PRTG Network Monitor\lookups\custom

Name the file: vmware.alerts.search.ovl

Now we need to create a custom EXE/XML sensor in this directory: C:\Program Files (x86)\PRTG Network Monitor\Custom Sensors\EXEXML

Name the file: VMwareAlerts.ps1

Once you have both files created, go to PRTG and add a new sensor called EXE/Script Advanced and select the new created script file. As Parameter you either type the host-name of your vSphere server or if you created it underneath the device in PRTG just use %host.

UPDATE: I changed the script cause I found it to be better to go with the following expected parameters and always making sure you have control over username and password used to connect to VMware. Please use the follow parameter moving forward:

There are still a few challenges you might need to overcome on top of this:

  • install the VMware PowerShell extensions on your PRTG probe server
  • credentials to connect to VMware can be a challenge as I tested this
    • you might need to have the service account of the PRTG probe have sufficient access rights – needs working SSO
    • alternative use a stored credentials file in PowerShell – somewhat secure
    • or provide the credentials clear text in PowerShell – least secure
    • please see line 20 respective the command “connect-viserver” for more details
  • updated the script – it now expects username and password as parameter

You might wanna test the script before you add a sensor to PRTG – the best way to do this is directly on the PRTG server with the service account of the PRTG probe to make sure it will work as a sensor later on.

Keep in mind that the script expects a parameter – the VMware vSphere server name / web-address.

This was also posted on the PRTG KB here.

 

How to create an independent backup network

How to create an independent backup network

Today we look at independent backup networks especially in regards to LTO 7 and VMware ESX hosts. Be aware – this very example also applies to any backup to disk (B2D / Backup-2-Disk) solution. But a good reseller / vendor would inform you about this right away anyways.

LTO 7 and later like LTO 8 drives have a write speed faster then a 1 GBit network can handle, making it now really necessary to think about options. On top of it, you do not want to over utilize the LAN side of your servers so that the impact on the user / application facing side stays minimal. This leaves you with two options, you can group switch ports assuming you have enough 1 GB ports and use them, you will need at least 3 ports combined, or you create a whole backup network on a 10 GB basis.

Let’s run some numbers:

  • LTO7 has a write speed of about 300 MB/s uncrompressed and up to 750 MB/s compressed
  • LTO8 (L8) has a write speed of about 360 MB/s uncrompressed and up to 900 MB/s compressed

Now – your network connection is meassured in MBit/s not MByte/s. Byte to bit is 8 bit are one byte, so we need to multiply those speeds in byte with 8 bit too see the network speed numbers.

  • LTO7 uncrompressed = 300MB/s * 8 = 2400 MBit/s
  • LTO7 compressed = 750MB/s * 8 = 6000 MBit/s
  • LTO8 uncrompressed = 360MB/s * 8 = 2880 MBit/s
  • LTO8 compressed = 900MB/s * 8 = 7200 MBit/s

Assuming you want to go with grouped ports, you see that with LTO7 you would need 6 ports and LTO8 7 to 8 ports to fully utilize the speed and minimize your backup window. Additionally think about the read speed that might affect you as well – not just for recovery but for the verify of your backup.

Now – this means – add at least one 10 GB switch and one 10 GB NIC to each server – let’s do this with an example:

  • 3x VMware ESX hosts – LAN side and management is configured 1 GB – we assume there is some kind of storage behind them that has the iOPS and speed we need like an SSD based storage
  • 1x Backup media server that has an LTO7 or LTO8 drive connected – 1 GB on the LAN side

What we need – minimal:

  • 4x 10 GB NICs
  • 1x 10 GB switch
  • 4x CAT6e or CAT7 cables

What I would recommend – nice to have:

  • 4x 10 GB NICs – dual port
  • 2x 10 GB switches
  • 10x CAT7 cables – 2x to stack/trunk the switches if not stacked otherwise

This is a nice to have – a fail-over, but the minimal configuration is sufficient as well.

Cable this all in – create a new IP-scope / VLAN on the backup side – you do not need any default Gateway etc. on the Backup-Network side (10 GB). Just an independent IP scope and have every host assigned a static address.

This keeps the regular network traffic and any broadcasts away from this network and your backup will run totally independent. You might need to disable your anti-virus solution on this NIC / IP-scope on the backup media server as well, cause it might actually influence the speed quite drastically. Having it separated and independent helps keeping the security up.

On the VMware hosts – I like to even allow VMware to vMotion on this backup-LAN – simply because it is extremely efficient there – independent from your LAN and if you have it from your iSCSI network as well. But that’s just an idea.

Now – the backup – how will it grab the data from the 10 GB side of your VMware hosts – especially if you have a vSphere cluster and grab the backup through the cluster?

Simple – you adjust the hosts file on your media server. Each and every VMware host needs to be listed in the hosts-file on the media server with the IP that it has in your 10 GB backup network. This way DNS and everything will act normal in your environment, only the backup-media server will reach out to those hosts on the 10 GB network due to the IP resolution of those hosts. This is the easiest way to accomplish this.

You will not to add a 10 GB connection, backup-network IP address etc. to your VMware vSphere controller – it can stay on your LAN or server-management network as is. This also means there is no reason to mention him in the hosts file on the media server.

How this works:
your backup will contact the vSphere controller on the LAN side
it will then be redirected to the host that currently holds the VM you want to backup
the media server now will contact the VMware host directly – due to the hosts-file entry on the 10 GB backup-network
backup will process…

This of course would work with a physical server as well – like a physical file-server etc. – though, today this is rather rare and especially VMware backups are actually large files that benefit most from the LTO7 write speed so the above makes sense there most. It wouldn’t matter if you do the same to an Hyper-V environment or any other VM host/guest solution. In theory it should always work the same.

What real world write speeds can you expect?
This is the big question – here are some real world examples of this – those are single jobs on per VM basis, meaning it includes even the tape-load and tape-unload processing time and udpating the catalogs while using Veritas Backup Exec.

Backup size (VM size)elapsed time in minutesjob rate write/overalljob rate verify
4 TB6:4717,227 MB/min26,404 MB/min
2 TB6:476,822 MB/min22,233 MB/min
1.21 TB3:498,271 MB/min20,235 MB/min
147 GB0:336,491 MB/min22,655 MB/min
138 GB0:1718,403 MB/min27,726 MB/min
25 GB0:109,172 MB/min20,700 MB/min

The above list is just an example – realistically we see speeds between about 3,000 MB/min to 18,000 MB/min as for the overall speed. This is due to the VM itself for some part – thin or thick provisioned, what kind of data is it holding, how busy is the host cause we might double trouble him due to multiple drives doing backups at the same time to the same host etc… In average we see around 8,000 to 9,000 MB/min in speed, what is still great – and I wanted to show as well that it can vary quite a bit so don’t be scared. We still did improve the time the backup took from going from an LTO4 LAN based backup scenario to an LTO7 independent backup network while cutting the time in half, actually, even less then half. The slowest speeds we see today are due to systems that can only be backed up on the LAN side, while the ports are grouped there but we still don’t have the same speed as we see on the backup-network side. Many factors come in play but that all depends on the individual situation.

Hoping the information above helps some of you out there – keep in mind that your situation might be different, run some examples and ideas and if you have questions, reach out – this remains an example of what I really implemented at a company and how it affected the backup configuration and management.

WDS respective PXE boot and VMware

WDS respective PXE boot and VMware

If you try to PXE boot a VMware guest system that e.g. uses WDS / Windows Deployment Services or similar, you might encounter that the boot.wim etc. download unbelievable slow. This can take several hours. This has especially to do with booting via the VMware EFI environment. VMware BIOS does not cause this issue. You could switch a EFI system to BIOS to capture/deploy the image – but this is not really a solution rather then bypassing it.

The solution for this is pretty simple, while the download is transferred through a TFTP, VMware has an issue with the blocksize and this gets a bit messed up due to the variable blocksize between the VM guest system and your PXE server.

Set the Maximum Block Size to 1456 – what is the exact value VMware needs to work properly. Disable further the Variable Window Extension and try to PXE boot again – you will see it will load in about a minute now – depending on your WinPE image size and your issues are in the past.

In detail:

  1. open your Windows Deployment Services
  2. right click on the server and select Properties
  3. navigate to the register card TFTP
  4. set the Maximum Block Size to 1456
  5. uncheck the Enable Variable Window Extension checkbox

Build your own lab environment with VMware

Build your own lab environment with VMware

If you have a virtual system, like VMware, and a storage device behind it you might be able to mirror your whole real / live environment and create a complete playground or shadow network to simulate any of your guest VMs like the real system and being able to change, update and adjust the configuration within this lab environment.. This is actually rather easy to accomplish and can save you a lot of headache while using it.

This is just an example, you might be able to accomplish something similar while just cloning single VMs and in theory it wouldn’t even matter if you use VMware vSphere or something like Microsoft Hyper-V. Still, it has advantages to do it this way, but let me explain it.

Assumed scenario:

You work with an VMware environment with one or more host systems and several guest VMs. You need to update software and configurations on those guest VMs but need to test this beforehand to ensure everything runs smooth. The VMs are stored on a central storage device that can do volume-level snapshots.

Prepare the environment

  1. you need one NIC per host that you will connect to a switch
  2. best use an independent switch that has no other network connections then to the 1x NIC per host
  3. create an Shadow-LAN virtual switch in your VMware cluster and use the 1x NIC per host so VMs through the cluster can communicate proper

What you do:

  1. create a snapshot and mount it to the VMware host systems as an additional volume
  2. go to the clone-volume file system and add the VMs you need to the inventory (you might want to rename to shadow-<servername> so you can easily identify them)
  3. re-configure their network connection from your regular LAN virtual switch to the shadow-LAN virtual switch you prepared
  4. start the added guest VM
    1. if you are asked if you moved or copied the guest just say you moved it – to avoid hardware / MAC and other changes possibly causing Windows to want to re-activate
    2. VMware might complain about a duplicate MAC address – you can ignore this, cause you are on two different / independent networks

Real usage example:

Let me give you a more detailed example on this with a few more details on what I personally used and did with this. The example you find below should help you understand the whole principle better.

  • VMware cluster with e.g. 10x host systems – we had enough RAM and CPU power that we could have 3x hosts go down – you won’t need that much, but of course you would have buffer for RAM and CPU usage
  • Nimble All-Flash storage arrays in the background connected to all the VMware hosts and using the Nimble VMware plugins (Note: Nimble was bought by HP / HPE as off today)
    • the Nimbles are configured to do volume level snapshots multiple times per day
  • all physical host systems had a dedicated network card (NIC) connected to an independent physical switch that was NOT connected to any other network switch
  • a virtual switch SHADOW-LAN was created and those physical NICs of the hosts systems had been assigned to it
    • this allowed any VM connected to this virtual switch to communicate with other VMs connected to the same virtual switch on other hosts

Due to migrations, software updates and quality controlled systems we constantly had the challenge to test and changes and adjustments thoroughly. So I came up with the solution to just clone a snapshot on the Nimble storage array the VM resided and mounting it to the VMware cluster, taking only minutes and then moving forward to add e.g. domain controllers, DHCP servers, necessary file-system servers and the target guest system to the inventory in VMware, adjusting their name so we could quickly identify them (even adding them to resource pools if necessary) and of course most important just changing their virtual switch configuration to the shadow switch.

Advantages and possibilities:

This now allowed to simulate the whole real world system (VMs) and simulate every change that we wanted. In order to get software there we attached if necessary VHDs that did hold what we needed or we even used a secondary internet connection to briefly connect to licensing services or update services that Vendors only provided online. The advantages of the solutions go even further:

  • simulate everything you have in your VMware environment available and have it working like the real / live system
  • if necessary, provide internet access while connecting a SECONDARY internet connection (router / firewall) to physical shadow network switch
  • adding real printers to the shadow switch to be able to test print-outs (we had those cases)
  • add physical workstations to simulate whole production environments
  • update / refresh the whole system in only a few minutes by using a fresh-snapshot clone
  • only minimal to almost none impact on the storage / free space of your storage device
    • this is due to grabbing a Nimble snapshot that was cloned and therefor created a new branch and only the deltas (changes) had an impact on the storage – we talk even for “bigger” simulations only about a few gigabyte changed data – if at all that much – of course depending on your storage and what you do
  • we installed VMware console connections on quality testing workstations so they could access the system directly on the console
    • of course only granting them minimal rights to this specific pool of VMs
    • avoiding that their access to the VMware environment had any impact to the real system
  • documenting any changes, challenges faced and solutions found
  • due to a physical switch, and best practice using a layer-3 switch, we where able to simulate whole VLANs, routing etc. within the environment and even connecting various physical systems like printers, workstations and temporarily an internet-connection to this environment

It is only a small amount of effort to initially prepare for those simulations, cause the virtual shadow switch, the physical shadow switch and the hosts network card connections to this physical switch are a one time effort. After this you just clone and mount snapshots and add the actual VMs you need while adjusting their network connection to the virtual shadow switch.

Once setup, preparing simulations usually takes less then 30 minutes till everything is cloned, mounted, added to the inventory (incl. NIC adjust to the shadow switch) and booted up.

Why not just clone all VMs via VMware?

Good questions – the answer is simple, this would have an impact on your storage capacity, cause it would create an actual clone. And it actually takes longer to clone individual VMs then to just grab a storage-level snapshot and being able to adjust what you want down to the volume level on the storage. Even the clean-up might be more intense or leave some unwanted data back – while a clone on off the volume only needs you to remove the VM guest system from the inventory and then unmount and delete the whole shadow volume.

I did write this all up cause I wanted to share it – the whole idea is not that special in theory, but I thought it is an good example on how you can accomplish having a huge and decent lab environment with only minimal effort. In any case, I hope the idea behind it will help some off you out there 🙂

 

 

Solution for VSS exceptions with VMware guests / VM tools

Solution for VSS exceptions with VMware guests / VM tools

Today’s blog is about “Solution for VSS exceptions with VMware guests / VM tools” and was initially posted by myself here: https://vox.veritas.com/t5/Backup-Exec/Solution-for-VSS-exceptions-with-VMware-guests-VM-tools/td-p/829072 – it actually became a KB entry for an as of today older version of Veritas Backup Exec – but I did not want to leave it out of my blog.. Here is the link to the KB article: https://www.veritas.com/docs/000009506

Here is the article I wrote – starting with a description of the actual issue:

We have been experiencing VSS issues with VMware guests in regards to the installed Backup Exec Agent and a previously installed VMware Tools VSS option.

Uninstalling the VMware Tools VSS option in various ways including restarts did not fix our issues. If you search the internet for solutions, you find many attempts but no real solution or explanation.

One of our admins spend several hours with the Veritas support without a solution, he was about to escalate the issue with them, when we found the root cause and could actually fix our issues.

First the steps to solve this:

  1. Uninstall the VMware Tools VSS option (no restart will be requiered)
  2. Make sure VMware VSS service was deleted
    1. If this is not the case, you might need to do so manually and remove additional DLL’s etc. as well as restart the system, but this is independent from this solution
  3. You might have already done steps 1 and 2 but you still get VSS exceptions from the backup that says you have more than one VSS agent installed:
    1. V-79-8192-38331 – The backup has detected that both the VMware VSS Provider and the Symantec VSS Provider have been installed on the virtual machine ‘hostname’. However, only one VSS Provider can be used on a virutal machine. You must uninstall the VMware VSS Provider.
    2. Now you wonder what causes this and you get stuck
    3. You could uninstall the Veritas/Symantec Backup Exec Agent and only back the system up per VMDK
    4. You would lose the GRT / granular backup / restore capabilities
  4. Check your registry for the following reg key
    1. HKLM\System\CurrentControlSet\Services\BeVssProviderConflict
    2. If this key exists, but your VMware VSS provider is uninstalled, you need to follow up with step 5
  5. Open a notepad as administrator
  6. Open this file in the notepad
    1. C:\ProgramData\VMware\VMware Tools\manifest.txt
  7. Search for the following two entries:
    1. Vcbprovider_2003.installed
    2. vcbprovider.installed
  8. Make sure both of them are set to FALSE, most likely one of them is TRUE
  9. Run a test backup
    1. This test backup now should not show the exception anymore
    2. The registry key should vanish (refresh/press F5) without you taking action

So what happened?

You uninstalled the VMware Tools VSS provider, but this manifest file did not get updated. We actually could see that it sometimes does get updated and sometimes does not. This seems to be some kind of issue with the VMware Tools uninstalled/installer.

But why this manifest.txt file?

As we found out, there scripts that get executed by Symantec/Veritas Backup Exec before the backup. You might find them in two locations, and it seems to depend a bit on the Windows version which script is executed (at which location). You could edit them both and just undo the checks in the scripts, but this wouldn’t be correct. It is more correct to update the manifest.txt file. If you want to, you can check the date/time of the manifest.txt file before you change it – you might see it was not updated while you uninstalled the VMware Tools VSS provider (assuming you did only do this and not do additional installs/uninstalls within the VMware Tools / please note as well that this only is true when you still experienced those issues).

Now, back to those scripts, you find them here:

  • C:\Windows
  • C:\Program Files\Symantec\Backup Exec\RAWS\VSS Provider

The name of the script that matters:

  • pre-freeze-script.bat

This script checks several DLLs, registry entries, paths and on Windows 2008 and newer the ProgramData-Path for this specific manifest.txt file and the two entries mentioned.

Once you uninstall the VMware VSS provider, and the file did not get updated, you might see this issue and wonder how to solve it. The solution is to simply update it to mirror the uninstallation of the VMware VSS provider (vcbprovider). We double checked this with several installations and could see if the file actually gets updated, those two values are set to FALSE, if it doesn’t, at least one of the values remains true, what causes the pre-freeze-script.bat to write the registry key mentioned earlier and therefor causing the issue in the backup – exceptions.

If you still have the same issues after updating the manifest.txt, simply check all the DLL’s that are mentioned in the script and make sure they don’t exist. You might also consider to manually delete the registry-key (it seems to be just a dumy-key) to make sure there is no issue that prevents the script from deleting it. Make sure it does not re-appear after a backup! Otherwise you might still have some DLLs left on your system that cause the script to re-create the registry key.

Hope this helps a few of you out there. This was an ongoing issue for a while and I came accross those issues many times ever since Windows 2008. This applies to Windows 2008 R2, Windows 2012, Windows 2012 R2 and pretty sure to Windows 2016 as well.

It helped us getting rid of those issues completely and actually not even needing a single restart of the guest VM to solve the issues (removing the VMware VSS provide did not need a restart).