Performing a VASA Provider’s certificate replacement after vCenter Server convergence, would result in virtual volumes (vVOLs) going inaccessible from all ESXi host’s inventory.

What causes this?

1. The convergence workflow installs RPMs related to the PSC services which also means a new VMware Certificate Authority (VMCA)
instance is created on the embedded VC node.

2. VMCA creates a new VMCA root certificate which in turn is used for future certificate requests that the embedded node handles.

3. While the old certs are retained maintaining VC<-> host communication, other solutions like vVOl do not operate as the new certs provided to VASA providers have new ROOT certificte details whereas the hosts still have old ones causing vVol workflow to break.

How do you resolve this?

Renew or Refresh ESXi Certificates connected to vcenter server.

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.security.doc/GUID-ECFD1A29-0534-4118-B762-967A113D5CAA.html

The certificate refresh has to be done manually per host.

Note:Bulk certificate management is currently not possible from vCenter Server UI at this time.

VMFS-6 heap memory exhaustion on Esxi 7.0/7.0b hosts

What is VMFS heap and What its used for?

This is defined in the advanced setting VMFS3.MaxHeapSizeMB. The main consumer of VMFS heap are the pointer blocks which are used to address file blocks in very large files/VMDKs on a VMFS filesystem. Therefore, the larger your VMDKs, the more VMFS heap you can consume

How to check the current heap used on esxi host:

vsish -e ls /system/heaps | grep vmfs3
vsish -e get /system/heaps/”Output of above command”/stats

example:

When the issue is observed

Any file open activities can encounter the issue.

Datastores showing “Not consumed” on hosts

Consolidation activity fails to perform with “Consolidation failed for disk node ‘scsi0:1’: 12 (Cannot allocate memory).”

vMotion,snapshot, VM power on/ power off activities.

Logs and key words to check

vmkernel.log

2020-06-29T14:59:36.351Z cpu21:5630454)WARNING: HBX: 2439: Failed to initialize VMFS distributed locking on volume 5eb9e8f1-f4aeef84-4256-1c34da50d370: Out of memory
2020-06-29T14:59:36.351Z cpu21:5630454)Vol3: 4202: Failed to get object 28 type 1 uuid 5eb9e8f1-f4aeef84-4256-1c34da50d370 FD 0 gen 0 :Out of memory
2020-06-29T14:59:36.351Z cpu21:5630454)Vol3: 4202: Failed to get object 28 type 2 uuid 5eb9e8f1-f4aeef84-4256-1c34da50d370 FD 4 gen 1 :Out of memory
2020-06-29T14:59:36.356Z cpu21:5630454)WARNING: HBX: 2439: Failed to initialize VMFS distributed locking on volume 5eb9e8f1-f4aeef84-4256-1c34da50d370: Out of memory

vmkwarning.log

vmkwarning.0:2020-06-16T13:28:23.291Z cpu48:3479102)WARNING: Heap: 3651: Heap vmfs3 already at its maximum size. Cannot expand.
vmkwarning.0:2020-06-16T14:20:23.676Z cpu62:3479103)WARNING: Heap: 3651: Heap vmfs3 already at its maximum size. Cannot expand.

Check for the consumed Heap size using vish commands mentioned above.

Fix the issue by running below command for each vmfs6 datastore on each host.

1.Create Eager zeroed thick disk on all of the mounted VMFS6 datastores.

vmkfstools -c 10M -d eagerzeroedthick /vmfs/volumes/datastore/eztDisk

2.Delete the Eager zeroed thick disk created in step 1.

vmkfstools -U /vmfs/volumes/datastore/eztDisk



SRM/vSphere Replication site pairing fails with an error. “Cannot complete login due to an incorrect user name or password.”

When will you see this?

While attempting to do a site pair after a re-installation, upgrade of the VC/VR/SRM.

[Log Excerpt]

dr.log:

2020-05-05T21:32:13.527+05:30 warning vmware-dr[04864] [SRM@6876 sub=LocalHms] Failed to connect:
–> (vim.fault.InvalidLogin) {
–> faultCause = (vmodl.MethodFault) null,
–> faultMessage =
–> msg = “Received SOAP response fault from []: login
–> Cannot complete login due to an incorrect user name or password.”
–> }
–> [context]zKq7AVMEAAgAAFaTwQAMdm13YXJlLWRyAAAqPwJ2bWFjb3JlLmRsbAABtM4CdmltLXR5cGVzLmRsbAAB/X8yAqXCBXZtb21pLmRsbAACz+AFAOt+GwBLjhsAyYghA39PAk1TVkNSMTIwLmRsbAADJlECBNITAEtFUk5FTDMyLkRMTAAF9FQBbnRkbGwuZGxsAA==[/context]
–> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 8.1.2, build: build-12686166, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
–> backtrace[03] vmacore.dll[0x00023F2A]
–> backtrace[04] vim-types.dll[0x0002CEB4]
–> backtrace[05] vim-types.dll[0x00327FFD]
–> backtrace[06] vmomi.dll[0x0005C2A5]
–> backtrace[07] vmomi.dll[0x0005E0CF]
–> backtrace[08] vmacore.dll[0x001B7EEB]
–> backtrace[09] vmacore.dll[0x001B8E4B]
–> backtrace[10] vmacore.dll[0x002188C9]
–> backtrace[11] MSVCR120.dll[0x00024F7F]
–> backtrace[12] MSVCR120.dll[0x00025126]
–> backtrace[13] KERNEL32.DLL[0x000013D2]
–> backtrace[14] ntdll.dll[0x000154F4]
–> [backtrace end]

/opt/vmware/hms/logs/hms.log

2020-05-05 09:44:28.246 ERROR com.vmware.vim.sso.client.impl.SoapBindingImpl tcweb-11 operationID=lro-2-71e1a81-37ab-HMS-201468 | SOAP fault
com.sun.xml.internal.ws.fault.ServerSOAPFaultException: Client received SOAP Fault from server: Access not authorized! Please see the server log to find more detail regarding exact cause of the failure.

2020-05-05 09:44:28.247 ERROR jvsl.security.authentication.sm tcweb-11 operationID=lro-2-71e1a81-37ab-HMS-201468 | Invalid token
com.vmware.vim.sso.client.exception.InvalidTokenRequestException: Request is invalid: ns0:InvalidRequest: Access not authorized!

2020-05-05 09:44:28.248 INFO hms.i18n.class com.vmware.hms.response.filter.I18nActivationResponseFilter tcweb-11 operationID=lro-2-71e1a81-37ab-HMS-201468 | The localized message is: Cannot complete login due to an incorrect user name or password.

Why would we see this?

One or multiple SolutionUsers get removed from the groups they should be a part of, resulting in the issue.

Steps to resolve:

Following are the 4 SRM & VR SolutionUsers that one would have in their environment.

SRM-
SRM-remote-
h5-dr-
com.vmware.vr-

The following are the groups these SolutionUsers should be a part of:

  1. SolutionUsers
    SRM-
    SRM-remote-
    h5-dr-
    com.vmware.vr-
  2. ActAsUsers
    CN=h5-dr-
    com.vmware.vr-
  3. Administrators
    SRM-
  4. LicenseService.Administrators
    SRM-
  5. SRM Remote Users
    SRM-remote-
  6. HmsRemoteUsers
    SRM-remote-
  7. Login to the vCenter Server using vsphere Flex client.
  8. Navigate to Administration -> Single Sign-On -> Users and Groups -> Groups -> Add Group members.
  9. Manually add the SolutionUsers to these groups.
  10. Re-register SRM/VR.

SSO Domain Re-point fails in vCenter 6.7 at Authz data export

This article about how to repoint Embedded PSC in one sso domain to another embedded domain in same of different sso domain.

Why do this?

  • To have the both vCenter’s connected under ELM(Enhanced Linked mode)
  • It will help in managing multiple vCenter’s with one user interface

How to do it?

We can run the re-point command in pre-check mode and execute mode.

Pre-check helps us validate the current environment and provide any potential errors we can encounter before we execute the command.

++Command Syntax :
cmsso-util domain-repoint -m pre-check –src-emb-admin Administrator –replication-partner-fqdn vcsa2.gss.local –replication-partner-admin Administrator –dest-domain-name vsphere.local

cmsso-util domain-repoint -m execute –src-emb-admin Administrator –replication-partner-fqdn vcsa2.gss.local –replication-partner-admin Administrator –dest-domain-name vsphere.local

  1. Pre-check mode in 6.7 u2 fails during the authz Data export

++In the /var/log/vmware/cloudvm/domain_consolidator.log you see the following error:

2019-04-25T20:49:29.215Z INFO domain_consolidator Started required services.
2019-04-25T20:49:29.659Z INFO domain_consolidator RC = 1
Stderr = Picked up JAVA_TOOL_OPTIONS: -Xms32M -Xmx128M
Exception in thread “main” java.lang.NoClassDefFoundError: org/springframework/context/support/AbstractApplicationContext
        at com.vmware.vim.vmomi.core.types.VmodlContext.initContext(VmodlContext.java:61)
        at com.vmware.vim.vmomi.core.types.VmodlContext.initContext(VmodlContext.java:42)

++Fix was to upgrade the vCenter server to vCenter 6.7 U3 version

++Workaround for the issue:

A. Validate the spring* files under /opt/vmware/lib64 should be with 4.3.20
B. Update of the spring version in vCenter 6.7 U2 from 4.3.9 to 4.3.20. The script /usr/lib/repoint/authzservice_component_script.py has hard set references to the 4.3.9 version.
You can run below command to edit all the entries in script
sed -i ‘s/4.3.9/4.3.20/g’ /usr/lib/repoint/authzservice_component_script.py

C.Proceed with running the pre-check command for successful completion

2. Pre-check mode failed in 6.7 U3 during Authz data export

++In the /var/log/vmware/cloudvm/domain_consolidator.log you see the following error:

2020-08-04T02:49:06.213Z INFO domain_consolidator Started required services.
2020-08-04T02:49:07.908Z INFO domain_consolidator RC = 1
Stderr = Picked up JAVA_TOOL_OPTIONS: -Xms32M -Xmx128M
Exception in thread “main” java.lang.Exception: QueryClient creation failed for VC:vcsa1.gss.local. Check ‘domain_data_export.log
at com.vmware.vim.dataservices.ExportAuthzData.main(ExportAuthzData.java:224)

2020-08-04T02:49:07.909Z INFO domain_consolidator Export of Authz Data failed. Exception {
“resolution”: null,
“problemId”: null,
“detail”: [
{
“id”: “install.ciscommon.command.errinvoke”,
“translatable”: “An error occurred while invoking external command : ‘%(0)s'”,
“args”: [
“Stderr: Picked up JAVA_TOOL_OPTIONS: -Xms32M -Xmx128M\nException in thread \”main\” java.lang.Exception: QueryClient creation failed for VC:vcsa1.gss.local. Check ‘domain_data_export.log’\n\tat

++In domain_data_export.log we could see error mentioned below indicates STS certification validation failed.

2020-08-04T02:49:07.814Z [main DEBUG com.vmware.vim.sso.client.impl.SoapBindingImpl opId=] Sending SOAP request to the STS server
2020-08-04T02:49:07.833Z [main DEBUG com.vmware.vim.sso.client.impl.ssl.StsSslTrustManager opId=] The SSL certificate of STS service cannot be verified against the list of client-trusted
certificates

sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

++Fix would be update the STS service Certificate in MOB with MACHINE_SSL_CERT certificate.

A. Validate the certificate from vCenter mob for STS

B. Open the MOB, go to https://vCenter_IP/lookupservice/mob?moid=ServiceRegistration&method=List in a browser. login using administrator@vsphere.local account

C. In the filterCriteria text field, modify the value field to have only the tags <filterCriteria></filterCriteria> and click Invoke Method. This displays the ArrayOfLookupServiceRegistrationInfo objects

D. Search for sts/STS on the page. Find the value of the corresponding sslTrust field. The content of that field is the Base64 encoded string of the old certificate

E. Copy and paste the string in the ArrayofString field in the row of the sslTrust name (next to the ArrayOfString type), and save the string as a file named sts.cer.

F: Note the Thumbprint of certificate by opening it.

G. Run this command to export the new certificate to a file:
/usr/lib/vmware-vmafd/bin/vecs-cli entry getcert –store MACHINE_SSL_CERT –alias __MACHINE_CERT –output /temp/new_sts.crt

H.The Thumprint of sts.cer and New_sts.crt do not match and its caused the validation for STS service fail.

I. Command to update correct certificate information under mob.

python /usr/lib/vmidentity/tools/scripts/ls_update_certs.py –url https://psc.domain.com/lookupservice/sdk –fingerprint Thumbprint_of_sts.cer –certfile /temp/new_sts.crt –user administrator@vsphere.local –password Password


++ Run the command to perform Domain re-point in pre-check mode.