Someone recently approached me asking for advice on how to configure antivirus on virtualized endpoints as they received complains about poor performance on their infrastructure. After I got very briefly excited that the request would give me a chance to discuss hypervisor based malware protection it turns out that the only option at this point is plain old (thick) antivirus agents in each guest VM. To my surprise it seemed that few of the – seemingly obvious – configuration options to lessen the pain were set. After looking around online I found very thin coverage of this topic so I decided to write up my take on it. Please feel free to let me know your opinion, criticism or suggestions.
As the technology matures new options to protect data processed in virtualized environments becomes available. At this point however the malware protection strategy for VI in many organizations is still focused on agent based technology. Unfortunately conventional antivirus solutions have a potential to cause considerable stress on shared VI resources. However, many antivirus solutions do provide configuration options to mitigate the performance hit on these shared environments. The following sections will provide thoughts on what to implement.
Coordinated maintenance windows
As mentioned above the key point is to minimize concurrent load on the hardware resources. This requires planning inside and outside of the antivirus solution to avoid bottlenecks.
Identify service hours and maintenance tasks required for the virtual infrastructure
This might seem obvious but an effort should be made to correctly identify the core service hours of the virtual workloads. A capacity management team or similar personnel should be consulted in multi-tenant environments to identify periods with low workload pressure which would be suitable maintenance windows. Careful planning is required to ensure that maintenance tasks do not affect the other tenants during their core service hours where resources are scarce. Capacity management should collate relevant maintenance tasks for the environment to ensure capacity reserves are adequate to ensure seamless service.
Schedule maintenance tasks in a coordinated way
Once preferred/potential maintenance periods are known and necessary activities are understood the individual activities should be scheduled in non-conflicting timeslots where possible. Poor planning of multiple support teams trying to run service tasks on multiple assets might result in resource exhaustion for the whole infrastructure as illustrated below.
Ideally this can be avoided by assigning appropriately timed timeslots to the support teams spreading out the overall load on the virtual environment in the maintenance periods.
Typical maintenance tasks to consider in regards to resource contention are conventional backup and restore jobs, snapshot creation and clean up, software deployments to guests or host systems, Antivirus scans, vulnerability assessments and inventory scans, etc. Once it is clear what time and timeframe is available for antivirus related tasks further planning can take place.
Methodical endpoint configuration
There are several configuration settings to optimize the resource utilization within and outside the maintenance window. Often these settings are implemented on logical groups (e.g. ‘Domains’ in Trend Micro, ePO Site/groups in Mcafee ePO managed endpoints, etc) which necessitates the endpoints to be grouped according to their configuration requirements. This should be done as a first step where it does not conflict with existing categorization methodologies.
Set scan exclusions
Usually there are no special scan exclusion recommendations when it comes to guest OS in virtual machines but general advice on exclusions is provided by Microsoft, antivirus vendors and often application vendors. From a performance point of view it might be beneficial to limit the scope of the full scan to known risk file format extensions like .exe, .com, .vbs, etc. (e.g. IntelliScan setting in Trend Micro).
For further details please refer to:
What Anti-Virus scanning exclusions should be considered for system and servers?
VMWare http://www.vmware.com/files/pdf/VMware-View-AntiVirusDeployment-WP-en.pdf (pg.8)
Limit agent resource consumption
Many antivirus solutions provide the option to either tune or limit their resource consumption on the protected endpoint during selective scanning tasks. These setting can be used to reduce the load on the endpoints during resource intensive tasks but it is important to understand that
– this is an estimate of the consumed resources and not a 100% reliable measure
– limiting the available resources might result in an extended runtime of the maintenance task
If it is decided to use these configuration options it should be verified that all necessary maintenance tasks are completed within the allocated time window on all agents.
Aligning tasks and timings
Once the general maintenance windows are defined the antivirus specific tasks and timings can be reviewed. The key points to take into consideration are the update tasks (policies and Virus signatures) and the on demand scan timings.
Optimizing update tasks
Updates of agent components can consume considerable resources on shared infrastructure especially if the workload is scheduled to be executed at the same time for all agents across the cluster. This is easily illustrated by the simple equation below.
(((AV component * Update size)*Number out-dated components)*Number of endpoints) = Update Volume
Assuming there is a respectable number of live endpoints on a cluster the update distribution of a single Antivirus component (e.g. pattern file) can be quite a strain on the vswitch(es). This pain point can be reduced by regular incremental updates, phased update windows and pull randomization within the update window. The options you have to implement mitigation measures depend on the capabilities of your antivirus product.
On Demand scan schedule
It is debatable whether on demand scans are necessary if all endpoints are protected by on access scan. Some vendors recommend to turn ODS off in virtual infrastructures but ultimately this is a risk decision your business will have to make. Points to consider –
– Is OAS enabled all the time or might there be protection gaps introduced by automated processes or human error/deliberate acts
– Does OAS reliably identify and stop relevant malicious code
– Is OAS an appropriate risk mitigation for the environment or should it be turned off?
– Cost of other clean-up processes if malicious code got past OAS (out-dated AV component, disabled service, etc)
There are probably other reasons you might think off and every situation will lead to a different decision outcome. If you decide ODS is not for you just skip ahead.
Considering the timings of the update tasks the support groups should then go on and consider the time settings for the scheduled scan task to run. It makes good sense to run the ODS scans at a period after the updates have been applied to the endpoints. This will ensure that the latest available virus signatures will be used during the scan process.
As mentioned in the previous sections careful consideration must be taken to ensure that the on demand scan runs at a time that does not conflict or overlap with other system related tasks. Usually on demand scan can be configured for a maximum scan period after which time the ODS will forcefully end. Keep in mind however that the scan might be stopped before it inspected all data potentially leaving malicious code undiscovered.
The picture below illustrates the effect of a default – unmethodical – antivirus configuration. The tasks which are scheduled to be carried out during the allocated maintenance window start as soon as the maintenance window starts. While this might not have a negative impact on the overall service (as it is happening during maintenance period) it is certainly far from ideal.
The result of this setup would be random consumption spikes as agents on the shared infrastructure start their ODS task. This might trigger monitoring thresholds and raises (debatably false) alarms incurring unnecessary investigation work for support teams and/or dulling awareness to real alerts. Due to the overlap in pattern update and ODS some agents might scan with out-dated pattern files also reducing the effectiveness of the scan.
The setup illustrated below is preferable as it will ensure the shared resources are utilized in a more controlled pattern avoiding spikes in consumption as much as possible.
While there might be some overlaps of start/end times of ODS between the groups the overall resource consumption will be more evenly spread out. The asset owner will have to review and decide how their assets can be set up in logical groups and define the various scan times of all groups.