SQL DBA Blog

The Dilemma of Automated SQL Server Patching

Nathan Peterman — Tue, 17 Mar 2020 03:33:08 GMT

No matter what technological area expertise, patching is always a requirement. This is no different in the life of a DBA. One of the tasks is to ensure that the environment is kept up to date with new releases and security updates but are you able to automate patching? The answer, yes but carefully.

There are a few questions that should be answered when approaching the subject of automated patching:

Has a backup been taken?
Is the environment ready for patching?
Is the Databases ready for patching?
Is the server a member of an AlwaysOn availability group? Can it be failed over safely?
Is there a backout plan for if anything goes wrong?
Can we send out an alter in the event of a failed backup?

When out team was approached with creating a method of deploying automated patching for SQL servers, we took the safe approach and made sure the automated process was aware of the environment that it was patching and would only patch if it met specific requirements. There are a few queue areas that this post will concentrate on to make sure that each patch installed is ether successful or an alert is sent out for a failure. What we would like to avoid is silent failures.

Logging

When a patch job is launched on a server, the first step in the process is to setup a patching log file. This step is critical, and the entire job will fail with an exit code if the process of log creation does not meet the following requirements:

Does the Log Directory exist? If not attempt to create it and fail if it is not possible
Does a Log exist? We can't overwrite old logs we need a method of re-naming old log files while only keeping a specific number for housekeeping. If no log exists create a new log
Can I successfully write to the log file? If not then the job fails.

Once logging has been verified, we can continue with the patching process now that we know we have a central location shared between all steps to send output. From this point on, we can have a failure with the execution of certain steps based on the output but being able to log data is a critical step that should fail the entire patching process if not successful.

Investigation

The next phase of the patching process is to gather information about the current environment targeted and to check to make sure it meets patching requirements. Some of the information that is gathered to be used for further analysis later on in the patching process:

The name(s) and number of installed Instances
The SQL Port of the Engine Service
The SQL Cluster name (if there is one)
Is HADR enabled on the Instance
Verify that the account used for patching has appropriate SQL permissions
Verify that the account used for patching has appropriate OS permissions
AlwaysOn Availability group information and roles
AlwaysOn Availability group listener information
Disk space
Database health
OS pending reboot status

Logic

Based on the information found during the Investigation phase now the fun part begins. If we setup our job with all of the information gathering contained within the investigation phase, you can then make a small sub job that can be run before the patching process that will pre-check the environment before the full automation patch job is launched. This way small or easy failures can be corrected before a full patching job is launched on the SQL Server.

The Logic phase will act on the information and run specific tasks to prepare the server for patching. The responsibility of a DBA is to maintain SQL server availability and data integrity and blindly patching a server can compromise both of those responsibilities. Some of the Logic used before patching are as follows:

Reboot the server if has a pending reboot and pick up where it left off after a reboot
Check Database status and correct as needed
Launch a standardized backup job and launch a automated backup process if no standardized job is found
If AlwaysOn exists, check for failover mode and data synchronization type, failover if possible so we are only patching on a Secondary.
Verify AlwaysOn secondary status and patch level before patching primary if a failover is not possible
Check for monitoring services and cleanly shut them down

Patching

When entering the patching phase, we are at the point of no return. Once monitoring has been disabled on the server, we can't have any of the process fail to run as we would be left with an un-monitored environment. In out process, turning off monitoring is the last step before patching. Enabling monitoring again is the first step after a patch has completed it's executing. At this point we are not verifying that patching is successful, we are just making sure we have a monitored environment and alerts are able to be sent out in the event of a failure. This phase is the watchdog patching sandwich.

Validations

After monitoring has started successfully, we move on the validation phase with tasks that read through the SQL patching log files for events and exit codes to verify that the patching process has completed successfully. If an error is discovered, an alert can be generated with the severity of the system being patched currently.

Conclusion

The dilemma of patching SQL server through an automated process while maintaining the requirements of a database admin for data integrity and availability, can be accomplish through the means of some logic and smart patching processes that verifies and corrects issues before the patching process has been launched.

An Easy Way to Backup SQL Servers

Nathan Peterman — Wed, 22 Jan 2020 16:58:21 GMT

Let me start off by saying, this is not a replacement for an Enterprise Backup Solution that should be handling the backups for Production environments. But if you are running a similar setup to my current organization, backups are not completed for DEV or QA environments. This makes for an interesting problem because these are the environments are where SQL Updates are tested. If like me, you want to make your life easier when it comes time to roll back and take a pre-backup in some instances, you might want to look at using the following Script:

Desani/DBA-Tools-Public

This repository will contain tools that leverage PowerShell and SQL Modules to help automate some tasks that need to be performed by DBA's on the Microsoft SQL Server platform. - Desani/DBA-Too...

DesaniGitHub

The Script I am referring to is: Backup-SQLDatabases.ps1

The script allows for a number of command line arguments to be supplied to specifically target required databases for backup. The only mandatory parameters are -Path to dictate where the backups will be stored and the type of backup being taken: Full, Log, or Diff.

I will go over a few different parameter combinations and what type of backups they produce.

.\Backup-SQLDatabases.ps1 -Full -Path G:\Backup -CheckSum -Verify

The command above launches a full backup of USER databases on the locally installed SQL Server with a backup checksum and then verifies each backup. The nice thing about this is that you do not have to supply a connection string to the local instance. The script will pull instance information installed on the local machine and use it for the connection to SQL Server. There is no problem if you would like to launch this script on a remote machine, just provide a Connection String for the remote instance and provide a Path that is on the network that is accessible by the SQL Server engine service and the machine launching the PowerShell.

If no databases are specified, then all user databases will be selected for backup. If you would like to include System Databases then you are able to use the param -SystemDB which will let the script know to include those in the backup. If you would like to only backup the system database you can use the param -NoUserDB which will skip all user created databases. You can use the param -Database to specify specific databases that you would like backed up, for use if you would like one or many targeted.

The PowerShell script also leverages GridView for easy selection of specific databases. Use the param -SelectDB to have a list displayed with all available databases. Highlight the databases you would like included and select and push "Ok" to backup only the selected databases.

.\Backup-SQLDatabases.ps1 -Path X:\Backups -CheckSum -Verify -SelectDB -Full

.\Backup-SQLDatabase.ps1 -Diff -Path "\\NetworkLocation\sharedfolder" -Database -SystemDB -NoUserDB -ConnectionString "Servername,InstancePort" -Retention 14

This command creates a Diff backup of only the System Databases for a remote SQL Server Instance and attempts to remove backup files older than 14 days. This will result in no database backups as DIFF is not supported for System Databases.

.\Backup-SQLDatabase.ps1 -Full -CopyOnly -Path G:\Backup -Database "foglightdb,dbadmin,master" -SystemDB -AlwaysOn -SQLCredential sysdba

This command creates a Full Copy-Only backup for only the Databases dbadmin, foglightdb and master on the local SQL Server to G:\Backup. Runs the backup using the sysdba SQL Account. -SystemDB and -AlwaysOn must be used to make sure the databases supplied are eligible for a backup as master is a SystemDB and foglightdb is an AlwaysOn database in this example. -AlwaysOn only needs to be supplied when running the backup on a secondary AlwaysOn server.

The parameters -Script and -WhatIf do not produce any backups. -Script will create a Query log file in the same directory as the backup logs with each database scripted out in TSQL so that backups can be run manually. -WhatIf just displays the results as if the commands run without any backups occurring. This can be used if you are unsure what databases will be targeted by different combinations of parameters.

There are more parameters included that can be explored by visiting the GitHub link provided. If there is any additional functionality that you would like included, feel free to shoot me an email and I will look at incorporating it.

Identifying Problematic Queries

Nathan Peterman — Fri, 06 Dec 2019 17:16:00 GMT

As a DBA, being able to quickly locate and track problematic queries is invaluable. I will share some queries that I have collected from various sources or created that has assisted in the troubleshooting process when trying to identify problematic queries. As time goes on, I will update this posts with additional content as new methods of investigation are found.

You log in a SQL Server that is reporting that there are issues found and you open Activity Monitor to get a good idea of what is going on and you see the following:

Where do you start? What do you look into first? Below is some sample issues reported and the queries that I would use to start the troubleshooting process.

Problem 1: Queries are being blocked and you would like to identify the head blocker.

There are two queries that I use to identify the head blocker, the first is more consistent in it's results but the second also includes a bit more information.

Query 1: Provides the SPID and the query text

SET NOCOUNT ON
GO
SELECT SPID, BLOCKED, REPLACE (REPLACE (T.TEXT, CHAR(10), ' '), CHAR (13), ' ' ) AS BATCH
INTO #T
FROM sys.sysprocesses R CROSS APPLY sys.dm_exec_sql_text(R.SQL_HANDLE) T
GO
WITH BLOCKERS (SPID, BLOCKED, LEVEL, BATCH)
AS
(
SELECT SPID,
BLOCKED,
CAST (REPLICATE ('0', 4-LEN (CAST (SPID AS VARCHAR))) + CAST (SPID AS VARCHAR) AS VARCHAR (1000)) AS LEVEL,
BATCH FROM #T R
WHERE (BLOCKED = 0 OR BLOCKED = SPID)
AND EXISTS (SELECT * FROM #T R2 WHERE R2.BLOCKED = R.SPID AND R2.BLOCKED <> R2.SPID)
UNION ALL
SELECT R.SPID,
R.BLOCKED,
CAST (BLOCKERS.LEVEL + RIGHT (CAST ((1000 + R.SPID) AS VARCHAR (100)), 4) AS VARCHAR (1000)) AS LEVEL,
R.BATCH FROM #T AS R
INNER JOIN BLOCKERS ON R.BLOCKED = BLOCKERS.SPID WHERE R.BLOCKED > 0 AND R.BLOCKED <> R.SPID
)
SELECT N'    ' + REPLICATE (N'|         ', LEN (LEVEL)/4 - 1) +
CASE WHEN (LEN(LEVEL)/4 - 1) = 0
THEN 'HEAD -  '
ELSE '|------  ' END
+ CAST (SPID AS NVARCHAR (10)) + N' ' + BATCH AS BLOCKING_TREE
FROM BLOCKERS ORDER BY LEVEL ASC
GO
DROP TABLE #T
GO

Sample Results for Query 1

Query 2: Provides a bit more information but is not as constant with locating head blockers.

SELECT
db.name DBName,
tl.request_session_id,
wt.blocking_session_id,
OBJECT_NAME(p.OBJECT_ID) BlockedObjectName,
tl.resource_type,
h1.TEXT AS RequestingText,
h2.TEXT AS BlockingTest,
tl.request_mode
FROM sys.dm_tran_locks AS tl
INNER JOIN sys.databases db ON db.database_id = tl.resource_database_id
INNER JOIN sys.dm_os_waiting_tasks AS wt ON tl.lock_owner_address = wt.resource_address
INNER JOIN sys.partitions AS p ON p.hobt_id = tl.resource_associated_entity_id
INNER JOIN sys.dm_exec_connections ec1 ON ec1.session_id = tl.request_session_id
INNER JOIN sys.dm_exec_connections ec2 ON ec2.session_id = wt.blocking_session_id
CROSS APPLY sys.dm_exec_sql_text(ec1.most_recent_sql_handle) AS h1
CROSS APPLY sys.dm_exec_sql_text(ec2.most_recent_sql_handle) AS h2
GO

Sample Results for Query 2

Once you have a bit more information about the head blocker, you can start exploring more options and more query information with the SPID provided.

Query 3: List all current waiting tasks with blocking session ID

SELECT w.session_id
     , w.wait_duration_ms
     , w.wait_type
     , w.blocking_session_id
     , w.resource_description
     , s.program_name
     , t.text
     , t.dbid
     , s.cpu_time
     , s.memory_usage
 FROM sys.dm_os_waiting_tasks as w
      INNER JOIN sys.dm_exec_sessions as s
         ON w.session_id = s.session_id
      INNER JOIN sys.dm_exec_requests as r 
         ON s.session_id = r.session_id
      OUTER APPLY sys.dm_exec_sql_text (r.sql_handle) as t
  WHERE s.is_user_process = 1;

Problem 2: High CPU Usage

You log into the SQL Server and you see that CPU is being hit hard and you want to quickly determine what are the problem here are some quick queries that can be run to pull out the heavy hitting CPU usage queries.

Query 1: Top 10 queries consuming high CPU currently running

SELECT TOP 10 s.session_id,
	r.status,
	r.blocking_session_id 'Blk by',
	r.wait_type,
	wait_resource,
	r.wait_time / (1000 * 60) 'Wait M',
	r.cpu_time,
	r.logical_reads,
	r.reads,
	r.writes,
	r.total_elapsed_time / (1000 * 60) 'Elaps M',
	Substring(st.TEXT,(r.statement_start_offset / 2) + 1,
	((CASE r.statement_end_offset
		WHEN -1
		THEN Datalength(st.TEXT)
		ELSE r.statement_end_offset
		END - r.statement_start_offset) / 2) + 1) AS statement_text,
	Coalesce(Quotename(Db_name(st.dbid)) + N'.' + Quotename(Object_schema_name(st.objectid, st.dbid)) + N'.' +
	Quotename(Object_name(st.objectid, st.dbid)), '') AS command_text,
	r.command,
	s.login_name,
	s.host_name,
	s.program_name,
	s.last_request_end_time,
	s.login_time,
	r.open_transaction_count
FROM sys.dm_exec_sessions AS s
JOIN sys.dm_exec_requests AS r
ON r.session_id = s.session_id
CROSS APPLY sys.Dm_exec_sql_text(r.sql_handle) AS st
WHERE r.session_id != @@SPID
ORDER BY r.cpu_time desc

This will give you a list of the top 10 currently running heavy hitting CPU queries. Usually this is enough to give me a good idea at where to start looking for more details.

The next query I will use after I log into a system and the symptoms of the problem have already subsided. This query is intensive and so running it while there is an issue on-going, will only make the problem worse.

Query 2: Top 50 CPU Intensive Queries logged in the data management view dm_exec_query_stats

SELECT TOP 50
	[Avg. MultiCore/CPU time(sec)] = qs.total_worker_time / 1000000 / qs.execution_count,
	[Total MultiCore/CPU time(sec)] = qs.total_worker_time / 1000000,
	[Avg. Elapsed Time(sec)] = qs.total_elapsed_time / 1000000 / qs.execution_count,
	[Total Elapsed Time(sec)] = qs.total_elapsed_time / 1000000,
	qs.execution_count,
	[Avg. I/O] = (total_logical_reads + total_logical_writes) / qs.execution_count,
	[Total I/O] = total_logical_reads + total_logical_writes,
	Query = SUBSTRING(qt.[text], (qs.statement_start_offset / 2) + 1,
		(
			(
				CASE qs.statement_end_offset
					WHEN -1 THEN DATALENGTH(qt.[text])
					ELSE qs.statement_end_offset
				END - qs.statement_start_offset
			) / 2
		) + 1
	),
	Batch = qt.[text],
	[DB] = DB_NAME(qt.[dbid]),
	qs.last_execution_time
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.[sql_handle]) AS qt
where qs.execution_count > 5	--more than 5 occurences
ORDER BY [Total MultiCore/CPU time(sec)] DESC

The top 10 and top 50 queries can also be used to list different resource issues by changing what they are ordered by. For example to change the top 50 query to order by the most IO used you would change the last line to:

ORDER BY [Total I/O] DESC

Problem 3: High ram usage or very low Page Life Expectancy

I haven't run into this issue very many times, usually when the Page Life Expectancy is very low and items are continually being pushed out of memory, it is a resource issue that can be fixed by lowering the data footprint of larger tables or adding more resources to the server to accommodate data growth. There will be time however you would like to have a better understanding as to what tables are the biggest ram hogs currently. The following queries can assist in troubleshooting:

Query 1: List Database Memory Usage

DECLARE @total_buffer INT;

SELECT @total_buffer = cntr_value
FROM sys.dm_os_performance_counters 
WHERE RTRIM([object_name]) LIKE '%Buffer Manager'
    AND counter_name = 'Database Pages';

;WITH src AS
(
    SELECT 
        database_id
        ,db_buffer_pages = COUNT_BIG(*)
    FROM sys.dm_os_buffer_descriptors
    --WHERE database_id BETWEEN 5 AND 32766
    GROUP BY database_id
)
SELECT
    [db_name] = CASE [database_id] WHEN 32767 
        THEN 'Resource DB' 
    ELSE DB_NAME([database_id]) END,
    db_buffer_pages,
    db_buffer_MB = db_buffer_pages / 128,
    db_buffer_percent = CONVERT(DECIMAL(6,3), 
    db_buffer_pages * 100.0 / @total_buffer)
FROM src
ORDER BY db_buffer_MB DESC;

Query 2: List Table Memory Usage when run against a specific Database

;WITH src AS
(
    SELECT
        [Object] = o.name,
        [Type] = o.type_desc,
        [Index] = COALESCE(i.name, ''),
        [Index_Type] = i.type_desc,
        p.[object_id],
        p.index_id,
        au.allocation_unit_id
    FROM
    sys.partitions AS p
    INNER JOIN
    sys.allocation_units AS au
    ON p.hobt_id = au.container_id
    INNER JOIN
    sys.objects AS o
    ON p.[object_id] = o.[object_id]
    INNER JOIN
    sys.indexes AS i
    ON o.[object_id] = i.[object_id]
    AND p.index_id = i.index_id
    WHERE
    au.[type] IN (1,2,3)
    AND o.is_ms_shipped = 0
)
SELECT
    src.[Object],
    src.[Type],
    src.[Index],
    src.Index_Type,
    buffer_pages = COUNT_BIG(b.page_id),
    buffer_mb = COUNT_BIG(b.page_id) / 128
FROM
src
INNER JOIN
sys.dm_os_buffer_descriptors AS b
ON src.allocation_unit_id = b.allocation_unit_id
WHERE
b.database_id = DB_ID()
GROUP BY
    src.[Object],
    src.[Type],
    src.[Index],
    src.Index_Type
ORDER BY
buffer_pages DESC;

Query 3: Active Queries with current Memory Grants

SELECT r.session_id 
        ,r.status 
    ,mg.granted_memory_kb 
    ,mg.requested_memory_kb 
    ,mg.ideal_memory_kb 
        ,mg.used_memory_kb 
        ,mg.max_used_memory_kb 
    ,mg.request_time 
    ,mg.grant_time 
    ,mg.query_cost 
    ,mg.dop 
    ,( 
        SELECT SUBSTRING(TEXT, statement_start_offset / 2 + 1, ( 
                    CASE 
                        WHEN statement_end_offset = - 1 
                            THEN LEN(CONVERT(NVARCHAR(MAX), TEXT)) * 2 
                        ELSE statement_end_offset 
                        END - statement_start_offset 
                    ) / 2) 
        FROM sys.dm_exec_sql_text(r.sql_handle) 
        ) AS query_text 
    ,qp.query_plan 
FROM sys.dm_exec_query_memory_grants AS mg 
INNER JOIN sys.dm_exec_requests r ON mg.session_id = r.session_id 
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp 
ORDER BY mg.required_memory_kb DESC;

This should point you in the right direction on the tables/queries that are currently consuming large amounts of memory. Typically when troubleshooting performance issues, there are not many time when I have to investigate RAM usage but there have been a few times.

Problem 4: There is an active session to the database that is taking longer than expected to return data

Long running queries is the most investigated issue and there are a few queries that can be used to quickly identify those queries as well as letting you know what they are currently process or waiting on.

Query 1: Current Long Running Queries

SELECT
   r.session_id
,   r.start_time
,   TotalElapsedTime_ms = r.total_elapsed_time
,   r.[status]
,   r.command
,   DatabaseName = DB_Name(r.database_id)
,   r.wait_type
,   r.last_wait_type
,   r.wait_resource
,   r.cpu_time
,   r.reads
,   r.writes
,   r.logical_reads
,   t.[text] AS [executing batch]
,   SUBSTRING(
            t.[text], r.statement_start_offset / 2, 
            (   CASE WHEN r.statement_end_offset = -1 THEN DATALENGTH (t.[text]) 
                   ELSE r.statement_end_offset 
               END - r.statement_start_offset ) / 2 
          ) AS [executing statement] 
,   p.query_plan
FROM
   sys.dm_exec_requests r
CROSS APPLY
   sys.dm_exec_sql_text(r.sql_handle) AS t
CROSS APPLY   
   sys.dm_exec_query_plan(r.plan_handle) AS p
ORDER BY 
   r.total_elapsed_time DESC;

Along with the Top 10 currently most CPU intensive queries, this might be one of my most run troubleshooting queries that are run first. One of the advantages of this is that it will also give you a link to the query plan for quick troubleshooting. Note, however, that some queries that are running do not have a query plan in dm_exec_query_plan and when that is the case, they are excluded from this list. If you are not seeing the results that you expect to see, remove the CROSS APPLY for the query plan view and the query plan from the select.

Query 2: Top 100 Queries with the longest elapsed time run against a specific database

SELECT TOP 100
    qs.total_elapsed_time / qs.execution_count / 1000000.0 AS average_seconds,
    qs.total_elapsed_time / 1000000.0 AS total_seconds,
    qs.execution_count,
    SUBSTRING (qt.text,qs.statement_start_offset/2, 
         (CASE WHEN qs.statement_end_offset = -1 
            THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2 
          ELSE qs.statement_end_offset END - qs.statement_start_offset)/2) AS individual_query,
    o.name AS object_name,
    DB_NAME(qt.dbid) AS database_name
  FROM sys.dm_exec_query_stats qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as qt
    LEFT OUTER JOIN sys.objects o ON qt.objectid = o.object_id
where qt.dbid = DB_ID()
  ORDER BY average_seconds DESC;

This is a historical view of the top 100 longest running queries in case you long on to the server after the problem ceases to exist.

With these first responder scripts, you should be able to quickly identify the queries that are causing issues and then work on finding more information about what that query is attempting to do.

Automating SQL Installations

Nathan Peterman — Sun, 17 Nov 2019 18:28:00 GMT

Automating the installation of SQL Server I am sure is an ask of every DBA team. There are many different processes that companies select to automate server delivery such as Ansible or Rundeck. Our team, like many others were tasked with creating process for automating the installation of SQL server. We chose to complete this using PowerShell as this would allow us to use any Server Build Automation software to call the PowerShell script and maintain the standards that are already setup and tested. Before SQL can be installed, a Confirguation.ini file needs to be generated that follows your companies standards. I will not be supplying an an example of the Configuration.ini but I will post examples of settings that were changed after the file was generated.

Microsoft's Help page for the Configuration.ini file:

https://docs.microsoft.com/en-us/sql/database-engine/install-windows/install-sql-server-using-a-configuration-file?view=sql-server-ver15

To generate the Configuration.ini file. Launch the SQL Server installation manually and follow the steps in the wizard until you get to the step that says "Ready to Install". Displayed on the page is the location of the Configuration.ini file that you are able to copy and modify to work with the PowerShell installation script.

Here are some changes that we chose to make to the Configuration.ini with the explanation of why it was done.

; Specifies that SQL Server Setup should not display the privacy statement when ran from the command line. 

SUPPRESSPRIVACYSTATEMENTNOTICE="True"

; By specifying this parameter and accepting Microsoft R Open and Microsoft R Server terms, you acknowledge that you have read and understood the terms of use. 

IACCEPTROPENLICENSETERMS="True"

; Accept SQL 2016 License Terms

IACCEPTSQLSERVERLICENSETERMS="True"

These lines accept all of the license terms so that the installation can proceed.

; Setup will not display any user interface. 

QUIET="False"

; Setup will display progress only, without any user interaction. 

QUIETSIMPLE="True"

This will show you a progress window during installation, helpful if running the PowerShell script manually. If you would like it completely silent with no progress window, switch the true and false values.

; Specify whether SQL Server Setup should discover and include product updates. The valid values are True and False or 1 and 0. By default SQL Server Setup will include updates that are found. 

UpdateEnabled="True"

; If this parameter is provided, then this computer will use Microsoft Update to check for updates. 

USEMICROSOFTUPDATE="False"

; Specify the location where SQL Server Setup will obtain product updates. The valid values are "MU" to search Microsoft Update, a valid folder path, a relative path such as .\MyUpdates or a UNC share. By default SQL Server Setup will search Microsoft Update or a Windows Update service through the Window Server Update Services. 

UpdateSource=.\

I use these values to let the installer know to look in the installation directory for any updates and install them along side of the base install.

; Specify a default or named instance. MSSQLSERVER is the default instance for non-Express editions and SQLExpress for Express editions. This parameter is required when installing the SQL Server Database Engine (SQL), Analysis Services (AS), or Reporting Services (RS). 

INSTANCENAME="INSTANCE_NAME_TO_BE_REPLACED"

; Specify the Instance ID for the SQL Server features you have specified. SQL Server directory structure, registry structure, and service names will incorporate the instance ID of the SQL Server instance. 

INSTANCEID="INSTANCE_NAME_TO_BE_REPLACED"

The reason "INSTANCE_NAME_TO_BE_REPLACED" is used here is that the script that launches the install will write this value in before launching Setup.exe

; The number of Database Engine TempDB files. 

SQLTEMPDBFILECOUNT="4"

If this value is not supplied, SQL determines the optimal TempDB file count at installation time. I suggest you comment out or remove this line or you will end up with a static number of TempDB files on every install.

There are a few more changes that I made to the defaults but they are more in line with the standards set out by my current organization. Review the rest of the file to determine if it makes sense for you.

Once you are satisfied with the Configuration.ini, we can switch our attention to the SQL Server Installation. I will post an example of the PowerShell script that is used in our environment. You would need to customize it to work in your company as it expects a certain filesystem structure to work properly. Right now the script is designed to launched by a DBA but it can easily be adapted to run fully automated.

The script checks to make sure that the data volumes are formatted with a 64k Block size and will prompt you to format the volume if is not already:

The service accounts and instance name can be provided as parameters when calling the script or if none are provided, the script will prompt you to enter them in:

.\Install_SQL.ps1 -InstanceName "SQL2016Instance" -EngineServiceAccount "svc.2K16eng" -EngineServicePassword "ENPass" -AgentServiceAccount "svc.2K16ag" -AgentServicePassword "AGPass" -ISServiceAccount "svc.2K16is" -ISServicePassword "ISPass"

The script will then check each account against AD to verify that the credentials provided are correct. After that, it will generate a SA account password and launch the installation. It will record the SA Password in the installation log file.

The script posted is designed to work with 2016 currently but can easily be modified to work with any version or multiple versions as needed.

Here is the PowerShell Script Installation Script:

SQL 2016 Installation PowerShell Script. For use with combination of customized Configuration.ini

SQL 2016 Installation PowerShell Script. For use with combination of customized Configuration.ini - Install_SQL.ps1

262588213843476Gist

SQL Performance Troubleshooting Mini-Series - Performance Report

Nathan Peterman — Sat, 12 Oct 2019 14:43:00 GMT

The goal of the Mini-Series was to explorer the different performance monitors that correspond to quickly identifying if there is a potential issue with the underlying server infrastructure. Taking all of the performance counters mention in the previous posts, we are now ready to begin building out the SSRS report to tie everything together.

To start, we need data to report against. All of the counters used will be performance counters and there are various methods to be able to collect them. Performance Monitor can be used to capture and save the output into a BLG file.

You are able to use the Microsoft troubleshooting tool refereed to as PSSDIAG. You can download it here:
https://support.microsoft.com/en-us/help/830232/pssdiag-data-collection-utility

I use a modified version of the PSSDIAG tool for data collection that is setup to only collect performance counters. I created a powershell wrapper that fills out the required information in PSSDIAG and launches it. I will included a copy of a sample ini file at the end of this post that I use for data collection.

Now that you have the data collected, you need to process it and insert it into a SQL Database for reporting purposes. If you have a blg file created with performance monitor, you can convert that into data using a tool called relog.exe. Relog.exe is included already with windows, so no installation necessary. Here is the Microsoft commands article on relog.exe:

relog

Learn how to extract performance counter information from the performance coutner log files.

coreyp-at-msftMicrosoft Docs

The recommended method of processing the data is a Microsoft Troubleshooting tool call SQL Nexus. This method can only be used if PSSDIAG was used for data collection. Not only does it pull out perfmon data and process it, it also adds in other details from the SQL Server that I use in other parts of the report.

SQL Nexus can be downloaded here:

microsoft/SqlNexus

SQL Nexus is a tool that helps you identify the root cause of SQL Server performance issues. It loads and analyzes performance data collected by SQLDiag and PSSDiag. It can dramatically reduce th...

microsoftGitHub

When processing the data with SQL Nexus, it leverages relog.exe to process the perfmon counter data. When SQL Nexus processes the perfmon.blg file, it appears to only import 50% of the data points. I assume this is to save space and still give you an accurate picture of the readings from the environment. In some cases, this did not work out well for us, so after SQL Nexus finished processing the file, I truncated the tables and re-processed the data using relog.exe and ended up with no values skipped. SQL Nexus can be called though command line arguments to that it processes the data and uploads it to a named database instead of the default "SQLNexus" database that is selected if you use the GUI. I wrote a PowerShell wrapper that processes the zipped PSSDIAG file with SQL Nexus and then relogs the data and then launches a SSRS report against the data that was just processed.

The SSRS report can be built using the information provided in the mini-series but I will be working on packaging the report file in a downloadable Visual Studio project that will make deployment easier in your environment. I have added some additional information in the report that gets included in the database once it is processed with SQL Nexus.

Here is the example of some of the working file responsible for data collection and processing:

PSSDIAG.ps1 is responsible to launching PSSDIAG.cmd and inserting values into the xml that controls what gets logged by PSSDIAG.

PSSDIAG CMD PowerShell wrapper file which auto completes some information pulled out of registry and launches PSSDIAG.cmd

PSSDIAG CMD PowerShell wrapper file which auto completes some information pulled out of registry and launches PSSDIAG.cmd - PSSDIAG.ps1

262588213843476Gist

PSSDIAG uses an xml file to determine what type of information gets logged by the tool. There is a tool that quickly helps generate xml files called DiagManager. You are able to download the tool here:

microsoft/DiagManager

Pssdiag/Sqldiag Manager is a graphic interface that provides customization capabilities to collect data for SQL Server using sqldiag collector engine. The data collected can be used by SQL Nexus to...

microsoftGitHub

Here is an example pssdiag.xml file that we use:

Sample XML file that was generate for SQL Server 2016. There are variables embeded in this file that are replaced by PSSIDAG.ps1 when run.

Sample XML file that was generate for SQL Server 2016. There are variables embeded in this file that are replaced by PSSIDAG.ps1 when run. - pssdiag.xml

262588213843476Gist

Finally, here is the PowerShell that is use to generate reports using the zip file created in the PSSDIAG.ps1 PowerShell.

Here is the Powershell that unpacks the zip file, launches SQL Nexus to process the data, re-processed the data using relog.exe and then connects to SSRS to generate a report file in a Word Document Format.

PowerShell used to process the PSSDIAG data with SQL Nexus and then launches an SSRS Report file.

PowerShell used to process the PSSDIAG data with SQL Nexus and then launches an SSRS Report file. - GenerateReport.ps1

262588213843476Gist

Once the report files have been packaged and uploaded you should be able to modify it and make it relevant for on-demand performance analysis. It should be able to quickly pin-point areas that need further investigation.

SQL Performance Troubleshooting Mini-Series - SQL Performance

Nathan Peterman — Sun, 08 Sep 2019 06:09:00 GMT

There are lots of different measurements when it comes to SQL Performance but for the purposes of the SQL Performance report being built out, I will be concentrating on 3 perfmon counters:

Batch Requests/Sec
Page Life Expectancy
Lazy Writer

These counters, when used with the other checks covered by the previous blog posts, will be able to let us know if there are any issues with the underlying infrastructure that need to be looked into further. If everything looks good then the next step is to start checking out query performance and optimization.

Batch Requests/Sec are the number of Transact-SQL command batches received per second. This statistic is affected by all constraints (such as I/O, number of users, cache size, complexity of requests, and so on). High batch requests mean good throughput. Batch requests can be used as a good baseline to reflect server workload. (500 = Busy, 1000 = Watch CPU Capacity)

There is no measurement with this counter that will automatically flag an issue, but it is used to quickly determine how busy the SQL Server is and how that might affect the other areas of performance (CPU, Memory and Disk). The higher this number, the greater the demands will be for all resources on the server.

Page Life Expectancy is the number of seconds a page will stay in the buffer pool without references. If this value gets below 300 seconds, this is an indication that SQL Server is doing too many logical reads putting pressure on the buffer pool or potentially that your SQL Server could use more memory to boost performance. Anything below 300 is a critical level. Some would argue that the 300 number is an older number and that number should now be closer to 1000 with the greater availability of ram on servers now. Any number below 1000 should be investigated as potential memory pressure issues.

Here is the query used to create the Page Life Expectancy graph with threshold line:

SELECT det.CounterName
    ,d.CounterValue
    ,SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE det.countername = 'Page life expectancy'
    AND det.ObjectName LIKE '%:Buffer Node'
    AND det.InstanceName = '000'

UNION ALL

SELECT 'Threshold'
    ,500
    ,SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE det.countername = 'Page life expectancy'
    AND det.ObjectName LIKE '%:Buffer Node'
ORDER BY det.CounterName, CounterDateTime

The Lazy Writer counter tracks how many times a second that the Lazy Writer process is moving dirty pages from the buffer to disk to free up buffer space. A generally accepted value of LESS than 20 per second is acceptable; however, it should be close to zero.

If your Page Life Expectancy is low and you can see that the Lazy Writer is being activated to write pages to disk, you can tell that the SQL Server is having memory pressure issues and is forced to excessively make writes to disk. This in turn creates more IO on the disks and you should be able to see more usage than typical. For SQL to work optimally, the higher the lifetime pages are able to be kept in memory, the less work the server will have to put on the drives to move data into memory and to write data when it gets forced out.

Here is the query used to generate the SSRS gauge that is shown above:

SELECT det.CounterName
    ,d.CounterValue
    ,SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
FROM dbo.CounterData d INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Lazy writes/sec'
ORDER BY det.CounterName, CounterDateTime

The guage is setup to check the max value returned to see if it is beyond a threshold.

The SQL Performance Check on the main page is a combination of Page Life Expectancy and Lazy Writer checks. Here is the query used to generate the results for the main page gauge:

DECLARE @result1 varchar(2); 
DECLARE @result2 varchar(2); 

SET @result1 = (SELECT TOP 1
		CASE 
			WHEN d.CounterValue <= 200
				THEN 0
			WHEN d.CounterValue < 450
				THEN 1
			ELSE 2
		END as 'CounterValue'
	from dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
	where countername = 'Lazy writes/sec'
	ORDER BY 'CounterValue' desc)
SET @result2 = (SELECT TOP 1
		CASE 
			WHEN d.CounterValue > 1500
				THEN 0
			WHEN d.CounterValue > 500
				THEN 1
			ELSE 2
		END as 'CounterValue'
	from dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
	where countername = 'Page life expectancy'
	and det.ObjectName LIKE '%:Buffer Node'
	and det.InstanceName = '000'
	ORDER BY 'CounterValue' desc) 
Select
	CASE
		WHEN @result1 >= @result2
			THEN @result1
		ELSE @result2
END AS Result

A result of 2 is failed, 2 is a warning and 0 is a passed check.

This is the last of the infrastructure checks before we merge it all together to generate a performance report. The next blog entry will cover the basic's of building out an SSRS report and the methods that can be used for data collection.

SQL Performance Troubleshooting Mini-Series - Disk Performance

Nathan Peterman — Fri, 09 Aug 2019 20:11:00 GMT

There are quite a few counters that we are going to look at when it comes to determining if there are issues reading or writing to disks hosting the databases files. In this post I will cover the following perfmon counters:

Disk Latency - Reading (Avg Disk Sec/Read ms)
Disk Latency - Writing (Avg Disk Sec/Write ms)
Disk Throughput (Disk MegaBytes/sec)
Disk IOPS - Reading (Disk Reads/Sec)
Disk IOPS - Writing (Disk Writes/Sec)
Disk Queue Length (Current Disk Queue Length)

Disk Latency is the time it takes for the disk to respond to a read or write command. This counter measures the time in milliseconds of the average disk read or write. A high value for this counter is cause for concern even if the queue counters do not indicate an issue. Generally, the value needs to be less than 15 ms or 0.015 on this chart. In the following graphs, the threshold is set to 25 ms for our environment but I would argue that it should be no more than 15 ms. Here is the query that is used to generate the Disk Latency graphs:

SELECT 
    det.CounterName,
    d.CounterValue,
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime,
    det.ObjectName,
    det.InstanceName
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det ON d.CounterID = det.CounterID
WHERE countername = 'Avg. Disk sec/Read'
    AND det.ObjectName = 'LogicalDisk'
    AND det.InstanceName LIKE '_:'
    AND det.InstanceName NOT LIKE 'C:'

UNION ALL

SELECT 'Threshold',
    0.015,
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime,
    'Threshold',
    'Threshold'
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det ON d.CounterID = det.CounterID
WHERE countername = 'Avg. Disk sec/Transfer'
    AND det.ObjectName = 'LogicalDisk'
    AND det.InstanceName = 'C:'
ORDER BY det.CounterName,det.instancename, CounterDateTime

In the above query you are able to exclude any volume that you do not consider to be database volumes. When can see the output in the following SSRS graph:

I also create a table with the overall average read and write response time for the duration of the data collection for each disk with the columns changing colour based on if they exceed the threshold. Table Example:

Here is the query that was used when generating the table:

;with cte as (
SELECT CounterValue
		,det.CounterName
		,det.ObjectName
		,det.InstanceName
		,PERCENT_RANK() OVER (PARTITION BY det.InstanceName ORDER BY CounterValue ) AS PctRank  
	FROM dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
	WHERE countername = 'Avg. Disk sec/Read'
	AND det.ObjectName = 'LogicalDisk'
	AND det.InstanceName LIKE '_:'
	AND det.InstanceName NOT LIKE 'C:'
)
SELECT CONVERT(DECIMAL(16,3)
    ,AVG(CounterValue)) AS CounterValue
	,CounterName
	,ObjectName
	,InstanceName
	FROM cte 
	WHERE PctRank <= '.9'
	GROUP BY CounterName,ObjectName,InstanceName

When getting the average disk read and write there will be erroneous data in the values when a read or write is being initiated. We utilize PERCENT_RANK in this query and then select the average of the top 90% of the response time in order to eliminate some of the response time skips that occur during normal operation. If a disk has a consistent top response time above 0.015, it is flagged as an issue that needs to be addressed. Modern day storage systems in servers should be able to keep this value at 0.005 ms and lower and higher than 0.0015 indicates a busy storage system.

Disk throughput is a measurement of the average number of megabytes transferred within a period of one second for a specific file size. Each storage medium has an upper limit for how much data it can handle, so a constant max value might indicate that the throughput upper limit threshold has been reached. I do not flag any thresholds on this graph because each storage system will be different. It will be up to the DBA to determine if the numbers displayed on this graphs are problematic.

This graph is also accompanied with a table to be able to better interpret the data as seen here:

The disk that houses the Temporary databases (T:) has the highest volume

Here are the queries that are used to generate both the table and the graph above:

-- GRAPH:
SELECT det.CounterName
	,d.CounterValue / 1024.0 / 1024.0 AS 'Disk MegaBytes/Sec'
	,SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
    ,det.ObjectName
	,det.InstanceName
FROM dbo.CounterData d INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Disk Bytes/sec'
    AND det.ObjectName = 'LogicalDisk'
    AND det.InstanceName LIKE '_:'
    AND det.InstanceName NOT LIKE 'C:'

-- TABLE:
SELECT det.CounterName
	,SUM(d.CounterValue / 1024.0 / 1024.0 * 10) / (SELECT DATEDIFF(MINUTE,SUBSTRING(MIN(CounterDateTime),1,19),SUBSTRING(MAX(CounterDateTime),1,19)) FROM dbo.CounterData) AS AvgMBReadPerMin
	,det.InstanceName
FROM dbo.CounterData d INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Disk Read Bytes/sec'
	AND det.ObjectName = 'LogicalDisk'
	AND det.InstanceName LIKE '_:'
	AND det.InstanceName NOT LIKE 'C:'
GROUP BY det.CounterName,det.InstanceName

This information can also be used to see if there might be excessive writing to the log or database files if it is unexpected.

Disk IOPS counter captures the total number of individual disk IO requests completed over a period of one second. This is an indication of how busy the disks are on the virtual host. Each storage medium has an upper limit for how many IOPS it can handle, so a constant max value might indicate that the IOPS upper limit threshold has been reached.

To give an example of IPOS per storage medium, a traditional spinning HDD will have an IOPS range within 50 to 250 range. Modern day SSD drives have an IOPS range of 5,000 to 50,000. High end storage applicances can easily break the 100K IOPS mark. Lower IPOS numbers in the perfmon data does not indicate a problem by itself. I would only use these numbers in conjunction with other values that indicate that the server is not able to read/write data in a timely manor.

Here is an example of IOPS graphs within an SSRS report and a query example used to pull out the data:

SELECT det.CounterName
    ,d.CounterValue
    ,SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
    ,det.ObjectName
    ,det.InstanceName
FROM dbo.CounterData d INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Disk Reads/sec'
    AND det.ObjectName = 'LogicalDisk'
    AND det.InstanceName LIKE '_:'
    AND det.InstanceName NOT LIKE 'C:'

Current Disk Queue Length is a direct measurement of the disk queue present at the time of the sampling. This is a measurement of the data waiting to be written to the disk. Any value above 2 for an extended period is an indication of the disk not being able to keep up with the workload required.

Spikes are typical when looking at this graph but values that are consistently above the threshold might be indicative of an issue with storage not able to keep up with throughput demands.

Here is the query used to generate the Disk Queue Length:

SELECT det.CounterName
	,d.CounterValue
	,SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
    ,det.ObjectName
	,det.InstanceName
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Current Disk Queue Length'
    AND det.ObjectName = 'LogicalDisk'
    AND det.InstanceName LIKE '_:'
    AND det.InstanceName NOT LIKE 'C:'


UNION ALL

SELECT 'Threshold'
    ,2
    ,SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
    ,'Threshold'
    ,'Threshold'
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Avg. Disk sec/Transfer'
    AND det.ObjectName = 'LogicalDisk'
    AND det.InstanceName = 'C:'
order by det.CounterName,det.instancename, CounterDateTime

To produce the Disk Performance Check on the main page of the report, I used a SQL query that checks both read and writing latency and reports if those values are above the 0.015 threshold. Here is the query attached to the SSRS Gauge:

DECLARE @DriveIO decimal(18,5);
DECLARE @DriveIOWrite decimal(18,5);
DECLARE @result1 varchar(2); 
DECLARE @result2 varchar(2);
DECLARE @PerfResult varchar(2);

;with cte as (
                SELECT CounterValue
                        ,det.CounterName
                        ,det.ObjectName
                        ,det.InstanceName
                        ,PERCENT_RANK() OVER (PARTITION BY det.InstanceName ORDER BY CounterValue ) AS PctRank  
                    FROM dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
                    where countername = 'Avg. Disk sec/Read'
                    and det.ObjectName = 'LogicalDisk'
                    and det.InstanceName LIKE '_:'
                    and det.InstanceName NOT LIKE 'C:'
                    and det.InstanceName NOT LIKE 'D:'
                    and det.InstanceName NOT LIKE 'F:'
                )
                SELECT @DriveIO = AVG(CounterValue) 
                FROM (Select TOP 10 CounterValue
                    FROM cte 
                    WHERE PctRank <= '.9'
                    ORDER BY CounterValue DESC) AS Result 

SET @result1 = (SELECT 
	CASE
			WHEN @DriveIO >= 0.015
				THEN 2
			WHEN @DriveIO >= 0.0075
				THEN 1
			ELSE 0
			END AS 'Result' )

;with cte as (
                SELECT CounterValue
                        ,det.CounterName
                        ,det.ObjectName
                        ,det.InstanceName
                        ,PERCENT_RANK() OVER (PARTITION BY det.InstanceName ORDER BY CounterValue ) AS PctRank  
                    FROM dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
                    where countername = 'Avg. Disk sec/Write'
                    and det.ObjectName = 'LogicalDisk'
                    and det.InstanceName LIKE '_:'
                    and det.InstanceName NOT LIKE 'C:'
                    and det.InstanceName NOT LIKE 'D:'
                    and det.InstanceName NOT LIKE 'F:'
                )
                SELECT @DriveIOWrite = AVG(CounterValue) 
                FROM (Select TOP 10 CounterValue
                    FROM cte 
                    WHERE PctRank <= '.9'
                    ORDER BY CounterValue DESC) AS Result 

SET @result2 = (SELECT 
	CASE
			WHEN @DriveIOWrite >= 0.015
				THEN 2
			WHEN @DriveIOWrite >= 0.0075
				THEN 1
			ELSE 0
			END AS 'Result' )

SET @PerfResult = (Select
	CASE
		WHEN @result1 >= @result2
			THEN @result1
		ELSE @result2
	END as Result)

SELECT @PerfResult

A result of 2 indicates that there is an issue, 1 indicates that there is a warning and 0 if the drive latency is under the acceptable limit.

When used in conjunction, this collection of performance counters should be accurately tell you if there is an issues with disk performance when data collection is setup to run when the issue is present on the server.

SQL Performance Troubleshooting Mini-Series - Memory Utilization

Nathan Peterman — Mon, 15 Jul 2019 14:20:00 GMT

For Memory Utilization section we will be only working with one counter. There are more counters that will let you know if there are memory pressure issues but they will be covered later in the SQL Performance section. This section will focus on if there is any issues present with the with memory contention issues between the OS and SQL Server.

The counter, Available MBytes, shows the amount of physical memory, in megabytes, available to processes running on the computer. Monitor this counter to ensure that the server maintains a level of at least 20 percent of the total available physical RAM. Consider other applications that may be running on the physical host as well. Available Memory should never be less than 1000 MB as per the threshold on this graph.

To have the above graph displayed we use the following query against the perfmon data:

SELECT 
    det.CounterName,
    d.CounterValue,
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det ON d.CounterID = det.CounterID
WHERE countername = 'Available MBytes'

UNION ALL

SELECT 
    'Threshold',
    1000,
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Available MBytes'
ORDER BY det.CounterName, CounterDateTime

There isn't too much else to check, is there available memory or not? The memory check on the overview page pulls the minimum value for available memory and checks to make sure is in not below 1 GB and will give a warning if it is between 1 GB and 2 GB free. These numbers can be adjusted to numbers that might make better sense for your environment. I do check to make sure there are not memory contention issues within SQL Server itself but that will be covered in the SQL Performance topic.

SQL Performance Troubleshooting Mini-Series - CPU Performance

Nathan Peterman — Thu, 20 Jun 2019 14:47:00 GMT

For each of the posts in the mini-series, we are going to to take a selected topic and focus on the information that gives us a general idea that there is or isn't performance issues that should be looked into further. These checks might not give all of the necessary information to troubleshoot but it should outline if there is an issue that needs investigation.

For CPU performance we will focus on the following Perfmon counters:

CPU Utilization for individual Cores (% Processor Time *.*)
CPU Utilization Combined (% Processor Time)
CPU Privileged Time (% Privileged Time)
CPU Context Switches (Context Switches/sec)

There is a lot of logic that goes into creating the top-level check. They take each individual check and combine them to give an overall result that is quick and easy to determine. All of the heavy lifting for the logic is done within SQL and SSRS is used to display the results.

CPU Utilization for Individual Cores:

The information is displayed with a threshold line placed at 85%

This counter shows the percentage of time that the processor is executing application or operating system processes other than Idle. On the computer that is running SQL Server, this counter should be kept below 85 percent. In case of constant overloading, investigate whether there is abnormal process activity or if the server needs additional CPUs.

There are a few things I check with this counter's data:

CPU Average Utilization to make sure the CPU usage is not consistently being used above 85% utilization.
CPU Max Utilization, like average utilization, also checks to make sure that the max utilization data points are not above the threshold line.

To generate the data as displayed above you use the following query against PSSDIAG data that was processed with SQL Nexus or Relog.exe to insert it into a SQL database.

SELECT 
    det.CounterName,
    d.CounterValue,
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime,
    det.InstanceName
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det ON d.CounterID = det.CounterID
WHERE countername = '% Processor Time'
    AND det.InstanceName LIKE ('_,_')

UNION ALL

SELECT 
    'Threshold',
    85,
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime,
    'Threshold'
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det ON d.CounterID = det.CounterID
WHERE countername = '% Processor Time'
    AND det.InstanceName = ('0,0')
ORDER BY det.CounterName, CounterDateTime

When looking at this graph, you want to see an even utilization across all cores. If you see a few a few cores being heavliy utilized and the others not having the same utilization, there might be an issue with MAX DOP or the number of cores assigned to queries.

Here is an example of the logic assigned to the CPU Max utiliztion check:

SELECT TOP 1 (CounterValue) MinValue
FROM (SELECT TOP 20 PERCENT d.CounterValue
		FROM dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
			WHERE countername = '% Processor Time'
			    AND det.InstanceName LIKE ('_,_')
			ORDER BY CounterValue DESC) result
ORDER BY MinValue ASC

SSRS Indicator Value and States for the above query

I take the minimum value of the top 20 percent of values for the % of processer time and if the minimum is above 90 percent utilized, the check fails and is displayed as a visual.

Max CPU Check that has failed

The next two graphs that are used give an overview to make it easier to see the overall CPU utilization as it is a combined view and CPU Privileged time, which is a view of CPU the amount of CPU usage assigned to processes external to SQL Server.

In this example, CPU Utilization overall is higher than the 85% that we would like to see it under. The combined view makes it easy to see overall utilization. The CPU Privileged time demonstrates that the utilization as seen in the previous graph is indeed being used by the SQL Engine service. We are not having issues with another application on the server utilizing too much CPU.

Our final check that we display is CPU Context Switches:

Context Switches/sec is the combined rate at which all processors on the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is pre-empted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. If context switches per second are greater than 5000 per second per core, or above the combined total which is indicated by the threshold, it may indicate excessive switching. Here is the logic that is used to display the data in the above example graph:

SELECT 
    det.CounterName,
    d.CounterValue,
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
FROM dbo.CounterData d 
    INNER JOIN dbo.CounterDetails det ON d.CounterID = det.CounterID
WHERE countername = 'Context Switches/sec'

UNION ALL

SELECT 
    'Threshold',
    (CAST((SELECT TOP 1 [PropertyValue]
            FROM [dbo].[tbl_ServerProperties]
            WHERE PropertyName = 'cpu_count') AS INT) * 5000), 
    SUBSTRING(d.CounterDateTime,1,19) AS CounterDateTime
FROM dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
WHERE countername = 'Context Switches/sec'
ORDER BY det.CounterName, CounterDateTime

The logic to determine the threshold line is built into the query to place it at 5000 times the number of CPUs assigned.

To get the check on the top-level page for an overall view of CPU Utilization we combine the logic for all CPU related checks into one query. For this query, we chose to leave out the Context Switch check as it was not relevant for our needs, but it can easily be incorporated into the overall check. Here is the query responsible for the CPU Performance check at the beginning of this post:

DECLARE @CPUAVG decimal(18,5);
DECLARE @CPUMaxAvg decimal(18,5);
DECLARE @result1 varchar(2); 
DECLARE @result2 varchar(2);
DECLARE @PerfResult varchar(2);

SET @CPUAVG = (select AVG(d.CounterValue)
                from dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
                where countername = '% Processor Time'
                and det.InstanceName LIKE ('_,_'))

SET @result1 = (SELECT 
	CASE
			WHEN @CPUAVG >= 80
				THEN 2
			WHEN @CPUAVG >= 70
				THEN 1
			ELSE 0
			END AS 'Result' )


SET @CPUMaxAvg = (SELECT TOP 1 (CounterValue) MinValue
                    FROM (SELECT TOP 20 PERCENT d.CounterValue
                            FROM dbo.CounterData d inner join dbo.CounterDetails det on d.CounterID = det.CounterID
                                where countername = '% Processor Time'
                                and det.InstanceName LIKE ('_,_')
                                order by CounterValue DESC) result
                    ORDER BY MinValue ASC)

SET @result2 = (SELECT 
	CASE
			WHEN @CPUMaxAvg >= 90
				THEN 2
			WHEN @CPUMaxAvg >= 75
				THEN 1
			ELSE 0
			END AS 'Result' )

SET @PerfResult = (Select
	CASE
		WHEN @result1 >= @result2
			THEN @result1
		ELSE @result2
	END as Result)

SELECT @PerfResult

If the returned value is equal to 2 then the check fails, if it equals 1 then it presents a warning, if it returns 0, then the check passes.

With this information we are able to quickly determine if there are utilization issues with the CPU and we need to investigate to see if it is a resource issue or if we can take a look at optimizing the queries that being sent to the server any maybe offload some of the processing to another server if it is not possible to optimize.

SQL Performance Troubleshooting Mini-Series - Introduction

Nathan Peterman — Thu, 06 Jun 2019 15:27:00 GMT

One of the tasks of the database team is to continually improve process and automate what can be automated. My manger recently approached me with a request to see if there is some way we would be able to generate a report that would assist a DBA in problem investigations when it come to performance issue.

Our DBA responsibilities at this organization might differ from the role at a DBA at another organization. We are strictly in charge of the database infrastructure. We make sure the database is online an connections are able to occur and there is nothing, infrastructure wise, that is preventing the database from running optimally. This means, there are quantifiable numbers that we are able to check to make sure that the server falls within best practices that do not include things like, index performance, poorly written queries. If we can automate the quantifiable numbers, that will free up time to delve deeper into performance issues faster and possibly point out issues with infrastructure immediately.

In this mini-series we will be going over the four areas and talk about the counters and checks that will be used to generate a performance report:

At the end of the mini-series, I will be sharing some code that is used to generate a performance report in SQL Server Reporting Services that may be able to assist in some performance issue troubleshooting.

To generate the performance report with logic, we use the data collection tool PSSDAIG on an SQL Instance with it configured to only collect perfmon data. There will be a blog post on generating a performance report which will cover what is collected and how it is collected. All counters in the mini-series posts will refer to a perfmon counter that is availible on a SQL Server and included in the data collection.