Skip navigation

Tag Archives: Agent Jobs

After reading a blog post from Tom LaRock (b|t|g+) from about this time last year, and then another post that Brad Schulz (b) expounded on referring to outliers, I found that both of these posts did very well to explain how to get information on long-running jobs that have completed.

However, I was tasked to alert on long-running jobs that are currently running. Because the sysjobhistory table will not get updated with data until a job step completes, the above approaches would not work in whole for my purpose. They are a great start, and a necessary insight to finding long-running jobs, but they only look at the past. I needed to look at the present.

There can be many reasons why a SQL Server Agent job hangs. Some of the issues resulting from a hung job are outlined here:

  • Resources can be consumed for the duration
  • Object blocking/locking due to open transactions
  • No notifications
  • The list goes on….

On to the solution –

To get the currently running jobs on a server, I executed the sp_help_jobs system stored procedure in the msdb database:

The @execution_status=0 parameter will bring back all jobs that are not idle or suspended. Refer to the link above to see the different status values.

Now that I know which jobs are executing (if any), how do I know what their typical execution time might be, or what the current elapsed execution time is? Enter the sysjobactivity and sysjobhistory msdb system tables. The sysjobactivity table is populated with all currently defined Agent jobs each time the Agent service is started. This table feeds the job activity monitor. The sysjobhistory table contains information on executed jobs. As stated previously, this table is only updated after a job step completes. Perfect for finding the average duration of successful execution of jobs. With this in mind, you can join these tables to get all the information you need to determine current and historical data.

In my case, the sysjobactivity table had two records for each Agent job on one of my servers. This can happen when the SQL Server Agent service terminates while jobs are running. To grab the correct rows, the sysjobactivity table has a session_id column that corresponds to the same column in the syssessions table, which is updated with a new session_id and agent_start_date each time the SQL Server Agent service is restarted.  I joined the sysjobactivity table and the syssessions table, using MAX(session_id) to ensure I got the correct activity record.

Next, I wanted to utilize a “central management” server to run this script. Sure, you could monitor each server separately, but that just adds more jobs for me to watch, and report on in case of failure. Using linked servers and OPENROWSET, I can cycle through each of my SQL Servers, joining a result set from sp_help_job, sysjobactivity, and sysjobhistory. The only question I had here was how to get the server names\instance names that I wanted to cycle through. I declared a table variable, @Servers, and used hard-coded values to insert the data into that table for each server that I wanted to spin through. That method, however, is not exactly scalable. To fix that issue, I pulled the server names\instance names from the sys.servers system view in the master database.

So, now to put it all together, the script looks something like this:

  • Declare all the variables you will need.

Notice the execution of sp_configure to turn on the advanced option “Ad Hoc Distributed Queries.” Security measurements in our shop dictates that this option should be turned off. This block of script will enable the option just for the duration of the job. It will be turned off again, at the conclusion.

  • Populate the @Servers table variable, and get the first Server_ID for iteration. Use a while loop to spin through all the servers in the @Servers table variable, setting the @ServerName variable for use within the server context.

  • With the server context set, create the connection string for the OPENROWSET call, and create the SQL script that will grab all the data you will need to make your comparisons.

The script uses the run_duration column of the sysjobhistory table to calculate an average execution time of successful job runs, and adds a standard deviation to the average. Other posts talked about adding more than one STDEV, but in the case of a backup of a VLDB, I have seen the job take 10 – 13 hrs. Adding two standard deviations would allow for too much variance.  The calculations have to be done remotely because 1) the historical time is an aggregate, and 2) the current time could be in a different time zone than my management server. Also, I am grabbing the job name from the sp_help_job system stored procedure, to limit the amount of joins in the query. In the WHERE clause, I am limiting the AVG to successful executions, and times for the entire job. Finally, insert the results into the @RunningJobTimes table variable to spin through the jobs/server to compare times, one at a time.

  • If there were rows inserted into the @RunningJobTimes table variable, this means that there are jobs running, and we want to compare the current execution time(s) vs. the historical time calculated in the script above.

Here,  we use the @@ROWCOUNT built-in system function to check if any rows were inserted into the @RunningJobTimes table variable. If yes – run the comparison. If not – skip it. The comparison is simple. If the current execution time is greater than the calculated historical time, build an email and send it, using the sp_send_dbmail stored procedure. In order to make the email easily readable on a mobile device, the body of the email is short and sweet. It includes the name of the server first in the subject line, so the DBA can begin triage without having to open the email. The body includes the job name, the time it has been executing, and the amount of overrun.

  • Clear out the @RunningJobTimes table variable, and move on to the next server. After the last server has been iterated, turn off the “Ad Hoc Distributed Queries” advanced option.

This script can easily be turned into a stored procedure, and an Agent job. All you have to do is remove the advanced options blocks, and put CREATE PROCEDURE usp_<procedure_name> AS BEGIN at the top. Be sure to remove the GO at the bottom of the script as well. If your shop is like mine, you can instead put the advanced options blocks into the Agent job as steps one and three, with the stored procedure execution being step two.