Posts Tagged ‘Troubleshooting’

So, I have been running Service Manager for quite some time now. From SCSM 2010 to SCSM SP1 and most of the CU’s before I upgraded to SCSM 2012 Beta and RC and then to RTM (with the help of a Microsoft TAP Program). Service Manager is now quite entrenched in our business and has been running quite well for a very long time. However, a little while ago we began to experience performance issues with the Console and the logging of calls and the application of Workflows.

Time for some investigation.

There was a general slowdown with everything and business was not happy with this. So let the troubleshooting begin. Since there is a SQL back-end, I decided that this was a good place to start.

I was running a sp_who2 on my SQL server and I was seeing a lot of “blocking” happening. There will some “blocking” for a very short while, while the ServiceManager database is updated. However, I was seeing “blocking” SPID’s consuming processor time for several minutes and this is NOT normal at all.

As a result of this, my workflows were not kicking off properly and a few other issues including the slow pickup of e-mailed in incidents from SMTP “Drop Folder” and extremely slow response when assigning incidents to new people/tiers to name but a few. In general, Service Manager was SLOW!!!!

After much troubleshooting and some help from MS. It was determined to be a corrupt workflow, what had actually happened to me, is that my MP containing my Workflow was corrupted.

So, I stopped the “Health Service”, disabled all the workflows and slowly started re-enabling workflows and then starting the “Health Service” again and slowly started testing like that and I was able to determine that any workflow or template within a set MP was causing the SQL Blocking. As a result, I re-created the workflows and Templates from scratch and now I have no issues at all.

Slow and painful, but well worth it now.

Follow me.

facebook-small322252222 twitter-small322252222

MCC11_Logo_Horizontal_2-color_thumb_

So, just last night, we had an issue with our cluster.

One of our disks would not come online. We were seeing an error event ID: 1304 as below.

error1304

So, we began to investigate as all the Technet articles state that the drive is corrupted and Disk Signatures need to be changed, however we have had no major changes in the last little while. We were also able to view to CSV Volume from our SAN management tools. So, we were confused. The disk just would not come online.

1 . The Quorum was online, or else the cluster would not come online.

2. All the nodes were online.

3. All Networks were online

4. The cluster virtual name and IP was up.

However, only CSV would not come online. So, we began to dig a little bit deeper and discovered that the error message was a little misleading, it did however contain some useful information. From the error message we were able to glean the problem node (CSV Owner). We then hopped onto the problem node and after a little digging around, discovered that the volume had been set to OFFLINE. We set the volume to ONLINE and then tried to bring the resource ONLINE in Failover Cluster Manager and hey PRESTO!!! The volume can online and so did my VMs.

I have since created a PowerShell script which can be found here to help the troubleshooting process. All you need to do is provide the list of cluster/s to be checked.  It does the following.

1. Check if the nodes state is in an “UP” state. If the node is in an up state, it then checks the status of the Cluster Service and reports if it is up or down.

2. It then checks the CSV state and specifies the current owner of the CSV and then lets you know if the CSV online or reserved. This is good.

3. Should it get a state other than this, it then proceeds to run a list disk from diskpart on the reported CSV owners to help you find which CSV and which owner to start working on.

Follow me.

facebook-small3222522twitter-small3222522

MCC11_Logo_Horizontal_2-color_thumb_