In Part 1 of this series, I discussed an issue we were having in one of our SharePoint 2013 farms and how I determined the issue was occurring because of a set of event receivers acting on the library. In this post, I will discuss the code being used and what the final result was determined to be. Stick around, it’s not what you think.

To be terribly honest nothing jumped out at me while looking over the code. The initial review of the code indicated the issue could be around where the event receiver was trying to determine if the user adding the file was a member of the site owner group. The original code was:

Again, normally not a huge issue, except that best practices state that you shouldn’t instantiate SPSite, SPWeb or SPList objects within an Event Receiver. The reason for this is it causes extra database calls (more information here: https://msdn.microsoft.com/en-us/library/office/ee724407(v=office.14).aspx). I thought this could be the culprit, but wasn’t convinced. If this was the issue, why does it work fine for years and then suddenly stop working? The reason the code is instantiating the SPSite and SPWeb object is it is used elsewhere in the solution and could be called by users who do not have the required access. The same goes for the event receiver. If I do not have access to control security groups in the site, I get an UnauthorizedAccessException.

So I thought, why not just use that. We can safely assume that if the UnauthorizedAccessException error is thrown, the user is not in the Owners group. So I updated the code with a try\catch (why one wasn’t already being used I don’t know) and added some logic into the catch. Not generally the best method, but when used for targeted exceptions I believe acceptable IMHO.

Event Receiver Code

C#

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

//Loop through each group in the web. If the group is the owner group, check to see if user exists within.

So I moved the code into Pre-Prod and tried it out. No change. Still hanging, throwing errors and crashing the app pool.

Next step was to install Visual Studio into Pre-Prod and attach to the IIS Worker process. I followed the code until it got into the newly created CheckIfUserInSPGroupEvntRcvr method. There it stayed. It kept looping through the AD users and groups within the SharePoint group. As it was looping I watched the worker process memory usage grow and grow until it finally crashed again. This didn’t make any sense as there are NOT that many users in these groups.

The Cause of it All

I took a look at the ownership group for the site I was testing with. Like most (not all) of our project sites, it contained an AD group that contains our project team. Let’s call that group All-Project. All-Project had about a dozen users within it, however, there was an anomaly. It also contained the Owners group from another project site. This was an oddity. I took a look at the Owner group and it also contained the same All-Project group. There was the culprit.

As you can see in the code above, it is designed for nested groups, so if the code hits a group it digs down to see if the nested group contains the user. Because this Owner group was added (in error I found out while trying to figure out why it was there) to the All-Projects group, the code would dig into All-Projects then to the Owners group, from there back into the All-Projects group and then back into the Owners group… see where I am going with this? By adding that single group to the All-Projects group in error an infinite recursion loop was created in the code.

The Final Fix

So the final fix was not an environmental change or a code modification. It was simply to remove the Owners group from the All-Projects group. Once that was done, the original code functioned as designed. If this becomes a regular occurrence I will have to update the code to handle such an event, but in this case, I didn’t. The farm is in containment (no further development short of break\fix) and the issue has not occurred for two years before this. I hope the steps I documented in this blog series helps others out.

Had a doozy of an issue the other day. All of a sudden, a SharePoint farm that has been chugging along with no changes suddenly started having some weird issues. Users could open, view, edit documents, but as soon as they attempted a save or an upload of a new document things started to go bad. If they were using Windows Explorer they received the error: “The specified network name is no longer available”

If they were using the GUI the upload form hung for a while and eventually reverted to “The Page Cannot be Displayed”

At the same time, we were getting reports of users in other areas of the farm getting a very slow response within SharePoint. What was really confusing about this was that the issue was happening to just a single site collection in the farm.

Errors Received

Windows Event Log

We were receiving a number of errors besides those at the end user level. The server event log indicated our app pool was crashing. The error received was actually a warning (to me if an app pool is crashing, it should be an error) with the msg:

In the multiple WFE environment it was happening back and forth between the two serves indicating the load balance was doing its job. It also indicated why people were seeing slow response. Each time the app pool failed, it had to restart and then reload the SharePoint environment (like you see after an IIS Reset).

ULS Logs

The ULS logs were something else. In this particular environment our logs usually range from 5MB-40MB in size for a 30 min period. When I ran a one minute log export using “Merge-SPLogFIle” the exported file was 1.3 GB. Nothing screamed error at me, however there were a couple of things standing out.

So this screamed of some custom code (which we do have) running that is not disposing of the SPSite or SPWeb objects properly. Why it suddenly became a problem I don’t know. We did have security patches happen on the server over the weekend. I didn’t think it likely to be the cause as the environment had been used for a day and a half with no issues. We backed out of the patch anyways, but didn’t affect the issue occurring. What was also confusing was this issue was also occurring in our Pre-Prod environment. The silver lining is now I could really do some troubleshooting without affecting sites that were functioning or production data.

I finally tracked down the issue to an event receiver we have running in our environment. The project sites all of the same structure and it was decided that code would be used to enforce this structure. To that end, event receivers were built to ensure folders at certain levels (library root, root +1 level and root +2 levels) were not deleted nor files or folders at those levels were added. I took a guess that these event receivers were causing the issues. Using PowerShell I removed the event receivers from a library being affected. In case you need this for something else the code to remove a list event receiver is:

In the above code (which removes the event receivers from ALL specified libraies in ALL subsites) I used the event receiver class to find the items I wanted to remove. You can also use .Name and .Assembly if you wish. I used Class simply because when the sites were created and the receivers attached, no names were given. With the event receivers removed, users were now able to upload and save documents. So I had indeed found the culprit. Now to determine why.