[ofw] Opensm & WinMad: a race, causing BSOD722

Discussion:

Smith, Stan

2012-01-17 00:36:39 UTC

Permalink

What are the conditions required to generate the BSOD?

Fab Tillier

2012-01-17 00:46:36 UTC

Permalink

10D is the bugcheck code, WDF_VIOLATION. The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner. Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track. Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection. How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it? Is it a consistent repro? 1 in a 100?

Thanks,
-Fab

From: ofw-***@lists.openfabrics.org [mailto:ofw-***@lists.openfabrics.org] On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?

Fab Tillier

2012-01-17 16:57:18 UTC

Permalink

No, Uri, you're missing the point. How did you stop OpenSM? Was it immediately after it was started? Did you let it run for a fixed amount of time? The fact that it ran in your automated regression test environment is worthless information - it is not actionable. You need to describe *what* your automated regression tests *do* allows someone to try to duplicate the problem, even if it's one in 10000.

I can't believe that we have to argue about having repro steps included in bug reports. This is ridiculous. If you have a bug, tell whoever you are assigning the bug to how you ran into it. It really is not that much to ask for.

Look, we all know you hate WinMad and WinVerbs, but please try to be a little more constructive in your interaction with Sean, Stan, and the community. The animosity is getting tiresome. I'd say the same thing to Sean and Stan, and that was the main reason I responded - to let Leo know that the details in the bug report were good, unlike what Stan had suggested. However, I still stand by my statement that leaving repro steps out makes the bug report near worthless.

-Fab

From: Uri Habusha [mailto:***@mellanox.com]
Sent: Monday, January 16, 2012 9:16 PM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Again you miss the point.

Not always we have a clear repro steps. If we have it the life was simple.
This issue happened during night run of our regression. The same test runs every night and this is the first time it happens, what means it is race/timing issue.

In such a case we need to review the code start thinking what can cause the issue and add instrumental code so next time it happens we understand better the scenario.

Uri

From: Fab Tillier [mailto:***@microsoft.com]<mailto:[mailto:***@microsoft.com]>
Sent: Tuesday, January 17, 2012 2:47 AM
To: Smith, Stan; Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

10D is the bugcheck code, WDF_VIOLATION. The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner. Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track. Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection. How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it? Is it a consistent repro? 1 in a 100?

Thanks,
-Fab

From: ofw-***@lists.openfabrics.org<mailto:ofw-***@lists.openfabrics.org> [mailto:ofw-***@lists.openfabrics.org]<mailto:[mailto:ofw-***@lists.openfabrics.org]> On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?

Fab Tillier

2012-01-17 17:12:24 UTC

Permalink

You're telling me that you don't have access to the script that killed OpenSM when it crashed? Really?

Code review might be the only path to fixing this, but claiming you have no idea how it happened is false. You own the scripts that were running at the time of the crash. What were they trying to do? Last we knew, you were starting and killing OpenSM, but now it seems you may have been doing that via shutdown, and not directly? Which is it? How can you not have any idea of the steps that led to this?

From: Uri Habusha [mailto:***@mellanox.com]
Sent: Tuesday, January 17, 2012 9:07 AM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

No Fab you miss the point! I don't hate winmad and winverbs, I only love and hate people I don't have any feeling to SW component :)

For the point, we really don't have more information regarding the scenario that caused the BSOD. We don't do any target testing of SM. It can be due to shutdown, hibernate, driver restart, link disconnect/connect test and more.

How do you suggest to continue from this point. We can give up and ignore it. We can code review the code looks what can cause it. Add instrumental code or debug information trying to approve the assumption or collect more information for next time.
I think this is constructive approach...

From: Fab Tillier [mailto:***@microsoft.com]<mailto:[mailto:***@microsoft.com]>
Sent: Tuesday, January 17, 2012 6:58 PM
To: Uri Habusha; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

No, Uri, you're missing the point. How did you stop OpenSM? Was it immediately after it was started? Did you let it run for a fixed amount of time? The fact that it ran in your automated regression test environment is worthless information - it is not actionable. You need to describe *what* your automated regression tests *do* allows someone to try to duplicate the problem, even if it's one in 10000.

I can't believe that we have to argue about having repro steps included in bug reports. This is ridiculous. If you have a bug, tell whoever you are assigning the bug to how you ran into it. It really is not that much to ask for.

Look, we all know you hate WinMad and WinVerbs, but please try to be a little more constructive in your interaction with Sean, Stan, and the community. The animosity is getting tiresome. I'd say the same thing to Sean and Stan, and that was the main reason I responded - to let Leo know that the details in the bug report were good, unlike what Stan had suggested. However, I still stand by my statement that leaving repro steps out makes the bug report near worthless.

-Fab

From: Uri Habusha [mailto:***@mellanox.com]<mailto:[mailto:***@mellanox.com]>
Sent: Monday, January 16, 2012 9:16 PM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Again you miss the point.

Not always we have a clear repro steps. If we have it the life was simple.
This issue happened during night run of our regression. The same test runs every night and this is the first time it happens, what means it is race/timing issue.

In such a case we need to review the code start thinking what can cause the issue and add instrumental code so next time it happens we understand better the scenario.

Uri

From: Fab Tillier [mailto:***@microsoft.com]<mailto:[mailto:***@microsoft.com]>
Sent: Tuesday, January 17, 2012 2:47 AM
To: Smith, Stan; Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

10D is the bugcheck code, WDF_VIOLATION. The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner. Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track. Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection. How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it? Is it a consistent repro? 1 in a 100?

Thanks,
-Fab

From: ofw-***@lists.openfabrics.org<mailto:ofw-***@lists.openfabrics.org> [mailto:ofw-***@lists.openfabrics.org]<mailto:[mailto:ofw-***@lists.openfabrics.org]> On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?

Fab Tillier

2012-01-17 17:00:35 UTC

Permalink

Hi Leo,

A user-mode application should not be able to crash the system, so even if OpenSM is misbehaving, it's still a bug in the WinMad kernel driver, no doubt about that.

Can you describe how your regression tests start and stop OpenSM? How long does OpenSM run? Is it killed immediately after startup? Is it killed during a sweep? This kind of information at least gets someone started in trying to reproduce it. Heck, can you share your regression test script?

-Fab

From: Leonid Keller [mailto:***@mellanox.com]
Sent: Tuesday, January 17, 2012 12:57 AM
To: Uri Habusha; Fab Tillier; Smith, Stan; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

I believe, there are two questions here.
Why Opensm closes the port while the MAD receiving thread is still in work ?
Why Winmad doesn't expect that WmIoRead and WmProviderCleanup can be called simultaneously from different threads ?

Please, look at the code below.
WmIoRead gets provider without any lock and without checking whether it is valid.
It then calls WmProviderRead, which doesn't increment any reference, but handles MAD under WdfObjectAcquireLock(pProvider->ReadQueue), which can be already destroyed (like it was in our case).

WmFileCleanup removes provider from the list under a mutex, unknown to WmIoRead, and then calls WmProviderCleanup.
The latter waits for some reference, unknown for WmIoRead, and then calls WdfIoQueuePurgeSynchronously(pProvider->ReadQueue) without WdfObjectAcquireLock(pProvider->ReadQueue).

No doubt, I'm missing here something, I just did a quick review of the code.
Maybe WinMad presumes that the application first stops I/O, then sends WM_IOCTL_DEREGISTER and only then closes the file handle and the right way for us is just to fix a bad boy opensm ? :)

As Uri stated, we do not know how to reproduce it: opensm gets started and stopped 1000 times during a night regression run, so it's not 1 to 100, it's 1 to 100000. :(
We'd very appreciate if you could find the time to review the code.
(Of course, the same problem may exist for WmIoWrite)

TIA

static VOID WmIoRead(WDFQUEUE Queue, WDFREQUEST Request, size_t Length)
{
WDFFILEOBJECT file;
WM_PROVIDER *prov;
UNREFERENCED_PARAMETER(Queue);

file = WdfRequestGetFileObject(Request);
prov = WmProviderGetContext(file);
WmProviderRead(prov, Request);
}

void WmProviderRead(WM_PROVIDER *pProvider, WDFREQUEST Request)
{
WM_REGISTRATION *reg;
NTSTATUS status;
WM_IO_MAD *wmad;
size_t outlen, len = 0;

status = WdfRequestRetrieveOutputBuffer(Request, sizeof(WM_IO_MAD), &wmad, &outlen);
if (!NT_SUCCESS(status)) {
goto out;
}

WdfObjectAcquireLock(pProvider->ReadQueue);
if (pProvider->MadHead == NULL) {
status = WdfRequestForwardToIoQueue(Request, pProvider->ReadQueue);
WdfObjectReleaseLock(pProvider->ReadQueue);
if (NT_SUCCESS(status)) {
return;
}
goto out;
}

len = outlen;
status = WmCopyMad(wmad, pProvider->MadHead, &len);
if (status == STATUS_SUCCESS) {
reg = (WM_REGISTRATION *) pProvider->MadHead->context1;
reg->pDevice->IbInterface.put_mad(__FILE__, __LINE__, WmRemoveMad(pProvider));
}
WdfObjectReleaseLock(pProvider->ReadQueue);

out:
WdfRequestCompleteWithInformation(Request, status, len);
}

static VOID WmFileCleanup(WDFFILEOBJECT FileObject)
{
WM_PROVIDER *prov = WmProviderGetContext(FileObject);

KeAcquireGuardedMutex(&Lock);
RemoveEntryList(&prov->Entry);
KeReleaseGuardedMutex(&Lock);
WmProviderCleanup(prov);
}

void WmProviderCleanup(WM_PROVIDER *pProvider)
{
WM_REGISTRATION *reg;

while ((reg = IndexListRemoveHead(&pProvider->RegIndex)) != NULL) {
WmRegFree(reg);
}

if (InterlockedDecrement(&pProvider->Ref) > 0) {
KeWaitForSingleObject(&pProvider->Event, Executive, KernelMode, FALSE, NULL);
}

WdfIoQueuePurgeSynchronously(pProvider->ReadQueue);
WdfObjectDelete(pProvider->ReadQueue);

IndexListDestroy(&pProvider->RegIndex);
}

From: Uri Habusha
Sent: Tuesday, January 17, 2012 7:16 AM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Again you miss the point.

Not always we have a clear repro steps. If we have it the life was simple.
This issue happened during night run of our regression. The same test runs every night and this is the first time it happens, what means it is race/timing issue.

In such a case we need to review the code start thinking what can cause the issue and add instrumental code so next time it happens we understand better the scenario.

Uri

From: Fab Tillier [mailto:***@microsoft.com]<mailto:[mailto:***@microsoft.com]>
Sent: Tuesday, January 17, 2012 2:47 AM
To: Smith, Stan; Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

10D is the bugcheck code, WDF_VIOLATION. The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner. Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track. Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection. How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it? Is it a consistent repro? 1 in a 100?

Thanks,
-Fab

From: ofw-***@lists.openfabrics.org<mailto:ofw-***@lists.openfabrics.org> [mailto:ofw-***@lists.openfabrics.org]<mailto:[mailto:ofw-***@lists.openfabrics.org]> On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?

Smith, Stan

2012-01-17 17:55:53 UTC

Permalink

From: Leonid Keller [mailto:***@mellanox.com]
Sent: Tuesday, January 17, 2012 12:57 AM
To: Uri Habusha; Fab Tillier; Smith, Stan; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

I believe, there are two questions here.
Why Opensm closes the port while the MAD receiving thread is still in work ?

Hello Leo. I suspect the above is the root of the problem. A long time ago, opensm 3.3.6, opensm would hang in shutdown due to the MAD receiving thread ignoring the destroy thread call if reader thread was in winmad read().
Consequently, the MAD receiver thread destroy was commented out and opensm shutdown cleanly.
Perhaps changes in IBAL have altered timing such that shutdown is delayed enough such that the MAD reader thread runs again/further?
A SWAG at this point - I'll experiment today and let you know.

Stan.

Why Winmad doesn't expect that WmIoRead and WmProviderCleanup can be called simultaneously from different threads ?

Please, look at the code below.
WmIoRead gets provider without any lock and without checking whether it is valid.
It then calls WmProviderRead, which doesn't increment any reference, but handles MAD under WdfObjectAcquireLock(pProvider->ReadQueue), which can be already destroyed (like it was in our case).

WmFileCleanup removes provider from the list under a mutex, unknown to WmIoRead, and then calls WmProviderCleanup.
The latter waits for some reference, unknown for WmIoRead, and then calls WdfIoQueuePurgeSynchronously(pProvider->ReadQueue) without WdfObjectAcquireLock(pProvider->ReadQueue).

No doubt, I'm missing here something, I just did a quick review of the code.
Maybe WinMad presumes that the application first stops I/O, then sends WM_IOCTL_DEREGISTER and only then closes the file handle and the right way for us is just to fix a bad boy opensm ? :)

As Uri stated, we do not know how to reproduce it: opensm gets started and stopped 1000 times during a night regression run, so it's not 1 to 100, it's 1 to 100000. :(
We'd very appreciate if you could find the time to review the code.
(Of course, the same problem may exist for WmIoWrite)

TIA

static VOID WmIoRead(WDFQUEUE Queue, WDFREQUEST Request, size_t Length)
{
WDFFILEOBJECT file;
WM_PROVIDER *prov;
UNREFERENCED_PARAMETER(Queue);

file = WdfRequestGetFileObject(Request);
prov = WmProviderGetContext(file);
WmProviderRead(prov, Request);
}

void WmProviderRead(WM_PROVIDER *pProvider, WDFREQUEST Request)
{
WM_REGISTRATION *reg;
NTSTATUS status;
WM_IO_MAD *wmad;
size_t outlen, len = 0;

status = WdfRequestRetrieveOutputBuffer(Request, sizeof(WM_IO_MAD), &wmad, &outlen);
if (!NT_SUCCESS(status)) {
goto out;
}

WdfObjectAcquireLock(pProvider->ReadQueue);
if (pProvider->MadHead == NULL) {
status = WdfRequestForwardToIoQueue(Request, pProvider->ReadQueue);
WdfObjectReleaseLock(pProvider->ReadQueue);
if (NT_SUCCESS(status)) {
return;
}
goto out;
}

len = outlen;
status = WmCopyMad(wmad, pProvider->MadHead, &len);
if (status == STATUS_SUCCESS) {
reg = (WM_REGISTRATION *) pProvider->MadHead->context1;
reg->pDevice->IbInterface.put_mad(__FILE__, __LINE__, WmRemoveMad(pProvider));
}
WdfObjectReleaseLock(pProvider->ReadQueue);

out:
WdfRequestCompleteWithInformation(Request, status, len);
}

static VOID WmFileCleanup(WDFFILEOBJECT FileObject)
{
WM_PROVIDER *prov = WmProviderGetContext(FileObject);

KeAcquireGuardedMutex(&Lock);
RemoveEntryList(&prov->Entry);
KeReleaseGuardedMutex(&Lock);
WmProviderCleanup(prov);
}

void WmProviderCleanup(WM_PROVIDER *pProvider)
{
WM_REGISTRATION *reg;

while ((reg = IndexListRemoveHead(&pProvider->RegIndex)) != NULL) {
WmRegFree(reg);
}

if (InterlockedDecrement(&pProvider->Ref) > 0) {
KeWaitForSingleObject(&pProvider->Event, Executive, KernelMode, FALSE, NULL);
}

WdfIoQueuePurgeSynchronously(pProvider->ReadQueue);
WdfObjectDelete(pProvider->ReadQueue);

IndexListDestroy(&pProvider->RegIndex);
}

From: Uri Habusha
Sent: Tuesday, January 17, 2012 7:16 AM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Again you miss the point.

Not always we have a clear repro steps. If we have it the life was simple.
This issue happened during night run of our regression. The same test runs every night and this is the first time it happens, what means it is race/timing issue.

In such a case we need to review the code start thinking what can cause the issue and add instrumental code so next time it happens we understand better the scenario.

Uri

From: Fab Tillier [mailto:***@microsoft.com]<mailto:[mailto:***@microsoft.com]>
Sent: Tuesday, January 17, 2012 2:47 AM
To: Smith, Stan; Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

10D is the bugcheck code, WDF_VIOLATION. The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner. Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track. Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection. How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it? Is it a consistent repro? 1 in a 100?

Thanks,
-Fab

From: ofw-***@lists.openfabrics.org<mailto:ofw-***@lists.openfabrics.org> [mailto:ofw-***@lists.openfabrics.org]<mailto:[mailto:ofw-***@lists.openfabrics.org]> On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?

Hefty, Sean

2012-01-17 19:57:46 UTC

Permalink

[LK] Hello Stan. It looks like a Smart WAG. J
But I believe, we all agree with Fab, that the problem should be fixed at
WinMad level. We have other WinMad applications except opensm.
I hope, Sean will have a chance to look at it.

I have not looked at the crash report in detail yet, but, yes, I plan on looking at it.

- Sean

Smith, Stan

2012-01-17 20:28:54 UTC

Permalink

Leo,
Please try the enclosed patch which allows the umad_receiver() thread to exit once umad_receiver_stop() has been called [overall the same effect as cl_thread_destroy()].
This is not a fix for winmad() but a way of 'not' calling umad_read() after umad_receiver_stop() has been called.

Stan.

--- ulp/opensm/user/libvendor/osm_vendor_ibumad.c Thu Jan 12 15:27:32 2012
+++ ulp/opensm/user/libvendor/osm_vendor_ibumad.c Tue Jan 17 12:15:11 2012
@@ -67,6 +67,10 @@
#include <opensm/osm_helper.h>
#include <vendor/osm_vendor_api.h>
+#ifdef __WIN__
+static int umad_receiver_run;
+#endif
+
/****s* OpenSM: Vendor UMAD/osm_umad_bind_info_t
* NAME
* osm_umad_bind_info_t
@@ -253,7 +257,11 @@
OSM_LOG_ENTER(p_ur->p_log);
+#ifdef __WIN__
+ for (umad_receiver_run=1; umad_receiver_run;) {
+#else
for (;;) {
+#endif
if (!umad &&
!(umad = umad_alloc(1, umad_size() + MAD_BLOCK_SIZE))) {
OSM_LOG(p_ur->p_log, OSM_LOG_ERROR, "ERR 5403: "
@@ -469,9 +477,7 @@
pthread_join(p_ur->tid, NULL);
p_ur->tid = 0;
#else
- /* XXX hangs current thread - suspect umad_recv() ignoring wakeup.
- * cl_thread_destroy(&p_ur->tid);
- */
+ umad_receiver_run = 0;
#endif
p_ur->p_vend = NULL;
p_ur->p_log = NULL;

From: Leonid Keller [mailto:***@mellanox.com]<mailto:[mailto:***@mellanox.com]>
Sent: Tuesday, January 17, 2012 12:57 AM
To: Uri Habusha; Fab Tillier; Smith, Stan; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

I believe, there are two questions here.
Why Opensm closes the port while the MAD receiving thread is still in work ?

Hello Leo. I suspect the above is the root of the problem. A long time ago, opensm 3.3.6, opensm would hang in shutdown due to the MAD receiving thread ignoring the destroy thread call if reader thread was in winmad read().
Consequently, the MAD receiver thread destroy was commented out and opensm shutdown cleanly.
Perhaps changes in IBAL have altered timing such that shutdown is delayed enough such that the MAD reader thread runs again/further?
A SWAG at this point - I'll experiment today and let you know.

Stan.

[LK] Hello Stan. It looks like a Smart WAG. :)
But I believe, we all agree with Fab, that the problem should be fixed at WinMad level. We have other WinMad applications except opensm.
I hope, Sean will have a chance to look at it.

Why Winmad doesn't expect that WmIoRead and WmProviderCleanup can be called simultaneously from different threads ?

Please, look at the code below.
WmIoRead gets provider without any lock and without checking whether it is valid.
It then calls WmProviderRead, which doesn't increment any reference, but handles MAD under WdfObjectAcquireLock(pProvider->ReadQueue), which can be already destroyed (like it was in our case).

WmFileCleanup removes provider from the list under a mutex, unknown to WmIoRead, and then calls WmProviderCleanup.
The latter waits for some reference, unknown for WmIoRead, and then calls WdfIoQueuePurgeSynchronously(pProvider->ReadQueue) without WdfObjectAcquireLock(pProvider->ReadQueue).

No doubt, I'm missing here something, I just did a quick review of the code.
Maybe WinMad presumes that the application first stops I/O, then sends WM_IOCTL_DEREGISTER and only then closes the file handle and the right way for us is just to fix a bad boy opensm ? :)

As Uri stated, we do not know how to reproduce it: opensm gets started and stopped 1000 times during a night regression run, so it's not 1 to 100, it's 1 to 100000. :(
We'd very appreciate if you could find the time to review the code.
(Of course, the same problem may exist for WmIoWrite)

TIA

static VOID WmIoRead(WDFQUEUE Queue, WDFREQUEST Request, size_t Length)
{
WDFFILEOBJECT file;
WM_PROVIDER *prov;
UNREFERENCED_PARAMETER(Queue);

file = WdfRequestGetFileObject(Request);
prov = WmProviderGetContext(file);
WmProviderRead(prov, Request);
}

void WmProviderRead(WM_PROVIDER *pProvider, WDFREQUEST Request)
{
WM_REGISTRATION *reg;
NTSTATUS status;
WM_IO_MAD *wmad;
size_t outlen, len = 0;

status = WdfRequestRetrieveOutputBuffer(Request, sizeof(WM_IO_MAD), &wmad, &outlen);
if (!NT_SUCCESS(status)) {
goto out;
}

WdfObjectAcquireLock(pProvider->ReadQueue);
if (pProvider->MadHead == NULL) {
status = WdfRequestForwardToIoQueue(Request, pProvider->ReadQueue);
WdfObjectReleaseLock(pProvider->ReadQueue);
if (NT_SUCCESS(status)) {
return;
}
goto out;
}

len = outlen;
status = WmCopyMad(wmad, pProvider->MadHead, &len);
if (status == STATUS_SUCCESS) {
reg = (WM_REGISTRATION *) pProvider->MadHead->context1;
reg->pDevice->IbInterface.put_mad(__FILE__, __LINE__, WmRemoveMad(pProvider));
}
WdfObjectReleaseLock(pProvider->ReadQueue);

out:
WdfRequestCompleteWithInformation(Request, status, len);
}

static VOID WmFileCleanup(WDFFILEOBJECT FileObject)
{
WM_PROVIDER *prov = WmProviderGetContext(FileObject);

KeAcquireGuardedMutex(&Lock);
RemoveEntryList(&prov->Entry);
KeReleaseGuardedMutex(&Lock);
WmProviderCleanup(prov);
}

void WmProviderCleanup(WM_PROVIDER *pProvider)
{
WM_REGISTRATION *reg;

while ((reg = IndexListRemoveHead(&pProvider->RegIndex)) != NULL) {
WmRegFree(reg);
}

if (InterlockedDecrement(&pProvider->Ref) > 0) {
KeWaitForSingleObject(&pProvider->Event, Executive, KernelMode, FALSE, NULL);
}

WdfIoQueuePurgeSynchronously(pProvider->ReadQueue);
WdfObjectDelete(pProvider->ReadQueue);

IndexListDestroy(&pProvider->RegIndex);
}

From: Uri Habusha
Sent: Tuesday, January 17, 2012 7:16 AM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Again you miss the point.

Not always we have a clear repro steps. If we have it the life was simple.
This issue happened during night run of our regression. The same test runs every night and this is the first time it happens, what means it is race/timing issue.

In such a case we need to review the code start thinking what can cause the issue and add instrumental code so next time it happens we understand better the scenario.

Uri

From: Fab Tillier [mailto:***@microsoft.com]<mailto:[mailto:***@microsoft.com]>
Sent: Tuesday, January 17, 2012 2:47 AM
To: Smith, Stan; Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

10D is the bugcheck code, WDF_VIOLATION. The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner. Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track. Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection. How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it? Is it a consistent repro? 1 in a 100?

Thanks,
-Fab

From: ofw-***@lists.openfabrics.org<mailto:ofw-***@lists.openfabrics.org> [mailto:ofw-***@lists.openfabrics.org]<mailto:[mailto:ofw-***@lists.openfabrics.org]> On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?

Smith, Stan

2012-01-18 18:35:33 UTC

Permalink

From: Leonid Keller [mailto:***@mellanox.com]
Sent: Wednesday, January 18, 2012 12:57 AM
To: Smith, Stan; Uri Habusha; Fab Tillier; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Thank you, Stan.

A quick question after a quick look on the patch:
I understand that with this patch the MAD receiver thread will exit in some time.

The original problem is umad_read()s were repeatedly issued during and after the HCA/port close; infinite for() loop, error handler just continued the loop.
Eventually opensm exit() clobbered the umad_receiver() thread during process tear-down. Your testing exposed a race condition.

The fix allows 0..1 umad_read()'s and then exits the for() loop which then exits the umad_receiver() function which terminates the CL thread.

I do not understand why the crash can't happen during this time.

umad_receiver_stop() is called which clears the umad_receiver() for() loop execution condition.

Possible umad_receiver() outcomes:
umad_receiver () is not in umad_read(), hence the for() loop check will exit.
umad_receiver() is blocked in umad_read() call; when umad_read() returns there will either be a valid MAD or an error (device closed) - either way the for() loop check will exit.

Point is umad_read() will not be called after the HCA/port is closed.
cl_thread_destroy() was unable to interrupt the umad_read() call or if it did another umad_read() was issued before the thread terminated.

I'd expect umad_receiver_stop() to wait till umad_receiver_run variable gets a value "OK, I'm done, dude". :)

Potential deadlock situation as the caller of umad_receiver_stop() eventually closes the HCA/port. In order for umad_receiver_stop() to receive confirmation that umad_receiver() had stopped without first closing the HCA, one MAD would have to be received in order to exit the umad_receiver() for() loop and notify/release umad_receiver_stop().
I'd vote for keep-it-simple (KIS).

BTW, is this problem only for reading MADs ?

I don't know, yet to be discovered.

Is it possible that one thread is sending a MAD while the main one is closing the port ?

One would think the MAD driver (winmad) would disallow closing while a Transmit is in progress.

Stan.

From: Smith, Stan [mailto:***@intel.com]<mailto:[mailto:***@intel.com]>
Sent: Tuesday, January 17, 2012 10:29 PM
To: Leonid Keller; Uri Habusha; Fab Tillier; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Leo,
Please try the enclosed patch which allows the umad_receiver() thread to exit once umad_receiver_stop() has been called [overall the same effect as cl_thread_destroy()].
This is not a fix for winmad() but a way of 'not' calling umad_read() after umad_receiver_stop() has been called.

Stan.

--- ulp/opensm/user/libvendor/osm_vendor_ibumad.c Thu Jan 12 15:27:32 2012
+++ ulp/opensm/user/libvendor/osm_vendor_ibumad.c Tue Jan 17 12:15:11 2012
@@ -67,6 +67,10 @@
#include <opensm/osm_helper.h>
#include <vendor/osm_vendor_api.h>

+#ifdef __WIN__
+static int umad_receiver_run;
+#endif
+
/****s* OpenSM: Vendor UMAD/osm_umad_bind_info_t
* NAME
* osm_umad_bind_info_t
@@ -253,7 +257,11 @@

OSM_LOG_ENTER(p_ur->p_log);

+#ifdef __WIN__
+ for (umad_receiver_run=1; umad_receiver_run;) {
+#else
for (;;) {
+#endif
if (!umad &&
!(umad = umad_alloc(1, umad_size() + MAD_BLOCK_SIZE))) {
OSM_LOG(p_ur->p_log, OSM_LOG_ERROR, "ERR 5403: "
@@ -469,9 +477,7 @@
pthread_join(p_ur->tid, NULL);
p_ur->tid = 0;
#else
- /* XXX hangs current thread - suspect umad_recv() ignoring wakeup.
- * cl_thread_destroy(&p_ur->tid);
- */
+ umad_receiver_run = 0;
#endif
p_ur->p_vend = NULL;
p_ur->p_log = NULL;

From: Leonid Keller [mailto:***@mellanox.com]<mailto:[mailto:***@mellanox.com]>
Sent: Tuesday, January 17, 2012 12:57 AM
To: Uri Habusha; Fab Tillier; Smith, Stan; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

I believe, there are two questions here.
Why Opensm closes the port while the MAD receiving thread is still in work ?

Hello Leo. I suspect the above is the root of the problem. A long time ago, opensm 3.3.6, opensm would hang in shutdown due to the MAD receiving thread ignoring the destroy thread call if reader thread was in winmad read().
Consequently, the MAD receiver thread destroy was commented out and opensm shutdown cleanly.
Perhaps changes in IBAL have altered timing such that shutdown is delayed enough such that the MAD reader thread runs again/further?
A SWAG at this point - I'll experiment today and let you know.

Stan.

[LK] Hello Stan. It looks like a Smart WAG. :)
But I believe, we all agree with Fab, that the problem should be fixed at WinMad level. We have other WinMad applications except opensm.
I hope, Sean will have a chance to look at it.

Why Winmad doesn't expect that WmIoRead and WmProviderCleanup can be called simultaneously from different threads ?

Please, look at the code below.
WmIoRead gets provider without any lock and without checking whether it is valid.
It then calls WmProviderRead, which doesn't increment any reference, but handles MAD under WdfObjectAcquireLock(pProvider->ReadQueue), which can be already destroyed (like it was in our case).

WmFileCleanup removes provider from the list under a mutex, unknown to WmIoRead, and then calls WmProviderCleanup.
The latter waits for some reference, unknown for WmIoRead, and then calls WdfIoQueuePurgeSynchronously(pProvider->ReadQueue) without WdfObjectAcquireLock(pProvider->ReadQueue).

No doubt, I'm missing here something, I just did a quick review of the code.
Maybe WinMad presumes that the application first stops I/O, then sends WM_IOCTL_DEREGISTER and only then closes the file handle and the right way for us is just to fix a bad boy opensm ? :)

As Uri stated, we do not know how to reproduce it: opensm gets started and stopped 1000 times during a night regression run, so it's not 1 to 100, it's 1 to 100000. :(
We'd very appreciate if you could find the time to review the code.
(Of course, the same problem may exist for WmIoWrite)

TIA

static VOID WmIoRead(WDFQUEUE Queue, WDFREQUEST Request, size_t Length)
{
WDFFILEOBJECT file;
WM_PROVIDER *prov;
UNREFERENCED_PARAMETER(Queue);

file = WdfRequestGetFileObject(Request);
prov = WmProviderGetContext(file);
WmProviderRead(prov, Request);
}

void WmProviderRead(WM_PROVIDER *pProvider, WDFREQUEST Request)
{
WM_REGISTRATION *reg;
NTSTATUS status;
WM_IO_MAD *wmad;
size_t outlen, len = 0;

status = WdfRequestRetrieveOutputBuffer(Request, sizeof(WM_IO_MAD), &wmad, &outlen);
if (!NT_SUCCESS(status)) {
goto out;
}

WdfObjectAcquireLock(pProvider->ReadQueue);
if (pProvider->MadHead == NULL) {
status = WdfRequestForwardToIoQueue(Request, pProvider->ReadQueue);
WdfObjectReleaseLock(pProvider->ReadQueue);
if (NT_SUCCESS(status)) {
return;
}
goto out;
}

len = outlen;
status = WmCopyMad(wmad, pProvider->MadHead, &len);
if (status == STATUS_SUCCESS) {
reg = (WM_REGISTRATION *) pProvider->MadHead->context1;
reg->pDevice->IbInterface.put_mad(__FILE__, __LINE__, WmRemoveMad(pProvider));
}
WdfObjectReleaseLock(pProvider->ReadQueue);

out:
WdfRequestCompleteWithInformation(Request, status, len);
}

static VOID WmFileCleanup(WDFFILEOBJECT FileObject)
{
WM_PROVIDER *prov = WmProviderGetContext(FileObject);

KeAcquireGuardedMutex(&Lock);
RemoveEntryList(&prov->Entry);
KeReleaseGuardedMutex(&Lock);
WmProviderCleanup(prov);
}

void WmProviderCleanup(WM_PROVIDER *pProvider)
{
WM_REGISTRATION *reg;

while ((reg = IndexListRemoveHead(&pProvider->RegIndex)) != NULL) {
WmRegFree(reg);
}

if (InterlockedDecrement(&pProvider->Ref) > 0) {
KeWaitForSingleObject(&pProvider->Event, Executive, KernelMode, FALSE, NULL);
}

WdfIoQueuePurgeSynchronously(pProvider->ReadQueue);
WdfObjectDelete(pProvider->ReadQueue);

IndexListDestroy(&pProvider->RegIndex);
}

From: Uri Habusha
Sent: Tuesday, January 17, 2012 7:16 AM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Again you miss the point.

Not always we have a clear repro steps. If we have it the life was simple.
This issue happened during night run of our regression. The same test runs every night and this is the first time it happens, what means it is race/timing issue.

In such a case we need to review the code start thinking what can cause the issue and add instrumental code so next time it happens we understand better the scenario.

Uri

From: Fab Tillier [mailto:***@microsoft.com]<mailto:[mailto:***@microsoft.com]>
Sent: Tuesday, January 17, 2012 2:47 AM
To: Smith, Stan; Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

10D is the bugcheck code, WDF_VIOLATION. The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner. Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track. Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection. How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it? Is it a consistent repro? 1 in a 100?

Thanks,
-Fab

From: ofw-***@lists.openfabrics.org<mailto:ofw-***@lists.openfabrics.org> [mailto:ofw-***@lists.openfabrics.org]<mailto:[mailto:ofw-***@lists.openfabrics.org]> On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?