1

I have a very interesting and frustrating issue. The company that I work for uses an application to do simulation and verification for electrical engineering. This application produces output based on a random seed. Once you have the random seed for a given simulation, you can feed that random seed back into the application, and you should get identical results. Unfortunately we're running into issues where different Linux distributions are producing different results given the same random seed. This is problematic because the engineers utilizing these tools rely upon the ability to re-produce exact results every time. I should mention that we're using exactly the same version of the application across the different distributions - in fact, it is NFS-mounted, so it's exactly the same install point that gets tested every time.

At this point, I'm testing with the following operating systems:

CentOS 5
CentOS 6
SLES 11
openSuSE 11.4
openSuSE 12.1
openSuSE 12.2

The high-level results that I'm seeing are that CentOS 5 and SLES 11 produce identical results, and CentOS 6 and openSuSE 11.4/12.1/12.2 produce identical results. CentOS 5 and SLES 11 are the distributions listed as supported by the software manufacturer, so, at this point, we're looking at the result that those two distributions produce as the "correct" one. I haven't been able to find a consistent set of similarities or differences at this point that tells me what's going on. For example, here are the kernel versions in use on the different systems:

CentOS 5 - 2.6.18
CentOS 6 - 2.6.32
SLES 11 - 3.0.51
openSuSE 11.4 - 2.6.37.6
openSuSE 12.1 - 3.1.10
openSuSE 12.2 - 3.4.11

So, without digging into details about what patches might be applied to which kernel versions in these distributions, I can already say that it isn't a major change between 2.6 and 3.0, for example, because SLES 11 and CentOS 5 produce consistent results, yet have differing major kernel versions. I've looked at glibc, as well, and struck out there - openSuSE 11.4 and SLES 11 have very close versions of glibc packages (2.11.3).

My question is this: where are the next places I should look to try to track down this problem? I've started into getting strace output and trying to compare that, but that's very time-consuming and of little value when the strace output is vastly different even between distributions that produce identical output. I cannot very well go to the application manufacturer since the platforms that I'm having issues with are listed as unsupported, so they will just tell me to get lost. Any hints on where to go from here in tracking down the problem?

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Nick
  • 37
  • 3
  • 2
    Why not attach `gdb` and start stepping though the critical functions comparing until you see which function calls start diverging? – R. S. Feb 08 '13 at 00:31
  • What libraries is this program linked against? – mattdm Feb 08 '13 at 01:25
  • An ldd on the primary binary shows this: ` linux-vdso.so.1 => (0x00007fff3b199000) libc.so.6 => /lib64/libc.so.6 (0x00007f61ef15c000) /lib64/ld-linux-x86-64.so.2 (0x00007f61ef4c9000) ` However, I'm still trying to figure out how many other binaries might be called in the process, so this may not be a complete list. – Nick Feb 08 '13 at 14:57
  • Please elaborate on the reasoning for closing: this is a legitimate production environment question, and one that an engineer would be required to prove is not the fault of the servers. The code is *likely* to blame, but an engineer would need to know where to go from here. – Andrew B Feb 11 '13 at 20:05
  • @AndrewB "Why is X generating different binary code than Y" is not really a system administration question - it's a software development one (more appropriate to [Stack Overflow](http://stackoverflow.com) than Server Fault). – voretaq7 Feb 11 '13 at 20:29
  • This is something that walks the line IMO (it is in fact coming up in a system administration context), and I'm not sure I like the idea of pushing this sort of troubleshooting skill off into the realm of pure programming. Call me oldschool in that regard I guess. – Andrew B Feb 11 '13 at 20:50
  • @voteaq7 The reason I did not believe this was a software development issue is because I'm not developing the software - this is commercial software that my company licenses and we have no control over the code. I'm just trying to determine, as a sys admin, why this application is behaving differently on different Linux distributions and if there is something I need to do (environment, package install, etc.) to make it behave identically on both systems. – Nick Feb 11 '13 at 21:22

1 Answers1

1

The last time I ran into this particular issue, it ended up being a difference in whether the two operating systems were running in 32-bit mode or 64-bit mode. (@kormoc's suggestion is useful, as gdb was how I isolated it) Since you did not validate that this condition was the same between the servers, I would recommend that you begin your search there.

If the architecture does differ, you can then confirm that as the source of your problem by setting up one of your problem OSes with a matching architecture and seeing if the results converge once more. That should provide your developers enough information to work on implementing platform independent seeds.

Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • Sorry about that - meant to add that, as well. All of the distributions that I'm evaluating are 64-bit distributions, and I'm running them all as Xen PV domUs on a single piece of hardware. So, I don't think it's architecture or hardware at this point - something else. I will work on the gdb stuff tonight or tomorrow... – Nick Feb 08 '13 at 00:35
  • Shucks, was worth a shot. Take a look at [cgdb](http://cgdb.github.com/) if you don't have X installed, it splits your screen and tracks the current execution position within the code in a separate pane. I've never gone back. – Andrew B Feb 08 '13 at 00:39
  • I've taken the liberty of [firing up a chat](http://chat.stackexchange.com/rooms/info/7432/gdb-assistance-for-nick?tab=access) in the event that you need assistance stepping through gdb. If you haven't gotten 4 upvotes by the time you need to use it, visit the URL and reply with your profile URL on that site so that I can override the low rep filter. – Andrew B Feb 08 '13 at 00:59
  • Thanks for the offer of a chat - I think I got gdb working. Unfortunately it seems that the moderators believe my question does not belong here, so I will go on a search for a better place to ask it. – Nick Feb 11 '13 at 19:52