2

My client manufactures a medical device that takes various measurements of a given sample and writes the results to a database. The amount of data generated is relatively small.

In the current configuration, each device has its own computer, and that computer runs an instance of a database server. The devices are not networked.

The client wants to modify the device such that roughly fifty of them can be connected to a local area network.

The devices use various consumables that are lot numbered and once used cannot be used again. These lot numbers are written to the database when a sample is measured. This requirement is notable because in the current configuration a device has no way of knowing if a consumable has been used by a different device. In the proposed network configuration, the expectation exists that each device will have immediate access to information about consumables used by other devices.

The devices also need to track the quantity of various chemicals that are used in the testing process. Each bottle of chemical is lot numbered and barcoded. When a bottle is inserted into the machine, the machine reads the database to determine how much liquid has been consumed from the bottle. The expectation exists that a lot numbered bottle can be inserted into any machine and the machine will be able to accurately asses the amount of liquid in the bottle.

The client wants a recommendation on which of the two architectures should be used:

1.) Each device will write data to its own local database as it does now. Synchronization software will be installed on each device and synchronization will be performed in real-time. Each device will periodically broadcast a heartbeat (1 to 5 min intervals have been proposed) and this heartbeat will contain a CRC checksum. Every device on the network will listen for heartbeats. A device will initiate a sync if the heartbeat CRC differs from its own. The sync software be external to, and independent of, the software that runs tests. Therefore it is theoretically possible, but not probable, that a device will run while it is disconnected from the network or while the sync software is not running.

2.) The database server on each device will be removed and a database server will be used instead.

The client is concerned that if a database server is used, all devices on the network will be rendered unusable in the event of server failure. Does using a peer topology effectively mitigate this risk? In other words, if one peer on the network fails, is it business as usual for all other peers? Are any data integrity dangers or benefits associated with either approach?

Edit in response to answers from iag and MikeyB:

I can see how my question leaves room for ambiguity, so here it is again, hopefully phrased in a more meaningful way.

In a client-server environment, server failure is catastrophic because if the server fails, all clients are shut down. Given that design feature, why do some highly critical information, inventory, financial, and medical systems implement client-server architecture as opposed to peer-to-peer?

Please note I am NOT asking "How do I mitigate the risk server failure?" I AM asking "Is peer-to-peer architecture an effective way to mitigate the risk of server failure?" Why or why not? Does the topology of the network influence the design of the application? Does peer-to-peer introduce the possibility of data corruption or ambiguous results?

Is the following a realistic example of what might occur in a peer-to-peer network topology?

DeviceA, DeviceB, and DeviceC are computers on a peer network that share a common agent called agent R. Whenever a peer needs to check how much R is available, it synchronizes with other peers and calculates the availability. One day at about 1pm, the lab technician inserts a bottle of R into DeviceB. DeviceB immediately syncs with DeviceC and confirms that DeviceC has never consumed R from that bottle. DeviceA, however, has not been responding to pings since noon. Can DeviceB reliably calculate the quantity of R available in the bottle?

I am a software engineer and I will be writing the application that allows these devices to share data over a network. Honestly, I have an opinion about the question I am asking however my client does not trust my experience. I want to know the experience of my peers, hence my post here. I do not want to put words in anyone's mouth so I am trying not to be as general as possible and still explain the issue.

Andrew Schulman
  • 8,561
  • 21
  • 31
  • 47
Sam
  • 123
  • 4

3 Answers3

1

I'm seeing a lot of possible issues here.

First, you have been provided two half-baked solutions for consideration that are both difficult to manage as presented and fault intolerant.

Second, you appear to be confused on how to construct data services. This is more concerning.

I'm not sure what your engagement situation is with the environment described, but I would recommend not doing anything and having better requirements defined and a better plan for achieving them than random boxes running lots of databases with no backups (live or otherwise).

If your concern is lab inventory, there are lots of software out there that addresses this. If you're working with proprietary weirdness from a vendor, establish their environmental requirements, find a way to access and retain this data with some level of assurance. I assure you that it has been done before.

None of this is going to happen from posting vague questions on this forum exclusively. If you feel out of your depth, you should get a few hours of a consultant's time to assist you.

iag
  • 29
  • 2
  • My question is about network topology, just like it says in the subject. You seem to have some background in inventory management software...... do you know of any lab management systems that are peer-to-peer? Can you explain the technical reasons why an inventory management system would or would not use a peer topology as explained in my question? – Sam May 13 '14 at 18:41
  • 2
    @Sam the network topology is not the hard part. It really boils down to redundant connections and each node being able to try different paths without needing somebody else to tell it to. The hard part is to deal with failures where one of the devices suddenly drops off the network or even worse starts to send corrupted data. It is surprisingly difficult to maintain data consistency in such a situation. One starting point to read is this: http://en.wikipedia.org/wiki/Byzantine_fault_tolerance But remember, this is complicated stuff. – kasperd May 13 '14 at 18:48
  • Your edits make me worry more actually. Do you just want to code up some p2p filesystem? Planning on doing it from scratch, or do you want to use an existing architecture to achieve this? The reason why people centralize for data that they care about is pretty clear. Distributed filesystems *could* work if you can deal with their additional problems, but as kasperd refers, more complexity for the sake of itself wouldn't be something that I would recommend. – iag May 13 '14 at 23:17
  • @kasperd This is a database application. I want to write them a boring ol client/server n-tier application. They are convinced, however, that the server is a single point of failure (it is) and that peer-to-peer mitigates the risk (it doesn't, unless someone can convince me otherwise which is why I'm here). My hope is that someone more eloquent than myself can explain the problem from an epistemological standpoint (kasperd?) – Sam May 14 '14 at 15:53
1

In the given environment, it appears essential that there is a single source of information for the data. Is that true? We can't tell.

There are always going to be failure points - you need to design around what is acceptable.

You have to come up with the constraints around your system. Must there be a single source of data? Can a device use inventory while offline? Can a single server failure be tolerated? Can the system tolerate operating in read-only mode for a short while?

Once you have those constraints you'll find that the how of designing the system arises from the constraints.

MikeyB
  • 38,725
  • 10
  • 102
  • 186
  • Please see my edit to my question. Not sure what you mean by "single" source of data. There definitely needs to be an authoritative source of data which, which implies singularity in its origin. – Sam May 13 '14 at 19:52
0

A peer to peer software architecture can be an efficient and fault tolerant way to disseminate information between nodes assuming you already have redundancy in the underlying network.

The peer to peer architecture can also protect you against data loss if multiple nodes keep the data. In typical peer to peer systems nodes keep data due to their own interest. What you want is different since you want them to keep data due to adhering to a policy rather than individual interest.

Each node storing everything it ever saw is simple as long as the amount of data is limited. But storing everything may not be practical due to storage space (or in some scenarios due to legal requirements). Then one need to be careful about what to delete and what to keep. This is one of the major pitfalls.

But all of this does nothing to address the issue of data integrity and data consistency. If you simply switch to a peer to peer architecture without giving correctness of data a thought, then the robustness of the system in that respect will go down. There is simply many more places for corruption to be introduced.

In order to implement such a solution you need to figure out how to validate the integrity of a piece of data.

A piece of data which could only ever be updated by one specific node in the system is the easiest to deal with. But you still have to ask the question about what is acceptable behavior of the system, if that node starts misbehaving. Having the node cryptographically sign each update is not sufficient, if it could erroneously send out a signed updated to delete everything it previously wrote or to send out multiple signed updates that disagree on what the new value of data is. A simple approach again is to store everything and require manual intervention, if conflicting updates show up. But if you ever need any sort of automated decision to be made based on the data, then that is insufficient.

If only one node can update the data, but you have a strict requirement about everybody else agreeing on what update it did perform, then the problem becomes slightly more difficult.

The solution to this problem is still not extremely complicated, and it gives a good idea of the sorts of methods used to solve such data integrity problems.

  • Updating node signs updated data and distribute it through the peer to peer network
  • Receiving nodes sign the first version received and send it back to updating node
  • Once the updating node has signatures from more than 2/3 of all nodes (including itself), it distributes the data through the peer to peer network again with the collection of signatures.
  • Every node which receives this version validated by signatures from 2/3 will keep retransmiting (with exponential backoff) to all nodes which have not yet confirmed that they have permanently stored the final version of the data.

The node allowed to send the update in the first place could fail in a way that would prevent the data from ever being updated again. But as long as it send out a consistent update, then it will end up being stored consistently throughout the peer to peer network.

It may sound as if the large number of signatures needed on every piece of data is going to require lots of storage space. Luckily this can be avoided through a method known as threshold signatures.

But if you want to replace a database, it is not enough that one node can update a piece of data. You have multiple nodes, which are allowed to update the same piece of data, but you require the entire network to agree on who was first. This is where byzantine agreement comes into the picture.

The solutions to this is an order of magnitude more complicated that what I described above. But I can mention a few key results to be aware of.

You have to chose between two failure models. You can assume that a failing node simply stops communicating and never send out a single corrupted message. This model requires less hardware, but it takes just a single flipped bit to bring the system down.

Alternatively you can choose the byzantine failure model, which allows a failing node to do anything, and the system will still survive. In order to tolerate t failures in this model, you need 3t+1 nodes in total. In other words, in order to tolerate a single failing node you need four nodes. If you have 10 nodes in total, it is possible to tolerate the failure of 3 nodes.

You also have to choose between synchronous or asynchronous communication model. Synchronous communication means you make assumptions about the timing of communication. If packets take longer time to reach their destination than assumed, the system breaks down. Moreover if a node crashes you have to wait for the maximum allowed delay before the system can continue.

Asynchronous models makes the software design more complicated, but it has some clear advantages. You don't have to wait for timeouts, rather you just have to wait until you have heard from more than 2/3 of the nodes before you can continue, this can be much faster than a synchronous model where you need a large timeout.

Another drawback of the asynchronous model is that it must be randomized. Running time of the algorithm becomes a stochastic variable without a worst case bound. There is a theoretical possibility that an update will take infinite time, but the probability of this can be shown to be zero. And in fact the average number of communication roundtrips can be shown to be constant. To me this looks much favorable compared to the synchronous model which can break down in case of delayed communication.

As you can imagine getting such a system right is not an easy task. It takes a dedicated development effort to implement this. Moreover a software bug can still take the system down. If less than a third of the nodes fail, the system will survive. But if a bug exists in the software, you might very well install that buggy software on more than one third of the nodes.

kasperd
  • 29,894
  • 16
  • 72
  • 122
  • @ksperd thank you very much for your detailed response. I will show this thread to my client and let them come to their own conclusins. Would you kindly address the paragraph in my question that begins with "DeviceA, DeviceB, and DeviceC are computers on a peer network...". I want to know how a peer can act on a shared resource (i.e. a bottle in my example) if one or more peers is unavailable. My example says the bottle was received 1 pm but the last communication from the peer was noon. How can a reliable result be calculated if it is not known if the missing peer acted on the bottle? – Sam May 14 '14 at 20:13