2

We have a data warehouse that stores prices for our products. Once every business day, new data points are added representing the day's price for each product. On the website, a user sees this information on a page load. There are currently 7 products; analytics shows that the amount of page loads is about 100 per day.

The website server is an external server to our data center. The data warehouse server is internal to our data center and has other critical data (in different tables) not related and should not be accessible to the website.

The following approaches have been considered:

  1. The data warehouse could daily push (SFTP) a CSV file containing the daily data to the web server. The web server would have a process running on a crontab every 15 minutes. It would check if the file had changed. If so, it would update its database with the data. Then, on page loads, the web server would query its database to get the data to display on the web page.

    Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. The polling process would pick up the change and overwrite the data in the database.

  2. The web server could request data from the data warehouse using JDBC or similar SQL connection technology.

    However, a security concerns have been voiced. The concern is that by allowing the web server access to our data warehouse, SQL injection attack or some other external attack through the website could compromise the data warehouse. Measures could be put in place to reduce the risk, but the easiest and safest approach has been suggested to simply not allow any public facing system to directly access the data warehouse. In other words, the data warehouse can establish a communication with other servers (say to SFTP the file), but no server can initiate a connection with the data warehouse. Do these concerns seem reasonable and hard to mitigate?

  3. A web service could be built, and the web server could call the web service that is hosted in our internal data center.

  4. A process hosted in our internal data center could call the web server when it knows the data warehouse has data available. If so, how should this be done? HTTPS with some way to prevent other unauthorized clients from making the same call?

Which of the above approaches is best or is there a better approach than listed above? What are the pros and cons to an approach?

Gilles 'SO- stop being evil'
  • 50,912
  • 13
  • 120
  • 179
James
  • 123
  • 2

2 Answers2

1

Given that your warehouse contains additional data that you don't want to expose a new attack vector to it sounds like your best option is to implement some kind of basic caching layer for this data on the web server that the warehouse can push to.

You'd have some kind of service running in your secure warehouse environment pushing a serialized representation of your needed data to a key-value store or similar (redis, memcached, whatever) running on your web server. This should be more straightforward and potentially more secure than having to deal with a CSV file. It's just a question of the service on the data warehouse's side knowing when and how to bust the cache.

As you mention there's no reason you can't connect to your web server with a secure HTTPS connection to do the push. Alternatively/additionally you could implement some basic encryption to protect the data while in transit, though it sounds like the data your web server needs isn't actually sensitive data so this step may be overkill.

Again, because your data warehouse has sensitive data in it I'd advise against trying to build some kind of API that the web server can query. There's nothing intrinsically wrong with doing so, but it certainly opens up attack vectors and makes the project much more complicated that it seems like it needs to be as it doesn't sound like you need to do different types of querying, you just need one particular chunk of data. Which really makes this a "one way conversation" where the burden is on the warehouse to make sure its "public" data is pushed out.

If in the future there's a need to query, update, or otherwise have a "two way conversation" with the warehouse from the web server it would be wise to silo that data away from the sensitive data before deploying essentially "open" API endpoint(s).

Good luck!

Matt Surabian
  • 539
  • 3
  • 5
0

Of your options, 3 - Exposing a secure Web-service is best.

Avoiding antiquated (option 1) and unnecessarily dangerous (option 2) options, there are ways to secure webservices. The NIST puts out a really good read at http://csrc.nist.gov/publications/nistpubs/800-95/SP800-95.pdf, and I recommend you give it a thorough read-through before beginning.

You can provide both authentication and authorization mechanisms when implementing these services, usually with the WS-Security suite. (The XML nodes can be individually encrypted and signed). In addition, there are many tutorials as well as questions here towards securing services.

As far as cons (be sure to read in the above PDF for 'Security concerns of WS-Security', which covers pitfalls of improper setup) go, the development cost of a properly secured webservice would be relatively high to your other options, but the benefits of authentication and authorization far outweigh them.

Rubber Duck
  • 516
  • 1
  • 5
  • 16