0

I work for a small company and we are in need of a data warehouse.

Our production database have around 50Gb of data (grows ~10GB/year, currently), our server is running a little over it's capacity and we think that we could move some historical data to a data warehouse (around half of these 50gb can be moved) so it can run smoothly again.

Of course the data warehouse would have all the data ETL'd to it, not just the historical data. This way we can also take those costly reports and dashboards data from the DW instead of the production server.

I intend to ETL the data to the DW and store it using a snowflake scheme, and afterwards I plan to create some data marts for reporting and BI. This data marts would be created using star schemas, for making things simpler (faster?) to query.

We are inclined to use MariaDB for it, that brings me to my my main question that is which storage engine best applies to our case, innoDB or ColumnStore. And how much this decision would impact on the dimensioning of the server that it will run on.

My guess, from what I've read so far, is that ColumnStore can be faster and more apropriated for our use case, but would also need better hardware. Right now we can't afford more than a single server with 4 CPU cores and 32Gb of RAM (our business has been severily impacted by the global pandemic. We are getting back on our feet, but we are not there yet).

So, given the above server specs and use case, would you still recommend using ColumnStore over innoDB? We are even open to solutions other than MariaDB.

  • 1
    Does this answer your question? [Can you help me with my capacity planning?](https://serverfault.com/questions/384686/can-you-help-me-with-my-capacity-planning) – djdomi Oct 08 '21 at 14:33
  • I think my question is more specific than just dimensioning a server. I have a limited budget and would like to know which database solution would work better on it. – Henrique Miranda Oct 08 '21 at 14:37

1 Answers1

2

Engine: InnoDB. Period. (Sure, 1% of use cases are better off with something else, but yours does not seem to indicate the need for a different engine.)

Snowflake: Terrible, especially if you need to search on a "range". Please provide the schema (preferrable via SHOW CREATE TABLE); I will be more specific. (Then I may agree that Snowflake is good, but I doubt it.)

Star schema -- Good. Normalizing common strings: good. Normalizing 'continuous' values (dates, ints, floats): bad. But the purpose is to save disk space, hence speed up some queries.

10GB/year -- that sounds like a "few" rows per second on average. Heavy, but not terribly heavy. That is, the ETL processing does not sound like you need help.

Data Warehousing -- http://mysql.rjweb.org/doc.php/datawarehouse

Purge old data -- This is one of the few uses for PARTITIONing. http://mysql.rjweb.org/doc.php/partitionmaint

Splitting into separate tables that are kept online -- likely to be a hassle but with very little benefit.

Costly reports --> Summary tables http://mysql.rjweb.org/doc.php/summarytables Summary tables are much smaller than a Fact table; it is even acceptable to denormalize.

Columnstore -- One big plus is the significant compression it gives. But I don't see your 50GB as being very big. Another benefit of CS is automatic "indexing" of every column. However, only one column can be used for the two-level efficiency of lookup.

4 cores -- plenty for InnoDB; more cores would be useful for CS.

32GB RAM -- With only 50GB of data and 10GB/year -- If all you do is look at the latest year's data, 32GB is more than enough. If you frequently scan all 50GB, then there will be a lot of I/O. If you implement Summary Tables, then 32GB is overkill for most activities. (The Summary Tables might be under 10GB and go back to the beginning of the data; hence very cacheable.)

32GB + CS -- Your 50GB will become about 5GB. (But I don't know if the 32 will be overkill.)

HDD vs. SSD -- SSD is noticeably faster.

Bottom line (and budget) -- The techniques mentioned above can keep InnoDB on 32GB humming along fine for several years.

Rick James
  • 2,058
  • 5
  • 11
  • Thanks for your comments. I have a better understanding of what I have to do now. As for not using snowflake schema, what would you suggest instead? My goal is for the DW to contain everything from our production databases and then, from it, I would extract some fact tables and dimensions (also summary tables) for reporting and BI. – Henrique Miranda Oct 08 '21 at 19:29
  • @HenriqueMiranda - re Snowflake: Show me a specific example so I can give some specific comments. One that comes to mind is `Fact` -> `Address` -> `City` -> `Country`; then searching for `Fact` rows for a certain `country_id` is really messy and slow. – Rick James Oct 08 '21 at 19:38
  • I agree, but that data would not be queried very often. Most queries would happen on the data marts that make use of star schemas. – Henrique Miranda Oct 08 '21 at 20:42
  • @HenriqueMiranda - OK. – Rick James Oct 08 '21 at 20:54