Data Engineering

Hot and Cold data

Any sufficiently big system will invariably need to have Hot and Cold data. What is the difference? Hot data is part of the active data set that is used for decision making, while Cold data is data in storage for historical analysis and handling of exceptional situations.

What you do not want in your system? You do not want to have situation where you must use all of your data for decision making! With the growth of data set, each query will need to go through all of the data to get the answer, as dataset gets bigger, answers will take more time. Hardware increase is linear, while this problem is exponential, so you will hit the limit.

This problem is sneaky, because at the start of the development you are not going to see it, everything goes great, until suddenly timeouts starts happening, and systems stops feeling so responsive as in the first days. If you get this problem in first year…. start talking to developers…. whole application architecture is probably not designed for scaling. You are risking massive fail during scale out phase, with a whole application redesign looming over the horizon. Do not go into redesign project with the same guys who designed initial system, they will fail again. Do not go into this redesigning with the new hires, they will fail with the same results as initial development.

What to do? Hire consultants or experts that did similar project in the past to help out. And no, you will not find experts in sub 35 year old range. You will find talents that can implement the system, but not an experts who can design it to last and scale. It is just a matter of having enough failures in the pocket, and enough maturity to learn from these failures.

Data Engineering

Evolution of the information system and data flow

This short novel, is inspired by actual events that happened between May 2005 and May 2012…. all of it is true as far as I remember.

First steps was manual report generation….data analyst way. Unfortunately this was not feasible due to number and frequency of reports. Management wanted to track sales in real-time.

Next step was generating reports that were using live data. At the time it seemed like a great idea, however very soon we found out that you can not have fast insert to database and fast search. Every report requested by anyone in the building was slowing down the data server.

This was the lesson why you need to split OLTP and OLAP databases.

Second step was nightly automatic data parsing and nightly database load, preparing data used by reports. However management still wanted to have some data coming in real-time, so we had implemented probably first version of map-reduce algorithm that I know of. Real-time data was parsed and handled. All reports requiring real-time data used very small tables that were updated multiple times per second.

With time, this reports and web pages supporting them where spread across the company, and everyone was able to see everything. Obviously, top management did not like the idea that coffee lady can check the company earnings etc. Solution was to integrate all this separate pages into single application with log-in, permissions, security etc.

After integration of all web pages into single application, we faced the never ending requests for additional reports from all over the company. In addition to supporting nightly data ETL, reports preparation we also developed new lottery games and integrated our system with external customers. All this efforts, just caused proliferation of reports.

Down the line, top management got saturated with different reports signifying different KPI. And business decision was to focus on providing single set of KPIs aligned with business goals, instead of ad hoc KPIs defined by each department.

Few years later due to falling sales, Marketing and Sales started coming up with requests for reports that can be summed up in: “compare everything to everything else”. Only way to mitigate reports explosion was self service using pivot technology from Microsoft and data warehouse SSAS from that time. Using this technologies dev team got ahead of all requests. Data warehouse got data prepared on nightly basis. Processing during night took around 6 hours. We had multiple redundancies and consistency checks of the system.

Downside of this technology was that one excel file became new Information System, because everyone had a feeling of control and they could do whatever they wanted, some entrepreneurial colleagues started making their reports that used some of the data from DW.

Suddenly we were scrambling across company and we were keeping track of the Excel file version.

After introduction of Excel, we found out that no-one is logging and using IS.

And hilariously we got to the same situation we had at the start, coffee lady complained about her salary when she saw how much money company earned (while she ignored expenses).

Now we had to introduce the permissions on DW tables, to limit who can see what. Whole situation finally blowed when CEO sent the Excel with CEO privileges to payments, and payments accidentally sent the CEO Excel to sales…… and suddenly everyone was using CEO level access.

All this time our chief IT guy was Linux proponent, while the company was running Windows clients and mixed Windows and Linux servers.

This excel debacle was final straw that gave development team enough ammunition to push IT to support move to Active Directory. Introduction of Active Directory removed all the problems we had with permissions and privileges. Suddenly there was no need to log-in into IS, everyone could use any Excel and it worked out of the box.

It was a wild time, we were solving all this problems on the go while the systems are running, games are played by 7 million auditorium. There was no support for management to stop everything for a few days while we put new system, only way to stop anything for few hours was major crisis and breakdown in the system.

Today that majority of the system can be implemented by one person and all reports could be done using power BI, tableau, qlik view etc.