Big Data – data lake versus data leak

Opportunities and dangers of Big Data

By Klaus-Peter Kaul On Jan 11, 2021

Data and information play an increasingly important role in companies and increasingly represent a significant production factor. The concept of so-called data lakes promises much when it comes to the analysis of such data and information, supported by machine learning and artificial intelligence. However, there are not only advantages.

The advent of Big Data and scalable information retrieval relying on Lucene-based storage clusters, among others, has led to a renaissance of analytics techniques. Tracking knowledge across the enterprise has become possible and many have benefited. That you can now determine the relationship between response time online sales is huge. You can find out that productivity goes down when production downtime goes up and which specific delays have the biggest economic impact on an airline. This is valuable business information that is not obvious in large amounts of raw data. Machine learning now makes this extremely easy and, thanks to its high performance and volume coverage, can be applied to the immense amounts of data stored in what are known as data lakes.

The data lake is able to hold very heterogeneous and unstructured data. Examples include photos, videos, emails, Word documents or data from other systems, and other unrelated data. Data lakes are particularly popular wherever large amounts of sensor data exist, for example, or data that records the state of IoT devices about their health, for example. Due to this variety of data, which must be collected and combined for analysis, the concept of the data lake became widespread.

However, data lakes also harbor dangers that not everyone is aware of

Perhaps the tail has started wagging the dog. But since a lot of “strategically selected data” can be combined with “selected machine learning algorithms”, this has created considerable added value for the company. Therefore, it would only be logical to dive into more and more data to create even more value from it. The opposite is the case: the value contribution decreases with increasing data. Each additional data set overlaps with information already known, and the value added thus becomes smaller and smaller. However, this fact has not caused most companies to stop simply collecting all the data they can from as many different sources as possible. Many companies hope to ultimately generate the added value through machine learning.

Price plays another role. Since storage space has become so cheap, its usefulness is no longer even doubted and no one asks what evil it might carry with it. After all, deletion is always an option if it turns out that the data is not needed. However, against all odds, we have seen a significant number of attacks targeting these storage clusters. From brute-force attacks on passwords to abuses of software flaws, hackers and attackers continue to find ways to get at these enterprise data vaults. The more data that is centralized in a single “location,” the greater the damage if it ever falls into the wrong hands. These data lakes can result in an unfortunate data leak.

Although there is a clear added value in bringing data together for analytical purposes, the risk of a possible data leak must still be clearly assessed and taken into account. When data is decentralized, there is comparatively an implicit measure of data security. This also makes it more difficult for hackers or malicious insiders to walk out with all the crown jewels right away. Therefore, a company must be aware that once the data is in a data lake, they are simultaneously accepting a loss of control.

Prioritize data protection and network security

Similar arguments have been and are being raised in the discussion around the cloud as well as back-up solutions. Other products which have had a significant influence on the range of data protection solutions, data centralization and access control also play an important role in this context. For companies, this is a clear challenge and data lakes or the centralization of data consequently have an ambivalent meaning.

Ultimately, companies should think carefully about how data lakes are to be provided and used. What flows in can flow out. Therefore, when deciding on storage and data offloading strategies, the potential impact of a data leak should be considered from the outset. Often there is a neutral way to decentralize data. Of course, proper network security measures must be followed here as well. In addition, many analytics techniques can leverage existing database APIs. This allows data from many decentralized sources to be analyzed. This solution does not require that all data be pulled into the data lake. The decentralized data can be managed via native access control mechanisms. Of course, even with this arrangement, data leakage is not entirely avoidable, but the extent remains significantly smaller.

Klaus-Peter Kaul

Klaus-Peter Kaul ist Regional Sales Director für Alpine (Schweiz und Österreich) bei Riverbed Technology. Der in den Bereichen Server, Storage, Security und Netzwerke versierte Manager schaut auf eine bereits über 22 Jahre dauernde Karriere bei führenden Unternehmen zurück, darunter McAfee, Secure Computing, Veritas Software und SGI Silicon Graphics.