The data entering the system is enormous. However, at some point in time, the data must conform to some architectural standard. Data normalization involves tasks that make data more accessible to users. This includes, but is not limited to, the following steps: IT field
These processes can occur at different stages. For example, imagine that you work in a large organization with data scientists and a business intelligence team who rely on your data. You can store unstructured data in a data lake for use by your data science clients for their research. You can also store normalized data in a relational database or more, use a dedicated data warehouse એ that the BI team will use in their reports.
You may have more or less customer groups, or perhaps an application that chews on your data. The image below shows a modified version of the previous pipeline example:
Clients and access to data at various stages of processing
In this image, you see a hypothetical data pipeline and stages in which different customer groups often operate.
If your customer is a product team, a well-designed data model is critical. A smart data model application can be slow and almost non-responsive, behaving as if it already knows what data the user wants to access. Such decisions are often the result of collaboration between product and data teams.
Data normalization and modeling are usually part of the ETL transformation phase એ , but they are not the only ones in this category. Another common conversion step is data cleansing.
Data cleansing
Data cleaning goes hand in hand with normalization. Some even consider data normalization a subset of data cleansing. But while data normalization mainly focuses on bringing disparate data into line with some model, cleaning involves a number of actions that make the data more homogeneous and complete, including:
Casting the same data to the same type (for example, converting strings in an integer field to integers);
Data cleansing can fit into the deduplication and unification phases of the data model in the diagram above. In reality, however, each of these steps is very large and can include any number of stages and individual processes.
The exact steps you take to cleanse your data will greatly depend on the input data, the data model, and the desired results. However, the importance of clean data is undeniable:
Data scientists need them to improve the accuracy of their analysis.
They are essential for machine learning engineers to create accurate and generalizable models.
Business intelligence teams need them to deliver accurate reports and forecasts for the business.
The development team needs to be cleaned up to ensure their product does not crash or provide users with incorrect information.
Responsibility for cleansing data falls on the shoulders of many and depends on priorities and the organization as a whole. As a data engineer, you should strive to automate cleanup as much as possible and perform regular spot checks on incoming and stored data. Your customers and management can provide insight into what raw data is to them.
Data availability
Data availability is generally overlooked when compared to normalization and cleanup, but it is arguably one of the most important tasks of the customer-centric data development team.
Data availability means how easy it is for customers to access and understand the data. This is what is defined differently depending on the client:
Data scientists may simply need data that is accessible through a query language.
Analyst teams may prefer data grouped by some metric available through basic queries or the reporting interface.
Product teams often need data that can be accessed through quick and easy queries that don't change frequently, given the performance and reliability of the product.
As larger organizations provide these and other teams with the same data, many have moved on to developing their own internal platforms for their disparate teams. A great mature example of this is the taxi service Uber, which has shared many details of its impressivebig data platforms...
In fact, many data engineers are becoming platform engineers, which makes it clear the enduring importance of data design skills for data-driven enterprises.
Since data availability is closely related to how the data is stored, it is a core component of the ETL loading phase that relates to how the data is stored for future use.
Now that you've got to know some of the typical data engineering customers and understand their needs, it's time to take a closer look at what skills you can develop to meet those needs.
No comments:
Post a Comment