Data munging
Published:
Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
Data munging process could required the use of:
- Basic statistic computation: to find easy statistically incorrect patterns.
- Data visualization: to see easy outliers or elements out of the region of correctness.
- Parsing and sorting techniques: to be easy to inspect data.
To perform this activity, the data wrangler has to know the data, how the data was collected, and what he could expect from it. Most common errors in data are:
- Impossible data combinations: different columns values that are impossible to be in the same time.
- Out of range values.
- Missing values.
The related activities with data munging are:
- Data editing: the process involving the review and adjustment of collected survey data. The purpose is to control the quality of the collected data. Data editing can be performed manually, with the assistance of a computer or a combination of both.
- Data cleansing, data cleaning or data scrubbing: the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores.
- Data validation: the process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called “validation rules” “validation constraints” or “check routines”, that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic.
- Data curation: organization and integration of data collected from various sources, annotation of the data, and publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation.
- Data pre-processing. Step before the machine learning step or the data mining process. There are data that has to be cleaned.