Exploratory Data Analysis
Published:
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Related tasks into Exploratory Data Analysis:
- Data profiling: examining the data available and understand the data and identify valuable patterns, how to extend or comprise it in order to take the most profit possible from it. Usually done in the previous process of Data Management and Data Governance.
- Descriptive statistics: to know the data you have to summarize using the basic statistical tools. The basic focus can be univariate statistics and bivariate statistics.
- Data dredging (data fishing, data snooping, equation fitting and p-hacking): the use of the statistical testing in order to check the significance of basic patterns in the data, understand how to model the data and how to avoid possible overfitting.