Preprocess data as per specifications
Preprocess data as per specifications
Data ScientistAbout
This unit is about using a variety of techniques to preprocess data i.e. clean and transform the data
Scope
define the dataset, perform data preprocessing operations
Define the dataset
- define the format and structure for the dataset
- define indexes and organize variables as per the defined format
- identify data types for each variable of the dataset
Perform data preprocessing operations
- identify and fix missing values in each variable of the dataset
- identify and fix incorrect data types in each variable of the dataset
- sort the data and create subsets of the data as required
- perform operations to transform data types of variables as required
- identify and deal with data redundancy by normalizing the dataset
- validate preprocessed data using appropriate tools and processes
Required Knowledge & Understanding
Technical Skills
- the difference between various types of data. For example, qualitative vs quantitative data, processed vs unprocessed data, discrete vs continuous data
- different statistical analysis software, packages, libraries and tools that can be used to preprocess data such as R or Pandas
- different functions to identify and remove missing values
- different functions to identify and transform data types of variables such as integer, float, character
- different methodological approaches for normalizing the dataset such as standard score, feature scaling, etc.
- different data formats and structures
- how to index and organize data
- how to identify and refer anomalies in data
- how to work on various databases and operating systems
Soft Skills
Analytical Thinking
impact analysis of the various actions performed and disseminating relevant information to others. Analyze data and understand its implications on business
Attention to Detail
check your work is complete and free from errors