Contact Us

Preprocess data as per specifications

Preprocess data as per specifications
Data Scientist

About

This unit is about using a variety of techniques to preprocess data i.e. clean and transform the data

Scope

define the dataset, perform data preprocessing operations

Define the dataset
  • define the format and structure for the dataset
  • define indexes and organize variables as per the defined format
  • identify data types for each variable of the dataset
Perform data preprocessing operations
  • identify and fix missing values in each variable of the dataset
  • identify and fix incorrect data types in each variable of the dataset
  • sort the data and create subsets of the data as required
  • perform operations to transform data types of variables as required
  • identify and deal with data redundancy by normalizing the dataset
  • validate preprocessed data using appropriate tools and processes

Required Knowledge & Understanding

Technical Skills
  • the difference between various types of data. For example, qualitative vs quantitative data, processed vs unprocessed data, discrete vs continuous data
  • different statistical analysis software, packages, libraries and tools that can be used to preprocess data such as R or Pandas
  • different functions to identify and remove missing values
  • different functions to identify and transform data types of variables such as integer, float, character
  • different methodological approaches for normalizing the dataset such as standard score, feature scaling, etc.
  • different data formats and structures
  • how to index and organize data
  • how to identify and refer anomalies in data
  • how to work on various databases and operating systems
Soft Skills
Analytical Thinking
impact analysis of the various actions performed and disseminating relevant information to others. Analyze data and understand its implications on business
Attention to Detail
check your work is complete and free from errors