S

Sajiron

16 min readPublished on Mar 09, 2025

Understanding Data Preprocessing in JavaScript

DALL·E 2025-03-09 23.22.42 - A visually appealing illustration representing data preprocessing in JavaScript for machine learning. The image includes elements such as___- A struct.webp

1. Introduction

Related Post
📖 If you haven't read it yet, check out the previous blog: What is Machine Learning? A Beginner’s Guide in JavaScript

Machine Learning models are only as reliable as the data they're trained on. Data preprocessing ensures that the data is clean, accurate, and structured before it is fed into a model. Without proper preprocessing, even the best models can produce poor results.

1.1 Why This Guide Matters

Raw data is often messy, containing missing values, duplicates, and inconsistencies. This guide will walk through practical steps for preparing data in JavaScript, helping ensure that your ML models perform optimally.

2. Collecting & Loading Data in JavaScript

2.1 Why Is This Step Important?

Before preprocessing, we need to collect data from reliable sources and load it efficiently. Different formats (CSV, JSON, APIs) require different handling strategies.

2.2 Data Sources

Common data sources include:

CSV Files: Structured tabular data stored in text format.

JSON Files: Lightweight format ideal for JavaScript applications.

APIs: Fetching real-time data from online sources.

Databases: Querying SQL (e.g., MySQL, PostgreSQL) or NoSQL (e.g., MongoDB, Firebase) databases.

2.3 Loading Data Examples

Parsing CSV with PapaParse.js

Since JavaScript lacks native CSV support, PapaParse makes parsing easy:

Papa.parse("data.csv", {
download: true,
header: true,
complete: (results) => {
console.log(results.data); // Parsed CSV data
}
});

Fetching Data from an API

Use the Fetch API to retrieve JSON data:

async function fetchData(url) {
const response = await fetch(url);
const data = await response.json();
return data;
}
fetchData('https://api.example.com/data').then(data => console.log(data));

3. Cleaning Your Data

3.1 Why Is Data Cleaning Important?

Missing values can lead to incorrect model predictions.

Duplicate data skews the learning process.

Incorrect data types may cause errors in computations.

3.2 Handling Missing Values

Filling Missing Data with Defaults:

const cleanedData = rawData.map(row => ({
name: row.name,
age: row.age ?? 30, // Default to 30 if missing
salary: row.salary || 50000 // Default salary if missing
}));

3.3 Removing Duplicate Data

Duplicates can distort analysis. We can remove them using Set():

const uniqueData = [...new Set(rawData.map(JSON.stringify))].map(JSON.parse);

3.4 Type Conversion

Ensure numerical values are correctly formatted:

rawData.forEach(row => {
row.age = parseInt(row.age, 10);
row.salary = parseFloat(row.salary);
});

4. Data Transformation

4.1 Why Is Data Transformation Necessary?

Feature values may be on different scales, making ML models unstable.

Some models require standardized inputs to perform optimally.

4.2 Min-Max Scaling

Scales numerical data between 0 and 1:

function minMaxNormalize(arr) {
const min = Math.min(...arr);
const max = Math.max(...arr);
return arr.map(value => (value - min) / (max - min));
}

4.3 Z-Score Standardization

Standardizes data with mean = 0 and standard deviation = 1:

function zScoreNormalize(arr) {
const mean = arr.reduce((a, b) => a + b, 0) / arr.length;
const stdDev = Math.sqrt(arr.reduce((sum, val) => sum + (val - mean) ** 2, 0) / arr.length);
return arr.map(value => (value - mean) / stdDev);
}

5. Feature Engineering

5.1 Why Is Feature Engineering Important?

Converts raw data into meaningful inputs for ML models.

Helps improve prediction accuracy by making variables more informative.

5.2 Categorical Encoding (One-Hot Encoding)

Convert categorical data into a format suitable for ML:

const categories = ["red", "blue", "green"];
const categoryMap = categories.reduce((acc, cat, idx) => ({ ...acc, [cat]: idx }), {});

const encodedData = rawData.map(row => ({
...row,
categoryIndex: categoryMap[row.color] // Convert 'red' -> 0, 'blue' -> 1, etc.
}));

6. Data Splitting (Train/Test)

6.1 Why Do We Split Data?

To test our model’s performance on unseen data.

To prevent overfitting, where a model memorizes rather than learns.

Split data into training (80%) and testing (20%) sets:

const trainSize = Math.floor(rawData.length * 0.8);
const trainSet = rawData.slice(0, trainSize);
const testSet = rawData.slice(trainSize);

7. Structuring Data for ML

7.1 Why Is Structuring Important?

Ensures the dataset is ready for model training.

Clearly separates features (X) and labels (y).

const X = rawData.map(row => [row.age, row.salary, row.experience]);
const y = rawData.map(row => row.hired ? 1 : 0);

8. Conclusion

By now, you’ve learned how to collect, clean, transform, and structure data for Machine Learning in JavaScript.

📌 What’s Next?

🚀 Stay tuned for the next post: "Mathematical Foundations for ML."

💡 If you found this helpful, don’t forget to like and share! 👍