Sajiron
Related Post
📖 If you haven't read it yet, check out the previous blog: What is Machine Learning? A Beginner’s Guide in JavaScript
Machine Learning models are only as reliable as the data they're trained on. Data preprocessing ensures that the data is clean, accurate, and structured before it is fed into a model. Without proper preprocessing, even the best models can produce poor results.
Raw data is often messy, containing missing values, duplicates, and inconsistencies. This guide will walk through practical steps for preparing data in JavaScript, helping ensure that your ML models perform optimally.
Before preprocessing, we need to collect data from reliable sources and load it efficiently. Different formats (CSV, JSON, APIs) require different handling strategies.
Common data sources include:
CSV Files: Structured tabular data stored in text format.
JSON Files: Lightweight format ideal for JavaScript applications.
APIs: Fetching real-time data from online sources.
Databases: Querying SQL (e.g., MySQL, PostgreSQL) or NoSQL (e.g., MongoDB, Firebase) databases.
Parsing CSV with PapaParse.js
Since JavaScript lacks native CSV support, PapaParse makes parsing easy:
Papa.parse("data.csv", {
download: true,
header: true,
complete: (results) => {
console.log(results.data); // Parsed CSV data
}
});
Fetching Data from an API
Use the Fetch API to retrieve JSON data:
async function fetchData(url) {
const response = await fetch(url);
const data = await response.json();
return data;
}
fetchData('https://api.example.com/data').then(data => console.log(data));
Missing values can lead to incorrect model predictions.
Duplicate data skews the learning process.
Incorrect data types may cause errors in computations.
Filling Missing Data with Defaults:
const cleanedData = rawData.map(row => ({
name: row.name,
age: row.age ?? 30, // Default to 30 if missing
salary: row.salary || 50000 // Default salary if missing
}));
Duplicates can distort analysis. We can remove them using Set()
:
const uniqueData = [...new Set(rawData.map(JSON.stringify))].map(JSON.parse);
Ensure numerical values are correctly formatted:
rawData.forEach(row => {
row.age = parseInt(row.age, 10);
row.salary = parseFloat(row.salary);
});
Feature values may be on different scales, making ML models unstable.
Some models require standardized inputs to perform optimally.
Scales numerical data between 0 and 1:
function minMaxNormalize(arr) {
const min = Math.min(...arr);
const max = Math.max(...arr);
return arr.map(value => (value - min) / (max - min));
}
Standardizes data with mean = 0 and standard deviation = 1:
function zScoreNormalize(arr) {
const mean = arr.reduce((a, b) => a + b, 0) / arr.length;
const stdDev = Math.sqrt(arr.reduce((sum, val) => sum + (val - mean) ** 2, 0) / arr.length);
return arr.map(value => (value - mean) / stdDev);
}
Converts raw data into meaningful inputs for ML models.
Helps improve prediction accuracy by making variables more informative.
Convert categorical data into a format suitable for ML:
const categories = ["red", "blue", "green"];
const categoryMap = categories.reduce((acc, cat, idx) => ({ ...acc, [cat]: idx }), {});
const encodedData = rawData.map(row => ({
...row,
categoryIndex: categoryMap[row.color] // Convert 'red' -> 0, 'blue' -> 1, etc.
}));
To test our model’s performance on unseen data.
To prevent overfitting, where a model memorizes rather than learns.
Split data into training (80%) and testing (20%) sets:
const trainSize = Math.floor(rawData.length * 0.8);
const trainSet = rawData.slice(0, trainSize);
const testSet = rawData.slice(trainSize);
Ensures the dataset is ready for model training.
Clearly separates features (X) and labels (y).
const X = rawData.map(row => [row.age, row.salary, row.experience]);
const y = rawData.map(row => row.hired ? 1 : 0);
By now, you’ve learned how to collect, clean, transform, and structure data for Machine Learning in JavaScript.
🚀 Stay tuned for the next post: "Mathematical Foundations for ML."
💡 If you found this helpful, don’t forget to like and share! 👍
Learn the basics of Machine Learning in JavaScript! Explore ML concepts, types, and how JS can power AI in the browser. Start your ML journey today!
Learn how V8’s JIT compilation optimizes JavaScript execution, the impact of deoptimization on performance, and how to detect JIT issues in DevTools.
A futuristic illustration of React Server Components, showing computation shifting from client to server with a high-tech, neon cyber theme.
A futuristic interface with AI-driven UI, WebAssembly, micro frontends, and PWAs, showcasing the evolution of frontend engineering in 2025.