Android Malware Classification Pipeline

This is a brief overview of the machine learning workflow I built to classify Android software as “malware” or “not malware” for a Kaggle competition which served as the final project for my graduate-level machine learning course. Full implementation details are found in the report below and the linked GitHub repository.

Data Loading and Setup

First I imported training and testing datasets and calculated basic descriptive statistics: 3,124 training examples, 241 sparse numeric features, and notable class imbalance (79.8% belonging to the majority class). I established a fixed random seed (67) to ensure reproducible results.

Train–Validation Split

Before any modeling, I split the data 80/20. 80% was used to train the model; the remaining 20% was a dedicated validation set that never entered the cross-validation folds, letting it act as the final test for models before they were delivered as Kaggle submissions.

Establishing a Baseline

To set a minimum performance standard, I tested two simple classifiers: a majority-class classifier (always selects the majority class) and a stratified random classifier (selects a class at random with each class weighted by the proportion of the data it comprises). These established the macro-F1 score floor that any useful machine learning model needed to beat.

Learning Pipeline Construction

A reusable automated pipeline was built using the imblearn library to ensure every data transformation was performed correctly during cross-validation. The pipeline included four stages:

  • Imputation: Using the Impyute library to fill missing values with column means.
  • Scaling: Applying StandardScaler to normalize numeric features.
  • Resampling: Using synthetic minority over-sampling technique (SMOTE) to address class imbalance (or passing through without changes).
  • Classification: Inserting the machine learning algorithm to be tested.

Grid Search and Cross-Validation

I tuned the model hyperparameters using GridSearchCV (or RandomSearchCV for models with more hyperparameter options) with stratified 5-fold cross-validation, optimizing for macro-F1 score.

Model Implementation and Comparison

I tried four different machine learning algorithms (logistic regression, linear support vector machine, random forest, and XGBoost) and compared the results. Frankly, they all performed about the same (F1 scores in the range 0.97 ± 0.5):

Evaluation and Logging

At the start of the project, I implemented a detailed logging system to track each experiment’s parameters, CV scores, and timestamps, ensuring the methodology and results remained organized and transparent. The key takeaway of the project was that although the results were similar for each model, the results of my earlier attempts (before I had implemented a thorough data transformation pipeline) were significantly worse, so it appears that preparing the data properly is more significant (at least in this case) than the particular model you use. Also, though the models were equal in results, they were very unequal in training time, which makes a strong case for using simple models whenever possible and ramping up to more complex models, which can take orders of magnitude longer to train, only if necessary.

Full Report