From Raw Dive Logs to Onboard Sediment Classification Using Open Source ML

My name is Ade. I am a software engineering intern at JAIA Robotics and Final year Computer Science Masters Student at Brown University. I have been working at JAIA for 2 semesters now and while I dabble in a little bit of everything, I have been leveraging my AI knowledge to deliver value to the Jaiabot system.

I wanted to share a concrete example of how open source libraries significantly accelerated development for the JaiaBot.

Goal

We were interested in classifying bottom sediment types during dives using onboard sensors. Specifically, we wanted to distinguish between softer and harder bottom types using only signals available on the vehicle.

Rather than building custom ML infrastructure, we leveraged open source tooling to move quickly from HDF5 dive logs to a deployable model.


Step 1: Data Filtering from HDF5 Logs

JaiaBots log large HDF5 files containing many topics. For ML development we filtered each file down to only the datasets relevant to bottom interaction:

  • jaiabot::imu

  • jaiabot::pressure_adjusted

  • jaiabot::mission_dive

  • jaiabot::task_packet;14

Using h5py allowed us to programmatically extract only the needed groups and create smaller filtered files. This dramatically reduced iteration time when running experiments.


Step 2: Feature Extraction from Bottom Events

From each dive file we:

  1. Identified bottom dive events from task packets.

  2. Located the maximum depth point as the likely bottom impact moment.

  3. Extracted a time window around that impact.

  4. Computed physics motivated features:

    • Peak vertical acceleration

    • Impact duration above threshold

    • Maximum jerk

We used:

  • NumPy for vectorized computation

  • pandas for time aligned processing

Because these are mature libraries, the signal processing logic was concise and easy to validate.


Step 3: Unsupervised Structure Discovery

Instead of immediately training a supervised model, we first explored whether the data naturally clustered.

Using scikit-learn:

  • StandardScaler for normalization

  • KMeans for clustering

  • F1 scoring to compare clusters against ascent type

We searched over combinations of features to find the smallest set that best separated two clusters.

This gave us two emergent groups that strongly correlated with powered vs unpowered ascent behavior, which we treated as a proxy for soft vs hard bottom interaction as soft bottoms would usually trigger a powered ascent as the bots have to wiggle loose.


Step 4: Training a Lightweight Classifier

Once clusters were stable, we trained supervised classifiers to reproduce the cluster assignments:

  • RandomForest

  • GradientBoosting

  • LogisticRegression

  • SVM

Cross validation was handled entirely by scikit-learn utilities.

The best model was wrapped in a pipeline with scaling and exported using joblib. This produced a small serialized artifact that can be integrated into the Jaia software stack.


Why Open Source Made This Fast

Key advantages:

  • No custom clustering implementation required

  • Built in cross validation and metrics

  • Easy feature scaling and pipeline composition

  • Clean model serialization for deployment

  • Well tested numerical stability

Instead of debugging algorithms, we focused on domain questions:

  • Is the impact window correctly defined?

  • Are the features physically meaningful?

  • Are clusters consistent across dives?

The open source ecosystem allowed us to move from raw dive logs to a reproducible model quickly and confidently.


Lessons for Marine Robotics Teams

  1. Start with unsupervised learning when labels are scarce.

  2. Use physics informed features before deep models.

  3. Lean heavily on mature open source libraries.

  4. Serialize early and test deployment constraints early.

For small marine robotics teams, this approach avoids building infrastructure that already exists and lets you spend time where it matters most, which is on domain understanding and field validation.

If there is interest, I am happy to share more about:

  • Handling noisy underwater IMU data

  • Validating clusters across missions

  • Transitioning from clustering to fully supervised models

  • Constraints when deploying ML onboard embedded systems

Would love to hear how others are using open source ML in marine field systems

Hi Ade, welcome to the Open Ocean Software forum! Thank you for sharing your project using ML to do sediment classification.

Do you have a link to source code for this work, so that other people who are interested might be able to use it?