Advertisement

Machine Learning and Data Science (1)

阅读量:

整理一个机器学习数据分析学习笔记
Website:
Teachable Machine's Google website
"The 6-Step Field Guide for Building Machine Learning Projects" by Mr. D Bourke (https://www.mrdbourke.com/a-6-step-field-guide-for-building-machine-learning-projects/)
ML Playground

Course: Udemy Complete Machine Learning & Data Science Bootcamp 2022

Content

  • 机器学习的定义是什么?

  • 机器学习问题的类型与分类

  • 将机器学习与数据科学结合

  • 【1

    • 4. Tools using during data modelling

What is machine learning?

在这里插入图片描述

Machine learning employs algorithms (models) to acquire insights from patterns within datasets, enabling the creation of predictive models for future datasets.

在这里插入图片描述

Difference between machine learning algorithm and normal algorithm:
常规算法本质上是一种指令集合。
输出结果应当基于输入数据中的指令。
机器学习算法从输入数据和预期输出开始,并试图找出连接这两者的指令集合。

在这里插入图片描述

A machine learning model is required to undergo over 1000 iterations to determine the correct instructions dependent on the correct inputs. Thus, enabling us to utilize these patterns for future challenges.

Types of machine learning problems

在这里插入图片描述

Supervised learning (有监督学习): Input data already has categories (being labelled). The test data is labelled so that we know if our function is right or wrong.
Unsupervised learning (无监督学习): Input data has no categories (not being labelled). We let the machine creates categories for us.
Reinforcement learning (强化学习): Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones. It’s all about teaching machines through trial and error, through rewards and punishment.
Transfer learning (迁移学习) (not showing in the picture): A research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. Building a new machine learning model can be expensive, and that’s why we need transfer learning.
Neural networks, decision trees, support vector machines, and k nearest neighbor are simply algorithms that are used with these subfields in order to come to these predictions.

What is supervised learning
在这里插入图片描述

An instance of unsupervised learning involves grouping clients into distinct segments based on their purchasing history.

在这里插入图片描述

Within the image, an instance of utilizing a machine learning algorithm for identifying cars is presented, aimed at solving another issue that necessitates distinguishing various dog breeds.

在这里插入图片描述

Reinforcement learning relies on a computer program executing specific tasks within a defined environment, awarding success and penalizing failure accordingly. Example: Teaching a machine learning algorithm to play chess.

Combing machine learning with data science

1. Typical data analysis workflow

在这里插入图片描述

A typical data science process begins with launching a Jupyter Notebook session to open a CSV file, followed by exploring and analyzing the data using the Pandas library (a Python library for data analysis) and generating visual representations including graphs, then constructing machine learning models based on the dataset through scikit-learn.

2. Evaluation metrics of different machine learning models

在这里插入图片描述

Evaluation metrics for different machine learning problems.

3. Modelling the data - Three steps

Three steps to modelling the data:

  1. Choosing and training a model
  2. Tuning (调试) a model
  3. Model comparison

There are none of the perfect models available for each dataset. There are consistently better models available for every dataset.

在这里插入图片描述
在这里插入图片描述

The underlying dataset needs to be randomized initially and then divided into three sets.

How to choose a model?

在这里插入图片描述

Structured data in formats like CSV or Excel may benefit from models such as CatBoost, XGBoost, or Random Forest. Unstructured data types including photos and videos may see success with Deep Learning and Transfer Learning models.

在这里插入图片描述

Initially, when faced with an exceptionally large dataset comprising over one million rows, it's prudent to allocate only a modest subset of this data—just 1,000 rows—for initial model training. A minimal amount of data is sufficient to kickstart the model-building process. The rationale behind this approach stems from the fact that investing excessive computational resources into training highly complex models often yields negligible returns in terms of accuracy improvements. There's little benefit in spending extended periods optimizing for incremental gains.

How to tune a model?

在这里插入图片描述

通过调整模型参数(hyperparameters)来优化模型性能是一个关键步骤。这些可调节的参数通常与所选算法类型相关。例如,在随机森林模型中,您可以调整树的数量;而在神经网络模型中,则可以调整层的数量。

Model tuning can be performed either using the training dataset or the validation dataset.

How to test and compare a model?

在这里插入图片描述
在这里插入图片描述

A good machine learning model should exhibit similar performance in both the training and test datasets. Either underfitted or overfitted performance indicates that such a model will not generalize well in the real world.

The occurrence of overfitting might stem from data leakage, where a portion of the test set inadvertently enters the training phase. This underscores the critical necessity for meticulous dataset partitioning. The usage of training data is confined solely to its designated stage. However, both training and validation datasets assume pivotal roles during model optimization. Notably, exclusive access to testing data is reserved for its specific evaluation role.

The reason why underfitting occurs could stem from data misalignment (Data misalignment arises when the data used for testing differs from the data used for training, such as involving different features). This calls for ensuring that training takes place using the same type of data as will be employed during testing and closely resembles what will be used in future applications.

Methods for fixing overfitting and underfitting:

在这里插入图片描述
在这里插入图片描述

When comparing models in machine learning tasks, it's essential that all compared models are trained using identical datasets. Evaluating model performance requires considering multiple factors beyond just accuracy; metrics like training efficiency and prediction reliability must also be evaluated. The choice between prioritizing speed or precision ultimately hinges on the specific goals of your project.

All experiments should be conducted on different portions of your data.

Training dataset — This dataset is used for model training, with between 70% and 80% of your data considered the standard.

Validation and development datasets — These datasets are utilized for the purpose of model hyperparameter tuning and experimentation evaluation. They typically constitute 10–15% of your data, which is generally regarded as a standard proportion.

Test dataset — Optimize and validate models using this dataset, where typically, 10–15% of your dataset serves as a standard benchmark.

These specific figures may vary marginally based on the problem you're tackling and the data available to you.

Suboptimal performance on the training dataset indicates that the model has not learned adequately and exhibits underfitting behavior. It’s worth considering alternative models to address this issue. By tuning its hyperparameters, we can enhance its performance. Additionally, increasing the size of the training dataset may help mitigate underfitting.

The model shows impressive results on the training dataset, yet it struggles with poor performance on validation tests, indicating limited generalizability. This suggests that your model might be overfitted to the training dataset. Consider simplifying your current approach or ensuring that your test dataset mirrors its corresponding training set to enhance generalizability.

Another instance of overfitting can manifest as improved performance on the test dataset compared to the training dataset. This could indicate issues such as improper data partitioning, leading to leakage from the test set into the training process, or excessive focus on optimizing model parameters for specific test examples. To mitigate this, ensure that both your training and testing datasets remain entirely separate to prevent optimizing model performance solely based on the test set. Optimize instead using a combination of training and validation datasets to enhance model effectiveness.

When a model experiences performance degradation (in real-world scenarios), it indicates a discrepancy between the training and testing datasets compared to the actual operational conditions. It is crucial to verify that the data used during experimentation aligns with the operational data employed in production.

To maintain optimal performance, one must ensure that the gap between training and testing data does not exist. This involves verifying that the experimental dataset accurately represents the operational dataset, thereby ensuring consistent data quality across both training and operational phases.

Additionally, it is essential to implement robust verification mechanisms for maintaining data consistency across both training and operational phases. By doing so, one can prevent unintended performance drops while ensuring seamless integration of models into production environments.

Finally, regular monitoring of data sources ensures that any deviations from expected patterns are promptly identified. This proactive approach helps maintain system reliability by addressing issues before they impact end-users.

4. Tools using during data modelling

在这里插入图片描述

全部评论 (0)

还没有任何评论哟~