Machine Learning and Data Science (1)

阅读量：

整理一个机器学习数据分析学习笔记
Website:
Teachable Machine's Google website
"The 6-Step Field Guide for Building Machine Learning Projects" by Mr. D Bourke (https://www.mrdbourke.com/a-6-step-field-guide-for-building-machine-learning-projects/)
ML Playground

Course: Udemy Complete Machine Learning & Data Science Bootcamp 2022

Content

机器学习的定义是什么？
机器学习问题的类型与分类
将机器学习与数据科学结合
【1
- 4. Tools using during data modelling

What is machine learning?

Machine learning employs algorithms (models) to acquire insights from patterns within datasets, enabling the creation of predictive models for future datasets.

Difference between machine learning algorithm and normal algorithm:
常规算法本质上是一种指令集合。
输出结果应当基于输入数据中的指令。
机器学习算法从输入数据和预期输出开始，并试图找出连接这两者的指令集合。

A machine learning model is required to undergo over 1000 iterations to determine the correct instructions dependent on the correct inputs. Thus, enabling us to utilize these patterns for future challenges.

Types of machine learning problems

Supervised learning (有监督学习): Input data already has categories (being labelled). The test data is labelled so that we know if our function is right or wrong.
Unsupervised learning (无监督学习): Input data has no categories (not being labelled). We let the machine creates categories for us.
Reinforcement learning (强化学习): Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones. It’s all about teaching machines through trial and error, through rewards and punishment.
Transfer learning (迁移学习) (not showing in the picture): A research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. Building a new machine learning model can be expensive, and that’s why we need transfer learning.
Neural networks, decision trees, support vector machines, and k nearest neighbor are simply algorithms that are used with these subfields in order to come to these predictions.

An instance of unsupervised learning involves grouping clients into distinct segments based on their purchasing history.

Within the image, an instance of utilizing a machine learning algorithm for identifying cars is presented, aimed at solving another issue that necessitates distinguishing various dog breeds.

Reinforcement learning relies on a computer program executing specific tasks within a defined environment, awarding success and penalizing failure accordingly. Example: Teaching a machine learning algorithm to play chess.

Combing machine learning with data science

1. Typical data analysis workflow

A typical data science process begins with launching a Jupyter Notebook session to open a CSV file, followed by exploring and analyzing the data using the Pandas library (a Python library for data analysis) and generating visual representations including graphs, then constructing machine learning models based on the dataset through scikit-learn.

2. Evaluation metrics of different machine learning models

Evaluation metrics for different machine learning problems.

3. Modelling the data - Three steps

Three steps to modelling the data:

Choosing and training a model
Tuning (调试) a model
Model comparison

There are none of the perfect models available for each dataset. There are consistently better models available for every dataset.

The underlying dataset needs to be randomized initially and then divided into three sets.

How to choose a model?

Structured data in formats like CSV or Excel may benefit from models such as CatBoost, XGBoost, or Random Forest. Unstructured data types including photos and videos may see success with Deep Learning and Transfer Learning models.

Initially, when faced with an exceptionally large dataset comprising over one million rows, it's prudent to allocate only a modest subset of this data—just 1,000 rows—for initial model training. A minimal amount of data is sufficient to kickstart the model-building process. The rationale behind this approach stems from the fact that investing excessive computational resources into training highly complex models often yields negligible returns in terms of accuracy improvements. There's little benefit in spending extended periods optimizing for incremental gains.

How to tune a model?

通过调整模型参数（hyperparameters）来优化模型性能是一个关键步骤。这些可调节的参数通常与所选算法类型相关。例如，在随机森林模型中，您可以调整树的数量；而在神经网络模型中，则可以调整层的数量。

Model tuning can be performed either using the training dataset or the validation dataset.

How to test and compare a model?

A good machine learning model should exhibit similar performance in both the training and test datasets. Either underfitted or overfitted performance indicates that such a model will not generalize well in the real world.

The occurrence of overfitting might stem from data leakage, where a portion of the test set inadvertently enters the training phase. This underscores the critical necessity for meticulous dataset partitioning. The usage of training data is confined solely to its designated stage. However, both training and validation datasets assume pivotal roles during model optimization. Notably, exclusive access to testing data is reserved for its specific evaluation role.

The reason why underfitting occurs could stem from data misalignment (Data misalignment arises when the data used for testing differs from the data used for training, such as involving different features). This calls for ensuring that training takes place using the same type of data as will be employed during testing and closely resembles what will be used in future applications.

Methods for fixing overfitting and underfitting:

When comparing models in machine learning tasks, it's essential that all compared models are trained using identical datasets. Evaluating model performance requires considering multiple factors beyond just accuracy; metrics like training efficiency and prediction reliability must also be evaluated. The choice between prioritizing speed or precision ultimately hinges on the specific goals of your project.

All experiments should be conducted on different portions of your data.

Training dataset — This dataset is used for model training, with between 70% and 80% of your data considered the standard.

Validation and development datasets — These datasets are utilized for the purpose of model hyperparameter tuning and experimentation evaluation. They typically constitute 10–15% of your data, which is generally regarded as a standard proportion.

Test dataset — Optimize and validate models using this dataset, where typically, 10–15% of your dataset serves as a standard benchmark.

These specific figures may vary marginally based on the problem you're tackling and the data available to you.

Suboptimal performance on the training dataset indicates that the model has not learned adequately and exhibits underfitting behavior. It’s worth considering alternative models to address this issue. By tuning its hyperparameters, we can enhance its performance. Additionally, increasing the size of the training dataset may help mitigate underfitting.

The model shows impressive results on the training dataset, yet it struggles with poor performance on validation tests, indicating limited generalizability. This suggests that your model might be overfitted to the training dataset. Consider simplifying your current approach or ensuring that your test dataset mirrors its corresponding training set to enhance generalizability.

Another instance of overfitting can manifest as improved performance on the test dataset compared to the training dataset. This could indicate issues such as improper data partitioning, leading to leakage from the test set into the training process, or excessive focus on optimizing model parameters for specific test examples. To mitigate this, ensure that both your training and testing datasets remain entirely separate to prevent optimizing model performance solely based on the test set. Optimize instead using a combination of training and validation datasets to enhance model effectiveness.

When a model experiences performance degradation (in real-world scenarios), it indicates a discrepancy between the training and testing datasets compared to the actual operational conditions. It is crucial to verify that the data used during experimentation aligns with the operational data employed in production.

To maintain optimal performance, one must ensure that the gap between training and testing data does not exist. This involves verifying that the experimental dataset accurately represents the operational dataset, thereby ensuring consistent data quality across both training and operational phases.

Additionally, it is essential to implement robust verification mechanisms for maintaining data consistency across both training and operational phases. By doing so, one can prevent unintended performance drops while ensuring seamless integration of models into production environments.

Finally, regular monitoring of data sources ensures that any deviations from expected patterns are promptly identified. This proactive approach helps maintain system reliability by addressing issues before they impact end-users.

4. Tools using during data modelling

全部评论 (0)

还没有任何评论哟~

Machine Learning and Data Science (1)

整理一个机器学习数据分析学习笔记 Website: TeachableMachine A6StepFieldGuideforBuildingMachineLearningProjects Machin...

DHU Data science technology and Application Learning【1】——Week 1

文章目录写在前面上课内容学习笔记 extra 作业&自己的答案题目1 题目2 写在前面学校的数据科学技术与应用以python为基础，辅以Anaconda的环境，进行进阶学习，据说要求的难度对...

Data Mining Practical Machine Learning Tools and Techniques

4.数据挖掘基本方法在这一章中，我们来看一下基本思想。最具启发性的教训之一是，简单的想法往往非常有效，我们强烈建议在分析实际数据集时采用“简单性优先”的方法。数据集可以展示许多不同类型的简单结构八...

DHU Data science technology and Application Learning【2】——Week 2

文章目录上课内容学习笔记作业&自己的答案作业1 作业2 上课内容多维数据创建与访问多维数据运算综合实例学习笔记作业&自己的答案 oh，居然有成绩了 78 …… Notsogood，I...

Open Data Platform and Machine Learning: A Perfect Match

1.背景介绍机器学习（MachineLearning）和大数据平台（BigDataPlatform）是当今科技领域的热门话题。机器学习是人工智能的一个分支，它旨在让计算机自主地学习和改进自己的行为，...

AI：Machine Learning & Data Science

机器学习与数据科学左侧机器学习MachineLearning 机器学习是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为...

DHU Data science technology and Application Learning【5】——Homework Viewing-back

文章目录写在前面第一次作业第二次作业第三次作业第四次作业第五次作业第六次作业第七次作业第八次作业写在前面本博客用于记录（或者说是用来备份）我在2019s数据科学技术与应用课MOO...

(1) Understanding Machine Learning Concepts and Applica

作者：禅与计算机程序设计艺术 1.简介机器学习（Machinelearning）是一门新的计算机科学技术，它可以使计算机“学习”到数据内部的模式或规律性，并通过应用此模式解决现实世界中的各种问题。

Supervised Machine Learning: Regression and Classification 1

文章目录 1Supervisedvs.UnsupervisedMachineLearning 1.1SupervisedMachineLearning 1.2UnsupervisedMachineLe...

The Role of Machine Learning in Data Storage Management and Optimization

1.背景介绍数据存储管理和优化是计算机系统中一个关键的领域，它涉及到存储设备的管理、数据的存储和检索、数据的备份和恢复等方面。随着数据的增长和复杂性，传统的存储管理技术已经无法满足现实中的需求。因此...

是否确定退出登录?

Machine Learning and Data Science (1)

Content

What is machine learning?

Types of machine learning problems

Combing machine learning with data science

1. Typical data analysis workflow

2. Evaluation metrics of different machine learning models

3. Modelling the data - Three steps

How to choose a model?

How to tune a model?

How to test and compare a model?

4. Tools using during data modelling

全部评论 (0)

相关文章推荐

Machine Learning and Data Science (1)

DHU Data science technology and Application Learning【1】——Week 1

Data Mining Practical Machine Learning Tools and Techniques

DHU Data science technology and Application Learning【2】——Week 2

Open Data Platform and Machine Learning: A Perfect Match

AI：Machine Learning & Data Science

DHU Data science technology and Application Learning【5】——Homework Viewing-back

(1) Understanding Machine Learning Concepts and Applica

Supervised Machine Learning: Regression and Classification 1

The Role of Machine Learning in Data Storage Management and Optimization