数据建模和分析

数据建模和分析

学习数据的 representation, grouping, mining, processing, analysis 和 knowledge discovery。

需要有线代、概率、统计的知识基础。

涉及的内容包括:

  1. Data types, sources, nature, scales and distributions

  2. Data representations, transformation, dimensionality reduction and normalization

  3. Classification: Statistical based, Distance based, Decision based, Deep Learning.

  4. Clustering: Partitional, Hierarchical, Model and Density based, others.

  5. Retrieval and Mining: Similarity measures and matching techniques.

  6. Reinforcement Learning: Classification, Control and learning patterns over time.

  7. Knowledge discovery in data: Rule induction, Association rules mining, text mining.

数据的类型包括:

Nominal、Ordinal、Interval、Ratios、Structural Data、Graphs or Trees、Database 和 Lists/Vector/Matrix/Data Cube.

总结数据:

  • 集中趋势:均值
  • 分散程度:方差、标准差、MAD(Mean Absolute Deviation)、IQR(Interquartile Range)
  • 两个变量的关系:

PCC(Pearson Correlation Coefficient)、

\[ \begin{aligned} r &=\operatorname{cov}\left(v_{1}, v_{2}\right) / s_{1} s_{2} \\ \operatorname{cov}\left(v_{1}, v_{2}\right) &=\frac{1}{n}\left\{\left(v_{1}-\overline{v}_{1}\right)\left(v_{2}-\overline{v}_{2}\right)^{T}\right\} \end{aligned}\]

Cross Correlation

\[ R(s, t)=\frac{E\left[\left(X_{t}-\overline{x}\right)\left(X_{s}-\overline{x}\right)\right]}{\sigma_{t} \sigma_{s}}\]

数据预处理

  • 数据检查和清洗:Accuracy, Completeness, Consistency, Interpretability
  • 数据转换:
    • Filling in Missing Data
    • Smoothing with Bins
    • Smoothing with Windows
    • Normalization - Feature Scaling
    • Scaling
    • 数据减少:Sampling

数据相似度

  • 数据矩阵(n 个样本,d 个特征)
  • 距离矩阵:n 个数据点,两两间的距离

Previous Post Next Post