数据挖掘导论(Pang-Ning Tan)习题记录
第2章 数据
统计人员表示"是的"(补充说明),认为字段2和3也存在较多的问题。正确的翻译应为"字段2和3具有相同的含义"。在此过程中,在处理这句时承用了上一句中的"有意思"这一表述方式(补充说明),但作者的重点并非仅仅在于发现这些问题(补充说明),而是强调字段2和3所表达的意义完全一致(补充说明),并存在冗余属性列的情况(补充说明)。
习题2:难点主要体现在区分 区间(interval)和比率(ratio)属性。
区间属性是指两个值之间的差异具有实际意义。例如温度情况显示,在某些情况下95°C可能比85°C高出10°C。相比之下,在相对标称型和序数型属性中通常会体现出更强的数值特性
An interval measure is a quantification where the magnitude of change between two values is meaningful. When measuring temperatures at 100 degrees and 90 degrees, the change observed is equal magnitude to that when temperatures are measured at 90 degrees and 80 degrees.
在质量控制理论中, 比率型质量特性具有包括所有区间型质量特性的性质, 即若某特性为比率型特性, 则必然是区间的特性, 但反之则不然。区分区间的质量特性与比率型质量特性的方法在于考察值0.0是否具有实际意义: 即使将值理解为在比值意义上的无意义, 但对于区间的特性而言, 值却有明确的意义。例如, 质量作为一个比率为零的特性意味着没有质量存在; 但pH值却并非如此, 因为pH=0并不意味着缺乏酸度相反地代表了最强的酸性。
【A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. Temperature, expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those scales does not mean 'no temperature'. However, temperature in Kelvin is a ratio variable, as 0.0 Kelvin really does mean 'no temperature'. Another counter example is pH. It is not a ratio variable, as pH=0 just means 1 molar of H+. and the definition of molar is fairly arbitrary. A pH of 0.0 does not mean 'no acidity' (quite the opposite!). When working with ratio variables, but not interval variables, you can look at the ratio of two measurements. A weight of 4 grams is twice a weight of 2 grams, because weight is a ratio variable. A temperature of 100 degrees C is not twice as hot as 50 degrees C, because temperature C is not a ratio variable. A pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.】
习题6的问题焦点在于:当处理关联分析时,元组的属性需要被转换为二元形式。因此,在进行关联分析的过程中,一个基础性活动是首先要将这些属性转换成二元形式。
为什么需要二元化?有哪些属性二元化的方法?
