Nomenclature¶
Suhas Somnath
8/8/2017
Lets clarify some nomenclature to avoid confusion.
Data schema¶
Data schema or model refers to the way the data is arranged. This does not depend on the implementation in a particular file format
File format¶
This corresponds to the kind of file, such as a spreadsheet (.CSV), an image (.PNG), a text file (.TXT) within which information is contained.
Data format¶
data format is actually a rather broad term. However, we have observed that
people often refer to the combination of a data model implemented within a file format as a data format.
Measurements¶
In all measurements, some quantity such as voltage, resistance, current, amplitude, or intensity is collected
as a function of (typically all combinations of) one or more independent variables. For example, a gray-scale image represents the
quantity - intensity being recorded for all combinations of the variables - row and column. A (simple) spectrum represents
a quantity such as amplitude or phase recorded as a function of a reference variable such as wavelength or frequency.
Data collected from measurements result in N-dimensional datasets where each dimension corresponds to a variable that
was varied. Going back to the above examples a gray-scale image would be represented by a 2 dimensional dataset whose
dimensions are row and column. Similarly, a simple spectrum wold be a 1 dimensional dataset whose sole dimension would
be frequency for example.
Dimensionality¶
We consider data recorded for all combinations of 2 or more variables as
multi-dimensionaldatasets orNth order tensors:For example, if a single value of current is recorded as a function of driving / excitation bias or voltage having B values, the dataset is said to be
1 dimensionaland the dimension would be -Bias.If the bias is cycled C times, the data is said to be
two dimensionalwith dimensions -(Bias, Cycle).If the bias is varied over B values over C cycles at X columns and Y rows in a 2D grid of positions, the resultant dataset would have
4 dimensions:(Rows, Columns, Cycle, Bias).
Multi-feature: As a different example, let us suppose that thepetal width,length, andweightwere measured forFdifferent kinds of flowers. This would result in a1 dimensional datasetwith the kind of flower being the sole dimension. Such a dataset is not a 3 dimensional dataset because thepetal width, length, andweightare only differentfeaturesfor each measurement. Some quantity needs to be measured for all combinations of petal width, length, and weight to make this dataset 3 dimensional. Most examples observed in data mining, simple machine learning actually fall into this category