Nomenclature¶
Suhas Somnath
8/8/2017
Lets clarify some nomenclature to avoid confusion.
Data schema¶
Data schema or model refers to the way the data is arranged. This does not depend on the implementation in a particular file format
File format¶
This corresponds to the kind of file, such as a spreadsheet (.CSV), an image (.PNG), a text file (.TXT) within which information is contained.
Data format¶
data format is actually a rather broad term. However, we have observed that
people often refer to the combination of a data model implemented within a file format as a data format
.
Measurements¶
In all measurements, some quantity
such as voltage, resistance, current, amplitude, or intensity is collected
as a function of (typically all combinations of) one or more independent variables
. For example, a gray-scale image represents the
quantity - intensity being recorded for all combinations of the variables - row and column. A (simple) spectrum represents
a quantity such as amplitude or phase recorded as a function of a reference variable such as wavelength or frequency.
Data collected from measurements result in N-dimensional datasets where each dimension
corresponds to a variable that
was varied. Going back to the above examples a gray-scale image would be represented by a 2 dimensional dataset whose
dimensions are row and column. Similarly, a simple spectrum wold be a 1 dimensional dataset whose sole dimension would
be frequency for example.
Dimensionality¶
We consider data recorded for all combinations of 2 or more variables as
multi-dimensional
datasets orNth order tensors
:For example, if a single value of current is recorded as a function of driving / excitation bias or voltage having B values, the dataset is said to be
1 dimensional
and the dimension would be -Bias
.If the bias is cycled C times, the data is said to be
two dimensional
with dimensions -(Bias, Cycle)
.If the bias is varied over B values over C cycles at X columns and Y rows in a 2D grid of positions, the resultant dataset would have
4 dimensions:
(Rows, Columns, Cycle, Bias)
.
Multi-feature
: As a different example, let us suppose that thepetal width
,length
, andweight
were measured forF
different kinds of flowers. This would result in a1 dimensional dataset
with the kind of flower being the sole dimension. Such a dataset is not a 3 dimensional dataset because thepetal width, length
, andweight
are only differentfeatures
for each measurement. Some quantity needs to be measured for all combinations of petal width, length, and weight to make this dataset 3 dimensional. Most examples observed in data mining, simple machine learning actually fall into this category