What is data, a dataset, and how do we describe it?

What we say that data is something, which is held in a dataset, that just contains N observations (rows) and M features (colomns). So for each observation N, there will be M features. Those features could be almost anything, think a horse powers of a car, weight etc. However, M is also known as attribute, variable, field or characteristic and are used interchangeably along with feature. N is also known as objects. For each N we have a row in our dataset, and for M it would be a colomn, if we were to visualize it in a table.

Hint: When using Pandas, we refer to df, dataframe or train as the data, and to get a quick look at some of the data, you can use the function df.head().

What different types of datasets are there?

There are mainly three types of datasets; record data, relational data and ordered data. The type of data you will most often encounter is probably record data, as that is the most used and common data.

  • Record data: Data objects and their attributes in a table. An example is house data, where you have size of property, kitchen, how many rooms etc.
  • Relational data: Data objects and their relation on a graph. Think of a graph where you draw lines between people that know eachother.
  • Ordered data: Data objects that are sorted, just like a sequence. Often represented in a time series, where one or more values changes over time and time being sorted.

What is a feature in a dataset?

These features can either be discrete, continuous and binary (often called discrete variables, continuous variables and binary variables in books).

  • A discrete feature is often represented as an integer, but they are also usually finite. Examples are zip codes or set of words counted in a document.
  • A continuous feature is often represented as a floating point (or double), and they are often arbitrarily large. Examples are temperature, height or weight.
  • A binary feature is always represented as a binary value, e.g. 0 (false) or 1 (true). An example is a shopping cart. Is it empty (0), or does it has items (1).

Hint: When using Pandas to import your data, you can do df.dtypes and print it. df is the dataframe, also known as your data, sometimes called train. It will list all the features and which kind of variable it is. When referring to categorical features in Pandas, it would print it as the data type Object.

There are also different type of features; nominal, ordinal, interval and ratio:

  1. Nominal: a category, equal or not equal. Think ID numbers, eye color or zip codes. Also called categorical feature.
  2. Ordinal: object that can be ranked, greater than or less than something else. Think grades in school or taste of food from 1-10.
  3. Interval: an object that can be measured, we can apply addition or subtraction. An example is calendar days.
  4. Ratio: an object where we can apply multiplication or division. The number zero means absence of that object. This could be length, time or counts of something.

A feature can be one type and one type only, e.g. a feature cannot be both a ratio and an ordinal feature. But you could argue that, as we go down this list of 4 items, that each item inherits the properties of the one('s) above it. So ratio inherits all other items, but is not a nominal, ordinal nor interval feature.

What about quality of data?

Quality of data can be of the utmost importance. For an example, if you have many missing values in your dataset, you will have a bad time with that data. You also want to make sure, that the dataset you have found, is fit for the purpose you want to use it for. The data also has to correctly represent real data, that is, correct data about some event.

What you don't want in your dataset is 1) noise, 2) outliers and 3) missing values. That will at least cause problems with the continuing on through the pipeline of doing machine learning, especially modelling when your data has problems.

What is feature extraction?

In feature extraction (also called feature transformation), you are eliminating or suppressing some aspects of your data. That might be noise removal in audio, removal of background in images or eliminating common words (e.g. if, and) in text documents.

A typical feature transformation is where you either standardize or binarize your dataset. If you were to standardize your data, you would subtract the mean for specific features in each observation. And if you were to binarize your data, you would make a rule that specifies when the value is 1 or 0 for specific features. Such a rule could be if x > 0, then 1, else 0.

Here is another example of feature transformation. So it is hard to interpret text like in the colomn 'Nationality', and we would like an easier way for the computer to think about each person's nationality.

Age Weight Nationality
22 68 Germany
43 75 Italy
35 78 USA

What we then do is make features out of every nationality, so that it becomes a colomn and we label it with a 0 or 1, for true or false.

Age Weight Germany Italy USA
22 68 1 0 0
43 75 0 1 0
35 78 0 0 1

This way we now have more useful data. BTW this is called One-out-of-K encoding or one-hot encoding.

How could you describe your data in matrices?

One way to describe the above table in a matrix would be:

$$ X = \begin{bmatrix} 22 & 68 & 1 & 0 & 0 \\ \vdots &\vdots &\vdots &\vdots & \vdots \\ 35 & 78 & 0 & 0 & 1 \end{bmatrix}^T $$

This is a direct 'translation' from a table to a matrix, where the big T means transpose and it really just means we flipped the matrix. But what if we wanted to represent the matrix as a single row matrix?

You would then have a matrix $X_{i,j}$ where $i$ is the row or observation and $j$ is the colomn or attribute. So if we were to refer to $X_{2,4}$, we would be referring to the second observation with attribute 'Italy'. This is what we call the transpose of a matrix, where we change a colomn vector to a row vector, since we can always think of a colomn matrix as representing many vectors. We would also refer to this as $x_i$, for the i'th observation. An example is the second observation in the table above as $x_2$:

$$ x_2 = \begin{bmatrix} 43 & 75 & 0 & 1 & 0 \end{bmatrix} $$

Could there be issues with your data?

Certainly yes, there could be:

  • Irrelevant attributes, e.g. an ID colomn, because that attribute only depends on the order of the data.
  • Outliers, e.g. an observation that seems out of place. The price per gallon of fuel can't be -100 dollars or 100 dollars, when the rest of the observations are at about 3-8 dollars.
  • Missing data, e.g. you have values that are 0 or NaN (Not a Number).

What to do about issues with your data?

Sometimes you might discard an attribute, if it has too much missing data and is deemed as not being essential to the dataset. So you would have to think about what is essential to a dataset, an example for a house would be the area of the lot and price. Those two attributes are essential. Whereas a feature like area of garage is not that essential.

You might want an easy way to find out if your data has missing values, and I do that by looking at a graph, by using a module called missingno. I'm using Anaconda Navigator, and thereby I have access to Jupyter Notebook or Spyder, but the installation is the same. You open the Anaconda Prompt, and type in conda install -c conda-forge missingno and it will take a while to load but will get done eventually, particularly this process takes a long time: "Collecting package metadata".

Once installed, you use missingno.matrix(train, figsize = (30,10)) where train is your Pandas dataframe. Then it displays a nice sort of graph, where each black column is a feature. For every white horizontal line you see in each black column, that indicates missing or NaN values. Here is how it looks:

There are other ways to treat missing data, such as Bayesian learning methods, which can account for missing data. Also, other machine learning techniques can apply, but I won't dive into that in this topic, as this is introductory.

If it happens that we have a few observations where there is missing data, we could also discard those, as that wouldn't affect the overall dataset. 10 removed observations out of a 1000 is not going to make a difference.

But what if the attributes or observations with missing data is important?
Then you could make a script that fills in those missing values with some kind of neutral / objective guess. Note that this is not what you want, but it can a useful tool to keep your data relevant for the problem you are trying to solve with that data. The script could then take the mean of an attribute and fill in the empty spots with the mean, of course rounding up or down for integer values. If you have an ordinal attribute, then you could also write a script to find the most common observation and use that for all the missing values.

Look here for a reference on how to fill a dataframe:

Statistics Basics - Variance and Standard Deviation
We want to be able to say something about our data using some statistical measures, because it is important information. There are many ways to use statistics in machine learning, and the mean, variance and standard deviation is just some of them. I did shortly explain some of this in my last post, …