Scenario
Predicting wall thickness from below ground diameter for poles
This example demonstrates how to investigate the decay of poles from historical data. It will use the below ground diameter to predict the wall thickness.
It will create a data table from a CSV file, and use Neara's linear regression functionality to regress one feature against another.
Source data table
In the sample project a data table called Regression Data has already been populated.
Download the sample project file from the Introduction article
Click the (+) to the right of any group of tabs and select Data tables on the popup. On the Data tables panel that appears select the Regression Data entry to inspect the data:
Notice that not every field is fully populated. Linear regression will fail if provided with incomplete data, so it's important to be certain that the data set to use for linear regression is explicitly cleaned before use - there is no default filtering policy.
Import custom regression data
To create a data table with different regression data, on the Data tables panel click New:
Neara accepts data pasted from apps like Microsoft Excel and Google Sheets.
Open a spreadsheet (or CSV) file containing the source data in one of those apps, Select and copy the cells (including column headers) to the clipboard:
Neara Data Tables support data with column headers, but not row headers. Paste the clipboard data into the Data input box shown above:
Notice the formatting isn’t correct on initial paste. Click Save and Neara will format the pasted data into columns:
Methodology
Similar to the plotting example, this example will create a collection and then define the training data using fields on the collection.
The length of the collection should match the length of the data set. This example only trains one model on a static data set, and will fit the regression in the Model space. Create a new field under Model in the schema called u_regression_index
:
range(
len(
dt_regression_data
)
)
In the next steps create X and Y fields on the induced collection c_regression_index, extract them as lists, and pass those into the regression function.
Create Y data
Create a new field under Model in the schema called u_y
:
parse_num(
index(
model().dt_regression_data[].wallthicknessglfirst,
regression_index_item
)
) * unit("mm")
How it works
model().dt_regression_data[]
gets the data table with its columns as attributes, so .wallthicknessglfirst
yields that column as a list.
To extract one value from that list use index
and the regression index. Since data tables generated from text have their values parsed as strings, wrap this item in parse_num
to extract its numeric value.
Finally multiply by unit("mm")
to endow the quantity with a unit - linear regression is unit sensitive.
Although it is possible to perform linear regression on any (consistent) combination of units including dimensionless, respecting the units of the data during the training process allows the extraction of dimensioned information about the model, for use in dimensioned predictions.
Create X data
Although a regression model is usually targeted against a single quantity, it is often trained on many many feature quantities.
Each element of the X data has to be passed in as a collection of observations, one for each of the features.
An obvious container type to use in this case is a list. However lists in Neara require that all entries be of a uniform type. It is not possible to define a list with entries of mixed length and mass types, for example.
In this example the features may have dimensions, and not all may have the same type. A collection that supports heterogeneous types is required, and in this case the Neara collection type best suited to this purpose is a tuple.
Create a new field under Model in the schema called u_x
:
tuple(
parse_num(
index(
model().dt_regression_data[].belowgrounddiameter,
regression_index_item
)
) * unit("mm")
)
The construction of the united value is as in u_y
with the value wrapped in tuple() to create a tuple with a single element.
To produce a tuple with multiple features pass them in as a sequence separated by commas, in the same manner as when constructing a list().
Extract X and Y as lists
With the X and Y values defined for the regression, extract them from their collections as lists and pass them into the linear regression fitting function to train a model.
Y will be a list of a real type, X will be a list of tuples of a real type
Create a new field under Model in the schema called u_regression_model
:
fit_linear_regression_model(
c_regression_index[].u_x,
c_regression_index[].u_y,
intercept: 0
)
fit_linear_regression_model requires two positional arguments corresponding to the X and Y data (in that order), and an optional named argument intercept
.
intercept
, if specified, will set the intercept of the linear regression model and only fit the coefficients of the features. If it's left unspecified, the intercept coefficient will be fit also.
Note, if the intercept
is set, the unit of the provided intercept must match the unit of the Y data, in this case 0
matches the types of all homogenous units.
The result of fit_linear_regression_model
is a Neara type called a LinearRegressionModel that is parametrised by the types of the X and Y data on which it's trained:
The first parameter of the LinearRegressionModel type is the target type, in this case Span Distance. The second parameter is the coefficient type, in this case a tuple of a single dimensionless type. NOTE: The coefficient type != the feature type.
The LinearRegressionModel has a number of attributes reflecting the result of the training process:
coefficients
returns the fit coefficients for each of the features with dimensions as a tuple. Each element of the tuple has type Y type/X Type for each X type in the feature tuples.intercept
returns the intercept of the model with dimension equal to the dimension of the target on which it was trained.r_value
is a dimensionless quantity representing the R Squared value.
Create fields under Model in the schema to inspect these values:
u_coefficients
:
u_model.coefficients
u_intercept
:
u_model.intercept
u_r_squared
:
u_model.r_value
Inspecting these values (using the supplied sample data) returns:
Field | Value |
| A tuple with dimensionless value |
|
|
|
|
Interpretation
The coefficient of belowgrounddiameter is 0.5
and the intercept is 0
. That is, the model trained for predicting the wallthickness is to halve the belowgrounddiameter and add nothing. By specifically requesting that the intercept is 0
, this is not surprising.
Note the r_squared is 1
i.e., the model is a perfect fit of the training data.
This means that in all the samples observed, the wall thickness is exactly half the belowground diameter, which makes sense geometrically.
Make a prediction
With a trained Linear Regression Model it possible to interrogate the results of the fit. However, the goal of a LinearRegressionModel is typically to make predictions.
About LinearModels
A LinearModel is a set of coefficients together with an intercept used to to predict a target quantity from a set of feature observations.
If the feature variables are named x1, x2, …xn, and the goal is to predict target variable y, a LinearModel is a set of coefficients: a, b1, b2, …, bn used to make predictions with the following formula:
y_prediction = a + b1 * x1 + b2 * x2 + … + bn * xn
One way to generate these coefficients is to perform a linear regression on historical observations of the x1, x2, …, xn, y.
However, any LinearModel can be used to make a prediction. In Neara this is reflected by the fact that a LinearRegressionModel is also a LinearModel object (the reverse is not always true).
The function predict_with_linear_model allows users to make a prediction from a given LinearModel object and a set of observations which match the types of the coefficients and target.
In this case the data provided to predict_with_linear_model will match the types of a provided LinearRegressionModel only if the feature data passed in matches exactly the types of the Feature data it was trained on.
Specifically in this example, to make a prediction with u_model
, pass in as a feature observation a tuple with a single element whose type matches belowgrounddiameter.
Since all the supplied belowgrounddiameter data was used to train the model, the next step is to generate data.
Because belowgrounddiameter and wallthickness is a feature that belongs to poles, create a field under Pole in the schema to represent the belowground diameter. It is not important how this is done: in this example sample_distribution is used to generate a value randomly - all that is required is that below_ground_diameter is a value with length dimension:
index(
sample_distribution(
model().u_below_ground_diameter_distribution,
1,
seed: u_digest
),
0
) * unit("mm")
To make a prediction:
predict_with_linear_model(
model().u_regression_model,
tuple(
not_null(
u_below_ground_diameter
)
)
)
Remember, the provided feature training data was a list of tuples with a single length value inside, therefore, the feature observation to predict_with_linear_model must also be a tuple with a single length value.
The result is a mock prediction of the wall thickness for every pole.