⬅️ Day 5 – Introduction to Natural Language Processing
In the last chapter, we discussed the basics of Natural Language Processing. You can check my GitHub repository for updates. Today we’ll try to understand sequence and time series data and how machine learning can be utilized to create accurate predictors.
Common attributes of Time Series
Time series data is a set of values that are spaced over time and when plotted, the x-axis is usually temporal in nature. It can be found everywhere and is used to predict weather forecasts, stock prices, historic trends, and many more. Even though time series seem to be very random and noisy there are some common attributes that help to create machine learning models which can be used for predictions.
- Trend: time series typically moves in a specific direction.
- Seasonality: there is a repeating pattern for time series over time where the repeats happen at regular intervals called seasons.
- Autocorrelation: some time series have a predictable behavior after a certain event has occurred. for example, in a time series, you might find clear spikes, but after each spike, there’s a deterministic decay. This is called autocorrelation. If a time series contains many autocorrelations, then it can be a predictable time series.
- Noise: it is a set of seemingly random disturbances in a time series that can lead to a high level of unpredictability and can mask trends, seasonal behavior, and autocorrelation.
Techniques for Predicting Time Series
Naive prediction is the most basic method of predicting the next period’s forecast using the last period’s data without adjusting the factors. It states that the predicted value at time t + 1 is the same as the value from time t, effectively shifting the time series by a single period.
To understand this concept, let’s create a time series with trends, seasonality, and noise.
import tensorflow as tf import matplotlib.pyplot as plt import numpy as np def plot_series(time, series, format="-", start=0, end=None): plt.plot(time[start:end], series[start:end], format) plt.xlabel("Time") plt.ylabel("Value") plt.grid(True) def trend(time, slope=0): return slope * time def seasonal_pattern(season_time): """An arbitary pattern, can be changed""" return np.where(season_time < 0.4, np.cos(season_time * 2 * np.pi), 1 / np.exp(3 * season_time)) def seasonality(time, period, amplitude=1, phase=0): season_time = ((time + phase) % period) / period return amplitude * seasonal_pattern(season_time) def noise(time, noise_level=1, seed=None): rnd = np.random.RandomState(seed) return rnd.randn(len(time)) * noise_level time = np.arange(4 * 365 + 1, dtype="float32") baseline = 10 series = trend(time, 0.5) baseline = 10 amplitude = 15 slope = 0.09 noise_level = 6 # creating the series series = baseline + trend(time, slope) + seasonality(time, period=365, amplitude=amplitude) #updating with noise series += noise(time, noise_level, seed=42) #plotting the graph plt.plot(series)
Once you plot the graph this will be the output.
Above created data can then be splitted and used as training data, validation data, and testing data. If there’s a seasonality in data, it’ll be good if the series can be split as a whole season in each split.
As an example, you can split the above data at the time of step 1000 and have a training dataset with data up to step 1000 and validation data after step 1000.
The below code shows how the split_time variable can be used to predict a series from a split time period onwards.
split_time = 1000 time_train = time[:split_time] x_train = series[:split_time] time_valid = time[split_time:] x_valid = series[split_time:] plt.figure(figsize=(10, 6)) plot_series(time_train, x_train) plt.show() plt.figure(figsize=(10, 6)) plot_series(time_valid, x_valid) plt.show()
Below graphs show the data split for each testing and validation datasets.
naive_forecast = series[split_time - 1:-1] plt.figure(figsize=(10, 6)) plot_series(time_valid, x_valid) plot_series(time_valid, naive_forecast)
The below graph shows the validation set from step 1000 onwards with naive prediction overlaid.
Next to measure the prediction accuracy we can use the mean squared error (MSE) and mean absolute error (MAE).
- MSE: takes the difference between the predicted value and the actual value at time t, squares it (to remove negatives), and then find the average over all of them.
- MAE: calculates the difference between the predicted value and the actual value at time t, takes its absolute value to remove negatives (instead of squaring), and finds the average over all of them.
You can get the MSE and MAE for the above naive forecast created based on the synthetic time series as below.
print(tf.keras.metrics.mean_squared_error(x_valid, naive_forecast).numpy()) print(tf.keras.metrics.mean_absolute_error(x_valid, naive_forecast).numpy())
For the above, the output I got was 76.47491 for MSE and 6.899298 for MAE. If you can reduce the error, you can increase the accuracy of your predictions.
In the next chapter let’s look into another interesting topic related to machine learning. Happy coding! 😃🔥