Title: | For Implementation of Feed Reduction, Learning Examples, NLP and Code Management |
---|---|
Description: | This is for code management functions, NLP tools, a Monty Hall simulator, and for implementing my own variable reduction technique called Feed Reduction. The Feed Reduction technique is not yet published, but is merely a tool for implementing a series of binary neural networks meant for reducing data into N dimensions, where N is the number of possible values of the response variable. |
Authors: | Travis Barton (2018) |
Maintainer: | Travis Barton <[email protected]> |
License: | GPL-2 |
Version: | 1.2.2 |
Built: | 2025-02-17 05:16:55 UTC |
Source: | https://github.com/travis-barton/lilrhino |
Used as a function of Feed_Reduction, Binary_Networt uses a 3 layer neural network with an adam optimizer, leaky RELU for the first two activation functions, followed by a softmax on the last layer. The loss function is binary_crossentropy. This is a keras wrapper, and uses tensorflow in the backend.
Binary_Network(X, Y, X_test, val_split, nodes, epochs, batch_size, verbose = 0)
Binary_Network(X, Y, X_test, val_split, nodes, epochs, batch_size, verbose = 0)
X |
Training data. |
Y |
Training Labels. These must be binary. |
X_test |
The test Data |
val_split |
The validation split for keras. |
nodes |
The number of nodes in the hidden layers. |
epochs |
The number of epochs for the network |
batch_size |
The batch size for the network |
verbose |
Weither or not you want details about the run as its happening. 0 = silent, 1 = progress bar, 2 = one line per epoch. |
This function is a subset for the larger function Feed_Network. The output is the list containing the training and testing data converted into an approximation of probability space for that binary decision.
Train |
The training data in approximate probability space |
Test |
The testing data in 'double' approximate probability space |
Travis Barton
Check out http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf and Keras for details
Feed_Network
## Not run: if(8 * .Machine$sizeof.pointer == 64){ #Feed Network Testing library(keras) library(dplyr) install_keras() dat <- keras::dataset_mnist() X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784)) y_train = to_categorical(dat$train$y, 10) X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784)) y_test =to_categorical(dat$test$y, 10) index_train = which(dat$train$y == 6 | dat$train$y == 5) index_train = sample(index_train, length(index_train)) index_test = which(dat$test$y == 6 | dat$test$y == 5) index_test = sample(index_test, length(index_test)) temp = Binary_Network(X_train[index_train,], y_train[index_train,c(7, 6)], X_test[index_test,], .3, 350, 30, 50) } ## End(Not run)
## Not run: if(8 * .Machine$sizeof.pointer == 64){ #Feed Network Testing library(keras) library(dplyr) install_keras() dat <- keras::dataset_mnist() X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784)) y_train = to_categorical(dat$train$y, 10) X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784)) y_test =to_categorical(dat$test$y, 10) index_train = which(dat$train$y == 6 | dat$train$y == 5) index_train = sample(index_train, length(index_train)) index_test = which(dat$test$y == 6 | dat$test$y == 5) index_test = sample(index_test, length(index_test)) temp = Binary_Network(X_train[index_train,], y_train[index_train,c(7, 6)], X_test[index_test,], .3, 350, 30, 50) } ## End(Not run)
This function takes a corpus and a set of labels and uses Bootstrap_Vocab to increase the size of each label until they are all the same length. Stop words are not bootstrapped.
Bootstrap_Data_Frame(text, tags, stopwords, min_length = 7, max_length = 15)
Bootstrap_Data_Frame(text, tags, stopwords, min_length = 7, max_length = 15)
text |
text is the collection of textual data to bootstrap up. |
tags |
tags are the collection of tags that will be used to bootstrap. There should be one for every entry in 'text'. They do not have to be unique. |
stopwords |
stopwords to make sure are not apart of the bootstrapping process. It is advised to eliminate the most common words. See Stop_Word_Maker() |
min_length |
The shortest length allowable for bootstrapped words |
max_length |
The longest length allowable for bootstrapped words |
Most of the bootstrapped words will be nonseneical. The intention of this package is not to create new sentences, but to instead trick your model into thinking it has equal lengthed levels. This method is meant for bag of words style models.
A data frame of your original documents along with the bootstrapped ones (column 1) along with their tags (column 2).
Travis Barton
test_set = c('I like cats', 'I like dogs', 'we love animals', 'I am a vet', 'US politics bore me', 'I dont like to vote', 'The rainbow looked nice today dont you think tommy') test_tags = c('animals', 'animals', 'animals', 'animals', 'politics', 'politics', 'misc') Bootstrap_Data_Frame(test_set, test_tags, c("I", "we"), min_length = 3, max_length = 8)
test_set = c('I like cats', 'I like dogs', 'we love animals', 'I am a vet', 'US politics bore me', 'I dont like to vote', 'The rainbow looked nice today dont you think tommy') test_tags = c('animals', 'animals', 'animals', 'animals', 'politics', 'politics', 'misc') Bootstrap_Data_Frame(test_set, test_tags, c("I", "we"), min_length = 3, max_length = 8)
This function takes a selection of documents and bootstraps words from said sentences until there are N total sentences (both sudo and original).
Bootstrap_Vocab(vocab, N, stopwds, min_length = 7, max_length = 15)
Bootstrap_Vocab(vocab, N, stopwds, min_length = 7, max_length = 15)
vocab |
The collection of documents to boostrap. |
N |
The total amount of sentences to end up with |
stopwds |
A list of stopwords to not include in the bootstrapping proccess |
min_length |
The shortest allowable bootstrapped doument |
max_length |
The longest allowable bootstrapped document |
The min and max length arguements to not gaurantee that a sentence will reach that length. These senteces will be nonsensical.
A vector of bootstrapped sentences.
Travis Barton
testing_set = c(paste('this is test', as.character(seq(1, 10, 1)))) Bootstrap_Vocab(testing_set, 20, c('this'))
testing_set = c(paste('this is test', as.character(seq(1, 10, 1)))) Bootstrap_Vocab(testing_set, 20, c('this'))
for alerting you when your code is done.
Codes_done(title, msg, sound = FALSE, effect = 1)
Codes_done(title, msg, sound = FALSE, effect = 1)
title |
The title of the notification |
msg |
The message to be sent |
sound |
Optional sound to blurt as well |
effect |
If sound it blurted, what should it be? (check beepr package for sound options) |
Only for Linix (as far as I know)
smacdonald (stack overflow) with modificaion by Travis Barton
https://stackoverflow.com/questions/3365657/is-there-a-way-to-make-r-beep-play-a-sound-at-the-end-of-a-script
Codes_done("done", "check it", sound = TRUE, effect = 1)
Codes_done("done", "check it", sound = TRUE, effect = 1)
for making one dataset into two (test and train)
Cross_val_maker(data, alpha)
Cross_val_maker(data, alpha)
data |
matrix of data you want to split |
alpha |
the percent of data to split |
returns a list with accessable with the '$' sign. Test and Train are labeled as such.
Travis Barton
dat <- Cross_val_maker(iris, .1) train <- dat$Train test <- dat$Test
dat <- Cross_val_maker(iris, .1) train <- dat$Train test <- dat$Test
It takes the number of unique labels in the training data and tries to predict a one vs all binary neural network for each unique label. The output is an approximation of the probability that each individual input does not not match the label. Travis Barton (2018) http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf
Feed_Reduction(X, Y, X_test, val_split = .1, nodes = NULL, epochs = 15, batch_size = 30, verbose = 0)
Feed_Reduction(X, Y, X_test, val_split = .1, nodes = NULL, epochs = 15, batch_size = 30, verbose = 0)
X |
Training data |
Y |
Training labels |
X_test |
Testing data |
val_split |
The validation split for the keras, binary, neural networks |
nodes |
The number nodes for the hidden layers, default is 1/4 of the length of the training data. |
epochs |
The number of epochs for the fitting of the networks |
batch_size |
The batch size for the networks |
verbose |
Weither or not you want details about the run as its happening. 0 = silent, 1 = progress bar, 2 = one line per epoch. |
This is a new technique for dimensionality reduction of my own creation. Data is converted to the same number of dimensions as there are unique labels. Each dimension is an approximation of the probability that the data point is inside the a unique label. The return value is a list the training and test data with their dimensionality reduced.
Train |
The training data in the new probability space |
Test |
The testing data in the new probability space |
Travis Barton.
Check out http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf for details on the proccess
Binary_Network
## Not run: if(8 * .Machine$sizeof.pointer == 64){ #Feed Network Testing library(keras) install_keras() dat <- keras::dataset_mnist() X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784)) y_train = dat$train$y X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784)) y_test = dat$test$y Reduced_Data2 = Feed_Reduction(X_train, y_train, X_test, val_split = .3, nodes = 350, 30, 50, verbose = 1) library(e1071) names(Reduced_Data2$test) = names(Reduced_Data2$train) newdat = as.data.frame(cbind(rbind(Reduced_Data2$train, Reduced_Data2$test), c(y_train, y_test))) colnames(newdat) = c(paste("V", c(1:11), sep = "")) mod = svm(V11~., data = newdat, subset = c(1:60000), kernel = 'linear', cost = 1, type = 'C-classification') preds = predict(mod, newdat[60001:70000,-11]) sum(preds == y_test)/10000 } ## End(Not run)
## Not run: if(8 * .Machine$sizeof.pointer == 64){ #Feed Network Testing library(keras) install_keras() dat <- keras::dataset_mnist() X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784)) y_train = dat$train$y X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784)) y_test = dat$test$y Reduced_Data2 = Feed_Reduction(X_train, y_train, X_test, val_split = .3, nodes = 350, 30, 50, verbose = 1) library(e1071) names(Reduced_Data2$test) = names(Reduced_Data2$train) newdat = as.data.frame(cbind(rbind(Reduced_Data2$train, Reduced_Data2$test), c(y_train, y_test))) colnames(newdat) = c(paste("V", c(1:11), sep = "")) mod = svm(V11~., data = newdat, subset = c(1:60000), kernel = 'linear', cost = 1, type = 'C-classification') preds = predict(mod, newdat[60001:70000,-11]) sum(preds == y_test)/10000 } ## End(Not run)
Loads in GloVes' pretrained 42 billion token embeddings, trained on the common crawl.
Load_Glove_Embeddings(path = 'glove.42B.300d.txt', d = 300)
Load_Glove_Embeddings(path = 'glove.42B.300d.txt', d = 300)
path |
The path to the embeddings file. |
d |
The dimension of the embeddings file. |
The embeddings file should be the word, followed by numeric values, ending with a carriage return.
The embeddings matrix.
Travis Barton
#This code only works if you have the 5g file found here: <https://nlp.stanford.edu/projects/glove/> ## Not run: emb = Load_Glove_Embeddings()
#This code only works if you have the 5g file found here: <https://nlp.stanford.edu/projects/glove/> ## Not run: emb = Load_Glove_Embeddings()
A simulator for the famous Monty Hall Problem
Monty_Hall(Games = 10, Choice = "Stay")
Monty_Hall(Games = 10, Choice = "Stay")
Games |
The number of games to run on the simulation |
Choice |
Wether you would like the simulation to either 'Stay' with the first chosen door, 'Switch' to the other door, or 'Random' where you randomly decide to either stay or switch. |
This is just a toy example of the famous Monty Hall problem. It returns a ggplot bar chart showing the counts for wins or loses in the simulation.
A ggplot graph is produced. There is no return value.
Travis Barton
Monty_Hall(100, 'Stay')
Monty_Hall(100, 'Stay')
For Chen's homework, I'll change this when I generalize it.
Nearest_Centroid(X_train, X_test, Y_train)
Nearest_Centroid(X_train, X_test, Y_train)
X_train |
Training data |
X_test |
data to be tested |
Y_train |
training labels |
Based on homework from Guangling Chen's M251 class at SJSU
Travis Barton
A Function for the separating of numbers from letters. 'b4' for example would be converted to 'b 4'.
Num_Al_Sep(vec)
Num_Al_Sep(vec)
vec |
The string vector in which you wish to separate the numbers from the letters. |
output |
The separated vector. |
This is a really simple function really used inside other functions.
Travis Barton
test_vec = 'The most iconic American weapon has to be the AR15' res = Num_Al_Sep(test_vec) print(res)
test_vec = 'The most iconic American weapon has to be the AR15' res = Num_Al_Sep(test_vec) print(res)
For finding the accuracy of confusion matricies with true/pred values
Percent(true, test)
Percent(true, test)
true |
The true values |
test |
the test values |
Make sure your strings have the right values and create a square matrix.
the percent acc.
Travis Barton
true <- rep(1:10, 10) test <- rep(1:10, 10) test[c(2, 22, 33, 89)] = 1 Percent(true, test) #or #percent(table(true, test))
true <- rep(1:10, 10) test <- rep(1:10, 10) test[c(2, 22, 33, 89)] = 1 Percent(true, test) #or #percent(table(true, test))
This function goes through a number of pretreatment steps in preparation for vectorization. These steps are designed to help the data become more standard so that there are fewer outliers when training during NLP. The following effects are applied: 1. Non-alpha/numerics are removed. 2. Numbers are separated from letters. 3. Numbers are replaced with their word equivalents. 4. Words are stemmed (optional). 5. Words are lowercased (optinal).
Pretreatment(title_vec, stem = TRUE, lower = TRUE, parallel = FALSE)
Pretreatment(title_vec, stem = TRUE, lower = TRUE, parallel = FALSE)
title_vec |
Vector of documents to be pre-treated. |
stem |
Boolian variable to decide whether to stem or not. |
lower |
Boolian variable to decide whether to lowercase words or not. |
parallel |
Boolian variable to decide whether to run this function in parallel or not. |
This function returns a list. It should be able to accept any format that the function lapply would accept. The parallelization is done with the function Mcapply from the package 'parallel' and will only work on systems that allow forking (Sorry windows users). Future updates will allow for socketing.
output |
The list of character strings post-pretreatment |
Travis Barton
## Not run: # for some reason it takes longer than 5 seconds on CRAN's computers test_vec = c('This is a test', 'Ahoy!', 'my battle-ship is on... b6!') res = Pretreatment(test_vec) print(res) ## End(Not run)
## Not run: # for some reason it takes longer than 5 seconds on CRAN's computers test_vec = c('This is a test', 'Ahoy!', 'my battle-ship is on... b6!') res = Pretreatment(test_vec) print(res) ## End(Not run)
Creates a random forest style collection of neural networks for classification
Random_Brains(data, y, x_test, variables = ceiling(ncol(data)/10), brains = floor(sqrt(ncol(data))), hiddens = c(3, 4))
Random_Brains(data, y, x_test, variables = ceiling(ncol(data)/10), brains = floor(sqrt(ncol(data))), hiddens = c(3, 4))
data |
The data that holds the predictors ONLY. |
y |
The responce variable |
x_test |
The testing predictors |
variables |
The number of predictors to select for each brain in 'data'. The default is one tenth of the number of columns in 'data'. |
brains |
The number of neural networks to create. The default is the square root of the number of columns in 'data'. |
The is a vector with length equal to the desired number of hidden layers. Each entry in the vector corresponds to the number of nodes in that layer. The default is c(3, 4) which is a two layer network with 3 and 4 nodes in the layers respectively. |
This function is meant to mirror the classic random forest function exctly. The only difference being that it uses shallow neural networks to build the forest instead of decision trees.
predictions |
The predictions for x_test. |
num_brains |
The number of neural networks used to decide the predictions. |
predictors_per_brain |
The number of variabled used for the neural networks used to decide the predictions. |
hidden_layers |
The vector describing the number of layers, as well as how many there were. |
preds_per_brain |
This matrix describes which columns where selected by each brain. Each row is a new brain. each column describes the index of the column used. |
raw_results |
The matrix of raw predictions from the brains. Each row is the cummulative predictions of all the brains. Which prediciton won by majority vote can be seen in 'predictions |
The neural networks are created using the neuralnet package!
Travis Barton
dat = Cross_val_maker(iris, .2) train = dat$Train test = dat$Test Final_Test = Random_Brains(train[,-5], train$Species, as.matrix(test[,-5]), variables = 3, brains = 2) table(Final_Test$predictions, as.numeric(test$Species))
dat = Cross_val_maker(iris, .2) train = dat$Train test = dat$Test Final_Test = Random_Brains(train[,-5], train$Species, as.matrix(test[,-5]), variables = 3, brains = 2) table(Final_Test$predictions, as.numeric(test$Species))
Function for extracting the sentence vector from an embeddings matrix in a fast and convenient manner.
Sentence_Vector(Sentence, emb_matrix, dimension, stopwords)
Sentence_Vector(Sentence, emb_matrix, dimension, stopwords)
Sentence |
The sentence to find the vector of. |
emb_matrix |
The embeddings matrix to search. |
dimension |
The dimension of the vector to return. |
stopwords |
Words that should not be included in the averaging proccess. |
The function splits the sentence into words, eliminates all stopwords, finds the vectors of each word, then averages the word vectors into a sentence vector.
The sentence vector from an embeddings matrix.
Travis Barton
emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1, 1, 5, 3, 2, 4), nrow = 3), row.names = c('sentence', 'in', 'question')) Sentence_Vector(c('this is the sentence in question'), emb, 5, c('this', 'is', 'the'))
emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1, 1, 5, 3, 2, 4), nrow = 3), row.names = c('sentence', 'in', 'question')) Sentence_Vector(c('this is the sentence in question'), emb, 5, c('this', 'is', 'the'))
This function finds the $N$ most used words in a corpus. This is done to identify stop words to better prune data sets before training.
Stopword_Maker(titles, cutoff = 20)
Stopword_Maker(titles, cutoff = 20)
titles |
The documents in which the most populous words are sought. |
cutoff |
The number of $N$ top most used words to keep as stop words. |
output |
A vector of the $N$ most populous words. |
Travis Barton
test_set = c('this is a testset', 'I am searching for a list of words', 'I like turtles', 'A rocket would be a fast way of getting to work, but I do not think it is very practical') res = Stopword_Maker(test_set, 4) print(res)
test_set = c('this is a testset', 'I am searching for a list of words', 'I like turtles', 'A rocket would be a fast way of getting to work, but I do not think it is very practical') res = Stopword_Maker(test_set, 4) print(res)
Finds the acc of square tables.
Table_percent(in_table)
Table_percent(in_table)
in_table |
a confusion matrix |
The table must be square
make sure its square.
Travis Barton
true <- rep(1:10, 10) test <- rep(1:10, 10) test[c(2, 22, 33, 89)] = 1 Table_percent(table(true, test))
true <- rep(1:10, 10) test <- rep(1:10, 10) test[c(2, 22, 33, 89)] = 1 Table_percent(table(true, test))
Function for extacting word vectors from embeddings. This function is an internal function for 'Sentence_Puller'. It averages the word vectors and returns the average of these vectors.
Vector_Puller(words, emb_matrix, dimension)
Vector_Puller(words, emb_matrix, dimension)
words |
The word to be extracted. |
emb_matrix |
The embeddings matrix. It must be a data frame. |
dimension |
The Dimension of the embeddings to extract. They do not have to match that of the matrix, but they cannot exceed its maximum column count. |
This is a simple and fast internal function.
The vector that corresponds to the average of the word vectors.
Travis Barton
# This is an example emb_matrix emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), nrow = 2), row.names = c('cow', 'moo')) Vector_Puller(c('cow', 'moo'), emb, 5)
# This is an example emb_matrix emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), nrow = 2), row.names = c('cow', 'moo')) Vector_Puller(c('cow', 'moo'), emb, 5)