Package 'LilRhino' reference manual

Title:	For Implementation of Feed Reduction, Learning Examples, NLP and Code Management
Description:	This is for code management functions, NLP tools, a Monty Hall simulator, and for implementing my own variable reduction technique called Feed Reduction. The Feed Reduction technique is not yet published, but is merely a tool for implementing a series of binary neural networks meant for reducing data into N dimensions, where N is the number of possible values of the response variable.
Authors:	Travis Barton (2018)
Maintainer:	Travis Barton <[email protected]>
License:	GPL-2
Version:	1.2.2
Built:	2025-02-17 05:16:55 UTC
Source:	https://github.com/travis-barton/lilrhino

Binary Decision Neural Network Wrapper

Description

Used as a function of Feed_Reduction, Binary_Networt uses a 3 layer neural network with an adam optimizer, leaky RELU for the first two activation functions, followed by a softmax on the last layer. The loss function is binary_crossentropy. This is a keras wrapper, and uses tensorflow in the backend.

Usage

Binary_Network(X, Y, X_test, val_split, nodes, epochs, batch_size, verbose = 0)
Binary_Network(X, Y, X_test, val_split, nodes, epochs, batch_size, verbose = 0)

Arguments

`X`	Training data.
`Y`	Training Labels. These must be binary.
`X_test`	The test Data
`val_split`	The validation split for keras.
`nodes`	The number of nodes in the hidden layers.
`epochs`	The number of epochs for the network
`batch_size`	The batch size for the network
`verbose`	Weither or not you want details about the run as its happening. 0 = silent, 1 = progress bar, 2 = one line per epoch.

Details

This function is a subset for the larger function Feed_Network. The output is the list containing the training and testing data converted into an approximation of probability space for that binary decision.

Value

`Train`	The training data in approximate probability space
`Test`	The testing data in 'double' approximate probability space

Author(s)

Travis Barton

References

Check out http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf and Keras for details

Examples


## Not run: 
if(8 * .Machine$sizeof.pointer == 64){
  #Feed Network Testing
  library(keras)
  library(dplyr)
    install_keras()
    dat <- keras::dataset_mnist()
    X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784))
    y_train = to_categorical(dat$train$y, 10)
    X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784))
    y_test =to_categorical(dat$test$y, 10)


    index_train = which(dat$train$y == 6 | dat$train$y == 5)
    index_train = sample(index_train, length(index_train))
    index_test = which(dat$test$y == 6 | dat$test$y == 5)
    index_test = sample(index_test, length(index_test))

    temp = Binary_Network(X_train[index_train,],
    y_train[index_train,c(7, 6)], X_test[index_test,], .3, 350, 30, 50)
  }
  
## End(Not run)

## Not run: 
if(8 * .Machine$sizeof.pointer == 64){
  #Feed Network Testing
  library(keras)
  library(dplyr)
    install_keras()
    dat <- keras::dataset_mnist()
    X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784))
    y_train = to_categorical(dat$train$y, 10)
    X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784))
    y_test =to_categorical(dat$test$y, 10)


    index_train = which(dat$train$y == 6 | dat$train$y == 5)
    index_train = sample(index_train, length(index_train))
    index_test = which(dat$test$y == 6 | dat$test$y == 5)
    index_test = sample(index_test, length(index_test))

    temp = Binary_Network(X_train[index_train,],
    y_train[index_train,c(7, 6)], X_test[index_test,], .3, 350, 30, 50)
  }
  
## End(Not run)

A function for bootstraping textual data so that all levels have the same number of entries.

Description

This function takes a corpus and a set of labels and uses Bootstrap_Vocab to increase the size of each label until they are all the same length. Stop words are not bootstrapped.

Usage

Bootstrap_Data_Frame(text, tags, stopwords, min_length = 7, max_length = 15)
Bootstrap_Data_Frame(text, tags, stopwords, min_length = 7, max_length = 15)

Arguments

`text`	text is the collection of textual data to bootstrap up.
`tags`	tags are the collection of tags that will be used to bootstrap. There should be one for every entry in 'text'. They do not have to be unique.
`stopwords`	stopwords to make sure are not apart of the bootstrapping process. It is advised to eliminate the most common words. See Stop_Word_Maker()
`min_length`	The shortest length allowable for bootstrapped words
`max_length`	The longest length allowable for bootstrapped words

Details

Most of the bootstrapped words will be nonseneical. The intention of this package is not to create new sentences, but to instead trick your model into thinking it has equal lengthed levels. This method is meant for bag of words style models.

Value

A data frame of your original documents along with the bootstrapped ones (column 1) along with their tags (column 2).

Author(s)

Travis Barton

Examples

test_set = c('I like cats', 'I like dogs', 'we love animals', 'I am a vet',
             'US politics bore me', 'I dont like to vote',
             'The rainbow looked nice today dont you think tommy')
test_tags = c('animals', 'animals', 'animals', 'animals',
             'politics', 'politics',
             'misc')

Bootstrap_Data_Frame(test_set, test_tags, c("I", "we"), min_length = 3, max_length = 8)
test_set = c('I like cats', 'I like dogs', 'we love animals', 'I am a vet',
             'US politics bore me', 'I dont like to vote',
             'The rainbow looked nice today dont you think tommy')
test_tags = c('animals', 'animals', 'animals', 'animals',
             'politics', 'politics',
             'misc')

Bootstrap_Data_Frame(test_set, test_tags, c("I", "we"), min_length = 3, max_length = 8)

An internal function for Bootstrap_Data_Frame.

Description

This function takes a selection of documents and bootstraps words from said sentences until there are N total sentences (both sudo and original).

Usage

Bootstrap_Vocab(vocab, N, stopwds, min_length = 7, max_length = 15)
Bootstrap_Vocab(vocab, N, stopwds, min_length = 7, max_length = 15)

Arguments

`vocab`	The collection of documents to boostrap.
`N`	The total amount of sentences to end up with
`stopwds`	A list of stopwords to not include in the bootstrapping proccess
`min_length`	The shortest allowable bootstrapped doument
`max_length`	The longest allowable bootstrapped document

Details

The min and max length arguements to not gaurantee that a sentence will reach that length. These senteces will be nonsensical.

Value

A vector of bootstrapped sentences.

Author(s)

Travis Barton

Examples



testing_set = c(paste('this is test',  as.character(seq(1, 10, 1))))

Bootstrap_Vocab(testing_set, 20, c('this'))

testing_set = c(paste('this is test',  as.character(seq(1, 10, 1))))

Bootstrap_Vocab(testing_set, 20, c('this'))

For announcing when code is done.

Description

for alerting you when your code is done.

Usage

Codes_done(title, msg, sound = FALSE, effect = 1)
Codes_done(title, msg, sound = FALSE, effect = 1)

Arguments

`title`	The title of the notification
`msg`	The message to be sent
`sound`	Optional sound to blurt as well
`effect`	If sound it blurted, what should it be? (check beepr package for sound options)

Details

Only for Linix (as far as I know)

Author(s)

smacdonald (stack overflow) with modificaion by Travis Barton

References

https://stackoverflow.com/questions/3365657/is-there-a-way-to-make-r-beep-play-a-sound-at-the-end-of-a-script

Examples

Codes_done("done", "check it", sound = TRUE, effect = 1)
Codes_done("done", "check it", sound = TRUE, effect = 1)

For Creating a test and train set from a whole set

Description

for making one dataset into two (test and train)

Usage

Cross_val_maker(data, alpha)
Cross_val_maker(data, alpha)

Arguments

`data`	matrix of data you want to split
`alpha`	the percent of data to split

Value

returns a list with accessable with the '$' sign. Test and Train are labeled as such.

Author(s)

Travis Barton

Examples

dat <- Cross_val_maker(iris, .1)
train <- dat$Train
test <- dat$Test
dat <- Cross_val_maker(iris, .1)
train <- dat$Train
test <- dat$Test

A Function for converting data into approximations of probability space.

Description

It takes the number of unique labels in the training data and tries to predict a one vs all binary neural network for each unique label. The output is an approximation of the probability that each individual input does not not match the label. Travis Barton (2018) http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf

Usage

Feed_Reduction(X, Y, X_test, val_split = .1,
               nodes = NULL, epochs = 15,
               batch_size = 30, verbose = 0)
Feed_Reduction(X, Y, X_test, val_split = .1,
               nodes = NULL, epochs = 15,
               batch_size = 30, verbose = 0)

Arguments

`X`	Training data
`Y`	Training labels
`X_test`	Testing data
`val_split`	The validation split for the keras, binary, neural networks
`nodes`	The number nodes for the hidden layers, default is 1/4 of the length of the training data.
`epochs`	The number of epochs for the fitting of the networks
`batch_size`	The batch size for the networks
`verbose`	Weither or not you want details about the run as its happening. 0 = silent, 1 = progress bar, 2 = one line per epoch.

Details

This is a new technique for dimensionality reduction of my own creation. Data is converted to the same number of dimensions as there are unique labels. Each dimension is an approximation of the probability that the data point is inside the a unique label. The return value is a list the training and test data with their dimensionality reduced.

Value

`Train`	The training data in the new probability space
`Test`	The testing data in the new probability space

Author(s)

Travis Barton.

References

Check out http://wbbpredictions.com/wp-content/uploads/2018/12/Redditbot_Paper.pdf for details on the proccess

Examples

## Not run: 
if(8 * .Machine$sizeof.pointer == 64){
#Feed Network Testing
library(keras)

  install_keras()
  dat <- keras::dataset_mnist()
  X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784))
  y_train = dat$train$y
  X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784))
  y_test = dat$test$y

  Reduced_Data2 = Feed_Reduction(X_train, y_train, X_test,
                                val_split = .3, nodes = 350,
                                30, 50, verbose = 1)

  library(e1071)
  names(Reduced_Data2$test) = names(Reduced_Data2$train)
  newdat = as.data.frame(cbind(rbind(Reduced_Data2$train, Reduced_Data2$test), c(y_train, y_test)))
  colnames(newdat) = c(paste("V", c(1:11), sep = ""))
  mod = svm(V11~., data = newdat, subset = c(1:60000),
           kernel = 'linear', cost = 1, type = 'C-classification')
  preds = predict(mod, newdat[60001:70000,-11])
  sum(preds == y_test)/10000

}

## End(Not run)
## Not run: 
if(8 * .Machine$sizeof.pointer == 64){
#Feed Network Testing
library(keras)

  install_keras()
  dat <- keras::dataset_mnist()
  X_train = array_reshape(dat$train$x/255, c(nrow(dat$train$x/255), 784))
  y_train = dat$train$y
  X_test = array_reshape(dat$test$x/255, c(nrow(dat$test$x/255), 784))
  y_test = dat$test$y

  Reduced_Data2 = Feed_Reduction(X_train, y_train, X_test,
                                val_split = .3, nodes = 350,
                                30, 50, verbose = 1)

  library(e1071)
  names(Reduced_Data2$test) = names(Reduced_Data2$train)
  newdat = as.data.frame(cbind(rbind(Reduced_Data2$train, Reduced_Data2$test), c(y_train, y_test)))
  colnames(newdat) = c(paste("V", c(1:11), sep = ""))
  mod = svm(V11~., data = newdat, subset = c(1:60000),
           kernel = 'linear', cost = 1, type = 'C-classification')
  preds = predict(mod, newdat[60001:70000,-11])
  sum(preds == y_test)/10000

}

## End(Not run)

Function for loading in pre-trained or personal word embedding softwares.

Description

Loads in GloVes' pretrained 42 billion token embeddings, trained on the common crawl.

Usage

Load_Glove_Embeddings(path = 'glove.42B.300d.txt', d = 300)
Load_Glove_Embeddings(path = 'glove.42B.300d.txt', d = 300)

Arguments

`path`	The path to the embeddings file.
`d`	The dimension of the embeddings file.

Details

The embeddings file should be the word, followed by numeric values, ending with a carriage return.

Value

The embeddings matrix.

Author(s)

Travis Barton

Examples

#This code only works if you have the 5g file found here: <https://nlp.stanford.edu/projects/glove/>

## Not run: emb = Load_Glove_Embeddings()
#This code only works if you have the 5g file found here: <https://nlp.stanford.edu/projects/glove/>

## Not run: emb = Load_Glove_Embeddings()

Monty Hall Simulator

Description

A simulator for the famous Monty Hall Problem

Usage

Monty_Hall(Games = 10, Choice = "Stay")
Monty_Hall(Games = 10, Choice = "Stay")

Arguments

`Games`	The number of games to run on the simulation
`Choice`	Wether you would like the simulation to either 'Stay' with the first chosen door, 'Switch' to the other door, or 'Random' where you randomly decide to either stay or switch.

Details

This is just a toy example of the famous Monty Hall problem. It returns a ggplot bar chart showing the counts for wins or loses in the simulation.

Value

A ggplot graph is produced. There is no return value.

Author(s)

Travis Barton

Examples

Monty_Hall(100, 'Stay')
Monty_Hall(100, 'Stay')

For performing the nearest centroid problem (with modifications) on MNST data specifically (general to come)

Description

For Chen's homework, I'll change this when I generalize it.

Usage

Nearest_Centroid(X_train, X_test, Y_train)
Nearest_Centroid(X_train, X_test, Y_train)

Arguments

`X_train`	Training data
`X_test`	data to be tested
`Y_train`	training labels

Note

Based on homework from Guangling Chen's M251 class at SJSU

Author(s)

Travis Barton

Number/alpha numeric seperator for strings.

Description

A Function for the separating of numbers from letters. 'b4' for example would be converted to 'b 4'.

Usage

Num_Al_Sep(vec)
Num_Al_Sep(vec)

Arguments

vec

The string vector in which you wish to separate the numbers from the letters.

Value

output

The separated vector.

Note

This is a really simple function really used inside other functions.

Author(s)

Travis Barton

Examples

test_vec = 'The most iconic American weapon has to be the AR15'
res = Num_Al_Sep(test_vec)
print(res)
test_vec = 'The most iconic American weapon has to be the AR15'
res = Num_Al_Sep(test_vec)
print(res)

Percent of confusion matrix

Description

For finding the accuracy of confusion matricies with true/pred values

Usage

Percent(true, test)
Percent(true, test)

Arguments

`true`	The true values
`test`	the test values

Details

Make sure your strings have the right values and create a square matrix.

Value

the percent acc.

Author(s)

Travis Barton

Examples

true <- rep(1:10, 10)
test <- rep(1:10, 10)
test[c(2, 22, 33, 89)] = 1
Percent(true, test)
#or
#percent(table(true, test))
true <- rep(1:10, 10)
test <- rep(1:10, 10)
test[c(2, 22, 33, 89)] = 1
Percent(true, test)
#or
#percent(table(true, test))

Pretreatment of textual documents for NLP.

Description

This function goes through a number of pretreatment steps in preparation for vectorization. These steps are designed to help the data become more standard so that there are fewer outliers when training during NLP. The following effects are applied: 1. Non-alpha/numerics are removed. 2. Numbers are separated from letters. 3. Numbers are replaced with their word equivalents. 4. Words are stemmed (optional). 5. Words are lowercased (optinal).

Usage

Pretreatment(title_vec, stem = TRUE, lower = TRUE, parallel = FALSE)
Pretreatment(title_vec, stem = TRUE, lower = TRUE, parallel = FALSE)

Arguments

`title_vec`	Vector of documents to be pre-treated.
`stem`	Boolian variable to decide whether to stem or not.
`lower`	Boolian variable to decide whether to lowercase words or not.
`parallel`	Boolian variable to decide whether to run this function in parallel or not.

Details

This function returns a list. It should be able to accept any format that the function lapply would accept. The parallelization is done with the function Mcapply from the package 'parallel' and will only work on systems that allow forking (Sorry windows users). Future updates will allow for socketing.

Value

output

The list of character strings post-pretreatment

Author(s)

Travis Barton

Examples

## Not run:  # for some reason it takes longer than 5 seconds on CRAN's computers
test_vec = c('This is a test', 'Ahoy!', 'my battle-ship is on... b6!')
res = Pretreatment(test_vec)
print(res)

## End(Not run)
## Not run:  # for some reason it takes longer than 5 seconds on CRAN's computers
test_vec = c('This is a test', 'Ahoy!', 'my battle-ship is on... b6!')
res = Pretreatment(test_vec)
print(res)

## End(Not run)

Random Brains: Neural Network Implementation of Random Forest

Description

Creates a random forest style collection of neural networks for classification

Usage

Random_Brains(data, y, x_test,
variables = ceiling(ncol(data)/10),
brains = floor(sqrt(ncol(data))),
hiddens = c(3, 4))
Random_Brains(data, y, x_test,
variables = ceiling(ncol(data)/10),
brains = floor(sqrt(ncol(data))),
hiddens = c(3, 4))

Arguments

`data`	The data that holds the predictors ONLY.
`y`	The responce variable
`x_test`	The testing predictors
`variables`	The number of predictors to select for each brain in 'data'. The default is one tenth of the number of columns in 'data'.
`brains`	The number of neural networks to create. The default is the square root of the number of columns in 'data'.
`hiddens`	The is a vector with length equal to the desired number of hidden layers. Each entry in the vector corresponds to the number of nodes in that layer. The default is c(3, 4) which is a two layer network with 3 and 4 nodes in the layers respectively.

Details

This function is meant to mirror the classic random forest function exctly. The only difference being that it uses shallow neural networks to build the forest instead of decision trees.

Value

`predictions`	The predictions for x_test.
`num_brains`	The number of neural networks used to decide the predictions.
`predictors_per_brain`	The number of variabled used for the neural networks used to decide the predictions.
`hidden_layers`	The vector describing the number of layers, as well as how many there were.
`preds_per_brain`	This matrix describes which columns where selected by each brain. Each row is a new brain. each column describes the index of the column used.
`raw_results`	The matrix of raw predictions from the brains. Each row is the cummulative predictions of all the brains. Which prediciton won by majority vote can be seen in 'predictions

Note

The neural networks are created using the neuralnet package!

Author(s)

Travis Barton

Examples


dat = Cross_val_maker(iris, .2)

train = dat$Train
test = dat$Test

Final_Test = Random_Brains(train[,-5],
  train$Species, as.matrix(test[,-5]),
  variables = 3, brains = 2)
table(Final_Test$predictions, as.numeric(test$Species))


dat = Cross_val_maker(iris, .2)

train = dat$Train
test = dat$Test

Final_Test = Random_Brains(train[,-5],
  train$Species, as.matrix(test[,-5]),
  variables = 3, brains = 2)
table(Final_Test$predictions, as.numeric(test$Species))

Function for extracting the sentence vector from an embeddings matrix.

Description

Function for extracting the sentence vector from an embeddings matrix in a fast and convenient manner.

Usage

Sentence_Vector(Sentence, emb_matrix, dimension, stopwords)
Sentence_Vector(Sentence, emb_matrix, dimension, stopwords)

Arguments

`Sentence`	The sentence to find the vector of.
`emb_matrix`	The embeddings matrix to search.
`dimension`	The dimension of the vector to return.
`stopwords`	Words that should not be included in the averaging proccess.

Details

The function splits the sentence into words, eliminates all stopwords, finds the vectors of each word, then averages the word vectors into a sentence vector.

Value

The sentence vector from an embeddings matrix.

Author(s)

Travis Barton

Examples

  emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5,
  4, 3, 2, 1, 1, 5, 3, 2, 4), nrow = 3),
  row.names = c('sentence', 'in', 'question'))

  Sentence_Vector(c('this is the sentence in question'), emb, 5, c('this', 'is', 'the'))


emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5,
  4, 3, 2, 1, 1, 5, 3, 2, 4), nrow = 3),
  row.names = c('sentence', 'in', 'question'))

  Sentence_Vector(c('this is the sentence in question'), emb, 5, c('this', 'is', 'the'))

For the finding of the $N$ most populous words in a corpus.

Description

This function finds the $N$ most used words in a corpus. This is done to identify stop words to better prune data sets before training.

Usage

Stopword_Maker(titles, cutoff = 20)
Stopword_Maker(titles, cutoff = 20)

Arguments

`titles`	The documents in which the most populous words are sought.
`cutoff`	The number of $N$ top most used words to keep as stop words.

Value

output

A vector of the $N$ most populous words.

Author(s)

Travis Barton

Examples

test_set = c('this is a testset', 'I am searching for a list of words',
'I like turtles',
'A rocket would be a fast way of getting to work, but I do not think it is very practical')
res = Stopword_Maker(test_set, 4)
print(res)
test_set = c('this is a testset', 'I am searching for a list of words',
'I like turtles',
'A rocket would be a fast way of getting to work, but I do not think it is very practical')
res = Stopword_Maker(test_set, 4)
print(res)

Table Percent

Description

Finds the acc of square tables.

Usage

Table_percent(in_table)
Table_percent(in_table)

Arguments

in_table

a confusion matrix

Details

The table must be square

Note

make sure its square.

Author(s)

Travis Barton

Examples


true <- rep(1:10, 10)
test <- rep(1:10, 10)
test[c(2, 22, 33, 89)] = 1
Table_percent(table(true, test))
true <- rep(1:10, 10)
test <- rep(1:10, 10)
test[c(2, 22, 33, 89)] = 1
Table_percent(table(true, test))

Function for extacting word vectors from embeddings.

Description

Function for extacting word vectors from embeddings. This function is an internal function for 'Sentence_Puller'. It averages the word vectors and returns the average of these vectors.

Usage

Vector_Puller(words, emb_matrix, dimension)
Vector_Puller(words, emb_matrix, dimension)

Arguments

`words`	The word to be extracted.
`emb_matrix`	The embeddings matrix. It must be a data frame.
`dimension`	The Dimension of the embeddings to extract. They do not have to match that of the matrix, but they cannot exceed its maximum column count.

Details

This is a simple and fast internal function.

Value

The vector that corresponds to the average of the word vectors.

Author(s)

Travis Barton

Examples

# This is an example emb_matrix

emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), nrow = 2), row.names = c('cow', 'moo'))

Vector_Puller(c('cow', 'moo'), emb, 5)

# This is an example emb_matrix

emb = data.frame(matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), nrow = 2), row.names = c('cow', 'moo'))

Vector_Puller(c('cow', 'moo'), emb, 5)

Package 'LilRhino'

Help Index

Binary Decision Neural Network Wrapper

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

A function for bootstraping textual data so that all levels have the same number of entries.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

An internal function for Bootstrap_Data_Frame.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

For announcing when code is done.

Description

Usage

Arguments

Details

Author(s)

References

Examples

For Creating a test and train set from a whole set

Description

Usage

Arguments

Value

Author(s)

Examples

A Function for converting data into approximations of probability space.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Function for loading in pre-trained or personal word embedding softwares.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Monty Hall Simulator

Description

Usage

Arguments

Details

Value

Author(s)

Examples

For performing the nearest centroid problem (with modifications) on MNST data specifically (general to come)

Description

Usage

Arguments

Note

Author(s)

Number/alpha numeric seperator for strings.

Description

Usage

Arguments

Value