# Papaya

## What It Is

A collection of statistics, mathematics, and matrix manipulation related utilities by Adila Faruk for the Processing programming environment. [Last update, March 2013.]

## What It Has

The library contains a number of core statistical analysis methods. Array and matrix related utilities are included as well including Eigenvalue, LU, QR, and SVD Decompositions.

Want to know more? Read the Reader's Digest version below or dive right into the JavaDocs.

## Download and Installation

Get papaya version 1.0.0 in .zip format. Unzip and put the (entire) extracted papaya folder into the 'libraries' folder of your Processing sketchbook folder (read the How To Install an External Library wikipage if you have no idea what I'm talking about). Javadoc reference and sketch examples are included in the papaya folder as well.

**Tested on**: Mac OSX using Processing 1.5.1, 2.0.1

**Dependencies**: Your strength, will-power, and continued dedication to the task at hand.

## Overview

Basic **descriptive statistics** related methods such as computing the mean, standard deviation, variance, skew, kurtosis are included in the `Descriptive`

class. Weighted analysis are available too, but only for the mean, variance, and rms.

The `Correlation`

class contains utilities related to the correlation between datasets. If you want to compare datasets (e.g. using a Student T-Test), then look no further than the `Comparison`

class. There's also a `Distance`

class for computing the various different distance metrics.

**Matrix** (or, really, array) related utilities are given in the `Cast`

, `Find`

, `Mat`

, `NaNs`

, `Polynomial`

, `Rank`

, `Sorting`

, and `Unique`

classes. They all do what their name suggests, except `Mat`

which is an **over-achiever** and contains the matrix and array manipulation methods that don't fit into the other categories. These include mathematical operators, extended to arrays and matrices (sums, divisions, multiplications, products, logs), array and matrix manipulation like reverse, copy, print, and a whole slew of other things that will hopefully cut your code down from 600 lines to a mere 580, leaving you with much more time to focus on making pretty visuals.

Check out the JavaDocs for more detail on these, and other classes.

Here's an xkcd comic to lighten things up a little before we move on.

### Notation

Where relevant, small letters denote arrays of floats or integers, etc. I.e.

`float[] x,y; int[] a;`

Capital letters are reserved for matrices,

`float[][] A,B;`

For example,

```
float[] x = new float[]{44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1, 32.5, 45.9, 41.9};
float[] y = new float[]{2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8, 2.6, 5.2, 2.6, 1094.3};
float[][] A = new float[][]{ {1,2,3} , {4.1,2.9,3.2} , {6,1,2} };
```

### Classes

Here's a little more detail on some of the classes, listed in (roughly) alphabetical order.

`Cast`

The `Cast`

class is similar to the `toArray(T[] a)`

method specified in the java Collections interface, but for `float[]`

and `int[]`

arrays (as opposed to `Float[]`

and `Integer[]`

).

Cast an arrayList in

`Integers`

to an`int`

array.`ArrayList<Integer> intArrayList = new ArrayList<Integer(); ...fill up intArrayList with numbers... int[] intArray = Cast.arrayListToInt(intArrayList)`

`Correlation`

The `Correlation`

class can be used to compute different kinds of linear correlation coefficients.

- Pearson's linear correlation coefficient:
`float linearCorr = Correlation.linear(x,y,unbiased);`

- Spearman's rank correlation coefficient:
`float spearmanCorr = Correlation.spearman(x,y,unbiased);`

- Covariance:
`float covariance = Correlation.cov(x,y,unbiased);`

`Descriptive`

:

The `Descriptive`

class contains some of the basic descriptive statistics methods.

Averages:

```
float meanX = Descriptive.mean(x);
float geometricMeanY = Descriptive.Mean.geometric(y);
float weightedMeanX = Descriptive.Weighted.mean(x,weights);
```

Unbiased estimates of the standard deviation and variance:

```
float standardDeviation = Descriptive.std(x,true);
float variance = Descriptive.var(x,true);
float weightedVar = Descriptive.Weighted.var(x,weights,true);
```

Other misc:

```
float tukeyFiveNumSummaryX = Descriptive.tukeyFiveNum(x);
float modY = Descriptive.mod(y);
float sumOfSquaresX = Descriptive.Sum.squares(x);
float[] outliersY = Descriptive.outliers(y);
```

`Distance`

The `Distance`

class contains methods for computing various "distance" metrics for multidimensional scaling. These include `chebychev`

, `cityblock`

, `correlation`

, `cosine`

, `euclidean`

, `mahalanobis`

, `minkowski`

, `seuclidean`

, and `spearman`

distances.

`Eigenvalue`

, `QR`

, `LU`

, `SVD`

The `Mat`

class contains some non-trivial matrix-related utilities like computing the inverse, rank, determinant, and norms of input matrices. These functions actually call the `QR`

, `LU`

, `SVD`

, and `Eigenvalue`

decomposition classes.
A few things for the few of you who are still reading this:

You can solve very large systems of equations of the form
`A*x = y`

, or `A*X = Y`

by using the `solve()`

methods in the `QR`

or `LU`

classes.
All these classes take in float arrays but return doubles (for higher precision) -- you can cast the results back to floats using the `Cast`

class if necessary.

Here's a rough idea of how things work.

- Initialize an Eigenvalue class. This class containes the eigenvectors & eigenvalues of the input matrix
`A`

.`Eigenvalue eigen = new Eigenvalue(A);`

Get the eigenvector matrix of

`eigen`

. Each row corresponds to an eigenvector.`eigen.getV();`

-Get the real or imaginary values of the eigenvalue array

`eigen.getRealEigenvalues();`

eigen.getImagEigenvalues();Create a QR decomposition of the input matrix

`A`

.`QR qr = new QR(A);`

Get

`Q`

,`R`

, and whether the object has full rank.`double[][] Q = qr.getQ();`

double[][] R = qr.getR();

boolean isFullRank = qr.isFullRank();Get the

*array*,`x`

, corresponding to the least squares solution of`A*x = y`

.`double[] x = qr.solve(y);`

Get the

*matrix*X corresponding to the least squares solution of`A*X = Y`

.`double[][] X = qr.solve(Y);`

`Find`

The `Find`

class *finds* stuff. It's as simple as that.

Check if an array contains

`valueToFind`

.`boolean containsValue = Find.contains(x,valueToFind);`

Get the index of

`valueToFind`

in the input array. Returns`-1`

if not found.`int idx = Find.indexOf(x,valueToFind);`

Get all indices of

`valueToFind`

in the input array. Returns an empty array, otherwise.`int[] indices = Find.indicesO(x,valueToFind);`

Returns the number of times valueToFind is present in x.

`int numRepeats = Find.numRepeats(x,valueToFind);`

`Linear`

The `linear`

class contains methods related to determining the linear relationship between two datasets (of equal arrays) such as the best-fit linear line parameters, box-cox transformations, etc.

Get slope and y-intercept of the best fit linear line

`z = slope*x + intercept`

by minimizing the sum of least squares between`z`

and the`y`

.`float[] bestFit = Linear.bestFit(x,y); float slope = bestFit[0]; float intercept = bestFit[1];`

Get the array of residuals given by

`Delta_i = z_i - y_i`

, where`z_i = (slope*x_i + intercept)`

is the best fit linear line.`float[] residuals = Linear.residuals(x,y); //If the slope and intercept have already been computed: float[] residuals = Linear.residuals(x,y,slope,intercept);`

`Matrix`

The `Matrix`

class has *a ton* of useful functions, all of which deal with array or matrix related stuff.

Get a constant array. (Analogous to

`Array.fill(float[] arr, int v)`

, but you don't have to initialize your blank array.)`float[] z = Mat.constant(v,ln);`

Sum two arrays.

`float[] z = Mat.sum(x,y);`

Multiply two arrays.

`float[] z = Mat.multiply(x,y);`

Processing's

`map`

function, extended to arrays.`float[] mapped = Mat.map(x,low1,high1,low2,high2);`

Normalize the input array to its minimum and maximum values. Output array is now bounded by [0,1].

`float[] z = Mat.normalizeToMinMax(x);`

Normalize the input array to the sum of its values. (Great for transforming values to percentages.)

`float[] z = Mat.normalizeToSum(x);`

Replace all of a given value in an array with another value. Here, we're replacing all 0 elements with 1.

`float[] z = Mat.replace(y,0,1);`

Print an entire array out to the screen, with each element shown to numDecimal places.

`Mat.print(y,numDecimals);`

*Note: This is similar to*`System.out.println( join(nfc(y,5), "\t") );`

Dot Multiply two matrices:

`C[i][j] = A[i][j] \* B[i][j]`

.`float[][] C = Mat.dotMultiply(A,B);`

Get the determinant of the matrix.

`float[][] detA = Mat.det(A);`

Get the inverse of a matrix if

`A`

is square; get the pseudo-inverse otherwise.`float[][] invA = Mat.inverse(A);`

Get an

`n-by-n`

identity matrix.`float[][] identMatrix = Mat.identity(n);`

Prints matrix

`A`

to the screen with each element shown to numDecimals (compact println).`Mat.print(A,numDecimals);`

`NaNs`

Files without NaNs exist in the same world as rainbow-breathing unicorns that eat glitter and shit out stars. The `NaNs`

class is for the rest of us stuck in this world replete with Not-A-Number.

Check if an array contains NaNs.

`boolean containsNaNs = NaNs.containsNaNs(x)`

Remove all NaNs from an array.

`float[] xNoNaN = NaNs.eliminate(x);`

Returns a new array with NaN elements in the original data set replaced with the value specified by

`v`

.`float[] z = NaNs.replaceNewWith(x,v);`

Replace the NaN values in the original array with

`v`

.`NaNs.replaceOriginalWith(x,v);`

Get indices in the array corresponding to NaN values.

`int[] indices = NaNs.getNaNPositions(x);`

`Polynomial`

The `Polynomial`

class contains method for evaluating polynomial equations.

`y = polyval(x,coeff)`

returns the value of a polynomial of degree `n`

evaluated at `x`

. The input argument coeff is an array of length n+1 whose elements are the coefficients in descending powers of the polynomial to be evaluated.

`y = coeff[0]`*x**^n + coeff[1]*x^(n-1) + .... + coeff[n-1]*x + coeff[n]`

E.g. for a quadratic function,

`y = coeff[0]`*x**^2 + coeff[1]*x + coeff[2].`

The class accepts both `float`

and `double`

types.

```
//array values of a polynomial of degree n evaluated at each element of x.
float[] y = polyval(x, coeff);
float y = polyval(x, coeff);
```

`Visuals`

The `Visuals`

class stores most of the general plotting methods like writing the x and y labels, drawing the tick-marks, and coloring in the background. The `BoxPlot`

, `CorrelationPlot`

, `ScatterPlot`

and `SubPlot`

classes actually take in data and plot it. These classes were all built to enable quick visualization rather than maximal user-control (and also, to make my life a little easier when creating the examples in the example folder). Think of these more as classes for getting a quick snapshot of the data, so you can figure out what's important and what's not. For greater control, check out the gicentreUtils Processing library, or even, the java jFreeChart package. Or use the source code in the folder as a template for your soon-to-be-out-of-this-world-awesome-graphic. :)

I mean, seriously. *You can do better than the default*.

## FAQ

**What's the difference between this library and the Apache Commons Math (or any other larger Math/Statistics) library out there?**

The short answer is that it's a lot more lightweight. What I mean by that is that most of the classes in papaya don't store any variables but rather consist of collections of methods for computing and spitting out whatever it is they're supposed to compute and spit out. This eliminates unnecessary memory storage for variables that you're not interested in storing and/or creating additional classes when primitives suffice. Of course, lightweight is also code for "has much fewer methods". :) I've tried to include most of what can be considered core methods though but for computations involving higher level statistics (e.g. manova as opposed to anova, etc), consider using R.

papaya was also written specifically for Processing, and the way that most people use Processing. Hence, most of the methods take in float arrays or matrices instead of doubles, DoubleArrayList, RealMatrix, BlockMatrix, or one of the many other matrix-related classes available in other more substantial packages. This means that you don't have to rewrite your existing Processing code to use (for example) RealMatrices, DoubleArrayLists, or doubles even.

**Doesn't this sacrifice accuracy?**

Not really. At least as far I can tell. For methods where accuracy is key (for example in inverting matrices, or computing probability distributions), the internal computations are actually done using doubles and the results either output as doubles or, in some cases, cast back to floats. This overcomes most of the accuracy issues...although I didn't test this against really large data arrays. Of course, if you suspect that your computations are verging on overflow/underflow error limits, then you might want to practice extra caution.

** My dataset has NaN elements. Can I use the papaya library anyway?
Yes, and here's how. **

In the setup() portion of your sketch, include this line of code: println("DATASET HAS NANS!!!");

Now, every time you run your sketch, this will print to the screen and you will be reminded of the fact that your dataset has NaNs and needs to be pre-processed (using, for example, the papaya NaNs class) prior to doing anything.

This turned out to be a much more efficient way of doing things than having each function check each element of the input for NaNs individually before performing any computations. It was also more efficient (for me at least) than creating two different functions - one for arrays with NaNs and another for arrays without.

**What's your favorite statistics joke? **

Give me a moment...

## Issues/Bugs

Right now, the quickest method is to just send a gmail to adilapapaya.