Papaya
What It Is
A collection of statistics, mathematics, and matrix manipulation related utilities by Adila Faruk for the Processing programming environment. [Last update, March 2013.]
What It Has
The library contains a number of core statistical analysis methods. Array and matrix related utilities are included as well including Eigenvalue, LU, QR, and SVD Decompositions.
Want to know more? Read the Reader's Digest version below or dive right into the JavaDocs.
Download and Installation
Get papaya version 1.0.0 in .zip format. Unzip and put the (entire) extracted papaya folder into the 'libraries' folder of your Processing sketchbook folder (read the How To Install an External Library wikipage if you have no idea what I'm talking about). Javadoc reference and sketch examples are included in the papaya folder as well.
Tested on: Mac OSX using Processing 1.5.1, 2.0.1
Dependencies: Your strength, will-power, and continued dedication to the task at hand.Overview
Basic descriptive statistics related methods such as computing the mean, standard deviation, variance, skew, kurtosis are included in the Descriptive
class. Weighted analysis are available too, but only for the mean, variance, and rms.
The Correlation
class contains utilities related to the correlation between datasets. If you want to compare datasets (e.g. using a Student T-Test), then look no further than the Comparison
class. There's also a Distance
class for computing the various different distance metrics.
Matrix (or, really, array) related utilities are given in the Cast
, Find
, Mat
, NaNs
, Polynomial
, Rank
, Sorting
, and Unique
classes. They all do what their name suggests, except Mat
which is an over-achiever and contains the matrix and array manipulation methods that don't fit into the other categories. These include mathematical operators, extended to arrays and matrices (sums, divisions, multiplications, products, logs), array and matrix manipulation like reverse, copy, print, and a whole slew of other things that will hopefully cut your code down from 600 lines to a mere 580, leaving you with much more time to focus on making pretty visuals.
Check out the JavaDocs for more detail on these, and other classes.
Here's an xkcd comic to lighten things up a little before we move on.
Notation
Where relevant, small letters denote arrays of floats or integers, etc. I.e.
float[] x,y; int[] a;
Capital letters are reserved for matrices,
float[][] A,B;
For example,
float[] x = new float[]{44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1, 32.5, 45.9, 41.9};
float[] y = new float[]{2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8, 2.6, 5.2, 2.6, 1094.3};
float[][] A = new float[][]{ {1,2,3} , {4.1,2.9,3.2} , {6,1,2} };
Classes
Here's a little more detail on some of the classes, listed in (roughly) alphabetical order.
Cast
The Cast
class is similar to the toArray(T[] a)
method specified in the java Collections interface, but for float[]
and int[]
arrays (as opposed to Float[]
and Integer[]
).
Cast an arrayList in
Integers
to anint
array.ArrayList<Integer> intArrayList = new ArrayList<Integer(); ...fill up intArrayList with numbers... int[] intArray = Cast.arrayListToInt(intArrayList)
Correlation
The Correlation
class can be used to compute different kinds of linear correlation coefficients.
- Pearson's linear correlation coefficient:
float linearCorr = Correlation.linear(x,y,unbiased);
- Spearman's rank correlation coefficient:
float spearmanCorr = Correlation.spearman(x,y,unbiased);
- Covariance:
float covariance = Correlation.cov(x,y,unbiased);
Descriptive
:
The Descriptive
class contains some of the basic descriptive statistics methods.
Averages:
float meanX = Descriptive.mean(x);
float geometricMeanY = Descriptive.Mean.geometric(y);
float weightedMeanX = Descriptive.Weighted.mean(x,weights);
Unbiased estimates of the standard deviation and variance:
float standardDeviation = Descriptive.std(x,true);
float variance = Descriptive.var(x,true);
float weightedVar = Descriptive.Weighted.var(x,weights,true);
Other misc:
float tukeyFiveNumSummaryX = Descriptive.tukeyFiveNum(x);
float modY = Descriptive.mod(y);
float sumOfSquaresX = Descriptive.Sum.squares(x);
float[] outliersY = Descriptive.outliers(y);
Distance
The Distance
class contains methods for computing various "distance" metrics for multidimensional scaling. These include chebychev
, cityblock
, correlation
, cosine
, euclidean
, mahalanobis
, minkowski
, seuclidean
, and spearman
distances.
Eigenvalue
, QR
, LU
, SVD
The Mat
class contains some non-trivial matrix-related utilities like computing the inverse, rank, determinant, and norms of input matrices. These functions actually call the QR
, LU
, SVD
, and Eigenvalue
decomposition classes.
A few things for the few of you who are still reading this:
You can solve very large systems of equations of the form
A*x = y
, or A*X = Y
by using the solve()
methods in the QR
or LU
classes.
All these classes take in float arrays but return doubles (for higher precision) -- you can cast the results back to floats using the Cast
class if necessary.
Here's a rough idea of how things work.
- Initialize an Eigenvalue class. This class containes the eigenvectors & eigenvalues of the input matrix
A
.Eigenvalue eigen = new Eigenvalue(A);
Get the eigenvector matrix of
eigen
. Each row corresponds to an eigenvector.eigen.getV();
-Get the real or imaginary values of the eigenvalue array
eigen.getRealEigenvalues();
eigen.getImagEigenvalues();Create a QR decomposition of the input matrix
A
.QR qr = new QR(A);
Get
Q
,R
, and whether the object has full rank.double[][] Q = qr.getQ();
double[][] R = qr.getR();
boolean isFullRank = qr.isFullRank();Get the array,
x
, corresponding to the least squares solution ofA*x = y
.double[] x = qr.solve(y);
Get the matrix X corresponding to the least squares solution of
A*X = Y
.double[][] X = qr.solve(Y);
Find
The Find
class finds stuff. It's as simple as that.
Check if an array contains
valueToFind
.boolean containsValue = Find.contains(x,valueToFind);
Get the index of
valueToFind
in the input array. Returns-1
if not found.int idx = Find.indexOf(x,valueToFind);
Get all indices of
valueToFind
in the input array. Returns an empty array, otherwise.int[] indices = Find.indicesO(x,valueToFind);
Returns the number of times valueToFind is present in x.
int numRepeats = Find.numRepeats(x,valueToFind);
Linear
The linear
class contains methods related to determining the linear relationship between two datasets (of equal arrays) such as the best-fit linear line parameters, box-cox transformations, etc.
Get slope and y-intercept of the best fit linear line
z = slope*x + intercept
by minimizing the sum of least squares betweenz
and they
.float[] bestFit = Linear.bestFit(x,y); float slope = bestFit[0]; float intercept = bestFit[1];
Get the array of residuals given by
Delta_i = z_i - y_i
, wherez_i = (slope*x_i + intercept)
is the best fit linear line.float[] residuals = Linear.residuals(x,y); //If the slope and intercept have already been computed: float[] residuals = Linear.residuals(x,y,slope,intercept);
Matrix
The Matrix
class has a ton of useful functions, all of which deal with array or matrix related stuff.
Get a constant array. (Analogous to
Array.fill(float[] arr, int v)
, but you don't have to initialize your blank array.)float[] z = Mat.constant(v,ln);
Sum two arrays.
float[] z = Mat.sum(x,y);
Multiply two arrays.
float[] z = Mat.multiply(x,y);
Processing's
map
function, extended to arrays.float[] mapped = Mat.map(x,low1,high1,low2,high2);
Normalize the input array to its minimum and maximum values. Output array is now bounded by [0,1].
float[] z = Mat.normalizeToMinMax(x);
Normalize the input array to the sum of its values. (Great for transforming values to percentages.)
float[] z = Mat.normalizeToSum(x);
Replace all of a given value in an array with another value. Here, we're replacing all 0 elements with 1.
float[] z = Mat.replace(y,0,1);
Print an entire array out to the screen, with each element shown to numDecimal places.
Mat.print(y,numDecimals);
Note: This is similar to
System.out.println( join(nfc(y,5), "\t") );
Dot Multiply two matrices:
C[i][j] = A[i][j] \* B[i][j]
.float[][] C = Mat.dotMultiply(A,B);
Get the determinant of the matrix.
float[][] detA = Mat.det(A);
Get the inverse of a matrix if
A
is square; get the pseudo-inverse otherwise.float[][] invA = Mat.inverse(A);
Get an
n-by-n
identity matrix.float[][] identMatrix = Mat.identity(n);
Prints matrix
A
to the screen with each element shown to numDecimals (compact println).Mat.print(A,numDecimals);
NaNs
Files without NaNs exist in the same world as rainbow-breathing unicorns that eat glitter and shit out stars. The NaNs
class is for the rest of us stuck in this world replete with Not-A-Number.
Check if an array contains NaNs.
boolean containsNaNs = NaNs.containsNaNs(x)
Remove all NaNs from an array.
float[] xNoNaN = NaNs.eliminate(x);
Returns a new array with NaN elements in the original data set replaced with the value specified by
v
.float[] z = NaNs.replaceNewWith(x,v);
Replace the NaN values in the original array with
v
.NaNs.replaceOriginalWith(x,v);
Get indices in the array corresponding to NaN values.
int[] indices = NaNs.getNaNPositions(x);
Polynomial
The Polynomial
class contains method for evaluating polynomial equations.
y = polyval(x,coeff)
returns the value of a polynomial of degree n
evaluated at x
. The input argument coeff is an array of length n+1 whose elements are the coefficients in descending powers of the polynomial to be evaluated.
y = coeff[0]x^n + coeff[1]x^(n-1) + .... + coeff[n-1]*x + coeff[n]`
E.g. for a quadratic function,
y = coeff[0]x^2 + coeff[1]x + coeff[2].`
The class accepts both float
and double
types.
//array values of a polynomial of degree n evaluated at each element of x.
float[] y = polyval(x, coeff);
float y = polyval(x, coeff);
Visuals
The Visuals
class stores most of the general plotting methods like writing the x and y labels, drawing the tick-marks, and coloring in the background. The BoxPlot
, CorrelationPlot
, ScatterPlot
and SubPlot
classes actually take in data and plot it. These classes were all built to enable quick visualization rather than maximal user-control (and also, to make my life a little easier when creating the examples in the example folder). Think of these more as classes for getting a quick snapshot of the data, so you can figure out what's important and what's not. For greater control, check out the gicentreUtils Processing library, or even, the java jFreeChart package. Or use the source code in the folder as a template for your soon-to-be-out-of-this-world-awesome-graphic. :)
I mean, seriously. You can do better than the default.
FAQ
What's the difference between this library and the Apache Commons Math (or any other larger Math/Statistics) library out there?
The short answer is that it's a lot more lightweight. What I mean by that is that most of the classes in papaya don't store any variables but rather consist of collections of methods for computing and spitting out whatever it is they're supposed to compute and spit out. This eliminates unnecessary memory storage for variables that you're not interested in storing and/or creating additional classes when primitives suffice. Of course, lightweight is also code for "has much fewer methods". :) I've tried to include most of what can be considered core methods though but for computations involving higher level statistics (e.g. manova as opposed to anova, etc), consider using R.
papaya was also written specifically for Processing, and the way that most people use Processing. Hence, most of the methods take in float arrays or matrices instead of doubles, DoubleArrayList, RealMatrix, BlockMatrix, or one of the many other matrix-related classes available in other more substantial packages. This means that you don't have to rewrite your existing Processing code to use (for example) RealMatrices, DoubleArrayLists, or doubles even.
Doesn't this sacrifice accuracy?
Not really. At least as far I can tell. For methods where accuracy is key (for example in inverting matrices, or computing probability distributions), the internal computations are actually done using doubles and the results either output as doubles or, in some cases, cast back to floats. This overcomes most of the accuracy issues...although I didn't test this against really large data arrays. Of course, if you suspect that your computations are verging on overflow/underflow error limits, then you might want to practice extra caution.
My dataset has NaN elements. Can I use the papaya library anyway? Yes, and here's how.
In the setup() portion of your sketch, include this line of code: println("DATASET HAS NANS!!!");
Now, every time you run your sketch, this will print to the screen and you will be reminded of the fact that your dataset has NaNs and needs to be pre-processed (using, for example, the papaya NaNs class) prior to doing anything.
This turned out to be a much more efficient way of doing things than having each function check each element of the input for NaNs individually before performing any computations. It was also more efficient (for me at least) than creating two different functions - one for arrays with NaNs and another for arrays without.
What's your favorite statistics joke?
Give me a moment...
Issues/Bugs
Right now, the quickest method is to just send a gmail to adilapapaya.