History and Features of CMAT

In 1995, when my wife left for a position at LSU in New Orleans, I had the idea to CMAT due to frustration about some features of SAS/IML. In addition, I needed some easy usable software for testing software submitted to JSS which was just founded by Jan de Leeuw then. My work contract with SAS Institute from early 1984 permitted me to write my own software at my spare time as long as I would not have any profit from it.
Major points of frustration with IML were
- the only data objects were scalars, vectors and matrices - no multidimensional arrays, lists of objects, or structures
- the only data types were real and string and that not even in mixed form - no complex data or objects containing both numeric and string data.
Since I was tired of learning new languages I decided just to stick with the C language and extend that straightforward to my data objects and types. And I still think, I was right with that, even though some of my frieds have been asking me to make my language compatible with that of Matlab.
For a language I needed a parser and I did not want to invent the wheel again. Talking to a friend he told me of Yacc and Lex which was then available in the Unix-like Domain OS on the Apollo computers which I was able to buy from SAS Institute when it moved to HP Unix work stations just then. As I was never able to convince anybody of SAS to work with me on this I had to figure out how to work with Yacc and Lex myself and it took me several months to write the barebones of a grammar for some small subpart of the C language extended to the more general data objects I needed. Either I was too lazy or too dumb to do good error testing in my Yacc script which is still a pain when somebody gets the message "syntax error".
CMAT was intended to use public domain software and needed to combine static and dynamic (DLLs) libraries of C/C++ and Fortran code. The Apollo OS included compilers for both, C and Fortran, and I was able to bind static libraries together. Later when I had to work on PCs with Microsoft Windows I switched to Yacc and Lex from MKS and for the C/C++ and Fortran compilation to the free Watcom C/C++ and Fortran compilers and which also permitted DLLs. At the same time I started using emacs and bash which are both free of charge and which have been improved very much since then. In 2012 I found an opportunity to switch from the Watcom compiler platform to MS Visual Studio (for C/C++) and Intel Parallel Studio (for Fortran and additional features) and found a number of bugs due to the improved testing, and I'm still using emacs and bash.
For documentation I decided early on for Tex and LaTex which was free on Unix as it is now in Linux. Later I wanted more comfort and purchased PCTeX for Windows.
There was a number of software pieces which I was able to include into CMAT without major changes but only when the copyright conditions permitted that or I had the written agreement of the author. Some other software with more restrict copyright conditions had to be completely reprogrammed, many times with better numerical features.
In 1995 there was still no Blas or Lapack available and I had to invent some similar tools myself. Later (and still before Matlab:-) I had both, Blas and Lapack running in CMAT. And as soon as it was available I added the great Arpack to my tools.
The major components of CMAT are:
- Linear Algebra:
  good and efficient linear algebra for real and complex data, sparse matrices and tensors and mixed data types, in addition to Blas, Lapack, and Arpack (for real and complex eigenvalues and eigenvectors of large sparse matrices and singular values of large sparse matrices). I also developed many algorithms myself, like quite powerful methods for the nonnegative matrix factorisation, iterative methods for the solution of large sparse linear systems, and total least squares (DTLS and PTLS by van Huffel). Together with Robert Hartwig of NC State University we developed a modification of Aasen's algorithm for the solution of linear systems with large and sparse positive semidefinte (symmetric and maybe singular) coefficient matrices which works well for applications where iterative methods may have convergence problems.
- Optimization:
  linear, quadratic, and general nonlinear optimization, unconstrained and constrained for smooth and nonsmooth objective functions. In addition to many algorithms which I wrote myself similar to those which I developed for SAS PROC NLP, e.g. conjugate gradient, (limited memory) quasi Newton, Newton-Raphson, trust region, Levenberg-Marquardt, I included many algorithms which were given to me from other scientists. For example by Stephen Write (pcx), and many from Mike Powell at Cambridge University especially those for nonsmooth problems (COBYLA, UOBYQA, BOBYQA, NEWUOA) and linear constrained problems (LINCOA), some by Kaj Madsen (who obtained his PhD with Mike Powell) from the Technical University of Denmark (linear L1, Linfinity, and MinMax estimation), and from Mustafa Pinar (nonlinear L1 and Linfinity estimation). Thanks also to Richard Brent. For nonsmooth problems I also implemented the well working Nelder-Mead method and the unconstrained and boundary constrained bundle trust region methods (developed by the late Prof. Zowe and his group from Bayreuth).
- Statistics:
  some well designed algorithms for basic statistical problems, especially for predictive modeling, and some not so easy available. With the permission of the original author (Don Hedeker) I implemented his MIXOR, MIXNOM, and MIXREGV algorithms so that they work with my own optimization algorithms (but was not able to make them significant faster). The robust methods LMS, LTS, MVE, and MCD which I developed (with the help of Peter Rousseeuw) also for SAS Institute I reprogrammed for CMAT, and added the Hotelling's test (maybe I should add more). Logistic regression and general GLIMs, as well as LARS, SCAD, Huber and very general nonlinear regression were developed around my own optimization methods. Also included were some more unknown regression methods, like orthogonal regression (errors in variables), bidimensional regression for the comparison of twodimensional maps (for face recognition), univariate Deming regression etc.
- Data Mining:
  many methods for data mining covering large part of the functionality of SAS Enterprise Miner. It needs a larger paper to describe those. The so-called R2 node in SAS EM is basically a stepwise forward linear regression algorithm which is very much restricted by the number p of variables due to O(p^2) memory need. My algorithm for this problem needs only O(p) memory and is therefore able to solve problems with hundreds of thousands of variables, like those with microarray data. Hybrid neural networks and support vector machines (thanks to Olvi Mangasarian and Kristin Bennett) similar to those I developed for SAS Institute.
- Factor Analysis and Structural Equation Modeling:
  almost every feature which is available in other programs like AMOS or LISREL is implemented, except the full information ML method of missing values. Polychoric correlations, multiple sample solutions, and robust (Satorra-Bentler) Chi^2 and confidence intervals are available, with the help of Albert Maydeu-Olivares and the late Rod McDonald. In addition I developed a very powerful algorithm for finding the zero patterns of CFA solutions with large p values which works for up to about 40 variables. Many rotation methods for factor loadings and principal components including confidence intervals of the rotated loadings are available (Jennrich and Ogasarawa). There are also some well designed functions for IRT and MDS (thanks to Jim Ramsay for MULTSCALE) in CMAT.
- Encryption:
  many methods for the encryption of string data (like passwords), or files and entire directories of files. My own developed method is based on an unpublished algorithm and permits the specification of long passwords which should be stored externally. (Thanks to George Marsaglia.)
- Probability Functions:
  a good selection of cdf's, pdf's, and random generators, for uniform distribution methods by L'Ecuyer and Marsaglia, Mersenne-Twister, AES, RANLUX etc. and many for other distributions like Marsaglia's Ziggurat. When reviewing his papers for JSS I enjoyed very much working with George Marsaglia. For multivariate cdf of normal and t distribution I added code developed by Alan Genz and Frank Bretz which works well up to about 100 dimensions.
- Some more specific applications:
  - EM estimation of the mean vector and covariance matrix of sparse data (Gregorich)
  - Computing optimal locations for franchises (specific LP problem)
  - Computing the robust covariance matrix by shrinkage (Ledoit)
  - Convex hull, Voronoi diagrams, and Delaunay triangularization
  - Multiple comparison algorithms based on many types of contrast matrices (including input) and p value adjustments (as well for 1-way ANOVA as for general linear models)
  - Two-phase logistic modeling (Cain and Breslow, 1988)
  - Fisher's exact test of contingency tables an exact logistic regression
  - Support Vector Machines and Neural Networks
For nonprofit applications, CMAT is free of charge and can be downloaded after requiring a password. A number of documents are available without password, some here.

Some Benchmarks: State of February 10, 2013

I'm using three DELL Precision Desktop PCs:

PC1: DELL Precision 1650 bought in 2012: running 32-bit Windows 7 Professional
One 250 GB hard drive: 3.5 inch Serial ATA (7.200 Rpm),
3rd Gen Intel Core I7-3770 Processor (Quad Core 3.40 GHz Turbo, 8 MB w/HD Graphics 4000),
4 GB RAM 1600 MHz DDR3 Non-ECC,
using MS Visual Studio and Intel Parallel Studio XE 213, compiled in Release mode (optmization process takes about 30 minutes),
size of the executable 21.5 MB
PC2: DELL Precision 3500 bought in 2012: running 64-bit Windows 7 Professional
Two 300 GB hard drives: 15,000 Rpm SAS,
Intel Xeon W3670 (Six Core 3.20 GHz, 12 MB Cache, 4.8 GT/s Intel QPI),
12 GB RAM 1333 MHz DDR3 ECC UDIMM,
using MS Visual Studio and Intel Parallel Studio XE 213, compiled in Debug mode ,
size of the executable 38.6 MB
PC3: DELL Precision 1500 bought in 2001: running 32-bit Windows XP Professional
Two 35 GB SCSI hard drives (10.000 Rpm) and an older Intel Pentium Processor,
using very slow Watcom C/C++ and Fortran compilers in Debug mode ,
size of the executable 25.9 MB

Example in

cmat/test

tborut2.inp

tlrallv2.inp

tlrallv3.inp

tlrallv4.inp

tmicro2.inp

tmixrpana.inp

tmixrpan2.inp

trand2.inp

tranfor2.inp

tsvm12.inp

tsvm13.inp

tsvm23.inp

tsvm24.inp

tsmp3.inp

Time in Secs

for PC1

2812

3418

1853

1450

2736

345

342

1799

2389

1156

1542

1162

18349

19262

Time in Secs

for PC2

4785

6466

2752

2253

5627

1382

1386

9443

3891

3283

2781

5123

100491

81609

Time in Secs

for PC3

6605

12874

9917

8027

3576 ???

4839

4532

24425

7570

Back to Homepage