CHAPTER 5

Data Reduction

 

This part of data processing is often called scaling in monochromatic methods, which refers to frame-to-frame scaling and determination of relative temperature factors.  In Laue diffraction, a far more important and complex task wavelength normalization plays the major role.  Wavelength normalization results in a wavelength-dependent function often called l-curve.  This curve looks mostly like the X-ray spectrum of incident beam, when converted to an energy-dependent function.  This curve is in fact an overall correction of wavelength dependency, including absorption correction (see 5.2.4 for details).

 

Each Laue spot is usually associated with a wavelength, called a single, or several discrete wavelengths, called a multiple, until Ren et al. (J. Synchrotron Rad. 6, 891-917, 1999) pointed out that this is a gross oversimplification.

 

.

 

This equation reformulated from Ren et al. essentially states that relative bandwidth of a Laue reflection is proportional to crystal mosaicity and inversely proportional to tangent of Bragg angle, and approximately the angle itself in nearly all protein cases.  A Laue spot is stimulated by X-rays within an energy band of several hundreds of eV, and can often becomes a partial reflection.  This is particularly true at low Bragg angles.  A partial reflection in monochromatic oscillation happens since the oscillation range is limited to a small angle of 1° or so.  Such partial is called angular partial.  A Laue partial happens due to insufficient energy range, therefore Laue partial is called energy partial.  Considering energy band for each Laue reflection and energy partials improves data merging from different Bragg angles.

 

An individual program Epinorm (energy partial improved normalization) carries out all the functionalities described above.  What Epinorm really does is data reduction from integrated intensities to structure factor amplitudes.

 

5.1 Input Data and Control Parameters

 

An example of command script for scaling is shown below.

 

diagnostic    off

busy          off

warning       off

prompt        off

result        off

@ m37v3_8us_002.mar3450.inp

@ m37v3_8us_004.mar3450.inp

@ m37v3_8us_062.mar3450.inp

prompt        on

result        on

 

Input

   Image      initial.lam

   Resolution 2.1 100

   Wavelength 1 1.5 1.1

   Chebyshev  64 unimodal

   Spot       8 6

   Quit

 

Scale   3 2 1 m37v3_8us_002.mar3450.ii

Lambda        refined.lam

Apply     1   m37v3_8us.hkl

Stop

   Yes

 

Listing 5.1.0.0.1 Command script to run Epinorm.

 

The command script of Epinorm shown in Listing 5.1.0.0.1 is very similar to that of Precognition.  Misusing a script will generate error message.  Just like the one for integration, this script first loads in a set of .inp files of those frames to be scaled together.  Bad frames should be excluded from this list.  The global printing switches prompt and result help to make the log file more concise.

 

The Input section is optional.  An initial l-curve can be loaded here.  See Chapter 4 for format.  If no l-curve is loaded, a straight line is assumed.  In this case, you should specify a wavelength range.  If not, 0.5 to 1 Å is the default.  If an initial l-curve is loaded, you may still change the wavelength range by using the Wavelength command.  If part of the new wavelength range falls out of the l-curve, 0 intensity is assumed.  The Wavelength command can take a third, optional number as the reference wavelength.  However, please note that the explicit input of wavelength range and reference will take effect only if this command is after the initial l-curve.  If no reference is given, it is automatically determined from the initial l-curve.  The shortest wavelength corresponding to the greatest intensity values in the initial l-curve will be chosen as a reference.

 

The command Resolution specifies a resolution range within which data are loaded for scaling.  In most cases, this command is unnecessary, so that data at all resolutions are loaded.  If you explicitly specify a resolution range, use the integration range.  This command is only useful in special cases, and will be explained elsewhere.

 

The command Chebyshev can be used to specify a maximum order of Chebyshev polynomials.  If not specified, the program may set a default depending on the complexity of the initial l-curve.  If a second integer less than or equal to the first one is given, this many Chebyshev terms at higher degree are allowed to have frame specific values, that is, each frame may have its own l-curve.  If the second integer is missing, its default is 0, that is, all frames share a single l-curve.  The second integer cannot be greater than the first; or it will be ignored except a warning message.

 

An optional string argument of choices unimodal, bimodal, arbitrary, free, or fix can be given.  The first three choices hint the program to find a unimodal, bimodal or arbitrary spectrum, respectively.  The program will try to remove some spiky features at both ends of the derived spectrum, if no string argument is given.  An explicit argument arbitrary forces the program to leave all spikes unmodified.  See Figure 5.1.0.0.1 for an example of spike removal.  Option fix signals the program not to refine the spectrum, and free reverses.

Figure 5.1.0.0.1 l-curves of 128-term Chebyshev approximation derived by Epinorm.  The dotted and solid lines are before and after spike removal and both ends, respectively.

 

The command Spot with two numerical arguments initializes crystal mosaicity.  Obviously, more streaky the spots, larger the mosaicity it would be.  If no Spot is given, the default mosaicity is 0.  See 5.2.3 for more.  This command also prevent the program from restoring overall mosaicity from a saved parameter file.  See 5.3 for details.

 

If there are some heavy atoms present in your crystal, and if you desire to examine the anomalous scattering signal from them, an additional control command Anomalous in Input section can be used (not shown in Listing 5.1.0.0.1).  This command toggles a flag that signals whether anomalous scattering should be considered during data reduction.  The default state is off, which fits the most cases.  This is the very first point in the entire process where an explicit option can be given, if anomalous scattering should become a concern.  However, implicitly but clearly, at the very beginning of the data processing, consistent indexing of all frames in a dataset is crucial to extraction of anomalous signal.  One must take great care of such consistency by using all possible means provided in Chapters 2 and 3.  If re-indexing cannot be avoided, specify a desired orientation matrix prior to re-indexing as described in 3.8.  Switching on this flag would signal the program to separate Friedel pairs, so that each member of a Friedel pair is considered independent of, instead of equivalent to, the other.  Rmerge’s calculated later will not include discrepancy between Friedel pairs (See 5.2.6).  It is also possible to delay the switch after scaling and before merging of redundant and equivalent data.  See 5.4 for detail.  It must be noted that these two alternatives reflect different strategies of handling of anomalous signal.  The former preserves the maximum amount of anomalous signal, however, may misidentify some systematic errors as anomalous signal.  The latter guards from possible systematic errors, but may unknowingly attenuate some real anomalous signal.  I left this as a user’s choice.

 

The command Anomalous in Input section may take an optional string argument on or off.  If no argument or no recognizable one is given, the command negates the current state.

 

5.2 Data Selection and Parameter Fitting

 

The main command is Scale.  It may take three numeric arguments and a string argument.  All these arguments are optional.

 

This command in the current release does not enter a submenu, but this will change in future releases if scaling becomes more complex or has more options.

 

5.2.1 Data selection

 

The first number specifies a s-cut.  3 is the default.  If I/s(I) less than this value, this integrated intensity will not be used in scaling, however, this does not mean that this data point will be lost forever.  This minimization process does not require all the data points available.  You may watch the reported data-to-parameter ratio during scaling.  If this ratio reaches a few hundreds, there should be enough data points to over-determine the parameters.  s-cut must be a value greater than or equal to 0.  0 s-cut means that all positive, but not 0, integrated intensities will join scaling.  The s-cut is the only control where user can intervene the data rejection.  Other data rejection criteria are automatic.  See Listing 5.2.6.0.1 and text below.

 

Another way to control data selection is to specify number of data points loaded from each frame.  If the first numeric argument is equal to or greater than 100, it is no longer considered as s-cut, rather number of data points per frame.  If you have tens of frames to scale at once, a few hundreds data points per frame would be sufficient.  If you only scale a few images, you may need more.  Controlling data points per frame usually makes the program run faster, however, it opens a possibility of insufficient data.  It should be understood that data-to-parameter ratio is not the only thing to consider here.  Data population as functions of resolution and wavelength is more important.  Using only the strongest, and therefore insufficient data may results in no representation at high resolution and two wings of the spectrum.  This could cause arbitrary temperature factors and noisy l-curve.

 

5.2.2 Data isotropy

 

The second number can be -1, 0, 1, or 2.  0 is the default, which indicates isotropic scale factors and temperature factors only.  -1 indicates isotropic scale factors only.  All temperature factors will be kept as initialized.  1 indicates anisotropic but linear scale factors and temperature factors can be used.

 

, and

,

 

are scale factor and temperature factor, respectively.

 

2 indicates nonlinear anisotropic scale factors and temperature factors are allowed.

 

, and

.

 

Anisotropic factors in general help minimize local errors, but they may be refined to some unreasonably large values, if there are not enough data to restrain them.  Use them judiciously.

 

The string argument specifies a reference frame.  The isotropic scale factor a0 of this frame is fixed at 1.  All other factors a’s and b’s of this frame will be fixed as initialized.  If no reference is specified, the first frame is assumed to be the reference.

 

The program initialized a’s and b’s are 0 except that a0’s are 1.  The second numerical argument and the string argument to command Scale function as selectors to the initialized values, but these arguments do not reset these factors.  Therefore, these factors can also be initialized to other user-specified values, and the arguments to command Scale choose to fix some of them and to free others.  See 5.3 on user-initialized factors.

 

5.2.3 Crystal mosaicity

 

The third numerical argument can be 0, 1, or 2.  0 is the default, which indicates that crystal mosaicity will be fixed as initialized.  1 indicates that an overall mosaicity can be refined, and 2 means frame-by-frame mosaicity.  Combination of this option with Spot command in Input section makes all the possibilities.

 

5.2.4 Absorption correction

 

Text Box: Box 5.2.4.0.1 Absorption Coefficients

Light intensity after an absorbing material is reduced to a fraction of the original intensity before the material.

I = I0e-mp,

where m is linear absorption coefficient, a function of the light wavelength, and p is the path length through the absorbing material.  Path through a longer material is certainly equivalent to path through a denser one.  More often, mass absorption coefficient m/d is used.

I = I0e-(m/d)dp,

where d is density of the material.  Mass absorption coefficients of various elements and materials can be found at http://physics.nist.gov/PhysRefData/XrayMassCoef.
It is obvious that a l-curve obtained from the process of wavelength normalization accounts for the total effects of the source spectrum and all absorption by optical elements in the incident beam prior to the sample crystal, including obstacles like air and front wall of sample capillary.  It is very appropriate to call this wavelength-dependent correction l-curve, instead of spectrum.  It is less obvious that a l-curve also corrects an overall effect of absorption by elements around the crystal environment, for example, sample crystal itself, surrounding liquid, flow-cell, cryoloop, capillary, diamond anvil cell, gasket, air, front layer of detector, etc.  Why would a l-curve without special consideration be already capable of correction of a large portion of the seemingly complex absorption?  Consider absorption by only one element for the simplicity of the argument.  Absorption correction factor

 

fA = e-m(l)p(t),

 

where m(l) is linear absorption coefficient as function wavelength, and p(t) is path length through the absorbing element as function of the orientation of a reflected beam t.  Path length can be rewritten as a constant mean path length and a deviation from the mean as function of orientation:

 

p(t) = p0 + Dp(t).

 

Absorption correction factor then becomes a product of two parts:

 

fA = e-m(l)Dp(t),

 

where the first part is wavelength-dependent only.  This part will be automatically corrected by l-curve.  When the range of Dp(t) is smaller than p0, which is often the case, the most of absorption effect has already been taken care of by l-curve.  What is left uncorrected is the second, orientation-dependent portion.  Therefore, absorption correction factor can be redefined as:

 

fA = e-m(l)Dp(t),

 

in one element case.

 

In general, if a reflected beam at orientation t passes through n types of materials, absorption correction factor can be written as:

 

fA = Dpi(t)].

 

In X-ray wavelength range, mass absorption coefficients are roughly proportional to squared wavelength, so that a generalized path length P(t) can be defined independent of wavelength:

 

fA = exp[-l2P(t)].

 

The generalized path length P(t) is a spherical function or simply a 2-dimensional function in detector space that includes variation of path lengths, densities, and steepness of mass absorption coefficients of all materials involved.  Contrasted to wavelength normalization, absorption correction focuses on the unevenness across the detector space rather than wavelength dependency.

 

Absorption correction is not yet released in the latest version of Epinorm.

 

5.2.5 Initial scaling

 

If a set of integrated intensities has never been scaled, there is an option to initialize the process in a less error prone way, but this is not always necessary.  To use the initial cycle, specify a string argument initial to the command Scale, and ask for an abbreviated scaling followed by normal scaling later.  See 5.3 for saving and restoring intermediate results.

 

diagnostic    off

busy          off

warning       off

prompt        off

result        off

@ m37v3_8us_002.mar3450.inp

@ m37v3_8us_004.mar3450.inp

@ m37v3_8us_062.mar3450.inp

prompt        on

result        on

 

Input

   Image      initial.lam

   Resolution 2.5 100   # lower resolution

   Wavelength 1 1.5 1.1 # NOTE: use same bandwidth as in normal cycles

   Chebyshev  16        # lower Chebyshev order, no frame-specific lambda-curve

 # Spot       8 6       # comment out for 0 mosaicity

   Quit

 

Scale  3 -1 0 initial   # use strong observations, isotropic scaling only

Lambda        refined.lam

Stop

   Yes

 

Listing 5.2.5.0.1 Command script for an initial scaling cycle.

 

5.2.6 Minimization cycle and statistical report

 

====================================================

Scaling Cycle 4

   Isotropic scale factor

   Overall spectrum

====================================================

Total measurements: 123021

  Accepted        : 115161 93.6108%

  Rejected        : 7860 6.38915%

Data-to-parameter : 1251.75

Maximum iteration : 32

Tolerance         : 0.0001

Chi-square        :   2.8666e+07   3.32024e+07  -4.53632e+06      -13.6626%

R.M.S.D.          :      936.884       1008.29      -71.4082       -7.0821%

Quadratic R-factor:      14.2902%

(Current and previous values, absolute and relative changes)

 

______

|      )_

| Report |

| ------ |

| ------ |

| ------ |

| ----   |

|________|

 

R-model          = 0.125352

Weighted R-model = 0.117885

R-models calculated from

115161 accepted integrated intensities.

These R-factors indicate how well the integrated

intensities are modeled by the current parameter set.

 

         R-merge on F^2 = 0.168612

Weighted R-merge on F^2 = 0.127676

         R-merge on F   = 0.0993019

Weighted R-merge on F   = 0.0804078

R-merges calculated from

115131 accepted integrated intensities of

33883 unique reflections with redundant measurements.

These R-factors indicate how well the symmetry-related

reflections agree with each other.

 

Mean F^2 / sigma(F^2)   = 10.7618

Mean F   / sigma(F)     = 21.4314

Signal-to-noise ratio calculated from

9333 unique reflections with highly redundant measurements.

 

Resolution range (A)   Unique refl.   Mean F^2/sigma(F^2)   Mean F/sigma(F)

____________________   ____________   ___________________   _______________

1000.0000 -   4.7877            235                15.97             31.64

   4.7877 -   3.8000            527                19.34             38.30

   3.8000 -   3.3196            615                17.18             34.07

   3.3196 -   3.0161            601                14.53             28.93

   3.0161 -   2.7999            625                12.49             24.85

   2.7999 -   2.6348            588                11.54             23.11

   2.6348 -   2.5028            590                10.65             21.33

   2.5028 -   2.3938            538                 9.58             19.16

   2.3938 -   2.3017            580                 9.95             19.88

   2.3017 -   2.2223            569                 8.51             17.00

   2.2223 -   2.1528            626                 8.93             17.83

   2.1528 -   2.0912            652                 8.12             16.15

   2.0912 -   2.0362            657                 7.85             15.65

   2.0362 -   1.9865            659                 7.75             15.41

   1.9865 -   1.9413            653                 7.62             15.16

   1.9413 -   1.9000            618                 7.18             14.28

 

File light.inp is overwritten.

 

File m37v_1a_004.mar3450.ii.lam is overwritten.

 

File m37v_1a_006.mar3450.ii.lam is overwritten.

 

 

File m37v_1b_062.mar3450.ii.lam is overwritten.

 

Listing 5.2.6.0.1 Statistics report from each cycle of scaling.

 

The minimization process is scheduled in many cycles.  Each cycle generates a report like the one listed above.  First, a title tells what parameters are refined in this cycle, followed by a section of basic statistics on the data.  Data rejection is done automatically based on several criteria.  A non-redundant measurement is rejected, since it cannot contribute to the refinement.  Data points with large errors are also rejected automatically, but once again, rejected data during scaling may still be included in the final output.

 

, and

,

 

where the summation is over N accepted data points.  c2 and R.M.S.D. measure how well the observed integrated intensities are modeled, but they do not give a relative sense.  Rquadratic in more statistical sense and Rmodel in more crystallographic sense defined below indicates such relative residual of fitting:

 

, and

.

 

The Rmerge, also known as Rsymm, measures how well the symmetry-related and redundant measurements agree with each other after applying the current correction factors.  All these statistics shall improve cycle by cycle, if the refinement is going well.  However, you may notice that some R factors may increase slightly.  This is due to newly applied data rejection may accept more data point into the refinement while the process converges.

 

Mean F2/s(F2) and mean F/s(F) are meant to be objective measures of signal-to-noise ratio.  The sample standard deviation s is calculated from redundant observations of at least 4 times, so that s is a lower bound of the real noise content.

 

5.2.7 Results

 

Results of the minimization process are reported after the final cycle as the Listing below. 

 

______

|      )_

| Report |

| ------ |

| ------ |

| ------ |

| ----   |

|________|

 

Beam polarization: 0.921758

 

Mean crystal mosaicity (degree): 0

   m37v3_8us_002.mar3450.ii     0.00000

   m37v3_8us_004.mar3450.ii     0.00000

  

   m37v3_8us_062.mar3450.ii     0.00000

 

Isotropic scale factor:

   m37v3_8us_002.mar3450.ii     1.00000

   m37v3_8us_004.mar3450.ii     1.15969

  

   m37v3_8us_062.mar3450.ii     1.43031

 

Isotropic temperature factor:

   m37v3_8us_002.mar3450.ii     0.00000

   m37v3_8us_004.mar3450.ii    -4.58414

  

   m37v3_8us_062.mar3450.ii    -7.24519

 

Anisotropic scale factor:

   m37v3_8us_002.mar3450.ii

      0.00000      0.00000      0.00000

      0.00000      0.00000      0.00000

      0.00000      0.00000      0.00000

   m37v3_8us_004.mar3450.ii

      0.00042      0.00791     -0.00124