Zhong's 2014 thesis: September 2014

Wednesday, September 10, 2014

Data extraction

Procedures

Perform OpenCv Haar Feature based Cascade classification on every frame of RGB video.
Perform transforming/mapping for all frames of depth video. This should happen simultaneously with step one. After transforming, the depth pixels are relocated in same co-ordinates as RGB coordinates. Then we can get depth information of interested region from depth. The result image will be referred as "Mapped" in following context.
For each area (left eye, right eye, mouth) detected in RGB frame , we can look up depth value for every pixel in "Mapped". And we stores the depth information of all area (eyes, mouth) in temporary images for later use.
Resize temporary images to 36*24 for eyes, and 40*24 for mouth. (It is required for feature extraction and classification algorithm), and save all depth information in txt files.

Problem

The temporary image mentioned in step 3 was created as an Opencv Ipimage. The imageData of Ipimage is aligned in 4 bytes for better performance. For example, if we have a 50-by-50 (depth 8) image, and the imageData is aligned in 4 bytes. The size of reach row will not be (50*3 =150 bytes but 152 bytes i.e.152 % 4 = 0). It is dangerous to use pointer to manipulate the imageData, because there is no boundary check, and values from other memory address may be returned. However, it is fast to use pointer if we have hundreds of thousand pixels need to be processed.

Tuesday, September 2, 2014

IR and RGB camera mapping

There are three coordinates system that relevant when we perform calibration, which are real world coordinates, camera matrices both IR camera and RGB camera.

Suppose P is a point in real world coordinates, P_ir (3D point) is the same point but in the perspective of IR camera coordinates, p_ir (2D point) is the same point in human perspective (the image). The relationship of P_ir and p_ir shows below:

$P_{ir} = {H_{ir}}^-1 p_{ir}$

$p_{ir} = {H_{ir}} P_{ir}$

Where H_ir is the intrinsic matrix of IR camera. which we obtain from previous calibration result.

P_ir can be transformed to RGB camera coordinates through the relative transformation R and T:

$P_{rgb} = R * P_{ir} + T$

Where R is the rotation matrix and T is the transition matrix.

Then we project P_rgb using H_rgb to obtain coordinates of P in RGB camera coordinates p_rgb.

$p_{rgb} = H_{rgb}P_{rgb}$

When two or more points projected on a same pixel, the closet point is chosen.

It should be noticed that both p_rgb and p_ir are homogenous coordinates, therefore when we trying to form p_ir, we need to multiply the pixel coordinate (x,y) by z, which is the depth value.

Find Rotation and Transition matrix

Extrinsic matrix transforms a point from 3D world coordinates to 3D camera coordinates.

The extrinsic matrices we are going to use are different in format from the ones in previous post. The extrinsic matrices are 4*4 in size, which consist a 3*3 rotation matrix (R) , a 3*1 transition matrix (T) and a row of redundancies. The extrinsic matrices are showing as following:

IR camera:

-0.026084 -0.928295 -0.370929 194.092830

-0.999637 0.021708 0.015968 51.210837

-0.006771 0.371211 -0.928524 278.779239

0.000000 0.000000 0.000000 1.000000

RGB camera:

-0.036546 -0.929878 -0.366048 185.288116

-0.998892 0.023121 0.040994 28.217325

-0.029656 0.367140 -0.929693 268.344275

0.000000 0.000000 0.000000 1.000000

Suppose a 3D world coordinate P, it can be transformed to a 3D camera matrix using the matrices above. The following relations can be found:

$P_{ir} = R_{ir}P+T_{ir}$

$P_{rgb} = R_{rgb}P+T_{ir}$

We substitute P in equation 2 with P_ir,R_ir and T_ir from equation 1, we obtain:

$R = R_{rgb}R_{ir}^-1$

$T = T_{rgb} - R_{rgb}R_{ir}^-1T_{ir} = T_{rgb} - RT_{ir}$

Applying data, the results are:

[0.99993188 0.01050178 -0.00504891]

[-0.01061398 0.99968556 -0.02271776]

[ 0.00480963 0.02276964 0.9997292 ]

[-7.92176365]

[-14.58407103]

[-12.45903772]

The result looks well. However, GML tool manual suggested that calibration with two patterns at same time will produce a much more accurate result. Because we are going perform Feature Extraction in pixel level, a more accurate calibration with two different patterns may be required.

Monday, September 1, 2014

RGB to Depth conversion

Basic information

Depth information captured from Kinect is stored in 16-bit data structure as shown below:

D13

D12

D11

D10

Real depth information

(13 bits)

User indicators (3 bits for 7 users)

User indicators: 3 lower value bits that indicate the users

Real depth information:13 lower bit indicate the distance between detected object and the Kinect depth sensor.

The accuracy of the IR sensor is 7 mm, therefore the first 3 lower bit are always zero

D13

D12

D11

D10

Real depth information (13 bits)

RGB to depth conversion

In the sample video, the depth value is converted to 24-bit RGB value using Nui_ShortToQuad_Depth() from Kinect SDK, the following equations are used:

BYTE red= s;

BYTE green= s>>3 & 224;

BYTE blue3= s>>5 & 192;

Blue								Green								Red
D13	D12	0	0	0	0	0	0	D11	D10	D9	0	0	0	0	0	D8	D7	D6	D5	D4	0	0	0

In order to extract real value from the RGB representation, we need to do following mappings:

$Depth\;in\;mm=2^5 blue + 2^3 green + red$