Sunday, February 23, 2014

Why GPU's

We’ve seen the meteoric rise of hardware-accelerated video transcoding. That is, we've seen a growth in the availability of graphics cards and processors that can decode and subsequently re-encode compressed video, whether using GPU shaders or dedicated transcoding logic. These days, every new GeForce, Radeon, A-series APU, and Core processor has some sort of hardware transcoding. Vendors video conversion software across the world are working hard to support them all. Both CPUs and GPUs come with their own dedicated transcoding logic, some systems offer multiple paths to hardware acceleration. Traditional GPUs rely on the existence of fast shared local memory, which the programmer needs to program explicitly. While some traditional GPUs are based on HW scheduling of many tiny threads, these and other differences suggest that applications usually benefit from tuning to the HW they’re intended to run on. The folks behind the popular x264 software encoder have been quietly plugging away at an OpenCL-accelerated version of their look ahead pipeline. Look ahead only accounts for 10-25% of the total encoding time, according to x264 lead developer Jason Garrett-Glaser, but the process allows for nearly unlimited parallelism and is relatively easy to implement in OpenCL. Re-writing all of the x264 encoder in OpenCL, by contrast, would be "very hard." Garrett-Glaser says the accelerated look ahead can increase performance by up to 40% on AMD's new Trinity APUs and by a factor of two on the latest Radeon graphics cards. Targeted Applications for HW accelerations using GPU cards: 1. UltraHD (3840x2160) & Cinema 4K (4096x2160) 2. Internet and mobile video streaming (VOD) 3. HD Video Communication 4. Broadcast Distribution(OTT) Differences between H.264 and H.265 are:
TECHNOLOGY OVERVIEW: In order to get maximum CPU and GPU utilization from available resources, GPU should operate/execute following module.


CPU Formatting Pre-analysis and Decision Making Configuration Rate Control CABAC Overall operation GPU Motion Estimation Intra and Inter prediction DCT and DST Transform and Inverse Transform Quantization and DE quantization SAO and De-blocking filters The current picture/frame pixels is predicted from the reference frame’s pixels:



 The reference picture can be from past or future The prediction happens on a block-by-block basis and there can be multiple reference frames for each block Motion Compensation:


The most compute intensive part of Motion compensation is sub-pixel interpolation  Luma – 8 or 7 tap filter  Chroma – 4 tap filter Sub pixel interpolation is data parallel, i.e., interpolation of each block within a frame can happen in parallel and hence suited for GPU computing



 Challenges in CPU+GPU Implementation 1. Efficient Partitioning of work between CPU and GPU


  1. The effective FPS of decoder will be the minimum of the FPS achieved by the CPU and GPU for their respective work  So the partitioning needs to be efficient so that both of them perform their respective work at almost the same speed(FPS) 2. Efficient pipelining data between CPU and GPU  The algorithms running on CPU will depend on the output of algorithms from GPU and/or vice versa 
  2. A good design should make sure neither the CPU nor the GPU spend any time waiting for the output of the other 3. Cache coherency
  3. Cache coherency between CPU and GPU data need to ensure. Why GPUs will stay ahead 1. Technical reasons: 
  4.  SIMD cores (instead of MIMD cores) means larger proportion of chip devoted to floating point computation 
  5. tightly-coupled fast graphics memory means much higher bandwidth 2. Economic reasons: 
  6. CPUs driven by price-sensitive office/home computing; not clear these need vastly more speed 
  7. CPU direction may be towards low cost, low power chips for mobile and embedded applications  GPUs driven by high-end applications– prepared to pay a premium for high performance  Gives at least 10×improvement in energy efficiency and price / performance compared to 2× quad-core Intel Xeons.  Effectively a personal cluster in a PC under your desk 
References:
1. http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=6618412&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel7%2F6599032%2F6618213%2F06618412.pdf%3Farnumber%3D6618412 
2. http://work.martin.spb.ru/ 
3. http://www.arm.com/support/university/researchers/gpu/graphics-processing-software-tools-for-researchers.php 
4. https://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor 
5. http://x265.org/ 
6. http://www.mainconcept.com/products/sdks/gpu-acceleration.html 
7. http://homepages.laas.fr/elbaz/PCO11.pdf

Saturday, February 8, 2014

HTTP over UDP/TLS

Recently during investigation with different clients and origin servers I noticed the following header, “Alternate-Protocol: 80:quic”.  On further investigation it appears QUIC (Quick UDP Internet Connections) is a Google experimental protocol to reduce latency over the internet, distinct from Google SPDY.
The links below are an interesting read.
http://blog.chromium.org/2013/06/experimenting-with-quic.html
https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/edit
https://docs.google.com/document/d/1lmL9EF6qKrk7gbazY8bIdvq3Pno2Xj_l_YShP40GLQE/edit#heading=h.h3jsxme7rovm

Monday, February 3, 2014

Tower of Hanoi

Tower of Hanoi is a classic problem to learn recursion. Here we see how recursion base conditions are generated, how parameters to the successive call to the same function are modified to give the desired output.

Problem at hand is : We have three pegs : A, B, C. Peg A contains N disks which are stack in such a way that any disk below is larger that then disk.
We need to move these N disks to Peg B in same condition mentioned above using the peg C.

First thing that comes into mind after reading or looking at the problem is : Move N-1 disks from peg A to peg C and then move the Nth disk to destination peg B. Now we have a problem left with N-1 disks which need to move from peg C to peg B using peg A as spare.
Easy enough ??

So we get the sight of recursion here?
First  : We need to do same process to N-1 disk which we did for N disks.
Second : Base case is when we are left with only one disk at source, we directly move that to destination.

Pseudo-code
Function Tower_Of_Hanoi:
If N==1 : Move disk from source to destination
else
Tower_Of_Hanoi (N-1, source, destination, helper)
Move disk from source to destination
Tower_Of_Hanoi (N-1, helper, destination, source)

Now we would see how stacks can be used to solve this problem. A rule of thumb is :  If a problem can be solved using recursion, it can also be solve using stack.
We need to map the recursion calls to stack.

Code
Code given here is of recursive implementation.

void tower_of_hanoi(int N, char source, char dest, char help){

        if(N>0){
                tower_of_hanoi(N-1, source, help, dest);
                printf("Moving disk from %c to %c\n", source, dest);
                tower_of_hanoi(N-1, help, dest, source);
        }

}