Wednesday, December 22, 2010

Google's AROUND Operator for Proximity Search

via Google Operating System by Alex Chitu on 12/8/10

Google has an undocumented operator called "AROUND" for finding web pages that include words or phrases which are near to each other.

If you want to find results that include both "Steve Jobs" and "Andy Rubin", you might search for ["Steve Jobs" "Andy Rubin"] or even for ["Steve Jobs * Andy Rubin"]. Google's AROUND operator lets you specify the maximum number of words that separate the two names. For example, you could search for ["Steve Jobs" AROUND(3) "Andy Rubin"] and only get web pages that include the two names separated by less than three words.

"The AROUND operator is a handy trick to use when you're looking for a combination of search terms when one dominates the results, but you're interested in the relationship between two query terms. Note also that if Google can't find anything within the limit, it will just do regular ranking of the terms without the AROUND coming into play. Using AROUND is especially useful when the documents are rather long (think book-length articles). So try this operator in Google Books.... [slavery AROUND(4) indigo]," suggests Google's Daniel Russell.

Barry Schwartz notes that Bing has a similar operator, but it's called "near".

{ via Search Engine Roundtable }

Tuesday, December 21, 2010

Joining all lines in a file with comma

awk '{ printf "%s," ,$0 }' file.txt > newfile.txt

Monday, December 20, 2010

Mysql db count tables

select count(*) as number_of_tables 
from information_schema.tables 
where table_schema = 'my favourite schema' 

Saturday, December 11, 2010

Find the minimum difference between two arrays

Given two sorted arrays, A and B, find i,j for which |A[i] - B[j]| is minimum

Longest sequence of d-dimensional boxes fitting into each other

Interview question :
You have 'n' d-dimensional boxes.
You have to find out the longest sequence
of boxes which fit into each other.

Hints :
1. How do you say that whether a 3-d box
fits into another?
2. Extend it to d-dimensions.
3. It's possible to do a bfs on a directed-acyclic-graph
without keeping a 'visited' lookup table, since it will
never go in an infinite loop.
4. Finding longest path in a graph is NP-complete.

Encoding/decoding a tree

How do you encode a tree, send it over
the wire and decode it.
Give algorithms for encoding/decoding.

Location of an element where index is same as the value

In a sorted integer array,

find an index, for which the
value is same as the index, i.e. a[i] is same as i.

For e.g. in a = {-2,0,2,3}
a[2] is 2, so the answer is 2.

In {1,2,5,7}, there is no such index.

My answer is on the lines of binary-search,
but would also like to see other answers.

Locker design interview question

You have to design a locker system where people
can deposit their bags of 3 types : small,medium,big.
And there are 3 types of lockers : small,medium,big.
Let's say there are 'n' lockers of each type.

Also, a small bag goes in a small locker,
a big bag in the big locker and so on.

You need to support two operations.
1. deposit a bag
2. retrieve a bag(when the depositor comes back)

interview questions

find the height of a binary tree, iteratively

Tuesday, December 7, 2010

Interview question

I have 6 million co-ordinates of the type (x,y).
I know the bounding rectangle for those points,
that is I know the min(x),min(y),max(x),max(y)
on the complete range.

Now, write a function closestTen(x,y), which
will take a (x,y) point as input and find its 10 closest
neighbors in the given bounding rectangle.

This input (x,y) can be inside the bounding rectangle
or outside it. It may also be one of the 6 million points,
but not necessarily.

Some Interview questions

1. Given a sorted array which has distinct integers in ascending order,
find an index where a[k] = k;
Solution : modified binary search

2. Give me the data structure and algorithm for a locker system
where there are 3 types of lockers big,medium and small. Number of
lockers is n for every type.
Customers come with 3 types of bags, big,medium and small.
Small bag goes in small locker, and so on...

You have to support two operations :
a. insert bag,
b. retrieve bag.

For both of the operations, the expected time complexity is O(1).

3. Given a binary search tree, find two elements which sum up to k.
Solution : convert it into a doubly linked list and then solve the problem
as if you are solving it for an array.

Monday, December 6, 2010

php echo print difference

echo is faster, since it doesn't return anything as opposed
to print

PHP how to get the browser information

through $_SERVER[], a parameter called HTTP_USER_AGENT,
may be coupled with get_browser method.

GET POST difference

1. POST is superset of get, that is it reads data from querystring as
well as the request body,
whereas GET reads data only from the queryString.
2. Results of GET can be cached, not POST.
3. Sending passwords by GET can expose your password to :
(a) a casual observer looking over your shoulder
(b) someone reading server logs
(c) in referrer information
4. Through POST, you can send more data.
5. resubmitting a POST request prompts(should prompt) user.
6. GET or query string posts are really good for information required
for either bookmarking a particular item, or for assisting in search
engine optimization and indexing items.

Saturday, December 4, 2010



$root['data'] = 22;
$root = tree_insert(&$root,20);
$root = tree_insert(&$root,24);
$root = tree_insert(&$root,23);
$root = tree_insert(&$root,23.5);

$root = tree_insert(&$root,19);
$root = tree_insert(&$root,21);
$root = tree_insert(&$root,20.5);
$root = tree_insert(&$root,21.5);

$root = tree_insert(&$root,20.25);
$root = tree_insert(&$root,20.75);
$root = tree_insert(&$root,18);
$root = tree_insert(&$root,19.5);
$root = tree_insert(&$root,18.5);


function tree_insert($root,$data) {
    if($root) {
        if($root['data'] < $data) {
            $root[r] = tree_insert(&$root[r],$data);
        } else {
            $root[l] = tree_insert(&$root[l],$data);
    } else {
        $root['data'] = $data;
        return $root;
    return $root;


Web Security issues

1. Cross Site Request Forgery : (CSRF)

Change the DNS entry for the server in the ADSL of the user for a particular domain,
so that the request goes to a different domain altogether

For e.g. a URL in one of the forums which Bob uses : 
<img src="">
If Bob's cookies haven't expired, this will benefit mallory.

Prevention :
1. Add authentication token in GET/POST
2. Check referrer
3. secret token
4. crossdomain.xml

2. Replay attack :
Evasdrop on a client server communication and re-play it.
Prevention : 
Server should the client a one-time use token.
Other Issues : 

3. Cross site cooking / Cross sub domain cooking (allowing a web site to set cookies for other domains/sub domains)
then, if the affected person logs on, evil person can use that cookie
General solutions : 
generate new SId, before something crucial.
destroy session for malicious referrer
check browser etc information
time out old SIDs.

4. A billion laughs : 

<!DOCTYPE root [
<!ENTITY ha "Ha! ">
<!ENTITY ha2 "&ha;&ha;">
<!ENTITY ha3 "&ha2;&ha2;">
<!ENTITY ha4 "&ha3;&ha3;">
<!ENTITY ha5 "&ha4;&ha4;">
<!ENTITY ha128 "&ha127;&ha127;">

5. Similar to a Billion laughs : 
ReDos (Regular expression Denial of Service)

Cross site scripting attack (Google example)

The script ( is normally used for redirecting the browser from Google's website to other sites.

For example, the following request will redirect the browser to 

When the parameter (q) is passed to the script with illegal format (The format seems to be: http://domain), a "403 Forbidden" page returns to the user, informing that the query was illegal. The parameter's value appears in the HTML returned to the user.

If is requested, the text in the "403 Forbidden" response would be:
"Your client does not have permission to get URL /url?q=USER_INPUT from this server."

The server response lacks charset encoding enforcement, such as:
* Response headers: "Content-Type: text/html; charset=[encoding]".
* Response body: "<meta http-equiv="Content-Type" (...) charset=[encoding]/>".

Google's 404 NOT FOUND mechanism:
When requesting a page which doesn't exist under, a 404 NOT FOUND response is returned to the user, with the original path requested.

If is requested, the following text appears in the response:
"Not Found The requested URL /NOTFOUND was not found on this server."

The server response lacks charset encoding enforcement, such as:
* Response headers: "Content-Type: text/html; charset=[encoding]".
* Response body: "<meta http-equiv="Content-Type" (...) charset=[encoding]/>".

XSS vulnerabilities:
While the aforementioned mechanisms (URL redirection script, 404 NOT FOUND) escape common characters used for XSS, such as <> (triangular parenthesis) and apostrophes, it fails to handle hazardous UTF-7 encoded payloads.

Therefore, when sending an XSS attack payload, encoded in UTF-7, the payload will return in the response without being altered.

For the attack to succeed (script execution), the victims browser should treat the XSS payload as UTF-7.

IE charset encoding Auto-Selection:
If 'Encoding' is set to 'Auto-Select', and Internet-Explorer finds a UTF-7 string in the first 4096 characters of the response's body, it will set the charset encoding to UTF-7 automatically, unless a certain charset encoding is already enforced.

This automatic encoding selection feature makes it possible to mount UTF-7 XSS attacks on

Google solved the aforementioned issues at 01/12/2005, by using character encoding enforcement.

Friday, December 3, 2010

Some db performance optimizations

1. Materialized views : 
Views where the results are cached, and updated time-to-time.

2. Save the result of aggregate operations, and refresh periodically ( or use triggers)
(for e.g. stackoverflow updates your accept rate, not immediately, but after some time)

3. Vertically split tables (by columns), keep the primary key column in both the partitions.
If the original table is being maintained, no need to carry over the RI constraints.

4. Redundant data
table1(empid,empname,deptId) table2(deptId,deptName)
if empname,deptName are usually fetched together, ponder
copying the deptName to the first table.

5. While dealing with repeating groups, you can consider having multiple columns
rather than multiple rows in order fasten the access.
but before doing this know that,
-> it will be difficult to aggregate the data
-> it will be accessed collectively

Data Denormalization guidelines

this page mirrored from the The Data Adminstration Newsletter Please visit them for more.

Craig S. Mullins, PLATINUM 
technology, inc.

Normalization is the process of putting one fact in one appropriate place. This optimizes updates at the expense of retrievals. When a fact is stored in only one place, retrieving many different but related facts usually requires going to many different places. This tends to slow the retrieval process. Updating is quicker, however, because the fact you're updating exists in only one place.

It is generally recognized that all relational database designs should be based on a normalized logical data model. With a normalized data model, one fact is stored in one place, related facts about a single entity are stored together, and every column of each entity refers non-transitively to only the unique identifier for that entity. Although an in-depth discussion of normalization is beyond the scope of this article, brief definitions of the first three normal forms follow:

  • In first normal form, all entities must have a unique identifier, or key, that can be composed of one or more attributes. In addition, all attributes must be atomic and non-repeating. (Atomic means that the attribute must not be composed of multiple attributes. For example, EMPNO should not be composed of social security number and last name because these are separate attributes.)
  • In second normal form, all attributes that are not part of the key must depend on the entire key for that entity.
  • In third normal form, all attributes that are not part of the key must not depend on any other non-key attributes.

Frequently, however, performance needs dictate very quick retrieval capability for data stored in relational databases. To accomplish this, sometimes the decision is made to denormalize the physical implementation. Denormalization is the process of putting one fact in numerous places. This speeds data retrieval at the expense of data modification.

It is not the intention of this article to promote the concept of denormalization. Of course, a normalized set of relational tables is the optimal environment and should be implemented for whenever possible. Yet, in the real world, denormalization is sometimes necessary. Denormalization is not necessarily a bad decision if implemented wisely. You should always consider these issues before denormalizing:

  • can the system achieve acceptable performance without denormalizing?
  • will the performance of the system after denormalizing still be unacceptable?
  • will the system be less reliable due to denormalization?

If the answer to any of these questions is "yes," then you should avoid denormalization because any benefit that is accrued will not exceed the cost. If, after considering these issues, you decide to denormalize be sure to adhere to the general guidelines that follow.

If enough DASD is available at your shop, create two sets of tables: one set fully normalized and another denormalized. Populate the denormalized versions by querying the data in the normalized tables and loading or inserting it into the denormalized tables. Your application can access the denormalized tables in a read-only fashion and achieve performance gains. It is imperative that a controlled and scheduled population function is maintained to keep the data in the denormalized and normalized tables synchronized.

If DASD is not available for two sets of tables, then maintain the denormalized tables programmatically. Be sure to update each denormalized table representing the same entity at the same time, or alternately, to provide a rigorous schedule whereby tables will be synchronized. At any rate, all users should be informed of the implications of inconsistent data if it is deemed impossible to avoid unsynchronized data.

When updating any column that is replicated in many different tables, always update it everywhere that it exists simultaneously, or as close to simultaneously as possible given the physical constraints of your environment. If the denormalized tables are ever out of sync with the normalized tables be sure to inform end-users that batch reports and on-line queries may not contain sound data; if at all possible, this should be avoided.

Finally, be sure to design the application so that it can be easily converted from using denormalized tables to using normalized tables.

The Reason for Denormalization

Only one valid reason exists for denormalizing a relational design - to enhance performance. However, there are several indicators which will help to identify systems and tables which are potential denormalization candidates. These are:

  • Many critical queries and reports exist which rely upon data from more than one table. Often times these requests need to be processed in an on-line environment.
  • Repeating groups exist which need to be processed in a group instead of individually.
  • Many calculations need to be applied to one or many columns before queries can be successfully answered.
  • Tables need to be accessed in different ways by different users during the same timeframe.
  • Many large primary keys exist which are clumsy to query and consume a large amount of DASD when carried as foreign key columns in related tables.
  • Certain columns are queried a large percentage of the time. Consider 60% or greater to be a cautionary number flagging denormalization as an option.

Be aware that each new RDBMS release usually brings enhanced performance and improved access options that may reduce the need for denormalization. However, most of the popular RDBMS products on occasion will require denormalized data structures. There are many different types of denormalized tables which can resolve the performance problems caused when accessing fully normalized data. The following topics will detail the different types and give advice on when to implement each of the denormalization types.

Pre-Joined Tables

If two or more tables need to be joined on a regular basis by an application, but the cost of the join is prohibitive, consider creating tables of pre-joined data. The pre-joined tables should:

  • contain no redundant columns (matching join criteria columns)
  • contain only those columns absolutely necessary for the application to meet its processing needs
  • be created periodically using SQL to join the normalized tables

The cost of the join will be incurred only once when the pre-joined tables are created. A pre-joined table can be queried very efficiently because every new query does not incur the overhead of the table join process.

Report Tables

Often times it is impossible to develop an end-user report using SQL or QMF alone. These types of reports require special formatting or data manipulation. If certain critical or highly visible reports of this nature are required to be viewed in an on-line environment, consider creating a table that represents the report. This table can then be queried using SQL, QMF, and/or another report facility. The report should be created using the appropriate mechanism (application program, 4GL, SQL, etc.) in a batch environment. It can then loaded into the report table in sequence. The report table should:

  • contain one column for every column of the report
  • have a clustering index on the columns that provide the reporting sequence
  • not subvert relational tenets (such as, 1NF and atomic data elements)

Report tables are ideal for carrying the results of outer joins or other complex SQL statements. If an outer join is executed and then loaded into a table, a simple SELECT statement can be used to retrieve the results of the outer join, instead of the complex UNION technique shown in Figure 1. Some RDBMS products support an explicit outer join function which can be used instead of the UNION depicted. However, depending on the implementation, the explicit outer join may be simpler or more complex than the UNION it replaces.

Figure 1. Outer Join Technique Using UNION

Mirror Tables

If an application system is very active it may be necessary to split processing into two (or more) distinct components. This requires the creation of duplicate, or mirror tables. Consider an application system that has very heavy on-line traffic during the morning and early afternoon hours. This traffic consists of both querying and updating of data. Decision support processing is also performed on the same application tables during the afternoon. The production work in the afternoon always seems to disrupt the decision support processing causing frequent time outs and dead locks.

This situation could be corrected by creating mirror tables. A foreground set of tables would exist for the production traffic and a background set of tables would exist for the decision support reporting. A mechanism to periodically migrate the foreground data to background tables must be established to keep the application data synchronized. One such mechanism could be a batch job executing UNLOAD and LOAD utilities. This should be done as often as necessary to sustain the effectiveness of the decision support processing.

It is important to note that since the access needs of decision support are often considerably different than the access needs of the production environment, different data definition decisions such as indexing and clustering may be chosen for the mirror tables.

Split Tables

If separate pieces of one normalized table are accessed by different and distinct groups of users or applications then consider splitting the table into two (or more) denormalized tables; one for each distinct processing group. The original table can also be maintained if other applications exist that access the entire table. In this scenario the split tables should be handled as a special case of mirror table. If an additional table is not desired then a view joining the tables could be provided instead.

Tables can be split in one of two ways: vertically or horizontally. Refer to Figure 2. A vertical split cuts a table column-wise, such that one group of columns is placed into one new table and the remaining columns are placed in another new table. A horizontally split table is a row-wise split. To split a table horizontally, rows are classified into groups via key ranges. The rows from one key range are placed in one table, those from another key range are placed in a different table, and so on.

Vertically split tables should be created by placing the primary key columns for the old, normalized table into both of the split tables. Designate one of the two, new tables as the parent table for the purposes of referential integrity unless the original table still exists. In this case, the original table should be the parent table in all referential constraints. If this is the case, and the split tables are read only, do not set up referential integrity (RI) for the split tables as they are being derived from a referentially intact source. RI would be redundant.

When a vertical split is being done, always include one row per primary key in each split table. Do not eliminate rows from either of the two tables for any reason. If rows are eliminated the update process and any retrieval process that must access data from both tables will be unnecessarily complicated.

When a horizontal split is being done, try to split the rows between the new tables to avoid duplicating any one row in each new table. This is done by splitting using the primary key such that discrete key ranges are placed in separate split tables. Simply stated, the operation of UNION ALL, when applied to the horizontally split tables, should not add more rows than contained in the original, un-split tables. Likewise, it should not contain fewer rows either.

Combined Tables

If tables exist with a one-to-one relationship consider combining them into a single combined table. Sometimes, even one-to-many relationships can be combined into a single table, but the data update process will be significantly complicated because of the increase in redundant data.

For example, consider an application with two tables: DEPT (containing department data) and EMP (containing employee data). Combining the two tables into a large table named, for example, EMP_WITH_DEPT. This new table would contain all of the columns of both tables except for the redundant DEPTNO column (the join criteria). So, in addition to all of the employee information, all of the department information would also be contained on each employee row. This will result in many duplicate instances of the department data. Combined tables of this sort should be considered pre-joined tables and treated accordingly. Tables with one to one relationships should always be analyzed to determine if combination is useful.

Redundant Data

Sometimes one or more columns from one table are accessed whenever data from another table is accessed. If these columns are accessed frequently with tables other than those in which they were initially defined, consider carrying them in those other tables as redundant data. By carrying these additional columns, joins can be eliminated and the speed of data retrieval will be enhanced. This should only be attempted if the normal access is debilitating.

Consider, once again, the DEPT and EMP tables. If most of the employee queries require the name of the employee's department then the department name column could be carried as redundant data in the EMP table. The column should not be removed from the DEPT table, though (causing additional update requirements if the department name changes).

In all cases columns that can potentially be carried as redundant data should be characterized by the following attributes:

  • only a few columns are necessary to support the redundancy
  • the columns should be stable, being updated only infrequently
  • the columns should be used by either a large number of users or a few very important users

Repeating Groups

When repeating groups are normalized they are implemented as distinct rows instead of distinct columns. This usually results in higher DASD usage and less efficient retrieval because there are more rows in the table and more rows need to be read in order to satisfy queries that access the repeating group.

Sometimes, denormalizing the data by storing it in distinct columns can achieve significant performance gains. However, these gains come at the expense of flexibility. For example, consider an application that is storing repeating group information in the normalized table below:

This table can store an infinite number of balances per customer, limited only by available storage and the storage limits of the RDBMS. If the decision were made to string the repeating group, BALANCE, out into columns instead of rows, a limit would need to be set for the number of balances to be carried in each row. An example of this after denormalization is shown below:

In this example, only six balances may be stored for any one customer. The number six is not important, but the concept that the number of values is limited is important. This reduces the flexibility of data storage and should be avoided unless performance needs dictate otherwise.

Before deciding to implement repeating groups as columns instead of rows be sure that the following criteria are met:

  • the data is rarely or never aggregated, averaged, or compared within the row
  • the data occurs in a statistically well-behaved pattern
  • the data has a stable number of occurrences
  • the data is usually accessed collectively
  • the data has a predictable pattern of insertion and deletion

If any of the above criteria are not met, SQL SELECT statements may be difficult to code making the data less available due to inherently unsound data modeling practices. This should be avoided because, in general, data is denormalized only to make it more readily available.

Derivable Data

If the cost of deriving data using complicated formulae is prohibitive then consider storing the derived data in a column instead of calculating it. However, when the underlying values that comprise the calculated value change, it is imperative that the stored derived data also be changed otherwise inconsistent information could be reported. This will adversely impact the effectiveness and reliability of the database.

Sometimes it is not possible to immediately update derived data elements when the columns upon which they rely change. This can occur when the tables containing the derived elements are off-line or being operated upon by a utility. In this situation, time the update of the derived data such that it occurs immediately when the table is made available for update. Under no circumstances should outdated derived data be made available for reporting and inquiry purposes.


A hierarchy is a structure that is easy to support using a relational database such as DB2, but is difficult to retrieve information from efficiently. For this reason, applications which rely upon hierarchies very often contain denormalized tables to speed data retrieval. Two examples of these types of systems are the classic Bill of Materials application and a Departmental Reporting system. A Bill of Materials application typically records information about parts assemblies in which one part is composed of other parts. A Department Reporting system typically records the departmental structure of an organization indicating which departments report to which other departments.

A very effective way to denormalize a hierarchy is to create what are called "speed" tables. Figure 3 depicts a department hierarchy for a given organization. The hierarchic tree is built such that the top most node is the entire corporation and each of the other nodes represents a department at various levels within the corporation. In our example department 123456 is the entire corporation. Departments 1234 and 56 report directly to 123456. Departments 12, 3, and 4 report directly to 1234 and indirectly to department 123456. And so on.

The table shown under the tree in Figure 3 is the classic relational implementation of a hierarchy. There are two department columns, one for the parent and one for the child. This is an accurately normalized version of this hierarchy containing everything that is represented in the diagram. The complete hierarchy can be rebuilt with the proper data retrieval instructions.

Figure 3. Classic Relational Implementation of a Department Hierarchy

Even though the implementation effectively records the entire hierarchy, building a query to report all of the departments under any other given department can be time consuming to code and inefficient to process. Figure 4 shows a sample query that will return all of the departments that report to the corporate node 123456. However, this query can only be built if you know in advance the total number of possible levels the hierarchy can achieve. If there are n levels in the hierarchy then you will need n-1UNIONs.

Figure 4. Querying the Departmental Hierarchy

A "speed" table can be built such as the one in Figure 5. This table depicts the parent department and all of the departments under it regardless of the level. Contrast this to the previous table which only recorded immediate children for each parent. A "speed" table also commonly contains other pertinent information that is needed by the given application. Typical information includes the level within the hierarchy for the given node, whether or not the given node of the hierarchy is a detail node (at the bottom of the tree), and, if ordering within level is important, the sequence of the nodes at the given level.

Figure 5. Speed Table Implementation of a Departmental Hierarchy

After the "speed" table has been built, speedy queries can be written against this implementation of a hierarchy. Figure 6 shows various informative queries that would have been very inefficient to execute against the classical relational hierarchy. These queries work for any number of levels between the top and bottom of the hierarchy.

A "speed" table is commonly built using a program written in COBOL or another high level language. SQL alone is usually either too inefficient to handle the creation of a "speed" table or impractical because the number of levels in the hierarchy is either unknown or constantly changing.

Figure 6. Querying the Speed Table

Types of Denormalization

We have discussed nine different types of denormalization. The table below will summarize the types of denormalization that are available with a short description of when this type of denormalization is useful.


The decision to denormalize should never be made lightly because it involves a lot of administrative dedication. This dedication takes the form of documenting the denormalization decisions, ensuring valid data, scheduling of data migration, and keeping end users informed about the state of the tables. In addition, there is one more category of administrative overhead: periodic analysis.

Whenever denormalized data exists for an application the data and environment should be periodically reviewed on an on-going basis. Hardware, software, and application requirements will evolve and change. This may alter the need for denormalization. To verify whether or not denormalization is still a valid decision ask the following questions:

Have the processing requirements changed for the application such that the join criteria, timing of reports, and/or transaction throughput no longer require denormalized data?

Did a new DBMS release change performance considerations? For example, did the introduction of a new join method undo the need for pre-joined tables?

Did a new hardware release change performance considerations? For example, does the upgrade to a new high-end processor reduce the amount of CPU such that denormalization is no longer necessary? Or did the addition of memory enable faster data access enabling data to be physically normalized?

In general, periodically test whether the extra cost related to processing with normalized tables justifies the benefit of denormalization. You should measure the following criteria:

  • I/O saved
  • CPU saved
  • complexity of update programming
  • cost of returning to a normalized design

It is important to remember that denormalization was initially implemented for performance reasons. If the environment changes it is only reasonable to re-evaluate the denormalization decision. Also, it is possible that, given a changing hardware and software environment, denormalized tables may be causing performance degradation instead of performance gains.

Simply stated, always monitor and periodically re-evaluate all denormalized applications.

Craig S. Mullins is Vice President of Marketing and Operations for the database tools division of PLATINUM technology, inc. He has extensive experience in all facets of database systems development, including developing and teaching database and data modeling classes, systems analysis and design, database and system administration, and data analysis. Craig has worked with DB2 since V1 and has experience in multiple roles including programmer, DBA, instructor, and analyst. His experience spans industries having worked in manufacturing, banking, utilities, commercial software development, consulting and as a computer industry analyst for the Gartner Group. Additionally, Craig authored the popular DB2 Developer's Guide which provides over 1100 pages of tips, techniques, and guidelines to optimize DB2 for MVS.

Craig is also a frequent contributor to computer industry publications having over five dozen articles published during the past few years. His articles have been published in magazines like Byte, DB2 Update, Database Programming & Design, DBMS, Data Management Review, Relational Database Journal, Enterprise Systems Journal, Mainframe Client/Server and others.

Craig graduated cum laude with a degree in Computer Science and Economics from the University of Pittsburgh.

[The Article Archive]

The Data Administration Newsletter
Robert S. Seiner - Publisher and Editor -

Interview question

1. Find two numbers in a binary search tree
which add up to X.

Solution : Convert it to doubly linked list, using recursion : O(lgn) space,
and have head tail pointers like what you would have done for a sorted array.

Some more discussion .

2. What are prepared statements in MySql?

3. table (empname,designation,mgrId,empId)
find employee name, manager name

4. What if the load balancer fails, what is the back up strategy?

5. What happens when you hit in the browser.

generating all the possible permutations in actionscript/flex

<?xml version="1.0" encoding="utf-8"?>
<mx:Application xmlns:mx="" layout="absolute" creationComplete="main()">
import mx.collections.ArrayCollection;
private function main():void {
var ac1:ArrayCollection = new ArrayCollection();
var output:ArrayCollection = permute(ac1);
for(var i:int = 0; i<output.length; ++i) {
var tmpAc:ArrayCollection = output.getItemAt(i) as ArrayCollection;
var singleLine:String = "";
for(var j:int = 0; j<tmpAc.length; ++j) {
singleLine += String(tmpAc[j])+" ";
private function permute(inputListOfNumbers:ArrayCollection):ArrayCollection {
var output:ArrayCollection = new ArrayCollection();
if(inputListOfNumbers.length == 2) {
var ac1:ArrayCollection = new ArrayCollection();
var ac1:ArrayCollection = new ArrayCollection();
return output;
for(var i:int = 0; i<inputListOfNumbers.length; ++i) {
var otherThanIth:ArrayCollection = getElementsOtherThanIth(inputListOfNumbers,i);
var tmpRes:ArrayCollection = 
prependNumber(inputListOfNumbers.getItemAt(i) as int,permute(otherThanIth));
for(var j:int = 0; j<tmpRes.length; ++j) {
return output;
private function prependNumber(i:int,ac1:ArrayCollection):ArrayCollection {
for(var j:int = 0; j<ac1.length; ++j) {
var tmpAc:ArrayCollection = ac1.getItemAt(j) as ArrayCollection;
return ac1;
private function getElementsOtherThanIth(ac1:ArrayCollection,i:int):ArrayCollection {
var output:ArrayCollection = new ArrayCollection();
for(var j:int = 0; j<ac1.length; ++j) {
if(i != j) {
return output;

Thursday, December 2, 2010

Dijkstra's_algorithm in PHP

$d['a']['b'] = 10;
$d['a']['c'] = 2;
$d['b']['d'] = 1;
$d['c']['d'] = 1;
$d['c']['e'] = 7;
$d['d']['e'] = 20;
$d['e']['f'] = 9;

$startNode = 'a';

$nodeStack = array();
while(count($nodeStack) > 0) {
$curr = array_pop($nodeStack);
foreach($d[$curr] as $v3 => $d3) {
if($d[$startNode][$v3]) {
$newDist = $d[$startNode][$curr] + $d[$curr][$v3];
if($d[$startNode][$v3] > $newDist)
$d[$startNode][$v3] = $newDist;
} else {
$d[$startNode][$v3] = $d[$startNode][$curr] + $d[$curr][$v3];


Yahoo Interview Questions

1. Find the combined median of two sorted arrays.
2. How to you implement the T9 word suggestion/spell correction on mobile phones.
3. Given a lot of (x,y) points on Cartesian plain, draw a straight line such that
sum of (length of the line) + (perpendiculars on the line from the points) is minimum.
4. How do you search an element in a 2-D array, where every row and every column is sorted.
5. Detect the starting of the loop in a linked list
6. There is an array with 1 million numbers, having only 0,1,2. How do you sort it?
What if the elements are complex objects having a field type with values 0,1,2.

Binary Search C recursive and iterative

int arrayBinSearchIter(int* arr,int val,int len) {
int min = 0;
int max = len - 1;
int curr = -1;
while(min <= max) {
curr = (min+max)/2;
if(arr[curr] == val) return(curr);
if(arr[curr] < val) min = curr + 1;
if(arr[curr] > val) max = curr - 1;

int arrayBinSearchRec(int* arr,int val,int min,int max) {
if(min > max) return(-1);
int curr = (min+max)/2;
if(arr[curr] == val) return curr;
if(arr[curr] < val) min = curr + 1;
if(arr[curr] > val) max = curr - 1;
return arrayBinSearchRec(arr,val,min,max);

Wednesday, December 1, 2010

Flex Interview Questions

1. What are synchronous/asynchronous errors and how do you handle them?
2. HttpService call is sync or sync?
3. Parent class of Image?
4. How do you add a Sprite to Canvas?
5. What is HTMLWrapper?
6. What is the compiler argument of specifying targeted flash player version?
7. How do you do pagination for datagrids?

Friday, November 26, 2010

Web Server Issues

1. N/W  I/O
2. Disk I/O
3. CPU
4. Memory

A. Dig the root cause : too many processes may start thrashing,
which may seem like a Disk I/O problem, but actually is a memory

Tools : sar,systat,iostat,top,vmstat,nstat

Apache performance testing : TSUNG

Apache usually dies at 5000 concurrent connections.
It usually works in pre-fork/worker model.
Pre-fork : OS kernel does the load balancing.
Worker : Apache master process does the load balancing.

Usually every process spawns many threads.
A process has 4GB address virtual memory
shared by all its threads.
So, if there are too many, for e.g. 500, threads,
one thread may bajaao the band for everyone else.

So,it's better to have 5 processes with 100 threads each.

But in apache, creating many threads has its own problems : 
every thread has its kernel stack, process stack, which eats
up resources.

Lighttpd works on event driven model.
Accepts connections, passes the baton
to fast-cgi.exe, when it returns, lighttpd
returns to requester.

Since it is event driven, it scales better.

Work is on to make lighttpd work better on multiple cores,
so that it can spawn a different process for every core.

FrontEnd optimization : 
in .htaccess set headers(for e.g. TTL(time to live), time after which the client should fetch the image again) for static images.
caching proxy layer : varnish/squid : saves db query time, php script running time.
a separate caching layer, as opposed to php's in-built caching may help diagnose the issue better.
- proxy because caching runs at :80, server at :8080

you can also look into php opcode caching...


look at 31 yahoo tips, yahoo performance blog, mysql performance blog

If there are too many threads running, and you may be using external
libraries(code) from them, it's better to kill a thread after it has served
5000(n) requests. Since, after that it will have many memory leaks,
which will be hard to find.

MySql General Info

ibdata/iblog are write ahead logs(queries are written in the logs before executing them on the server)
- used for recovering a crashed server

binlog -> logs of queries after committing
- for slave level replication

Slave level replication is async, i.e. there are 2 threads : 
1. slave reads queries from server and writes them to its relay log
2. another process reads from relay log and executes them
hence it's async

If it were sync, it will hamper performance of master, since
the master will have to wait for the slave to write in db.


To set up a slave : 
When you take the dump from the master, specify an option, which will
tell the slave, from where to start reading the logs.

You can specify in slave config, about which dbs/tables to include/exclude.

show slave status\G
start slave
keep taking periodic solid dumps

Tuesday, November 23, 2010

Intellij Idea : Using framework as a RSL(Remote shared library) in a Flex project

In Flex Builder 3: 
Project -> Properties -> Flex Build Path -> Library Path -> Framework linkage in Flex Builder.
Flex builder also copies the required files : 

In Intellij Idea 9.0.3 : 
Right click on a Module -> Module Settings -> Flex Compiler Settings -> Advanced->
Select Use Framework as Runtime Shared Library (RSL)

Unlike Flex Builder, Idea doesn't copy the required .swz and .swf files.
You have to copy them manually to the location where your main application swf resides.
Copy them from here : 
{location of your flex sdk}/frameworks/rsls/

To verify if RSL is working for you, clear the Flash Player Cache and run your flex application.
You should get a swz or swf file of roughly 525KB in your cache.

On Windows XP, the location of the Flash Player Cache : 
C:\Documents and Settings\{User name}\Application Data\Adobe\Flash Player\AssetCache\

On linux, it's : 

Monday, November 22, 2010

Installing and running hadoop on windows xp in standalone mode

Prereqs : have cygwin installed
2. unpack the file hadoop-0.20.2.tar.gz.gz in C:/ (using tar xvf filename)
3. Add the following to conf/ in the unpacked folder : 
export JAVA_HOME=/cygdrive/c/Java/jdk1.6.0_23
(assuming that C:/Java/jdk1.6.0_23/bin contains javac.exe and other binaries)

5. Unpack the file tomwhite-hadoop-book-32dae01.tar.gz
6. in the unpacked folder : cd ch02/src/main/java
7. mkdir -p build/classes
8. $ javac -verbose -classpath C:\\hadoop-0.20.2\\hadoop-0.20.2-core.jar MaxTemperature*.java -d build/classes
9. export HADOOP_CLASSPATH=build/classes
10. hadoop MaxTemperature ../../../../input/ncdc/sample.txt output

your output is in output folder

Thursday, November 18, 2010

Shell: showing lines which are present only in one file

comm -13 <(sort a.txt) <(sort b.txt)
will show only the lines which are present in b.txt
but not in a.txt.

Monday, November 15, 2010

Vim regexp example

I have a file : 

So, line 1 has "alert" without any white spaces.
line 2 has white spaces followed by "alert"
line 3 has white spaces,"//" followed by "alert"

I want to find "alert" which is not commented, i.e. not following "//"

Here is the regexp : 


which is basically : 
(beginning of line OR absense of "//") followed by "alert"

Shell scripts on Cygwin, xargs in general

Lessons learned today : 
1. Convert the scripts and the files they operate upon to unix, using d2u.

2. For debugging, add #!/bin/bash -x at the top of the script, so that you know
which variable gets expanded to what?

3. For debugging xargs 
(i) find out whether the command being used with xargs expects single operand or multiple operands, for e.g.
find . -name expects a single operand, whereas grep -i can operate on multiple operands.

(ii) if you need to convert multiple lines of input to single line, do : 
xargs --max-args=1

(iii) use -t option with xargs to see what's going on?

Friday, November 12, 2010


cat filename | aspell list
produce misspelt words

cat filename | aspell munch-list simple
produce roots of the words

Built-in shell variables

# Number of arguments given to current process.

@ Command-line arguments to current process. Inside double quotes, expands to individual arguments.

* Command-line arguments to current process. Inside double quotes, expands to a single argument.

- (hyphen) Options given to shell on invocation.

? Exit status of previous command.

$ Process ID of shell process.

0 (zero) The name of the shell program.

! Process ID of last background command. Use this to save process ID numbers for later use with the wait

ENV Used only by interactive shells upon invocation; the value of$ENV is parameter-expanded. The result
should be a full pathname for a file to be read and executed at startup. This is an XSI requirement.
HOME Home (login) directory.

IFS Internal field separator; i.e., the list of characters that act as word separators. Normally set to space, tab,
and newline.

LANG Default name of current locale; overridden by the otherLC_* variables.

LC_ALL Name of current locale; overrides LANG and the otherLC_* variables.

LC_COLLATE Name of current locale for character collation (sorting) purposes.

Shell : getopts

use getopts for routine command line option processing

Shell: exit status of last command

echo $?

Shell: setting unsetting functions

who_is_on ( ) { Define a function
    who | awk ' { print $1 }' | sort -u     //Generate sorted list of users
. . .
unset -f who_is_on    //Remove the function

Thursday, November 11, 2010

The Pragmatic Programmer : tips

1. Test stage coverage, not code coverage
2. Test early, test often.
3. Refactor early, refactor often.
4. Costly tools don't produce better designs.

The Pragmatic Programmer : tips

1. Invest regularly in your knowledge portfolio
2. Keep knowledge in plain text
3. You can't write perfect software
4. Don't be a slave to formal methods
5. Sign your work 

Shell: showing lines which are present only in one file

comm -13 <(sort a.txt) <(sort b.txt)
will show only the lines which are present in b.txt
but not in a.txt.

Wednesday, November 10, 2010

Mysql : percentile of queries completed in a specific duration, on weekly basis

What I have :
A MySql table with 2 columns : load_start_time,load_end_time, both the
columns contain unix timestamps.

What I need :
Weekwise data in the following format :
Week Time_taken_for_load_to_complete percentile_of_loads_completed_in_that_time

Here are the requisite queries :

drop table if exists tbl;

create table tbl as ( select
week(date(from_unixtime(load_start_time))) as w,round((load_end_time -
load_start_time)/10) as 10SecInterval,count(1) as numReqs from
user_load_data where load_start_time is not null and load_end_time is
not null group by w,10SecInterval having(10SecInterval < 240 and
10SecInterval >= 0));

--compute percentile
drop table if exists tbl1;
create table tbl1 as ( select a.w,a.10SecInterval*10 as
seconds,round(100*((select sum(numReqs) from tbl as b where b.w = a.w
and b.10SecInterval <= a.10SecInterval)/(select sum(numReqs) from tbl
as c where c.w = a.w))) as percentile from tbl as a);
select * from tbl1 limit 200;

Monday, October 25, 2010

Taglist in shell

Tag list source code.


Shell word frequency filter

Also available here .

# standard output.
# Usage:
# wf [ n]
tr -cs A-Za-z\' '\n'|
tr A-Z a-z |
sort |
uniq -c |
sort -k1,1nr -k2 |
sed ${1:-25}q
cat fileName.txt | ./ 12

Description :

tr -cs A-Za-z\' ' \n' | Replace nonletters with newlines
  tr A-Z a-z | Map uppercase to lowercase
    sort | Sort the words in ascending order
      uniq -c | Eliminate duplicates, showing their counts
        sort -k1,1nr -k2 | Sort by descending count, and then by ascending word
          sed ${1:-25}q Print only the first n (default: 25) lines;

$ wf 999999 < filename.txt | awk ' $1 >= 5' | wc -l
number of unique words occurring at least 5 times

$ wf 999999 < hamlet | tail -n 12 | pr -c4 -t -w80
some of the least frequent words

$ wf 999999 < hamlet | wc -l
number of unique words

$ wf 999999 < hamlet | grep -c ' ^ *1•'

$ wf 999999 < hamlet | egrep -c '[[:space]]+1[[:space:]]+'

number of words used exactly once


Friday, October 22, 2010

tr shell command example

cat file(s) | tr A-Z a-z | tr -c a-z\' ' \n' 

The second pipeline stage converts uppercase to
 lowercase, the third replaces nonletters by newlines.
The third stage treats apostrophes as letters.

shell script for converting tab separated values to html table


#! /bin/sh
# Convert a tab-separated value file to grammar-conformant HTML.
# Usage:
#    tsv-to-html < infile > outfile
cat << EOFILE Leading boilerplate
            Office directory
        <LINK REV="made" HREF="mailto: $USER@` hostname` ">
sed -e ' s=&=\&amp; =g' \ Convert special characters to entities
    -e ' s=<=\&lt; =g' \
    -e ' s=>=\&gt; =g' \
    -e ' s=\t=</TD><TD>=g' \ And supply table markup
    -e ' s=^. *$=            <TR><TD>&</TD></TR>='
cat << EOFILE Trailing boilerplate

a fantastic resource for unix command line examples

cut and awk for same job : extract fields and join them

$ cat test.txt

Welcome@computer_1012 ~
$ cat test.txt | cut -d: -f1,5 --output-delimiter="-"

Welcome@computer_1012 ~
$ cat test.txt | awk -F: '{print $1 "-" $5}'

shell pattern matching

The problem is that the asterisks can match anything,
 even the commas used as delimiters. However,
 if you add the delimiters to the pattern, you can
 no longer match the ends of the list:

list=orange, apple, banana
case $list in
*, orange, *)        echo "The only fruit for which there is no Cockney slang. "; ;

[no output]

To resolve this, wrap the list in an extra set of delimiters when expanding it:
list=orange, apple, banana
case , $list, in
*, orange, *)        echo "The only fruit for which there is no Cockney slang. "; ;
The only fruit for which there is no Cockney slang.

shell case switch

#! /bin/sh
echo " Matching against ' $pattern' : "
for string
  case $string in
  $pattern) echo " $string: Match. " ; ;
  *) echo "$string: No match. " ; ;

Save this script to a file named pattern, make it executable (chmod a+x pattern), and you 
can use it to perform your own tests:

$ ./pattern ' *' ' hello'
Matching against ' *' :
hello: Match.

$ ./pattern ' hello*' ' hello' ' hello, there' ' well, hello'
Matching against ' hello*' :
hello: Match.
hello, there: Match.
well, hello: No match.
Remember to use single quotes around the arg

shell command line looping (tested on cygwin)

% for file in *.bar ; do > echo $file > 
mv $file `echo $file | sed 's/\.bar$/.foo/'` > done
% for file in `cat file_list` ; do > echo $file >
mv `echo $file | sed 's/\\r//'` `echo $file | sed 's/\.bar$/.foo/'` > done


An example: To see the list of lines in a file, sorted by the number of times each occurs:

sort file | uniq -c | sort -n


Wednesday, October 20, 2010

The Pragmatic Programmer

4 key take-aways : 

1. Design with concurrency in mind.
 - i.e. assume that you are designing a multi threaded application
even though you may be designing a single threaded app.
It helps you to design cleanly and avoid temporal coupling.

Temporal coupling -> a;b; works but b;a; does not.
i.e. order of execution is critical.

An application designed in such a way(threadsafe) will
scale better, and if need be , can be adapted to multi-threaded

2. Get better at command line tools : awk/sed/gawk/for each/perl/grep etc.

3. Learn a new programming language every year

4. Read a technical book every quarter.

Tuesday, October 19, 2010

Thursday, October 14, 2010

Actionscript singleton class template IntelliJ Idea

Thanks to Gaurav Kushwaha, here is a singleton class template
for actionscript which you can use with Intellij Idea : 

package ${PACKAGE_NAME}#if (${PACKAGE_NAME} != "") #end
public class ${NAME} #if (${Super_class_name} != "")extends ${Super_class_name}#end{
private static var _instance:${NAME};

public static function getInstance():${NAME} {
    if (_instance == null) {
        _instance = new ${NAME}(new Dummy());
    return _instance;
public function ${NAME}(dummy:Dummy) {

class Dummy {
    public function Dummy() {

sorting lines with vim

For e.g.

test.txt : 

If I say ->
:1,4 sort

test.txt will become

:.,+3 sort
will sort from current line to current line + 3.

:1,$ sort u
will sort and remove duplicates

To sort in reverse,

PS: This functionality doesn't work with IntelliJ Idea's vim plugin

Shell sample commands

Need to create a list of all the unique package names explicitly imported
by your Java code? The following stores it in a file called "list."

grep ' ^import ' *. j ava |
sed -e' s/. *import *//' -e's/; . *$//' |
sort -u >list

(sed -e is used for giving multiple commands to sed)

Shell sample commands

Which Java files have not been modified in the last week?
find . -name ' *. java' -mtime +7 -print

Which Java files have been modified in the last week?
find . -name ' *. java' -mtime -7 -print

Finding all files newer than a given file

find all c files newer than the Makefile : 

find . -name ' *. c' -newer Makefile -print

Problems in running shell scripts on cygwin

I copied pasted this code : 
while test "$N" -le "10"
echo "Number $N"


Then ran it like : 

To get : 
./ line 7: syntax error: unexpected end of file

Issue was that Windows/Dos files don't work well
as shell scripts, due to difference in new line formatting.

So, the solution is , dos2Unix convert command : 

And, that's all.
Run happily ever after.

extracting social security number from an xml file using grep and cut


here is the regexp

and the command is 
grep -o '\([0-9]\+-\)\+[0-9]\+<' file.txt | cut -d "<" -f 1

Wednesday, October 13, 2010

Linting php files from command line in Cygwin

$ ls -1 *.php | xargs -n1 ./php.exe -l | grep -i "errors parsing"

Executing vim commands from command line


I executed the following when I wanted to replace
all "include" statements in my php files to "include_once"

vim.exe -c "argdo %s/include /include_once /ge | update" *.php

Friday, October 8, 2010

bookmarklet for changing a page's max size


Flex 4 difference between property and class

Now, how do you know, regardless of the defnition style, which is a property
and which is a class?

   1:  <s:Group>
   2:    <s:layout>
   3:     <s:VerticalLayout/>
   4:    </s:layout>
   5:    <s:Button label="1"/>
   6:    <s:Button label="2"/>
   7:    <s:Button>
   8:     <s:label>3</s:label>
   9:    </s:Button>
  10:  </s:Group>

The G in the <s: Group> tag is uppercase, so it refers to an instance of a Flex class named  Group. The l in the <s:layout> tag is lowercase, so it is a property of the Group. The V in the  <s:VerticalLayout> tag is uppercase, so it is referring to a new instance of a Flex class named  VerticalLayout.

Blog Archive