Tuesday, May 26, 2015

Production Incident 10: Default Proxy Or Proxy Or CORS When Remote Name Cannot be resolved

Issue: Remote Name cannot be resolved.

Scenario: When we call third party or cross domain api within our web api.

Resolution :
  1. Enable CORs at third party api Or
  2. Try to include default proxy
  3. Or specific proxy at your client web APi



"usesystemdefault="True" />

Enable Cors
Error Slab:


Friday, May 1, 2015

Production Support Incident 6: Bad Architecture Design Database server Box having IIS webserver

Architecture Scenario:
A shared database SQL Server physical box or VM contains shared databases which is connected to different web servers hosted with different website. Lets assume within that database box we have IIS configured with WCF hosted service. The initial objective would be the wcf service which is present within database server must be using database table for storing service data. This is bad design.

What is the problem?
Now when we allocate RAM to SQL server box , sql server consumes all the RAM and keep very little RAM for OS activity. Now lets assume we have IIS configured with wcf service with database server and there is very little RAM available for IIS and OS as SQL server consumes most of the RAM memory . In such case there is going to be memory overflow or Network IO wait conditions for website request that are sending requests to this wcf service hosted in IIS that resides within this database server.

Workaround: You have to explicitly allocate RAM for sql server and for IIS to process the requests. The flipside is your sql server may started getting load and RAM consumption may shoot up to peek which may results in Suspended transactions and process requests with latency.

This is design flaw at infrastructure Level hence it is always important to make assessment and revisit architecture.


Production Support Incident 5: Enable default proxy to access internet content or any outside DMZ API

Sometimes production webserver servers are not allowed  to access outside internet web to prevent any vulnerable attack within the server.
Sometime there may be requirement to call web api which is hosted in cloud and your private intranet web server requires to call those api which is hosted externally in such case we may use internal proxy within the network to make a call across firewall.
To enable the proxy setting I have added following piece of code in config file

...." usesystemdefault="True" />

Even after allowing access through proxy the server certificates are not configured properly which may results in further execution. Ensure certificates are installed in browser to make handshake to the api.

Check certificates in server

Production Support Incident 3: Disable SSlv3 Poddle Attack on Azure web role and worker Role

Poodle attack - SSLV3 Enabled

Tool to ---Test your website server is poodle free..poodle test?

Courtesy: http://en.wikipedia.org/wiki/POODLE

The POODLE attack (which stands for "Padding Oracle On Downgraded Legacy Encryption") is a man-in-the-middle exploit which takes advantage of Internet and security software clients'

Website and server should be poodle free, its vulnerable if we have SSLV3 policy protocol is enabled. One can disabled SSLV3 through regedit whereas in cloud we have to disable using startup script in Service definitions.

1. Add .cmd batch file in startup script of Role profile folder.
2. Give path of start up script in servicedefination configuration under task tag.
3. Keep the powershell file in root of website or webrole.
4. Ensure ps1 file is copy to content always. Right click on powershell file in solution VS and check property to enable this options.

Even after deployment your SSL scan test lab shows you  C grade check for webseal or WAF (Web application firewall). In such case you may have to disable ssl in server that is acting as firewall. Check with your infrastructure team.

****Check any WAF environment. Web application Firewall behind which your azure web role may resides.



Please engage the team who maintains these Linux box and follow document below to disable SSL3.

Web servers


Put the following line in your configuration file, or replace any existing line starting with SSLProtocol:

SSLProtocol All -SSLv2 -SSLv3

Then run: sudo apache2ctl configtest && sudo service apache2 restart.

Don't forget to test your website.


Put the following line in your configuration file, or replace any existing line starting with ssl_protocols:

ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

Then restart the server (in Ubuntu: sudo service nginx restart).

Don't forget to test your website.


Lighttpd releases before 1.4.28 allow you to disable SSLv2 only.

If you are running at least 1.4.29, put the following lines in your configuration file:

ssl.use-sslv2 = "disable"
ssl.use-sslv3 = "disable"

Then restart the server (in Ubuntu: sudo service lighttpd restart).

Don't forget to test your website.

Wednesday, April 8, 2015

Production support Incident 2 : Never Rely on LINQ Object IEumerable

If you are supporting application that have nhibernate, entity framework without stored procedures setup, there will speed breaker ahead in your journey. It may work fine for given capacity and user base but sometimes may give you a surprise.

Something below in your query would ring a alarm bell.
return logEntries.ToList().Take(10);

This particular query will bring resultsets from database to webserver and then fetch 10 records for you. Just imagine for some reason and data combination you got thousands of records from database and then it manipulates something web server with these sets of resultset, surely your web server CPU will spikes for sure. There will be intermittent downtimes due to concurrent users. If there is caching profile then there could be possible race condition to create them. There are possible potential occurrence of hung and suspended transactions in sql server and so on.

You even think of brute force method to KILL SPID...and you save sometime to rescue yourself. If you're a support guy then it is good you atleast know what LINQ object query does in the background. With just little knowledge on the surface won't help. You may take short cut to fix this by mounting or increasing server configuration however this short term solution and this may blow out of proportion in another next month due to increase in users and process.

Stop gap arrangement :- to maintain P1 at bay. until you fix the main solution.


      PRINT 'Checking for long running processes'    

      DECLARE @TRANSACTION_STATUS as varchar(40)

      Declare @TimeElapsed as decimal

      Set @TimeElapsed =0.001


      CREATE TABLE ##temp (

      [SPID] [varchar] (13),

      [Status] [varchar] (120),

      [Login] [varchar] (120),

      [HostName] [varchar] (120),

      [BlkBy] [varchar] (13),

      [DBName] [varchar] (120),

      [Command] [varchar] (130),

      [CPUTime] [varchar] (120),

      [DiskIO] [varchar] (120),

      [LastBatch] [varchar] (130),

      [ProgramName] [varchar] (140),

      [SPID2] [varchar] (13),

      [REQUESTID] [varchar] (13)



      --Keep Only Recipe Related Suspended Logs /Details to process further

      INSERT INTO ##temp

      (SPID,[Status],[Login],HostName,BlkBy,DBName,Command,CPUTime,DiskIO,LastBatch,ProgramName,SPID2, REQUESTID)

      EXECUTE sp_who2

      DELETE from ##temp where not [login] = 'xyz' or not dbname = 'abcDB' or HostName not in ('01-VM','02-VM','03-VM')
--The above hostname is loadbalanced webserver.

      UPDATE ##temp set lastbatch = Convert(DateTime, Convert(VarChar(4), Year(GetDate())) + '/' + lastbatch)

      SELECT spid, lastbatch from ##temp where lastbatch < (getdate() - @TimeElapsed)


      --Check for suspended transaction for last 1 hour.

      IF (select count(*) from ##temp where lastbatch < (getdate() - @TimeElapsed) and Status=@TRANSACTION_STATUS) <> 0


                        -- Generate output files

                        SET NOCOUNT ON

                        DECLARE @spid varchar(5)

                        DECLARE @sql varchar(200)

                        DECLARE @sql2 varchar(200)

                        select @spid = rtrim(spid) from ##temp where lastbatch < (getdate() - @TimeElapsed) and Status=@TRANSACTION_STATUS

                        PRINT 'Start Processing'

                        -- Kill the rogue process

                        PRINT 'Process to be killed is: ' + @spid

                        DECLARE @cmd varchar(10)

                        select @cmd = 'kill ' + @spid

                        Print @cmd

                        --Kill Process

                        exec (@cmd)


drop table ##temp

--select * from ##temp



Sunday, March 29, 2015

Production Support Incident 1. SQL Server Suspended Transaction And IO Wait Issue

This is the most critical findings when there is issue with application downtime.

Application: CMS System- Content Management System

Technology: Custom Asp.net

Scenario: For any CMS system , the caching plays a very essential role. To improve overall user experience and responsive of the system , as a thumbrule and architecure design norms the CMS system should be always initialized by caching. The system content is cached one time so that there is no more chatty communication with SQL server or for that matter with database. This is important as the content in CMS website public facing internet website most of the content is global and applicable for all users. In such scenario the best practice is to cache the content and most common element one time during overall website lifecyle throughout a day.

So when we consider caching below set of design principle must be taken care:-
Life cycle of caching-Age of caching
Frequency and timeline when Business user changing content so that the changes reflects during business as usual.
Warm up caching option in IIS to reduce overall users impact on cache expiration.
Importantly the amount of data cached .Impact on w3p process in IIS, CPU utilization and heap memory fot which sql query is executed.
Mission critical application keep logic outside of application layer..keep it in database for quick fix and resolution . If logic is embedded in business logic with linq query within application layer, Consider hugh business impact and application downtime.

1. Quick Checks:

USE master;
EXEC sp_who2 'active';

If there is suspended transaction SPID then there is serious problem. if suspended transaction is not getting clear within 10 secs then there is potential issue with memory or execution completion of query
2. Quick Checks async_network_io wait in sql server

3. Quick Checks Page latch above 20

SELECT session_id, wait_type, resource_description FROM sys.dm_os_waiting_tasksWHERE wait_type LIKE 'PAGELATCH

Either Optimize query
Or Increase RAM of Sql server OS box.

Thursday, March 19, 2015

Production Support Security Vulnerability Attack

Production Support

The production support is always a touch job to do . The development is a lean process and it follows the timeline, process, planning and execution within the given timeline. There is liberty to give estimation and do planning whereas with support the planning is never the case. One can never know what next.
Security vulnerability sometimes taken lightly in support production and there is always a kind of disconnect among different groups like application, database and infrastructure support. When these groups work in a very disconnected mode and communication channel is not so apparent among them then there is a chance of high lapse in support paralysis.

Poddle Attack

 Unused certificates

Check for expired SSL certificates.
Step by step of how to disable SSL V3.



Use the following site to see if your site is poodle free.



You need to get GRADE A after you have applied the fix.

DOS-DDOS- Distributed Denial of Service

Look out of requests from most common source . Someone must be screwing your system calling /loading or making requests to your website. If you check netstat, IIS logs, windows event application logs, webstats or google analytics something which gives you a indication that there is something wrong with your application. This will tell you the unusual behaviour within the systems when requests common to your server from most common sources.

There are chances your application login attempts of all users will be exhausted and thus users accounts are locked. This is a very huge business impact. Just imagine if this is your E-commerce or banking or financial sites. The day loss of business would be enormous. Hence we have something called captha introduced in early web world to tackle this.

User Session
Normal Day
If see for given day and timeperiod the session building up in the system is going exponentially there is something serious activity going on in the system. Splunk ,HP and other tools help you find out that.
Check the size of iis log . Compare with previous days and can help you analyse the situations more clearly.