Image Bug #3445

Very slow processing of MQ records (average ~ 240ms, expected 10ms)

Added by Radim about 1 month ago. Updated about 1 month ago.

Status:ClosedDue date:
Priority:Critical% Done:

100%

Assignee:Radim
Category:-
Image:WebSphere MQ 9.1 Your Marketplace Account ID:3334-3009-7275
Operating System:Linux Marketplace:Amazon Web Services
JRE:Any Customer State:Czech Republic
Instance Type:c4.xlarge Customer Country:Czech Republic

Description

Hello, we are running multi-instance queue in multiple AWS accounts and we face performance issue.
Yesterday 4.3.2020 starting about 12:16 UTC we made a stress test on MQ servers.
We observed very slow processing of MQ records ( average ~ 240ms, expected 10ms ), we checked EFS metrics (PercentIOLimit, BurstCreditBalance, TotalIOBytes, PermmittedThroughput) in cloudwatch and all seemed to be OK.
We could see increased IO waits (15-30%) on the Linux side when processing the queue. CPU and RAM seem to be OK.
We also tried to temporarily switch from burstable to provisioned throughput mode of EFS but it did not help.
I attached some logs from measurements.

We mount the filesystem with NFS options recommended by AWS:
defaults,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,_netdev
No possitive effect

We are running version 9.1.0.3, instance type 2x c5.xlarge, 1x EFS size 300-400 MiB.
We ran same solution in another environment but with version 9.1.0.1 and we dont have such performance issue as in account with higher version.

From AWS side is everythink OK, no HW issues, we discuss it with TAM and premium technical support in CaseID 6853506281.

Thanky you for help, because it is critical system for us and we go tu run live.

(1) - sar.txt (21.7 KB) Radim, 03/05/2020 03:41 pm

(3) - AMQ15101.0.FMT.mqlogperf.txt (32.7 KB) Radim, 03/05/2020 03:41 pm

primaryNode.txt Magnifier (2.62 KB) Radim, 03/06/2020 12:28 pm

MQwithEFS.txt Magnifier (1.25 KB) Radim, 03/06/2020 12:28 pm

MQwithEBS.txt Magnifier (1.67 KB) Radim, 03/06/2020 12:28 pm

Screen Shot 2020-03-06 at 14.28.39.png (147 KB) Mariusz, 03/06/2020 01:28 pm

Screen Shot 2020-03-06 at 14.28.58.png (302 KB) Mariusz, 03/06/2020 01:28 pm

sar.txt Magnifier (294 KB) Radim, 03/06/2020 02:22 pm

AMQ15101.0.FMT.mqlogperf.txt Magnifier (173 KB) Radim, 03/06/2020 02:22 pm

History

#1 Updated by Mariusz about 1 month ago

  • Assignee set to Mariusz

#2 Updated by Mariusz about 1 month ago

  • Status changed from New to Feedback
  • Assignee changed from Mariusz to Radim

Hi,
Thank you for using our MQ listings from AWS Marketplace.
We will be investigating your issue with the highest priority. Just to get as much information as we can - could you possibly provide the data below for both environments 9.1.0.3 and 9.1.0.1 which performs well.

1) CPU and available memory:
You mentioned you use c5.xlarge (4 CPUs, 8GiB Memory). Do you use the same for 9.1.0.1 version?

2) Disk information:
What is the size of the disk you use for both environments? (IOPS depend on the size of the disk as well)

3) Is there any other software running on these instances. Please check both environments. Is it possible that something takes the necessary memory?

4) Is there any other traffic on your network which could cause some delays? Compare both environments.

5) Are there any additional firewals etc? Also can you compare security groups used in both environments?

6)Can you compare both environments when it comes to MQ configuration? (specific channels, queues, topics, and applications involved)

7) Can you provide us some details on how you are measuring MQ performance?

Regards,
Mariusz

#3 Updated by Radim about 1 month ago

Hello Mariusz, there are answers.
1. version 9.1.0.1 is running on c5.large - 2 CPU, 8 GiB memory.
2. all instances has gp2 EBS and burstable EFS, EBS has size 30 GB, EFS for 9.1.0.1 has 700 MB, 9.1.0.3 has 400 MB (but we test it with provisioned EFS too without impact).
3. no, just MQ (default AMI from Marketplace) - Utilization of CPU and RAM during test is OK and there is a huge performance reservation on HW side. Confirm by AWS support.
4. nope, we have 2x 10 GBit fibres (using DirectConnect) with latency 10-13 ms. During the test we are running at 10 % of total link capacity. Links are share for entire Organiztaion in AWS
5. We communicate with same systems in both enviro - Same SG for every enviro, i double check this.
6. there is only difference in nmber of active logs. We have more on production due to message persistence. Configuration is the same 1:1. We dont use topics. Channels and queues has same setup. Application for testing is the same.
7. mesaurements from last testing is attached.We tested MQ with EBS, MQ with EFS, primary node metrics.
Summary, with EBS we are able to get 50 messages per second, with EFS 20 messages. With 9.1.0.1 we are able to get 700+ messages (not tested yet - few week ago).

Radim

#4 Updated by Mariusz about 1 month ago

Thank you for the information.
We contacted IBM support directly.
One more thing - can you check these two attachments are correct? sar.txt and AMQ15101.0.FMT.mqlogperf.txt
They just don't look like they are related to this issue.

#5 Updated by Radim about 1 month ago

Mariusz wrote:

Thank you for the information.
We contacted IBM support directly.
One more thing - can you check these two attachments are correct? sar.txt and AMQ15101.0.FMT.mqlogperf.txt
They just don't look like they are related to this issue.

Thank you Mariusz, data are relevant.
They are collected from primary node (measure 2 days ago) - we use multi-instance solution, where we have primary node as the main MQ and second node which is ready for DR (active).

Radim

#6 Updated by Mariusz about 1 month ago

When I open (1) and (2) I get something like this. Take a look at the screenshots.

#7 Updated by Radim about 1 month ago

Radim wrote:

Mariusz wrote:

Thank you for the information.
We contacted IBM support directly.
One more thing - can you check these two attachments are correct? sar.txt and AMQ15101.0.FMT.mqlogperf.txt
They just don't look like they are related to this issue.

Thank you Mariusz, data are relevant.
They are collected from primary node (measure 2 days ago) - we use multi-instance solution, where we have primary node as the main MQ and second node which is ready for DR (active).

Radim

Today we have tried to perform test is the same environment as we have with 9.1.0.3 with new instances based on 9.1.0.0 with the same bad results. We build it from Marketplace (ami-0faf4e6318b64e0c4) and newly configure, we have another environment where we have 9.1.0.0 too and it run correct.

#8 Updated by Mariusz about 1 month ago

Hi Radim,
Just to confirm:
1. This environment which works properly is also running on AWS, right?
2. Have you seen screenshots I attached in my previous comment? Could you possibly make sure sar.txt and AMQ15101.0.FMT.mqlogperf.txt contect is correct.

We already created a PMR with IBM and we work together to solve the issue.

Regards,
Mariusz

#9 Updated by Radim about 1 month ago

Mariusz wrote:

Hi Radim,
Just to confirm:
1. This environment which works properly is also running on AWS, right?
2. Have you seen screenshots I attached in my previous comment? Could you possibly make sure sar.txt and AMQ15101.0.FMT.mqlogperf.txt contect is correct.

We already created a PMR with IBM and we work together to solve the issue.

Regards,
Mariusz

1. correct
2. you are right, dont know why, on local i have right files - re-uploaded.

Radim

#10 Updated by Mariusz about 1 month ago

  • Status changed from Feedback to In Progress
  • Assignee changed from Radim to IBM Support

I shared these files with IBM, they investigate the issue.
In the meantime - try to find any differences between these two environments/instances. Are they on the same region, availability zones?

Is this 9.1.0.1 instance (which works correctly) based on the same AMI as the new one?
Could you also check the size of the messages sent to both environment?

#11 Updated by Radim about 1 month ago

Mariusz wrote:

I shared these files with IBM, they investigate the issue.
In the meantime - try to find any differences between these two environments/instances. Are they on the same region, availability zones?

Is this 9.1.0.1 instance (which works correctly) based on the same AMI as the new one?
Could you also check the size of the messages sent to both environment?

Thank you Mariusz for sharing with IBM.
Version 9.1.0.1 instance which work correct is based on another AMI id: ami-04c97e501eca3fcbc.And version 9.1.0.0 which work correct to in another account is based on ami-022635c79e041a1e2. We are doing same performance set in all account (all instances). We try to run test directly on instance - actually we run in on another machine and send messages to IBM MQ.

Radim

#12 Updated by Radim about 1 month ago

Hello, if we run performance test directly on IBM MQ server, everything is correct.
We use same region and AZ under Organization, so there is no difference.
We have try toi figure out, if there is no problem in perform. test..

Radim

#13 Updated by Mariusz about 1 month ago

  • Status changed from In Progress to Feedback
  • Assignee changed from IBM Support to Radim

Hi Radim,

Thank you for more information. So if I understand correctly it looks more like connection issues?

We received this answer from IBM. It unfortunately doesn't give us anything:

Hello.

Analysis of the infrastructure supporting a QMGR environment is outside the scope of MQ support.

MQ does not guarantee any particular level of performance as no two QMGR environments are the

same.  

If there are no specific MQ questions provided this case will be closed in one week.

Thanks.

They also confirmed there were no changes between versions - which could affect performance. I will speak to my team about your issue but I am afraid there is no too many options how we can help you.

We also offer a consultancy services. One of our experts could try to review your environment and try to find potential problems.

Let us know if you would like to schedule a call to know more about consultancy services we offer ?

Regards,
Mariusz

#14 Updated by Radim about 1 month ago

Hello Mariusz, there was a problem in performance test which was not multi-thread, i dont know how is possible, that in some acocutns it run OK, but after setting up jmeter to multi-thread we are able to get 1200+ messages with expacted latency.

So i am sorry for waste of your time and you can close this ticket.

Have a nice day,
Radim

#15 Updated by Mariusz about 1 month ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

Hi Radim,

Good to know problem has been solved.
This is also a good lesson for us to focus on the way how the performance is tested.
Do not hesitate to contact us when you have any issues.

Kind regards,
Mariusz Chwalek

Also available in: Atom PDF