# Performance Issue Check

After adopting the InsureMO platform, we conduct regular performance inspections to identify APIs performing below expectations.

Upon receiving inspection results, follow this guide to troubleshoot and optimize your system accordingly.

## Identify APIs

Before analyzing, please locate the following key information:  

- Tenant
- Environment
- API path
- Microservice
- API trigger time

<div class="docs-note"><span class="docs-admonitions-text">note</span>

When analyzing, you might find occasional slowness. For example, while most requests respond within 1 second, occasional requests may take up to 4 seconds. However, such occurrences are often acceptable in these scenarios:

- Cache clean-up because of daily retirement.
- Cache clean-up because of java memory GC.
- Cache clean-up because of new configuration deployment. 
- High CPU/Memory/RDS/ES usage.

As long as any above cases happen on BFFs or platform nodes, occasional slowness will happen. 

Now in a shared public environment with infra cost limit and without traffic control, we are unable to solve such occasional slowness and meet the P95 target. Instead of excessive focus on occasional slowness, we recommend monitoring average response times and analyzing the most frequent duration requests.

For rigid performance requirements, please contact the InsureMO support team. 

</div>


## Accesslog Analysis

The accesslog helps analyze performance issues. All our API performance inspections are based on accesslog, so they should be the first ones to take a look.

Please check the specific environment and key in the necessary information in the log monitor first.

### Search Method

Search criteria:

- Tenant
- Environment: URL
- API path: **request_path**
- Microservice: **current_app_name**
- API trigger time

Search results:

- API path: **request_path**
- Microservice: **current_app_name**
- API trigger time
- Trace ID: **trace_id**
- Duration: **latency**
- Import file size (optional): **request_content_size**
- Export file size (optional): **response_content_size**

![pfm_check_accesslog](./image/tool/pfm_check_accesslog.png)

In accesslog, please pay special attention to the distribution of time and frequency, and investigate the following questions:

### Is it the first batch of API calls each day?

Our cache automatically expires every early morning to save memory consumption. After cache expires, the first couple of requests might be slow because all the cache needs to be rebuilt.

Until the platform further optimizes its caching mechanism, please ignore the slow requests from the first few times each day based on environment settings.

![pfm_check_first_request](./image/tool/pfm_check_first_request.png)


### Are there any tenant or platform releases (such as reboots, service deployments, and business data releases)?

During our deployment, the cache will be cleaned up all together. After the cache expires, the first couple of requests might be slow because all the cache needs to be rebuilt.

Until the platform further optimizes its caching mechanism, please ignore the slow requests from the first few times each day based on environment settings.

![pfm_check_deploy](./image/tool/pfm_check_deploy.png)

### Is there high concurrency or batch processing occurring at specific times?

With customer consent, please try to reduce concurrency as much as possible or contact the InsureMO team to further evaluate the hardware resources.

### Are there any special cases such as download, upload, or query?

- Download

  Download performance is highly impacted by file size for network transmission. Therefore, if it's a download-related or export-related API, please review the **response_content_size** field in the accesslog and set reasonable standards based on the file size with the InsureMO support team.

- Upload

  Please also check the **request_content_size** field to evaluate the upload size.

- Query

  By default, our ES query has a maximum limit of 1000 records. However, for other query APIs without record limits, the performance will be slow when the query data is big.

If this causes a performance issue, we suggest imposing pagination for the query APIs so that only a limited number of data can be returned for each API call. 

![pfm_check_request_response_size](./image/tool/pfm_check_request_response_size.png)

For such scenarios identified, another possible solution is to change APIs into asynchronous calls. In this way, instant API responses can be given and the final result will be pushed later on.


## Trace Log Analysis

If all the above external factors have been ruled out and it's determined that specific APIs are slow, the next step will be to analyze the details of these APIs. Please view the trace IDs of the APIs and analyze them in the Jaeger trace monitor. 

### Search Method

Search criteria:

- Tenant
- Environment: URL
- Trace ID: **trace_id**

Search results:

Please focus on microservice names and each API's duration. It's better to ask a senior member of your team to check all microservice names for both platform and BFF.

![pfm_check_tracelog](./image/tool/pfm_check_tracelog1.png)


### Is an individual platform API slow?

If you are using APIs that are related to general insurance policies, see [Policy API Performance Improvement](https://docs.insuremo.com/gi_insurance_service/policy_api_pfm) for further checks.

### Is the internal API logic of the BFF slow?

#### Too Many Platform API Calls

Ways to optimize the number of calls can be:

- Adopt methods of batch calling (such as bulk loading of data tables).
- Adopt methods of aggregated calling (such as combining **Underwriting** and **Calculation** into a single action).

![pfm_check_repetitive_call_trace](./image/tool/pfm_check_repetitive_call_trace.jpg)

![pfm_check_repetitive_call_kibana](./image/tool/pfm_check_repetitive_call_kibana.jpg)


#### Unnecessary Seata calls

Please do not enable global transactions unless transaction consistency is required.

Here is an example of the wrong usage of Seata as a BFF that only orchestrates a loaded API for which you don't need to care about transactions at all.

```
@RequestMapping(value = "/commission/loadSettlementPaged", method = RequestMethod.POST)
public PagedResult<CommissionSettlement> loadSettlementPaged(
        @RequestBody SettlementLoadCondition requestBody) {

    globalTransactionService.beginGlobalTransaction(
            AppContext.getTenantCode() + "bff-SalesChannel-Commission");

    return this.salesCommissionResource.loadSettlementPaged(requestBody);
}

@RequestMapping(value = "/commission/loadCommissionlistBySettlementId", method = RequestMethod.GET)
public List<CommissionSettlement> loadCommissionlistBySettlementId(
        @RequestParam("settlementId") Long settlementId) {

    globalTransactionService.beginGlobalTransaction(
            AppContext.getTenantCode() + "bff-SalesChannel-Commission");
    return this.salesCommissionResource.loadCommissionlistBySettlementId(settlementId);
}

```


#### Unnecessary Iteration

Please follow the normal code standards to optimize your programming code.

![pfm_check_no_need_seata](./image/tool/pfm_check_bff_iteration.png)


### Is the integration with third-party services slow?

#### Optimize Slow Third-party API Calls

You can optimize third-party services or switch to asynchronous calls.
 
#### Improve Middleware Integration

You can increase middleware sizes or skip processes such as calling Elasticsearch APIs to build index structures.

## Applog Analysis

The applog is usually not the key for performance analyses. However, in certain scenarios, the applog will help with performance troubleshooting.

Therefore, if you have located slow APIs with trace IDs, we suggest you take a look at the applog for a comprehensive analysis.

### Search Method

Search criteria:

- Tenant
- Environment: URL
- Trace ID: **trace_id**
- API trigger time

Search results:

- API path: **request_path**
- Microservice: **current_app_name**
- API trigger time
- Trace ID: **trace_id**
- Log message: **message**

![pfm_check_applog](./image/tool/pfm_check_applog.png)

### SQL Execution Too Long

Whether it is a BFF API or a platform API, most APIs will interact with DB to save and load data. There are a couple of scenarios that might cause DB-related performance issues:

- DB structure setup that is not well-tuned (e.g. lack of index) 
- Excessive request pressure
- DB resource lack

Once such logs are found, follow the steps below:  

- For a BFF API, you need to consider optimizing the table structure setup.  
- For a platform API, you can contact the InsureMO support team for help.

![pfm_check_sql_long](./image/tool/pfm_check_sql_long.png)


### Too Many Unnecessary Logs

We always need to print necessary logs and change other logs to the debug level. If too many logs are produced, they might impact the performance.

Once you find too many application logs produced, you can perform as follows:  

- For a BFF API, you need to consider optimizing your program and change logs to the debug level.  
- For a platform API, you can contact the InsureMO support team for help.

![pfm_check_too_many_log](./image/tool/pfm_check_too_many_log.png)


### Timeout

If a timeout happens, the overall performance will be impacted. Therefore, you need to locate the problems and find out the exact reasons.


## GClog Analysis

System slowness might be caused by the lack of memory. To check further, you can view the GClog on our log monitor. 

By default, the allocated memory for the BFF is 1G. If there are too many GClogs produced or there are too many full GC running in the log, you need to either optimize the program or expand the memory.

![pfm_check_gc_log](./image/tool/pfm_check_gc_log.png)


### Are too much detailed microservices necessary?

When three services (bff-app, integration-app, batch-app) have been developed, you can follow the steps below to merge them (BFF+integration+batch) into one for infra resource-saving:

1.	Assume the BFF to be the target service.
2.	Check in the configuration center. You will find that these services share the same **tenant_db** schema, hence is no need for data migration.
3.	Merge code into the BFF’s git repository.
4.	Optional (need to check whether there are APIs exposed from the integration and batches): Adjust swagger or static routes for the integration and batches.
5.	Check whether cross-service API calls exist (e.g. BFF call integration). If they exist, you need to either adjust calls or configure routes via `newer_app_name.integration-app=bff-app`.

### Is assigned memory not big enough?

To assign memory, there is a config parameter called **java.options**.

By default, the platform will set this parameter as follows for all tenant BFFs. Its memory capacity is 1G at maximum:

* 1G: **-XX:-TieredCompilation -Xms100M -Xmx1024M -Dsun.net.client.defaultConnectTimeout=2000 -Dsun.net.client.defaultReadTimeout=30000**

To change the memory capacity, you can add the parameter in the tenant configuration center to increase the capacity to 2G or decrease it to 512M:

* 2G: **-XX:-TieredCompilation -Xms100M -Xmx2048M -Dsun.net.client.defaultConnectTimeout=2000 -Dsun.net.client.defaultReadTimeout=30000**  
* 512M: **-XX:-TieredCompilation -Xms100M -Xmx512M -Dsun.net.client.defaultConnectTimeout=2000 -Dsun.net.client.defaultReadTimeout=30000**

Please remember to check with the SiteOps team about the assigned EC2 capacity before changing the memory capacity of the parameter.


## JFR Analysis

In a traditional java program, when performance issues emerge, JProfile is an important tool for developers to perform detailed analyses and find out which part of the program is slow at a detailed source code level.

If you still cannot locate the exact reasons why the program is slow after checking all the mentioned logs, JFR, a lightweight substitute for JProfile, can be another choice for further checks.

For more details about how to use JFR, please see [JFR Recording](https://docs.insuremo.com/ics/app_framework/jfr).


## Other Scenarios

### Deprecate APIs 

If an API is planned to be deprecated in future versions, it can be temporarily marked for later handling once the new API is deployed.

### Timeout Setting

From InsureMO level, to protect platform not being dragged down by external request, there are multiple timeout setting imposed:

1.	Gateway (Response of single API call in total no matter whether orchestration or atomic): 2 minute
2.	Seata-server (Response of ochestration across multiple services by seata-server): 1 minute
3.	Ribbon (Respone of single platform API call): 30 second

That's to say:

* If user is calling an InsureMO API, user will get timeout error if single InsureMO API response time is bigger than 30s.
* If user is orchestrating on multiple InsureMO APIs using seata-server, user will get timeout error if total response time of multiple orchestrated APIs including ochestration behavior itself is bigger than 1 min.
* If user is orchestrating on multiple InsureMO APIs without using seata-server, user will get timeout error if total response time of multiple orchestrated APIs including ochestration behavior itself is bigger than 2 min.

So when you start to orchestrate your API, please seriously take above criteria into consideration. If the response timing is really longer than above criteria, you should either:

* Change your implementation method, e.g., adopt asynchronous approaches.
* Change your business solution, e.g., use group policies to split a large number of insured individuals.

It is generally not advisable to have any API response time exceeding 30 seconds, no matter whether it's atomic or orchestration. InsureMO team will not be responsible for any stability caused by such long time-consuming request.

## How to seek platform team support?

If you want to seek support from the platform team for system issues or performance, please follow the above guide to troubleshoot first, and then append all evidence on the mentioned aspects together in your request. Only in this way can we carry out a targeted analysis for you.
