Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onert-micro training api #12996

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

chunseoklee
Copy link
Contributor

No description provided.

@chunseoklee chunseoklee added the PR/NO TEST Tell CI to not run test label May 14, 2024
Signed-off-by: Chunseok Lee <chunseok.lee@samsung.com>
@chunseoklee chunseoklee force-pushed the onert-micro-api-draft branch 3 times, most recently from dcbf274 to 6da9827 Compare May 16, 2024 10:59
// 4. check loss
float loss;
om_train_get_loss(ctx, 0, &loss);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5/17일자 녹스 미팅에서 말씀 주신대로, PR기준 EarlyStop 관련 추가 요청 드립니다.
아시겠지만 EarlyStop은 대략 이 시점에서 global minimum loss를 계속 추적하는 기능입니다. (app에서 loss를 쓸지, accuracy를 쓸지 metrics를 설정해줄 수도 있지만)

  1. 이번 step의 loss가 global minimum loss보다 더 좋아지지(낮아지지) 않았으면, count 증가
  2. count가 app에서 지정해준 횟수 N을 넘으면 최대 epoch에 도달하지 않더라도 학습 중지(early stop) -> 더 이상의 학습이 의미 없다고 판단

만약 복잡도나 일정 등의 비용에 크게 부담이 되지 않는다면, EarlyStop 기능은 이 레벨에서 구현되는 것이 구조적으로 좋을 것 같습니다.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep onert-micro as simple as possible since a variant introduces more maintenance cost. IMHO, it is better to implement at platform(aifw) level. That is, while onert-micro provides basic feature like this api, AIFW can provide high level features by assembling them.

}
else {
om_train_save_as_inferencemodel(ctx, PATH);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

모든 체크포인트를 매번 inference가능한 산출물로 저장하면 안 되나요? 비용 차이가 크게 나는지 궁금합니다.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to unify the inference and the checkpoint format. But, not now.
Moreover, once checkpoint is converted into inference model, it is no more trainable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But app should validate origin model and new model if check point happened.
How it possible without updated file? (IMO, at least some kind of temp file is necessary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

om_train_save_as_inferencemodel(ctx, PATH); is for producing inference model. To validate the current model(during training), you can use om_train_inference(om_context *context);

// 4. check loss
float loss;
om_train_get_loss(ctx, 0, &loss);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

체크포인트 저장 시점은 보통 유저가 내린 설정에 따라서 결정되는데, 참고로 공유 드립니다.

`
best_model_name = f"./inverter_model/{self.get_model_name()}_{str(now_ts)}.h5"

mc = ModelCheckpoint(best_model_name, monitor="val_loss", save_best_only=True, mode="min", verbose=0)
`

=> monitor 대상인 "val_loss"가 "min"기준으로, best 일 때만 best_model_name 경로에 체크포인트를 저장

@chunseoklee
Copy link
Contributor Author

}
else {
om_train_save_as_inferencemodel(ctx, PATH);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But app should validate origin model and new model if check point happened.
How it possible without updated file? (IMO, at least some kind of temp file is necessary)

*
* @return @c OM_STATUS_NO_ERROR if successful
*/
OM_STATUS om_train_compile(om_context *ctx);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more parameter need to be expanded, as I introduced API set in last meeting.(late, loss, metrix...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add config api like set_train_info. But note that this config is optional and basically, circle model itself contains training info(loss,...)

* If it is nullptr, it will not change shape and batch size
* @return @c OM_STATUS_NO_ERROR if successful
*/
OM_STATUS om_train_set_input(om_context *ctx, uint32_t index, const void *input, int size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be conbined with om_train_inference, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, I'd like to keep one api for one role. Thus, hope that this api will be as it is. Only different between you suggestion and current function is single om_train_set_input call

@chunseoklee
Copy link
Contributor Author

FYI, we will update checkpoint api based on #12997.

@chunseoklee
Copy link
Contributor Author

While transforming to use nnfw's API, it found that we need extra (output buffer) copy to use nnfw_set_output in onert-micro side. onert-micro uses internal buffer for output while onert uses output buffer allocated by user.

@chunseoklee
Copy link
Contributor Author

api implementation based on #13107 will be on https://github.com/chunseoklee/ONE/commits/v3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/NO TEST Tell CI to not run test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants